I am an incoming Ph.D. student of Prof. Tao Yu at XLANG Lab, The University of Hong Kong, and a research intern at the Qwen Team, Alibaba Group. My research interest focuses on Embodied AI (VLA and WAM). I received my B.S. in Computer Science and Technology from Zhejiang University, where I was fortunate to be advised by Prof. Zhou Zhao. Feel free to reach out if you are interested in my work or have any questions to discuss!
🔥 News
- 2026.05: 🎉🎉 “FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies” is released on arXiv.
- 2026.05: 🎉🎉 “Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments” is released on arXiv.
- 2025.09: 🎉🎉 “MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations” is accepted by NeurIPS2025.
- 2025.09: 🎉🎉 “Tree of Preferences for Diversified Recommendation” is accepted by NeurIPS2025.
📝 Publications
FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies
Xintong Hu*, Xuhong Huang*, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, Tao Yu
🌐 Project Page
Code
RoboFine-VLM
Benchmark
An open framework for fine-grained VLA supervision, including: (1) FineVLA-Data And Pipeline that unifies 972K trajectories from 10 robot datasets into 47K human-verified fine-grained trajectories; (2) RoboFine-Bench, a 500-video benchmark with 10K+ atomic facts and 1K VQA questions; (3) RoboFine-VLM, a robotics-specialized VLM annotator for scalable trajectory annotation; (4) FineVLA-Policy, a steerable VLA policy achieving 86.8%/82.5% in RoboTwin and 62.7/100 in real-world dual-arm manipulation.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen Team (Xintong Hu is a core contributor)
- A unified embodied foundation model extending Qwen’s VL stack to action and trajectory generation via a DiT-based decoder, unifying manipulation, navigation, and trajectory prediction.
- Achieves 97.9% on LIBERO, 86.1%/87.2% on RoboTwin, 69.0% OSR on R2R, 76.9% OOD success on real-world ALOHA.
MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations
Wenxiang Guo*, Changhao Pan*, Zhiyuan Zhu*, Xintong Hu*, Yu Zhang*, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao†
-Datasets: Establish MRSAudio, a 500-hour multimodal spatial audio dataset with ambisonic audio, synchronized video, motion trajectories, and fine-grained annotations (transcripts, lyrics, scores), covering 4 real-world scenarios (daily life/speech/singing/music).
-Benchmark: Unified benchmark for 5 spatial audio tasks (spatialization, text-to-speech, singing synthesis, music generation, sound localization) enabling 3D-aware audio modeling.

Tree of Preferences for Diversified Recommendation
Hanyang Yuan,Ning Tang, Tongya Zheng, Jiarong Xu, Xintong Hu, Renhong Huang, Shunyu Liu, Jiacong Hu, Jiawei Chen, Mingli Song†
-Abstract: With the help of the agent, complete the user information to improve the diversity of recommendations. Use Agent to help solve the Filter Bubble problem in traditional recommendation algorithms.
🎖 Honors and Awards
- 2025.10 National Scholarship(Top 1%).
- 2024.10 National Scholarship(Top 1%).
- 2025.06 2025 IEEE ASRU AudioMos Challenge Second Prize.
- 2024.11 Zhejiang Province “Shangde Scholar” Award (Single Recipient).
- 2024.11 Zhejiang University CS ”Campus Star” Honor (Top 10 Students).
- 2025.09 Zhejiang University First-Class Scholarship (Top 3%).
- 2024.09 Zhejiang University First-Class Scholarship(Top 3%).
📖 Educations
- 2022.09 - Now, B.S. Zhejiang University, School of Computer Science and Technology.
💻 Internships
Alibaba Group, Qwen Team (Hangzhou)
Research Intern (01/2026 – Present)
Advisor: Shuai Bai
Research Topic: Vision-Language-Action (VLA)
XLANG NLP Lab, The University of Hong Kong (Hong Kong)
Research Assistant (06/2025 – Present)
Advisor: Prof. Tao Yu
Research Topic: Embodied AI
YiWise Lab, Zhejiang University (Hangzhou)
Research Assistant (02/2025 – 06/2025)
Advisor: Prof. Zhou Zhao
Research Topic: Spatial Audio
VIPA Lab, Zhejiang University (Hangzhou)
Research Assistant (07/2024 – 02/2025)
Advisor: Prof. Mingli Song
Research Topic: Recommendation
DemoPage