I am an incoming Ph.D. student of Prof. Tao Yu at XLANG Lab, The University of Hong Kong, and a research intern at the Qwen Team, Alibaba Group. My research interest focuses on Embodied AI (VLA and WAM). I received my B.S. in Computer Science and Technology from Zhejiang University, where I was fortunate to be advised by Prof. Zhou Zhao. Feel free to reach out if you are interested in my work or have any questions to discuss!

🔥 News

2026.05: 🎉🎉 “FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies” is released on arXiv.
2026.05: 🎉🎉 “Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments” is released on arXiv.
2025.09: 🎉🎉 “MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations” is accepted by NeurIPS2025.
2025.09: 🎉🎉 “Tree of Preferences for Diversified Recommendation” is accepted by NeurIPS2025.

📝 Publications

Preprint 2026

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Xintong Hu^*, Xuhong Huang^*, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, Tao Yu

🌐 Project Page Code RoboFine-VLM Benchmark

An open framework for fine-grained VLA supervision, including: (1) FineVLA-Data And Pipeline that unifies 972K trajectories from 10 robot datasets into 47K human-verified fine-grained trajectories; (2) RoboFine-Bench, a 500-video benchmark with 10K+ atomic facts and 1K VQA questions; (3) RoboFine-VLM, a robotics-specialized VLM annotator for scalable trajectory annotation; (4) FineVLA-Policy, a steerable VLA policy achieving 86.8%/82.5% in RoboTwin and 62.7/100 in real-world dual-arm manipulation.

Preprint 2026

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Qwen Team (Xintong Hu is a core contributor)

Code 🌐 Project Page

A unified embodied foundation model extending Qwen’s VL stack to action and trajectory generation via a DiT-based decoder, unifying manipulation, navigation, and trajectory prediction.
Achieves 97.9% on LIBERO, 86.1%/87.2% on RoboTwin, 69.0% OSR on R2R, 76.9% OOD success on real-world ALOHA.

NeurIPS 2025 Poster

MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Wenxiang Guo^*, Changhao Pan^*, Zhiyuan Zhu^*, Xintong Hu^*, Yu Zhang^*, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao^†

DemoPage Code Dataset

-Datasets: Establish MRSAudio, a 500-hour multimodal spatial audio dataset with ambisonic audio, synchronized video, motion trajectories, and fine-grained annotations (transcripts, lyrics, scores), covering 4 real-world scenarios (daily life/speech/singing/music).
-Benchmark: Unified benchmark for 5 spatial audio tasks (spatialization, text-to-speech, singing synthesis, music generation, sound localization) enabling 3D-aware audio modeling.

NeurIPS 2025 Poster

Tree of Preferences for Diversified Recommendation

Hanyang Yuan,Ning Tang, Tongya Zheng, Jiarong Xu, Xintong Hu, Renhong Huang, Shunyu Liu, Jiacong Hu, Jiawei Chen, Mingli Song^†

Code

-Abstract: With the help of the agent, complete the user information to improve the diversity of recommendations. Use Agent to help solve the Filter Bubble problem in traditional recommendation algorithms.

🎖 Honors and Awards

2025.10 National Scholarship(Top 1%).
2024.10 National Scholarship(Top 1%).
2025.06 2025 IEEE ASRU AudioMos Challenge Second Prize.
2024.11 Zhejiang Province “Shangde Scholar” Award (Single Recipient).
2024.11 Zhejiang University CS ”Campus Star” Honor (Top 10 Students).
2025.09 Zhejiang University First-Class Scholarship (Top 3%).
2024.09 Zhejiang University First-Class Scholarship(Top 3%).

📖 Educations

2022.09 - Now, B.S. Zhejiang University, School of Computer Science and Technology.

💻 Internships

Alibaba Group, Qwen Team (Hangzhou) Research Intern (01/2026 – Present)
Advisor: Shuai Bai
Research Topic: Vision-Language-Action (VLA)

XLANG NLP Lab, The University of Hong Kong (Hong Kong) Research Assistant (06/2025 – Present)
Advisor: Prof. Tao Yu
Research Topic: Embodied AI

YiWise Lab, Zhejiang University (Hangzhou) Research Assistant (02/2025 – 06/2025)
Advisor: Prof. Zhou Zhao
Research Topic: Spatial Audio

VIPA Lab, Zhejiang University (Hangzhou) Research Assistant (07/2024 – 02/2025)
Advisor: Prof. Mingli Song
Research Topic: Recommendation

Xintong Hu(胡鑫通)

🔥 News

📝 Publications

🎖 Honors and Awards

📖 Educations

💻 Internships