Skip to content

drawing

[Google Scholar, 40k+ as of 05/2026] [GitHub] [Linkedin] [Twitter] [Medium]

About Me

Hello. I am a research scientist at Facebook AI Research (FAIR), working on multi-modal pre-training, scalability, self-supervised learning and world modeling. I lead the MetaCLIP Series (1, 2 and S) and Llama 3 vision encoder, and also contribute to DINOv2, Perception Encoder and SAM 3 etc. as fundations.

Although scaling may seem like a solved problem in the age of LLMs, many of today's hardest challenges are fundamentally scalability issues (the bottleneck that forbids scaling to allow better intelligence to emerge): the Internet's data wall for LLM, scalability across modalities (e.g., the separation of LLMs and diffusion models), and the curse of multilinguality, where English-only models often outperform multilingual models even on English tasks (as in our work on MetaCLIP 2).

scalability of data

I believe scalability and the Bitter Lesson as two sides of the same coin: scalability exposes current bottlenecks and expecting scientific breakthrough, while the Bitter Lesson reminds past approaches with scalability issue.

I am fortunate to work with many exceptionally researchers that leading the progress of AI, including Prof. Saining Xie (AMI, NYU), Prof. Luke Zettlemoyer (UW, Meta), Prof. Zhuang Liu (Princeton), Dr. Scott Yih (Meta), Dr. Xinlei Chen (xAI) and many others. I also advised the following awesome students: Jeff Cui (first author of DynaMo, NYU), Lihe Yang (first author of Depth Anything 1/2, HKU), Yung-Sung Chuang (CSAIL, MIT, OpenAI), Shuming Liu (KAUST), Xiaoqian Shen (KAUST), Jiawei Ma (Columbia University), Wei Chen (JHU), Max Bain (VGG Group, Deepmind), Zhiyu Chen (UCSB) etc.

Previously, I received my Ph.D in computer science from University of Illinois at Chicago, advised by Prof. Philip S. Yu and Prof. Bing Liu. I got my master in microelectronics from Peking University. During my Ph.D. study I also work as a research intern at Facebook AI, Amazon AI Lab and WeChat AI lab. I am the winner of Yelp dataset challenge.

News

I'm open to invited talks about insights in my research, feel free to reach out.
[Sep. 2025] Gave a talk at Cohere Lab on MetaCLIP 2.
[Feb. 2020] I defended my Ph.D thesis, open-sourced my Ph.D. thesis.
[Sep. 2019] Ph.D proposal in lifelong representation learning for NLP.

Publication

2026

(Pixio) In Pursuit of Pixel Supervision for Visual Pre-training
Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao and Hu Xu.
[arxiv], [code]

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
CVPR 2026
Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra and Yunyang Xiong.
[arxiv]

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
MLSys 2026
Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole-Jean Wu and Shang-Wen Li.
[arxiv]

2025

Meta CLIP 2: A Worldwide Scaling Recipe
NeurIPS 2025 Spotlight
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li and Hu Xu.
[arxiv], [code]

Perception Encoder: The Best Visual Embeddings Are Not at the Output of the Network
CVPR 2026
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár and Christoph Feichtenhofer.
[arxiv], [code]

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
arXiv 2024
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny and Vikas Chandra.
[arxiv]

DINOv2 Meets Text: A Unified Framework for Image-and Pixel-Level Vision-Language Alignment
CVPR 2025
Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, Oriane Siméoni, Huy V. Vo, Patrick Labatut and Piotr Bojanowski.
[arxiv]

DepthLM: Metric Depth From Vision Language Models
ICLR 2025 Oral
Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra and Yangyang Shi.
[arxiv], [code]

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
ICML 2025
Yung-Sung Chuang, Benjamin Cohen-Wang Shannon, Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li and Wen-tau Yih.
[arxiv]

2024

Demystifying CLIP Data (MetaCLIP)
ICLR Spotlight
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer and Christoph Feichtenhofer.
[arxiv], [code]

The Llama 3 Herd of Models
(Core Author)
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone et al.
[arxiv]

Chameleon: Mixed-modal early-fusion foundation models
Chameleon Team
[arxiv], [code]

MoDE: CLIP Data Experts via Clustering
CVPR 2024
(Project Lead)
Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih, Hu Xu
[arxiv], [code]

An introduction to vision-language modeling
Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie, Pietro Astolfi, Reyhane Askari Hemmat, Jun Chen, Kushal Tirumala, Rim Assouel, Mazda Moayeri, Arjang Talattof, Kamalika Chaudhuri, Zechun Liu, Xilun Chen, Quentin Garrido, Karen Ullrich, Aishwarya Agrawal, Kate Saenko, Asli Celikyilmaz, Vikas Chandra
[arxiv]

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
COLM
Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li
[arxiv]

2023

CiT: Curation in Training for Effective Vision-Language Data
ICCV 2023
Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer.
[paper], [code]

DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
[paper], [code]

MAViL: Masked Audio-Video Learners
NeurIPS 2023
Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Haoqi Fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer.
[paper], [code]

Diffusion Models as Masked Autoencoders
ICCV 2023
Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer.
[paper], [website]

2022

Adapting a Language Model While Preserving its General Knowledge
EMNLP 2022
Zixuan Ke, Yijia Shao, Haowei Lin, Hu Xu, Lei Shu and Bing Liu.
[paper], [code]

Continual Training of Language Models for Few-Shot Learning
EMNLP 2022
Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu and Bing Liu.
[paper], [code]

Masked Autoencoders that Listen
NeurIPS 2022
Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer.
[paper], [code]

CM3: A Causal Masked Multimodal Model of the Internet
Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin*, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer.
[arxiv]

2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
EMNLP 2021
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer
[arxiv], [code]

HTLM: Hyper-text pre-training and prompting of language models
Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer
[arxiv]

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
ACL Findings 2021
Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh, Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer
[arxiv], [code]

Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning
NeurIPS 2021
Zixuan Ke, Bing Liu, Nianzu Ma, Hu Xu and Lei Shu
[code]

CLASSIC: Continual and Contrastive Learning of Aspect Sentiment Classification Tasks
EMNLP 2021
Zixuan Ke, Bing Liu, Hu Xu and Lei Shu
[code]

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions
EMNLP Findings 2021
Zhiyu Chen, Honglei Liu, Hu Xu, Seungwhan Moon, Hao Zhou, Bing Liu
[arxiv], [code and data]

Adapting BERT for Continual Learning of a Sequence of Aspect Sentiment Classification Tasks
NAACL 2021
Zixuan Ke, Hu Xu and Bing Liu
[paper], [code]

2020

Understanding Pre-trained BERT for Aspect-based Sentiment Analysis
COLING 2020
Hu Xu, Lei Shu, Philip S. Yu, Bing Liu
[arxiv], [code]

User Memory Reasoning for Conversational Recommendation
COLING 2020
Hu Xu, Seungwhan Moon, Honglei Liu, Bing Liu, Pararth Shah, Bing Liu, Philip S. Yu
[arxiv]

DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis
EMNLP Findings, 2020
Hu Xu, Bing Liu, Lei Shu, Philip S. Yu
[arxiv]

Controllable Text Generation with Focused Variation
EMNLP Findings, 2020
Lei Shu, Alexandros Papangelis, Yi-Chia Wang, Gokhan Tur, Hu Xu, Zhaleh Feizollahi, Bing Liu, Piero Molino
[arxiv]

2019

Open-world Learning and Application to Product Classification
The Web Conference (WWW 2019)
Hu Xu, Bing Liu, Lei Shu, P. Yu
[arxiv], [code]

BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis
(using BERT for review-based tasks)
2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019)
Hu Xu, Bing Liu, Lei Shu, Philip S. Yu
[paper], [arxiv], [code], [dataset]

Flexibly-Structured Model for Task-Oriented Dialogues
SIGDIAL 2019
Lei Shu, Piero Molino, Mahdi Namazifar, Bing Liu, Hu Xu, Huaixiu Zheng and Gokhan Tur
[paper], [code]

Modeling Multi-Action Policy for Task-Oriented Dialogues
2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019)
Lei Shu, Hu Xu, Bing Liu and Piero Molino

Review Conversational Reading Comprehension
arXiv 1902.00821
Hu Xu, Bing Liu, Lei Shu and Philip S. Yu
[paper]

2018

Double Embeddings and CNN-based Sequence Labeling for Aspect Extraction
the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018)
(This paper won Yelp Dataset Challenge Round 12 Grand Prize Award)
Hu Xu, Bing Liu, Lei Shu, Philip S. Yu
[paper], [code], [domain embedding], [bib], [poster]

Lifelong Domain Word Embedding via Meta-Learning
International Joint Conference on Artificial Intelligence (IJCAI 2018)
Hu Xu, Bing Liu, Lei Shu, Philip S. Yu
[arxiv], [code], [bib], [slides]

Dual Attention Network for Product Compatibility and Function Satisfiability Analysis
AAAI Conference on Artificial Intelligence (AAAI 2018)
(This paper focuses on complementary aspect extraction and polarity classification from question-answering pairs)
Hu Xu, Sihong Xie, Lei Shu, Philip S. Yu
[paper], [dataset], [bib], [slides]

Incorporating the Structure of the Belief State in End-to-End Task-Oriented Dialogue Systems
NeurIPS 2018 Conversational AI Workshop
Lei Shu, Piero Molino, Mahdi Namazifar, Bing Liu, Hu Xu, Huaixiu Zheng, Gokhan Tur
[paper]

Unseen Class Discovery in Open-world Classification
preprint arXiv:1801.05609
Lei Shu, Hu Xu, Bing Liu
[arxiv]

2017

Product Function Need Recognition via Semi-supervised Attention Network
IEEE International Conference on Big Data 2017 (IEEE Bigdata 2017)
Hu Xu, Sihong Xie, Lei Shu, Philip S. Yu
[paper], [dataset], [bib]

DOC: Deep Open Classification of Text Documents
2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)
Lei Shu, Hu Xu, Bing Liu
[paper], [bib], [code: EMNLP2017, www2019]

Lifelong Learning CRF for Supervised Aspect Extraction
the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017)
Lei Shu, Hu Xu, Bing Liu
[paper], [bib]

2016

Mining Compatible/Incompatible Entities from Question and Answering via Yes/No Answer Classification using Distant Label Expansion
arXiv preprint arXiv:1612.04499
Hu Xu, Lei Shu, Jingyuan Zhang, Philip S. Yu
[paper], [dataset], [bib]

CER: Complementary Entity Recognition via Knowledge Expansion on Large Unlabeled Product Reviews
(Previous title: Sentence-level Extraction of Complementary Entities using Large Unlabeled Product Reviews)
IEEE International Conference on Big Data 2016 (IEEE Bigdata 2016)
Hu Xu, Sihong Xie, Lei Shu, Philip S. Yu
[paper], [slides], [data], [bib]

Lifelong-RL: Lifelong Relaxation Labeling for Separating Entities and Aspects in Opinion Targets
(Previous title: Separating entities and aspects in opinion targets using lifelong graph labeling)
2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016)
Lei Shu, Bing Liu, Hu Xu, and Annice Kim
[paper], [slides], [bib]

2013

Planning Paths with Fewer Turns on Grid Maps
AAAI Sixth Annual Symposium on Combinatorial Search
Hu Xu, Lei Shu, May Huang
[paper], [dataset], [bib]

High-speed and accurate laser scan matching using classified features
IEEE International Symposium on Robotic and Sensors Environments (ROSE), 2013
Lei Shu, Hu Xu, May Huang
[check IEEE database], [bib]

2011

Accuracy analysis of power characterization and modeling
Convergence and Hybrid Information Technology Springer Berlin Heidelberg
Xiaolan Bai, Hu Xu and May Huang
[check Springer Database]


Service

PC Members: ACL 2020, EMNLP 2020/2019, AACL 2020, AAAI 2020/2019, IJCAI 2018-2020, NAACL 2019, COLING 2020, WWW 2019
External Reviewer, ACL 2019, AAAI 2018, IEEE DSAA 2016
Journal Reviewer: TPAMI, JAIR, TKDD, Natural Language Engineering, IEEE Transactions on Affective Computing, Transactions on Asian and Low-Resource Language Information Processing


Award

Yelp Dataset Challenge Grand Prize Award
AAAI Scholarship
Presenter Award, University of Illinois at Chicago
Travel Award, IEEE International Conference on Big Data
May 4th Scholarship, Peking University


Talk

Google AI: Learning for Open-world, Host: Dr. Qi Li, 2019
Facebook Conversational AI Summit, Host: Dr. Alborz Geramifard 2019
Amazon Alexa AI: Learning for Open-world, Host: Dr. Young-bum Kim, 2019
AAAI 2018: Dual Attention Network for Product Compatibility and Function Satisfiability Analysis, 2018
Bigdata 2016: CER: Complementary Entity Recognition via Knowledge Expansion on Large Unlabeled Product Reviews, 2016
SoCS 2013: Planning Paths with Fewer Turns on Grid Maps, 2013