Yuanhan Zhang (original) (raw)

Yuanhan (John) Zhang Hi! I'm Yuanhan Zhang, here is the standard Chinese pronunciation for my first name : Yuanhan, a third-year PhD student at MMLab@NTU, supervised by Prof. Ziwei Liu. My research interests lie in computer vision and deep learning. In particular, I am focused on adapting foundation models—from vision to multi-modal—for real-world exploration. This involves benchmarking model performance and adapting models through parameter-efficient tuning, in-context learning, instruction tuning. Email (yuanhan002@e.ntu.edu.sg) / Google Scholar / Twitter / Github

[2024-10] We update LLaVA-Video (formerly LLaVA-NeXT-Video), releasing both the model and the Data .
[2024-08] We release LLaVA-OneVision! A LMM that excels across single-image, multi-image, and video tasks.
[2024-07] IJCV Outstanding Reviewer Award 2023.
[2024-07] Finally, NOAH has been accepted by TPAMI.
[2024-07] Three papers are accepted at ECCV 2024.
[2024-06] We're organizing a workshop on Prompting in Vision at CVPR 2024.
[2024-05] LLaVA-NeXT-Video is released. Our team continues to build the most powerful open-source large modality model! Older News & Activities
- [2023-09] Visual Prompt Retrieval is accepted in NeurIPS2023, see you in New Orleans!
- [2023-09] Gave a talk at Alibaba, Damo Academy, Hosted by Dr. Lidong Bin.
- [2023-07] Gave a talk at HITSZ, Hosted by Prof. Rui Shao.
- [2023-06] Introducing Otter. Check it out now!
- [2022-10] We won the first place in Computer Vision in the Wild Challenge.
- [2022-07] OmniBenchmark is accepted in ECCV2022.
- [2022-03] Bamboo dataset is released.

	LLaVA-Video: Video Instruction Tuning With Synthetic Data Yuanhan Zhang,Jinming Wu,Wei Li,Bo Li,Zejun Ma,Ziwei Liu Chunyuan Li arXiv Preprint, 2024 PDF / Dataset, Model and Code Fully open-sourced video LMM model with competitive ability, including code, model, and data.
	Otter: A multi-modal model with in-context instruction tuning Bo Li,Yuanhan Zhang*, Liangyu Chen,Jinghao Wan,Fanyi Pu, Jingkang Yang,Chunyuan Li,Ziwei Liu arXiv Preprint*, 2023 PDF / Dataset and Code A vision-language model with in-context instruction tuning.

	LLaVA-OneVision: Easy Visual Task Transfer Bo Li,Yuanhan Zhang,Dong Guo,Renrui Zhang,Feng Li,Hao Zhang,Kaichen Zhang,Yanwei Li,Ziwei Liu Chunyuan Li TMLR, 2025 PDF / Dataset and Code A family of LMMs developed by consolidating insights into data, models, and visual representations.
	LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Feng Li,Renrui Zhang,Hao Zhang,Yuanhan Zhang,Bo Li,Wei Li,Zejun Ma, ICLR*, 2025 (Spotlight) PDF / Dataset and Code Tackling Multi-image, Video, and 3D in Large Multimodal Models.
	MMBench: Is Your Multi-modal Model an All-around Player? Yuan Liu,Haodong Duan,Yuanhan Zhang,Bo Li,Songyang Zhang,Wangbo Zhao,Yike Yuan,Jiaqi Wang,Conghui He,Ziwei Liu,Kai Chen,Dahua Lin ECCV*, 2024 (Oral) PDF / Dataset and Code Benchmarking the 20 abilities of vision-language models.

	Octopus: Embodied Vision-Language Programmer from Environmental Feedback Jingkang Yang,Yuhan Dong,Shuai Liu,Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang,Yuanhan Zhang,Kaiyang Zhou, Ziwei Liu ECCV, 2024 PDF / Dataset and Code An embodied vision-language model trained with RLEF, emerging superior in embodied visual planning and programming.

	FunQA: Towards Surprising Video Comprehension Binzhu Xie,Sicheng Zhang,Zitang Zhou,Bo Li,Yuanhan Zhang,Jack Hessel,Jingkang Yang,Ziwei Liu ECCV, 2024 PDF / Dataset and Code FunQA benchmarks funny, creative, and magic videos for challenging tasks.
	Knowledge augmented instruction tuning for zero-shot animal species recognition Zalan Fabian,Zhongqi Miao,Chunyuan Li,Yuanhan Zhang,Ziwei Liu,Andrés Hernández,Andrés Montes-Rojas,Rafael Escucha,Laura Siabatto, Andrés Link, Pablo Arbeláez,Rahul Dodhia, Juan Lavista Ferres Instruction Tuning and Instruction Following Workshop@NeurIPS 2023. PDF A knowledge augmented vision-language model for AI conservation.
	What Makes Good Examples for Visual In-Context Learning? Yuanhan Zhang,Kaiyang Zhou, Ziwei Liu NeurIPS, 2023 PDF / Code Retrieving prompt for visual in-context learning.
	Learning without Forgetting for Vision-Language Models Da-Wei Zhou,Yuanhan Zhang;Yan Wang,Jingyi Ning,Han-Jia Ye,De-Chuan Zhan,Ziwei Liu TPAMI PDF / Code Learning without Forgetting for Vision-Language Models.
	Neural Prompt Search Yuanhan Zhang,Kaiyang Zhou, Ziwei Liu TPAMI PDF / Project Page / Code Searching prompt modules for parameter-efficient transfer learning.
	3D Point Cloud Pre-training with Knowledge Distillation from 2D Images? Yuan Yao,Yuanhan Zhang,Zhenfei Yin, Jiebo Luo,Wanli Ouyang,Xiaoshui Huang. ICME, 2023 PDF / Code 3D Point Cloud Pre-training with Knowledge Distillation from 2D Images.
	Benchmarking Omni-Vision Representation through the Lens of Visual Realms Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu ECCV, 2022 PDF / Project Page / Leaderboard / Challenge:ImageNet1k-Pretrain Track / Challenge:Open-Pretrain Track /Dataset and Code New benchmark for evaluating vision foundation models; New supervised contrastive learning framework.
	Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He, Zhenfei Yin, Kun Wang,Lu Sheng, Yu Qiao,Jing Shao, Ziwei Liu IJCV, 2025 PDF / Project Page / Demo /Code 4 times larger than ImageNet; 2 time larger than Object365; Built by active learning.
	CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations Yuanhan Zhang, Zhenfei Yin,Yidong Li, Guojun Yin, Junjie Yan, Jing Shao, Ziwei Liu ECCV, 2020 PDF /Dataset /Demo /Code Large-scale face-antispoofing Dataset.

Reviewer: CVPR/ECCV/ICCV/ICLR/NeurIPS/T-PAMI/IJCV
Organizing Committee: The AI Talk, CVPR24 Workshop(Prompting in Vision) ECCV22 Workshop(workshop1, workshop2)
Happy to chat about any topics :)

Last updated in Jan. 2025.

Homepage credits: Jon Barron.