Yongshuo Zong

EH8 9AB

Edinburgh, UK

I am a Research Scientist at Google Research. Previously, I did my PhD at the University of Edinburgh, supervised by Prof. Timothy Hospedales and Dr. Yongxin Yang, where I am funded by UKRI CDT in Biomedical AI. I obtained my BSc in computer science from Tongji University, in 2021.

I am broadly interested in machine learning and its applications in healthcare, especially with multi-modal learning and large vision-language models. Feel free to drop me an email for potential collaborations!

news

Jan 08, 2026	Passed my PhD viva!
Aug 01, 2025	Internship complete at Cohere: we launched Command-A-Vision, with more fun to come!
Feb 26, 2025	Ground-V from my internship at Amazon is accepted to CVPR’25!
Jan 22, 2025	VL-ICL is accepted to ICLR’25!
Sep 03, 2024	Start my internship at Amazon AWS AI!
Jul 11, 2024	Survey on Self-supervised Multimodal Learning is accepted to IEEE T-PAMI!
May 01, 2024	Both VLGuard and Fool your (V)LLMs are accepted to ICML’24!
Apr 24, 2024	Giving a talk about VLGuard at BMVA Trustworthy Multimodal Foundation Models Symposium!
Feb 27, 2024	C-VQA is accepted to CVPR’24!
Jan 17, 2024	Giving a talk about Fool your (V)LLMs at BMVA Vision-Language Symposium!
Nov 06, 2023	Invited talk about MEDFAIR at FAIMI workshop!
Feb 27, 2023	Meta-Omnium is accepted to CVPR’23!
Jan 21, 2023	MEDFAIR is accepted to ICLR’23 as spotlight!

selected publications/preprints

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong, Qin Zhang, Dongsheng An, Zhihua Li, Xiang Xu, and 4 more authors

CVPR, 2025

TL;DR: Ground-V sets new state-of-the-art on reasoning segmentation tasks.

arXiv
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models

Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, and Timothy Hospedales

ICML, 2024

TL;DR: VLLM fine-tuning breaks LLM safety, but our VLGuard can fix this.

arXiv Code Poster Website
Fool your (vision and) language model with embarrassingly simple permutations

Yongshuo Zong, Tingyang Yu, Bingchen Zhao, Ruchika Chavhan, and Timothy Hospedales

ICML, 2024

TL;DR: (V)LLM-based MCQ is not permutation robust.

arXiv Code Poster
What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models

Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen, and 1 more author

CVPR, 2024

TL;DR: Vision large language models do not understand counterfactual conditions well.

arXiv Code Website
Meta omnium: A benchmark for general-purpose learning-to-learn

Ondrej Bohdal, Yinbing Tian, Yongshuo Zong, Ruchika Chavhan, Da Li, and 3 more authors

CVPR, 2023

TL;DR: A framework for evaluating meta-learners across various vision tasks consistently.

arXiv Code Website
MEDFAIR: benchmarking fairness for medical imaging

Yongshuo Zong, Yongxin Yang, and Timothy Hospedales

ICLR, 2023

TL;DR: We develop a fairness benchmark for medical imaging and find that the state-of-the-art bias mitigation algorithm does not significantly outperform ERM.

arXiv Code Poster Website
VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Yongshuo Zong, Ondrej Bohdal, and Timothy Hospedales

ICLR, 2025

TL;DR: VL-ICL Bench is a better multimodal ICL benchmark than VQA and captioning.

arXiv Code Website
Self-supervised multimodal learning: A survey

Yongshuo Zong, Oisin Mac Aodha, and Timothy Hospedales

IEEE T-PAMI, 2024

TL;DR: Systematic review of self-supervised multimodal learning methods.

arXiv Code
conST: an interpretable multi-modal contrastive learning framework for spatial transcriptomics

Yongshuo Zong, Tingyang Yu, Xuesong Wang, Yixuan Wang, Zhihang Hu, and 1 more author

BioRxiv preprint, 2022

TL;DR: A contrastive SSL method for spatial transcriptomics representation learning.

HTML Code