Background
Before graduate school, I spent ~3 years as an early member of the engineering / data science organizations at two high growth startups: Roadie (acquired by UPS for $500m) and Intelligent Flying Machines / OneTrack.AI as software engineer, where I led efforts to scale data infrastructure to match growth, and worked on a range of challenging problems, including embedded deep learning, fault-tolerant distributed systems, realtime adaptive pricing, and data pipelines.
Outside of research, I lift weights, read (here's my goodreads profile), watch mixed martial arts, and sometimes wonder whether randomness is real.
|
Research
Current Work
- automatic skill-targeted data / environment generation: DataEnvGym frames data generation as an RL-style sequential decision-making problem.
The goal is to build agents which can automate the process of identifying the weak skills of a model and generating training data to improve those weak skills.
It builds on EnvGen, which generates training environments that help an agent learn skills the agent is weak at.
- LLM-driven exploration and planning: MutaGReP explores large code repositories to find realizable plans for complex, multi-step user requests. We use LLM-guided mutations and informed tree search to explore plan space, and use a symbol retrieval to keep the plans grounded in the codebase.
Prior Work
- Self-training / self-improvement
- Using uncertainty during reasoning and decision-making
- Vision-language representation learning
Some papers are highlighted.
|
|
MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use
Zaid Khan,
Ali Farhadi,
Ranjay Krishna,
Luca Weihs,
Mohit Bansal,
Tanmay Gupta
arXiv, 2025
project page
/
arXiv
Neural tree search for repo-level code-use planning. MutaGReP explores plan space through LLM guided mutations, while grounding the plan to functionality in the codebase using a symbol retriever.
|
|
Learning to Generate Unit Tests for Automated Debugging
Archiki Prasad*,
Elias Stengel-Eskin*,
Justin Chih-Yao Chen,
Zaid Khan,
Mohit Bansal
arXiv, 2025
code
/
arXiv
Testing is a critical part of software engineering — what if we could automatically discover inputs which break your code? We show how to train SLMs (Qwen2.5-7B + Llama3.1-8B) to generate unit tests that break code and are useful for debugging.
* Equal contribution
|
|
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
Zaid Khan,
Elias Stengel-Eskin,
Jaemin Cho,
Mohit Bansal
ICLR, 2025   (Spotlight Presentation)
project page
/
arXiv
A testbed for RL-style data generation agents + teaching environments to automate post-training: the process of improving a model on diverse, open-ended tasks, based on automatically-discovered model skills / weaknesses.
|
|
Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
Zaid Khan,
Vijay Kumar BG,
Samuel Schulter,
Yun Fu,
Manmohan Chandraker
CVPR, 2024
project page
/
arXiv
We show how to improve the program synthesis ability of an LLM from execution feedback and apply it to create a 7B model that writes programs that orchestrate other models to solve computer vision tasks.
|
|
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
Zaid Khan,
Yun Fu
CVPR, 2024
arXiv
We show how to identify unreliable responses from multimodal LLMs by examining the consistency of their responses over the neighborhood of a visual question, without requiring access to the model's internals.
|
|
Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan,
Vijay Kumar BG,
Samuel Schulter,
Manmohan Chandraker,
Yun Fu
NeurIPS, 2023
project page
/
arXiv
We show how to selectively decompose complex questions into simpler sub-questions to improve zero-shot performance on challenging multimodal reasoning tasks.
|
|
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Zaid Khan,
Vijay Kumar BG,
Samuel Schulter,
Xiang Yu,
Yun Fu,
Manmohan Chandraker
CVPR, 2023
code
/
arXiv
Getting labels for a multimodal dataset can be expensive. We show how you can use unlabeled images to improve performance on data-scarce multimodal tasks.
|
|
Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning
Zaid Khan,
Yun Fu
ICLR, 2023
code
/
arXiv
We explore creating CLIP-like models by minimally updating already-trained vision and language models, finding that updating less than 7% of parameters can match full model training.
|
|
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Zaid Khan,
Vijay Kumar BG,
Xiang Yu,
Samuel Schulter,
Manmohan Chandraker,
Yun Fu
ECCV, 2022
project page
/
arXiv
We demonstrate a very data-efficient way to align vision and language by learning to reconstruct each modality from the other.
|
|
Exploiting BERT for Multimodal Target Sentiment Classification Through Input Space Translation
Zaid Khan,
Yun Fu
ACM MM, 2021   (Oral Presentation)
code
/
arXiv
Understanding the emotional content of social media posts is difficult for traditional sentiment analysis models.
We show that language models do a good job of this if the post can be translated into a natural input space for them.
|
|
One Label, One Billion Faces: Usage and Consistency of Racial Categories in Computer Vision
Zaid Khan,
Yun Fu
ACM FAccT, 2021
arXiv
Are notions of algorithmic fairness based on racial categories meaningful?
We study computer vision datasets that use racial categories, and empirically show that the racial categories encoded in each dataset are often highly inconsistent with each other and with human intuitions.
|
|