Zaid Khan

I'm a 1st-year PhD student at Mohit Bansal's group (MURGe Lab) at UNC Chapel Hill.

Most recently, I was a research intern at Ai2 working with Tanmay Gupta, Luca Weihs, and Ranjay Krishna. Before joining UNC, I was a student researcher in the Media Analytics Group at NEC Laboratories America under Manmohan Chandraker. I completed my BS+MS at Northeastern, where I worked with Raymond Fu.

 /   /   /   /   / 

profile photo

News

  • [Feb 2025] MutaGReP is out! Let an LLM explore a repo to find a plan for a complex user request you give it. Tree search + LLM-guided mutations + code retrieval + planning.
  • [Feb 2025] DataEnvGym has been accepted as a spotlight presentation at ICLR 2025!
  • [Oct 2024] DataEnvGym is out! Can we automate the process of generating data to improve a model on diverse, open-ended tasks, based on automatically-discovered model weaknesses? DataEnvGym is a testbed for data-generation agents + teaching environments. Twitter thread
  • [Mar 2024] Becoming a member of Mohit Bansal's group (MURGe-Lab) at UNC Chapel Hill as a PhD student, where I'll be working on multimodal agents, grounded language reasoning, and other exciting vision/language topics!
  • [Feb 2024] Two papers accepted to CVPR 2024, on self-training agents to solve computer vision tasks via program synthesis (summer internship work with NEC Laboratories) and black-box predictive uncertainty for multimodal LLMs.
  • [Feb 2024] Joining the PRIOR team at AllenAI this summer.
  • [Sep 2023] 1 paper accepted to NeurIPS 2023 on improving the reasoning abilities of open multimodal LLMs with question decomposition. (Collaboration with NEC Laboratories America).
  • [Aug 2023] Received a fellowship award from NEC Laboratories America.
  • [Jun 2023] 1 paper accepted to CVPR 2023 on self-training with synthetic data for visual question answering. (Summer internship work with NEC Laboratories America).
  • [Jun 2023] Joining the Media Analytics Group of NEC Laboratories America in San Jose again this summer to work on agentic foundation models for computer vision.
  • [May 2023] Completed my Masters in CompE (concentration in Computer Vision and Learning Algorithms) at Northeastern University at Raymond Fu's lab.
  • [Jan 2023] 1 paper accepted to ICLR 2023 on efficient vision-language pretraining.
  • [Jul 2022] 1 paper accepted to ECCV 2022 on data-efficient vision-language alignment (collaboration with NEC Laboratories America).
  • [Feb 2022] Joining the Media Analytics Group of NEC Laboratories America in San Jose this summer.
  • [Jul 2021] 1 paper (oral) accepted to ACM Multimedia 2021 on using language models for multimodal affective computing.
  • [May 2021] Received Northeastern's 2021 Outstanding Graduate Student Award!
  • [Feb 2021] 1 paper accepted to FAccT 2021 on why racial categories don't work for fair computer vision. Media Coverage: Scroll.IN, News@Northeastern reporting
  • [Sep 2020] Becoming a full-time MS student at Northeastern after wrapping up a 2-year stint at Roadie.

Background

Before graduate school, I spent ~3 years as an early member of the engineering / data science organizations at two high growth startups: Roadie (acquired by UPS for $500m) and Intelligent Flying Machines / OneTrack.AI as software engineer, where I led efforts to scale data infrastructure to match growth, and worked on a range of challenging problems, including embedded deep learning, fault-tolerant distributed systems, realtime adaptive pricing, and data pipelines.

Outside of research, I lift weights, read (here's my goodreads profile), watch mixed martial arts, and sometimes wonder whether randomness is real.

Research

Current Work

  • automatic skill-targeted data / environment generation: DataEnvGym frames data generation as an RL-style sequential decision-making problem. The goal is to build agents which can automate the process of identifying the weak skills of a model and generating training data to improve those weak skills. It builds on EnvGen, which generates training environments that help an agent learn skills the agent is weak at.
  • LLM-driven exploration and planning: MutaGReP explores large code repositories to find realizable plans for complex, multi-step user requests. We use LLM-guided mutations and informed tree search to explore plan space, and use a symbol retrieval to keep the plans grounded in the codebase.

Prior Work

Some papers are highlighted.

Publications

MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use
Zaid Khan, Ali Farhadi, Ranjay Krishna, Luca Weihs, Mohit Bansal, Tanmay Gupta
arXiv, 2025
project page / arXiv

Neural tree search for repo-level code-use planning. MutaGReP explores plan space through LLM guided mutations, while grounding the plan to functionality in the codebase using a symbol retriever.

Learning to Generate Unit Tests for Automated Debugging
Archiki Prasad*, Elias Stengel-Eskin*, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal
arXiv, 2025
code / arXiv

Testing is a critical part of software engineering — what if we could automatically discover inputs which break your code? We show how to train SLMs (Qwen2.5-7B + Llama3.1-8B) to generate unit tests that break code and are useful for debugging.

* Equal contribution

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
Zaid Khan, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
ICLR, 2025   (Spotlight Presentation)
project page / arXiv

A testbed for RL-style data generation agents + teaching environments to automate post-training: the process of improving a model on diverse, open-ended tasks, based on automatically-discovered model skills / weaknesses.

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, Manmohan Chandraker
CVPR, 2024
project page / arXiv

We show how to improve the program synthesis ability of an LLM from execution feedback and apply it to create a 7B model that writes programs that orchestrate other models to solve computer vision tasks.

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
Zaid Khan, Yun Fu
CVPR, 2024
arXiv

We show how to identify unreliable responses from multimodal LLMs by examining the consistency of their responses over the neighborhood of a visual question, without requiring access to the model's internals.

Exploring Question Decomposition for Zero-Shot VQA
Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu
NeurIPS, 2023
project page / arXiv

We show how to selectively decompose complex questions into simpler sub-questions to improve zero-shot performance on challenging multimodal reasoning tasks.

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Zaid Khan, Vijay Kumar BG, Samuel Schulter, Xiang Yu, Yun Fu, Manmohan Chandraker
CVPR, 2023
code / arXiv

Getting labels for a multimodal dataset can be expensive. We show how you can use unlabeled images to improve performance on data-scarce multimodal tasks.

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning
Zaid Khan, Yun Fu
ICLR, 2023
code / arXiv

We explore creating CLIP-like models by minimally updating already-trained vision and language models, finding that updating less than 7% of parameters can match full model training.

Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Zaid Khan, Vijay Kumar BG, Xiang Yu, Samuel Schulter, Manmohan Chandraker, Yun Fu
ECCV, 2022
project page / arXiv

We demonstrate a very data-efficient way to align vision and language by learning to reconstruct each modality from the other.

Exploiting BERT for Multimodal Target Sentiment Classification Through Input Space Translation
Zaid Khan, Yun Fu
ACM MM, 2021   (Oral Presentation)
code / arXiv

Understanding the emotional content of social media posts is difficult for traditional sentiment analysis models. We show that language models do a good job of this if the post can be translated into a natural input space for them.

One Label, One Billion Faces: Usage and Consistency of Racial Categories in Computer Vision
Zaid Khan, Yun Fu
ACM FAccT, 2021
arXiv

Are notions of algorithmic fairness based on racial categories meaningful? We study computer vision datasets that use racial categories, and empirically show that the racial categories encoded in each dataset are often highly inconsistent with each other and with human intuitions.


Adapted from Jon Barron's website..