Zaid Khan

brain@zaidkhan.me

My goal is to build trustworthy, teachable multimodal language-driven agents that can reason and write code.

I’m an incoming PhD student at Mohit Bansal’s group (MURGe Lab) at UNC Chapel Hill. I’m also a student researcher in the Media Analytics Group at NEC Laboratories America with Manmohan Chandraker (from 2022). I completed my BS+MS at Northeastern, where I was fortunate to be advised by Raymond Fu.

research interests

Grounded, complex reasoning tasks, such as interactive theorem proving or open-world image understanding. For example, GPT-4V is still far from solving WinoGround, and EncylopedicVQA is difficult even for PaLI.
Foundation model-driven agents that learn from grounded interaction. Systems like Voyager, ViperGPT use LLMs as planners, but keep them frozen. Can we improve the LLM from interactive feedback? This has been done in formal environments like LeanDojo for theorem-proving, but how do we construct virtual environments with feedback for tasks like open-world understanding?
Neurosymbolic systems in general, but especially approaches using program synthesis as a way to represent reasoning formally, disentangle reasoning from perception and impose constraints on behavior.
Uncertainty quantification and reliable models. A requirement for high-stakes applications (and even personal use) for any AI system is an ability to say “I don’t know”. A problem I’ve been thinking about is selective prediction for open-ended visual question answering, because the uncertainty can come from both the language model itself, as well as the binding between vision and language.

about

I completed my Masters in Computer Vision and Learning Algorithms at Northeastern University in Boston under Raymond Fu in close collaboration with the Media Analytics Group at NEC Laboratories America under Manmohan Chandraker, where I worked on grounded language understanding and reasoning. During my Masters, I won an university-wide Outstanding Graduate Student award for my work on the (mis) use of racial categories in computer vision (Scroll.IN reporting, News@Northeastern reporting). Before graduate school, I spent ~3 years as an early member of the engineering / data science organizations at two high growth startups: Roadie (acquired by UPS for $500m) and Intelligent Flying Machines / OneTrack.AI as software engineer, where I led efforts to scale data infrastructure to match growth, and worked on a range of challenging problems, including embedded deep learning, fault-tolerant distributed systems, realtime adaptive pricing, and data pipelines.

Outside of research, I lift weights, read (here’s my goodreads profile), watch mixed martial arts, and sometimes wonder whether randomness is real.

news

Mar 7, 2024	Becoming a member of Mohit Bansal’s group (MURGe-Lab) at UNC Chapel Hill as a PhD student, where I’ll be working on multimodal agents, grounded language reasoning, and other exciting vision/language topics!
Feb 29, 2024	Two papers accepted to CVPR 2024, on self-training agents to solve computer vision tasks via program synthesis (summer internship work with NEC Laboratories) and black-box predictive uncertainty for multimodal LLMs.
Feb 22, 2024	Joining the PRIOR team at AllenAI this summer.
Sep 24, 2023	1 paper accepted to NeurIPS 2023 on improving the reasoning abilities of open multimodal LLMs with question decomposition. (Collaboration with NEC Laboratories America).
Aug 27, 2023	Completed my Masters in CompE (concentration in Computer Vision and Learning Algorithms) at Northeastern University at Raymond Fu’s lab.
Aug 25, 2023	Recieved a PhD Fellowship from NEC Laboratories America.
Jun 16, 2023	1 paper accepted to CVPR 2023 on self-training with synthetic data for visual question answering. (Summer internship work with NEC Laboratories America).
May 22, 2023	Joining the Media Analytics Group of NEC Laboratories America in San Jose again this summer to work on agentic foundation models for computer vision.
Jan 27, 2023	1 paper accepted to ICLR 2023 on efficient vision-language pretraining.
Jul 4, 2022	1 paper accepted to ECCV 2022 on data-efficient vision-language alignment (collaboration with NEC Laboratories America).
Feb 4, 2022	Joining the Media Analytics Group of NEC Laboratories America in San Jose this summer.
Jul 4, 2021	1 paper (oral) accepted to ACM Multimedia 2021 on using language models for multimodal affective computing.
May 3, 2021	Received Northeastern’s 2021 Outstanding Graduate Student Award!
Feb 22, 2021	1 paper accepted to FAccT 2021 on why racial categories don’t work for fair computer vision. Media Coverage: Scroll.IN, News@Northeastern reporting)

selected publications

CVPR

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Khan, Zaid, Kumar BG, Vijay, Schulter, Samuel, Fu, Yun, and Chandraker, Manmohan

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Abs arXiv

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of a LLM using feedback from interactive experience. We propose a method in which we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger.
CVPR

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Khan, Zaid, and Fu, Yun

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

Abs arXiv

The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of neighborhood consistency to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model’s responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model.
NeurIPS

Exploring Question Decomposition for Zero-Shot VQA

Khan, Zaid, Kumar BG, Vijay, Schulter, Samuel, Chandraker, Manmohan, and Fu, Yun

In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) 2023

Abs arXiv

Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/
CVPR

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

Khan, Zaid, BG, Vijay Kumar, Schulter, Samuel, Yu, Xiang, Fu, Yun, and Chandraker, Manmohan

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023

Abs arXiv

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challenging, unlabeled images are often available. We introduce SelTDA (Self-Taught Data Augmentation), a strategy for finetuning large VLMs on small-scale VQA datasets. SelTDA uses the VLM and target dataset to build a teacher model that can generate question-answer pseudolabels directly conditioned on an image alone, allowing us to pseudolabel unlabeled images. SelTDA then finetunes the initial VLM on the original dataset augmented with freshly pseudolabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions, counterfactual examples and rephrasings, improves domain generalization, and results in greater retention of numerical reasoning skills. The proposed strategy requires no additional annotations or architectural modifications, and is compatible with any modern encoder-decoder multimodal transformer. Code available at https://github.com/codezakh/SelTDA
ICLR

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Khan, Zaid, and Fu, Yun

In The Eleventh International Conference on Learning Representations 2023

Abs arXiv

Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates (<7%) can achieve the same performance as full-model training, and updating specific components (<1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT.
ECCV

Single-Stream Multi-Level Alignment for Vision-Language Pretraining

Khan, Zaid, BG, Vijay Kumar, Yu, Xiang, Schulter, Samuel, Chandraker, Manmohan, and Fu, Yun

In European Conference on Computer Vision 2022

Abs arXiv

Recent progress in large-scale vision-language pre-training has shown the importance of aligning the visual and text modalities for downstream vision-language tasks. Many methods use a dual-stream architecture that fuses visual tokens and language tokens after representation learning, which aligns only at a global level and cannot extract finer-scale semantics. In contrast, we propose a single stream model that aligns the modalities at multiple levels: i) instance level, ii) fine-grained patch level, iii) conceptual semantic level. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. In the former part, we mask the input tokens from one of the modalities and use the cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In the latter part, we parse the caption to select a few key words and feed it together with the momentum encoder pseudo signal to self-supervise the visual encoder, enforcing it to learn rich semantic concepts that are essential for grounding a textual token to an image region. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA. We also demonstrate how the proposed models can align the modalities at multiple levels.
ACM MM

Exploiting BERT for Multimodal Target Sentiment Classification Through Input Space Translation

Khan, Zaid, and Fu, Yun

In ACM Conference on Multimedia 2021

Abs arXiv

Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimodal tweets are short and accompanied by complex, possibly irrelevant images. We introduce a two-stream model that translates images in input space using an object-aware transformer followed by a single-pass non-autoregressive text generation approach. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. Our approach increases the amount of text available to the language model and distills the object-level information in complex images. We achieve state-of-the-art performance on two multimodal Twitter datasets without modifying the internals of the language model to accept multimodal data, demonstrating the effectiveness of our translation. In addition, we explain a failure mode of a popular approach for aspect sentiment analysis when applied to tweets. We make our code publically available.
ACM FAccT

One Label, One Billion Faces: Usage and Consistency of Racial Categories in Computer Vision

Khan, Zaid, and Fu, Yun

In ACM Conference on Fairness, Accountability, and Transparency 2021

Abs arXiv

Computer vision is widely deployed, has highly visible, society altering applications, and documented problems with bias and representation. Datasets are critical for benchmarking progress in fair computer vision, and often employ broad racial categories as population groups for measuring group fairness. Similarly, diversity is often measured in computer vision datasets by ascribing and counting categorical race labels. However, racial categories are ill-defined, unstable temporally and geographically, and have a problematic history of scientific use. Although the racial categories used across datasets are superficially similar, the complexity of human race perception suggests the racial system encoded by one dataset may be substantially inconsistent with another. Using the insight that a classifier can learn the racial system encoded by a dataset, we conduct an empirical study of computer vision datasets supplying categorical race labels for face images to determine the cross-dataset consistency and generalization of racial categories. We find that each dataset encodes a substantially unique racial system, despite nominally equivalent racial categories, and some racial categories are systemically less consistent than others across datasets. We find evidence that racial categories encode stereotypes, and exclude ethnic groups from categories on the basis of nonconformity to stereotypes. Representing a billion humans under one racial category may obscure disparities and create new ones by encoding stereotypes of racial systems. The difficulty of adequately converting the abstract concept of race into a tool for measuring fairness underscores the need for a method more flexible and culturally aware than racial categories.