Zaid Khan

brain@zaidkhan.me

My goal is to build trustworthy, teachable multimodal language-driven agents that can reason and write code.

I’m an incoming PhD student at Mohit Bansal’s’s group (MURGe Lab) at UNC Chapel Hill. I’m also a student researcher in the Media Analytics Group at NEC Laboratories America with Manmohan Chandraker (from 2022). I completed my BS+MS at Northeastern, where I was fortunate to be advised by Raymond Fu.

research interests

  • Grounded, complex reasoning tasks, such as interactive theorem proving or open-world image understanding. For example, GPT-4V is still far from solving WinoGround, and EncylopedicVQA is difficult even for PaLI.
  • Foundation model-driven agents that learn from grounded interaction. Systems like Voyager, ViperGPT use LLMs as planners, but keep them frozen. Can we improve the LLM from interactive feedback? This has been done in formal environments like LeanDojo for theorem-proving, but how do we construct virtual environments with feedback for tasks like open-world understanding?
  • Neurosymbolic systems in general, but especially approaches using program synthesis as a way to represent reasoning formally, disentangle reasoning from perception and impose constraints on behavior.
  • Uncertainty quantification and reliable models. A requirement for high-stakes applications (and even personal use) for any AI system is an ability to say “I don’t know”. A problem I’ve been thinking about is selective prediction for open-ended visual question answering, because the uncertainty can come from both the language model itself, as well as the binding between vision and language.

about

I completed my Masters in Computer Vision and Learning Algorithms at Northeastern University in Boston under Raymond Fu in close collaboration with the Media Analytics Group at NEC Laboratories America under Manmohan Chandraker, where I worked on grounded language understanding and reasoning. During my Masters, I won an university-wide Outstanding Graduate Student award for my work on the (mis) use of racial categories in computer vision (Scroll.IN reporting, News@Northeastern reporting). Before graduate school, I spent ~3 years as an early member of the engineering / data science organizations at two high growth startups: Roadie (acquired by UPS for $500m) and Intelligent Flying Machines / OneTrack.AI as software engineer, where I led efforts to scale data infrastructure to match growth, and worked on a range of challenging problems, including embedded deep learning, fault-tolerant distributed systems, realtime adaptive pricing, and data pipelines.

Outside of research, I lift weights, read (here’s my goodreads profile), watch mixed martial arts, and sometimes wonder whether randomness is real.

news

Mar 7, 2024 Becoming a member of Mohit Bansal’s group (MURGe-Lab) at UNC Chapel Hill as a PhD student, where I’ll be working on multimodal agents, grounded language reasoning, and other exciting vision/language topics!
Feb 29, 2024 Two papers accepted to CVPR 2024, on self-training agents to solve computer vision tasks via program synthesis (summer internship work with NEC Laboratories) and black-box predictive uncertainty for multimodal LLMs.
Feb 22, 2024 Joining the PRIOR team at AllenAI this summer.
Sep 24, 2023 1 paper accepted to NeurIPS 2023 on improving the reasoning abilities of open multimodal LLMs with question decomposition. (Collaboration with NEC Laboratories America).
Aug 27, 2023 Completed my Masters in CompE (concentration in Computer Vision and Learning Algorithms) at Northeastern University at Raymond Fu’s lab.
Aug 25, 2023 Recieved a PhD Fellowship from NEC Laboratories America.
Jun 16, 2023 1 paper accepted to CVPR 2023 on self-training with synthetic data for visual question answering. (Summer internship work with NEC Laboratories America).
May 22, 2023 Joining the Media Analytics Group of NEC Laboratories America in San Jose again this summer to work on agentic foundation models for computer vision.
Jan 27, 2023 1 paper accepted to ICLR 2023 on efficient vision-language pretraining.
Jul 4, 2022 1 paper accepted to ECCV 2022 on data-efficient vision-language alignment (collaboration with NEC Laboratories America).
Feb 4, 2022 Joining the Media Analytics Group of NEC Laboratories America in San Jose this summer.
Jul 4, 2021 1 paper (oral) accepted to ACM Multimedia 2021 on using language models for multimodal affective computing.
May 3, 2021 Received Northeastern’s 2021 Outstanding Graduate Student Award!
Feb 22, 2021 1 paper accepted to FAccT 2021 on why racial categories don’t work for fair computer vision. Media Coverage: Scroll.IN, News@Northeastern reporting)

selected publications

  1. CVPR
    Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement
    Khan, Zaid, Kumar BG, Vijay, Schulter, Samuel, Fu, Yun, and Chandraker, Manmohan
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024
  2. CVPR
    Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
    Khan, Zaid, and Fu, Yun
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024
  3. NeurIPS
    Exploring Question Decomposition for Zero-Shot VQA
    Khan, Zaid, Kumar BG, Vijay, Schulter, Samuel, Chandraker, Manmohan, and Fu, Yun
    In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS) 2023
  4. CVPR
    Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
    Khan, Zaid, BG, Vijay Kumar, Schulter, Samuel, Yu, Xiang, Fu, Yun, and Chandraker, Manmohan
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023
  5. ICLR
    Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning
    Khan, Zaid, and Fu, Yun
    In The Eleventh International Conference on Learning Representations 2023
  6. ECCV
    Single-Stream Multi-Level Alignment for Vision-Language Pretraining
    Khan, Zaid, BG, Vijay Kumar, Yu, Xiang, Schulter, Samuel, Chandraker, Manmohan, and Fu, Yun
    In European Conference on Computer Vision 2022
  7. ACM MM
    Exploiting BERT for Multimodal Target Sentiment Classification Through Input Space Translation
    Khan, Zaid, and Fu, Yun
    In ACM Conference on Multimedia 2021
  8. ACM FAccT
    One Label, One Billion Faces: Usage and Consistency of Racial Categories in Computer Vision
    Khan, Zaid, and Fu, Yun
    In ACM Conference on Fairness, Accountability, and Transparency 2021