I'm a 2nd-year PhD student at Mohit Bansal's group (MURGe Lab) at UNC Chapel Hill, supported by a DoD NDSEG fellowship.
Most recently, I was a research intern at Ai2 working with Tanmay Gupta, Luca Weihs, and Ranjay Krishna.
Before joining UNC, I was a student researcher in the Media Analytics Group at NEC Laboratories America under Manmohan Chandraker.
I completed my BS+MS at Northeastern, where I worked with Raymond Fu.
/
/
/
/
/
News
[May 2026] New preprints: selective outcome forecasting in evolutionary kernel search (GPU Forecasters), uncertainty-aware belief states for long-horizon agents (Agent-BRACE), long-horizon memory evals (MINTEval), and dense token-level RL rewards for reasoning (AVSD).
[Apr 2026] Two new preprints: ToM-SB (an environment where defender LLMs learn via RL to fool attackers seeking sensitive information) and Cog-DRIFT (adaptive curriculum reformulation that improves exploration on hard reasoning problems).
[Oct 2025]One Life to Learn is out! We infer probabilistic world models in Python code for unknown, complex stochastic environments and use them for planning / simulation. Twitter thread
[Apr 2025]EFAGen is out! EFAGen infers the data-generating abstraction underlying a static math problem as a program, and executes it to generate diverse, verifiable problem variants.
[Feb 2025]MutaGReP is out! Let an LLM explore a repo to find a plan for a complex user request you give it. Tree search + LLM-guided mutations + code retrieval + planning.
[Feb 2025]DataEnvGym has been accepted as a spotlight presentation at ICLR 2025!
[Oct 2024]DataEnvGym is out! Can we automate the process of generating data to improve a model on diverse, open-ended tasks, based on automatically-discovered model weaknesses? DataEnvGym is a testbed for data-generation agents + teaching environments. Twitter thread
[Mar 2024] Becoming a member of Mohit Bansal's group (MURGe-Lab) at UNC Chapel Hill as a PhD student, where I'll be working on multimodal agents, grounded language reasoning, and other exciting vision/language topics!
[Feb 2024] Two papers accepted to CVPR 2024, on self-training agents to solve computer vision tasks via program synthesis (summer internship work with NEC Laboratories) and black-box predictive uncertainty for multimodal LLMs.
[Sep 2023] 1 paper accepted to NeurIPS 2023 on improving the reasoning abilities of open multimodal LLMs with question decomposition. (Collaboration with NEC Laboratories America).
[Jun 2023] 1 paper accepted to CVPR 2023 on self-training with synthetic data for visual question answering. (Summer internship work with NEC Laboratories America).
[Jun 2023] Joining the Media Analytics Group of NEC Laboratories America in San Jose again this summer to work on agentic foundation models for computer vision.
[May 2023] Completed my Masters in CompE (concentration in Computer Vision and Learning Algorithms) at Northeastern University at Raymond Fu's lab.
[Jan 2023] 1 paper accepted to ICLR 2023 on efficient vision-language pretraining.
[Jul 2022] 1 paper accepted to ECCV 2022 on data-efficient vision-language alignment (collaboration with NEC Laboratories America).
[Feb 2022] Joining the Media Analytics Group of NEC Laboratories America in San Jose this summer.
[Jul 2021] 1 paper (oral) accepted to ACM Multimedia 2021 on using language models for multimodal affective computing.
[Sep 2020] Becoming a full-time MS student at Northeastern after wrapping up a 2-year stint at Roadie.
Background
Before graduate school, I spent ~3 years as an early member of the engineering / data science organizations at two high growth startups: Roadie (acquired by UPS for $500m) and OneTrack.AI as software engineer, where I led efforts to scale data infrastructure to match growth, and worked on a range of challenging problems, including embedded deep learning, fault-tolerant distributed systems, realtime adaptive pricing, and data pipelines for time series and computer vision tasks.
Outside of research, I lift weights, read (here's my goodreads profile), watch mixed martial arts, and sometimes wonder whether randomness is real.
Can a reasoning LLM act as an approximate world model of a GPU? We train an LLM surrogate to forecast a kernel's runtime by reasoning about its code, then use RL to teach it what it doesn't know, so it sends only the kernels it can't reason about to a real GPU. In evolutionary kernel search, leaning mostly on the cheap surrogate and only selectively on the GPU finds faster kernels than the GPU alone, since the same budget explores far more of the program space.
We've been working on a way to get better on-policy token-level rewards for LLMs + RL! Self-distillation gives token-level rewards, using divergence against a teacher policy given privileged info (i.e true final answer). What if you could use multiple forms of privileged info?
Continually updated envs (e.g. Git repo histories, evolving docs) are central to knowledge work. Reasoning about these requires long context understanding + resolving temporally distributed / interfering changes to the env state. How well do LLM agents / memory systems do?
How can LLM agents solve long-horizon tasks or explore an environment with many details while using only a fixed amount of context window? In Agent-BRACE, we use RL to build+use belief states that decompose into natural-language claims about the world and their epistemic uncertainty.
ToM-SB is an environment where defender LLMs compete against attacker LLMs seeking sensitive information. To win, the defender must fool the attacker into leaving with the wrong information. RL in ToM-SB results in bidirectional emergence of theory-of-mind (ToM) and fooling ability.
How do we enable RLVR on hard problems where rollouts yield zero reward? Imitating expert trajectories is off-policy; instead, we reformulate each hard problem into a multi-level curriculum. Skills learned on the easier variants transfer back to the hard set where standard RLVR fails.
How can an agent reverse engineer the underlying laws of an unknown, hostile & stochastic environment in "one life", without millions of steps + human-provided goals / rewards? We infer a world model in Python for an unknown environment from a single episode!
The fully-open OpenThoughts3 dataset consists of 1.2M reasoning traces and problems constructed by a pipeline designed through 1,000+ controlled experiments taking 40k H100/A100 hours.
The resulting OpenThinker3-7B model achieves state-ofthe-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond – improvements of 15.3, 17.2, and 20.5 percentage points
compared to the DeepSeek-R1-Distill-Qwen-7B model.
A generative Process Reward Model for long-horizon information seeking. PRINTS learns to verbally estimate the information gain of tool calls, enabling a 4B model to effectively guide 30B+ agents through noisy, complex web search tasks like GAIA Level 3.
What if we could transform advanced math problems into abstract programs that can generate endless, verifiable problem variants? EFAGen uses test-time search with execution feedback to infer executable functional abstractions (EFAs) in Python for diverse math problems, including Olympiad-level problems.
Neural tree search for repo-level code-use planning. MutaGReP explores plan space through LLM guided mutations, while grounding the plan to functionality in the codebase using a symbol retriever.
Testing is a critical part of software engineering — what if we could automatically discover inputs which break your code? We show how to train SLMs (Qwen2.5-7B + Llama3.1-8B) to generate unit tests that break code and are useful for debugging.
In composable AI systems for visual reasoning, agents must reliably orchestrate noisy, black-box models as tools. We propose a method that trains the agent to perform online error recovery: it learns to identify potential failures in tool outputs and iteratively refine its queries to extract the correct information.
A testbed for RL-style data generation agents + teaching environments to automate post-training: the process of improving a model on diverse, open-ended tasks, based on automatically-discovered model skills / weaknesses.
We show how to improve the program synthesis ability of an LLM from execution feedback and apply it to create a 7B model that writes programs that orchestrate other models to solve computer vision tasks.
We show how to identify unreliable responses from multimodal LLMs by examining the consistency of their responses over the neighborhood of a visual question, without requiring access to the model's internals.
We show how to selectively decompose complex questions into simpler sub-questions to improve zero-shot performance on challenging multimodal reasoning tasks.
Getting labels for a multimodal dataset can be expensive. We show how you can use unlabeled images to improve performance on data-scarce multimodal tasks.
We explore creating CLIP-like models by minimally updating already-trained vision and language models, finding that updating less than 7% of parameters can match full model training.
Understanding the emotional content of social media posts is difficult for traditional sentiment analysis models.
We show that language models do a good job of this if the post can be translated into a natural input space for them.
Are notions of algorithmic fairness based on racial categories meaningful?
We study computer vision datasets that use racial categories, and empirically show that the racial categories encoded in each dataset are often highly inconsistent with each other and with human intuitions.