MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use

Below is an example of a plan created by MutaGReP for a user query in the DeepMind/ACME repository.

Scroll to see more

When a human requests an LLM to complete a coding task using functionality from a large code repository, how do we provide context from the repo to the LLM?

One approach is to add the entire repo to the LLM's context window. However, most tasks involve only a fraction of symbols from a repo, longer contexts are detrimental to the LLM's reasoning ability, and context windows are not unlimited. Alternatively, we could emulate the human ability to navigate a large repo, pick out the right functionality, and form a plan to solve the task.

We propose MutaGReP (Mutation-Guided Grounded Repository Plan Search), an approach to search for plans that decompose a user request into natural language steps grounded in the codebase. MutaGReP performs neural tree search in plan space, exploring by mutating plans and using a symbol retriever for grounding. On the challenging LongCodeArena benchmark, our plans use less than 5% of a 128K context window for GPT-4o but rival the coding performance of GPT-4o with a context window filled with the repository. Plans produced by MutaGReP allow Qwen 2.5 Coder 32B and 72B to match the performance of GPT-4o with full repo context and enable progress on the hardest LongCodeArena tasks.

Each node in the tree is a repo-grounded plan. At every time step, a node is chosen for growing the tree and successors are created by mutating the chosen plan. We use an LLM to implement the successor function.

The successor function mutates a plan (left-most column) to generate new plans (right-most column). For each modified intent, the grounding function maps the intent to symbols that might be used to implement the intent.

Using a fraction of the context, Plan Search (driven by MutaGReP) is competitive with adding the entire codebase into the LLM context and significantly outperforms ReAct based planning.

Plans produced by MutaGReP consistently improve performance across all models. Qwen 2.5 Coder 32B with our plans exceeds GPT-4o's full-repo performance despite conditioning on 120k fewer context tokens. Even models stronger than GPT-4o (e.g., O1) benefit from our GPT-4o-generated plans.

Plans found by MutaGReP enable progress on hard tasks where even full-repo context performed poorly. Conditioning on plans produced by MutaGReP shows gains on the hardest 10% of tasks where GPT-4o with a context window filled with the repository performs poorly: — only finding less than 20% of the symbols used in the reference code.

Unconstrained mutation outperforms monotonic mutation, especially at lower budgets. The graph shows the symbol recall of each mutation strategy using best-first search with the oracle scoring function and branching factor of 3.

Informed (best-first) search outperforms uninformed (depth-first) and linear search strategies and performance improves with branching factor (BF), especially for informed search.

MutaGReP: Execution-Free Repository-Grounded Plan Search for Code-Use

Abstract

Plan Search

Mutation and Grounding

System-Level Comparison

Enhancing other LLMs with Plans

Making Progress on Hardest 10% of Tasks

Test-time Scaling

Unconstrained Mutation

Informed Search

Citation