Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal

University of North Carolina, Chapel Hill

Left: The generative process underlying computational math problems, where the different instances share the same underlying problem-solving logic (function) but differ in parameter values. We introduce executable functional abstractions (EFAs) to model this latent structure. Right: we study the task of inferring EFAs; i.e., recovering the underlying problem-solving function and parameters from math problems expressed in natural language.

EFAGen Overview

Left: Representation of an executable functional abstraction (EFA) as a Python class. Right: Overview of EFAGen, a method for automatically inferring EFAs from a math problem. In EFAGen, we (a) over-generate multiple EFA candidates with an LLM and (b) filter out invalid candidates that fail automated tests. The EFA can generate new problem variants by sampling parameters and executing the solver.

Given a problem statement and its solution procedure (typically expressed as chain-of-thought reasoning), EFAGen uses a large language model (LLM) to generate a candidate EFA implementation that captures the logic and structure of the original problem.

For each problem, we sample N (e.g., 50) EFA candidates and apply a suite of automated tests to discard invalid abstractions. EFAGen uses the following tests to validate candidate EFAs:

is_extractable(response): Verifies that the class contains all required methods.
is_executable(EFA): Confirms that the class can be instantiated and executed without errors, and methods like EFA.sample() and EFA.solve() can be called without errors.
has_dof(EFA): Ensures that sampled parameters differ, rejecting EFAs with zero degrees of freedom that cannot produce new problems.
is_single_valued(EFA): Confirms that identical parameters yield equivalent solutions, rejecting impermissible implementations including multivalued functions or logically incoherent abstractions.
matches_original(EFA, orig_params, orig_sol): Validates that the abstraction, when instantiated with the original parameters, produces the original problem and solution. This serves as a cycle-consistency or soundness check.

Any program that fails these tests cannot logically be a valid implementation of an EFA. EFAGen enables generation of EFAs at scale, as large numbers of candidate EFAs can be generated and filtered automatically.

Self-Improvement: LMs Can Improve at Inferring EFAs With Execution Feedback

LLMs can use our tests to self-improve at inferring EFAs. We plot the percentage of constructed EFAs passing all tests across iterations of self-training, grouped by MATH problem difficulty (left) and by problem category (right). Harder difficulty levels and problem categories are harder to infer EFAs for and improve more during training.

Augmentation: EFAs are Effective at Expanding Static Math Datasets

EFA-based data augmentation is consistently effective. Comparison with and without synthetic data augmentation using problems drawn from generated EFAs. The table shows performance across MATH-500 and FnEval benchmarks (November and December snapshots). When augmenting, we use a 1:1 ratio of examples drawn from training data vs. from an EFA, and report results using 33% of the MATH train set and 100% of the train set.

Generality: EFAGen Can Work Across Diverse Math Domains

EFAGen can infer EFAs for diverse sources of math problems. Here, we show the results of applying EFAGen to infer EFAs for the NuminaMath dataset, which contains a mix of math problems from a diversity of sources ranging from grade school mathematics (GSM8K) to national/international olympiads (olympiads). EFAGen achieves a nonzero success rate across all sources of problems.

Adversarial Search: EFAGen Can Find Hard Problem Variants

EFAs can find harder variants of problems. We infer an EFA for a sample of Level 1 (easiest) and Level 5 (hardest) seed problems GPT-4o solves correctly, and generate k variants of each problem. We plot the percentage of seed problems for which a variant that GPT-4o solved incorrectly was found.