Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

University of North Carolina, Chapel Hill
UNC NLP Logo
EFA Overview
Left: The generative process underlying computational math problems, where the different instances share the same underlying problem-solving logic (function) but differ in parameter values. We introduce executable functional abstractions (EFAs) to model this latent structure. Right: we study the task of inferring EFAs; i.e., recovering the underlying problem-solving function and parameters from math problems expressed in natural language.

Abstract

Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from reinforcement learning (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for mathematical reasoning as problem generators for stress-testing models. However, prior work has been limited to automatically constructing abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced mathematics problems. We operationalize the task of automatically constructing EFAs as a program synthesis task, and develop EFAGen, which conditions a large language model (LLM) on a seed math problem and its step-by-step solution to generate candidate EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. Furthermore, we formalize properties any valid EFA must possess in terms of executable unit tests, and show how the tests can be used as verifiable rewards to train LLMs to become better writers of EFAs. Through experiments, we demonstrate that EFAs constructed by EFAGen behave rationally by remaining faithful to seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across multiple diverse sources of competition-level math problems. Finally, we show downstream uses of model-written EFAs, such as finding problem variations that are harder or easier for a learner to solve, as well as data generation.

EFAGen Overview

Method Overview
Left: Representation of an executable functional abstraction (EFA) as a Python class. Right: Overview of EFAGen, a method for automatically inferring EFAs from a math problem. In EFAGen, we (a) over-generate multiple EFA candidates with an LLM and (b) filter out invalid candidates that fail automated tests. The EFA can generate new problem variants by sampling parameters and executing the solver.

Given a problem statement and its solution procedure (typically expressed as chain-of-thought reasoning), EFAGen uses a large language model (LLM) to generate a candidate EFA implementation that captures the logic and structure of the original problem.

For each problem, we sample N (e.g., 50) EFA candidates and apply a suite of automated tests to discard invalid abstractions. EFAGen uses the following tests to validate candidate EFAs:

  • is_extractable(response): Verifies that the class contains all required methods.
  • is_executable(EFA): Confirms that the class can be instantiated and executed without errors, and methods like EFA.sample() and EFA.solve() can be called without errors.
  • has_dof(EFA): Ensures that sampled parameters differ, rejecting EFAs with zero degrees of freedom that cannot produce new problems.
  • is_single_valued(EFA): Confirms that identical parameters yield equivalent solutions, rejecting impermissible implementations including multivalued functions or logically incoherent abstractions.
  • matches_original(EFA, orig_params, orig_sol): Validates that the abstraction, when instantiated with the original parameters, produces the original problem and solution. This serves as a cycle-consistency or soundness check.

Any program that fails these tests cannot logically be a valid implementation of an EFA. EFAGen enables generation of EFAs at scale, as large numbers of candidate EFAs can be generated and filtered automatically.

Self-Improvement: LMs Can Improve at Inferring EFAs With Execution Feedback

Self-training iterations
LLMs can use our tests to self-improve at inferring EFAs. We plot the percentage of constructed EFAs passing all tests across iterations of self-training, grouped by MATH problem difficulty (left) and by problem category (right). Harder difficulty levels and problem categories are harder to infer EFAs for and improve more during training.

Augmentation: EFAs are Effective at Expanding Static Math Datasets

Data Augmentation Results
EFA-based data augmentation is consistently effective. Comparison with and without synthetic data augmentation using problems drawn from generated EFAs. The table shows performance across MATH-500 and FnEval benchmarks (November and December snapshots). When augmenting, we use a 1:1 ratio of examples drawn from training data vs. from an EFA, and report results using 33% of the MATH train set and 100% of the train set.

Generality: EFAGen Can Work Across Diverse Math Domains

NuminaMath Results
EFAGen can infer EFAs for diverse sources of math problems. Here, we show the results of applying EFAGen to infer EFAs for the NuminaMath dataset, which contains a mix of math problems from a diversity of sources ranging from grade school mathematics (GSM8K) to national/international olympiads (olympiads). EFAGen achieves a nonzero success rate across all sources of problems.

Adversarial Search: EFAGen Can Find Hard Problem Variants

Hard Problem Variants
EFAs can find harder variants of problems. We infer an EFA for a sample of Level 1 (easiest) and Level 5 (hardest) seed problems GPT-4o solves correctly, and generate k variants of each problem. We plot the percentage of seed problems for which a variant that GPT-4o solved incorrectly was found.

Citation

@inproceedings{khan2025executable,
  title={Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems},
  author={Khan, Zaid and Stengel-Eskin, Elias and Prasad, Archiki and Cho, Jaemin and Bansal, Mohit},
  journal={arXiv preprint arXiv:2504.09763},
  year={2025}
}