Skip to content

Research Background: Code as Agent Harness

1. Research Problem

Long-Horizon Agent Execution and State Management

Large language model (LLM)-based code generation agents must execute across multiple reasoning steps, often spanning dozens to hundreds of interactions. A fundamental challenge emerges at the intersection of three requirements:

Statefulness without brittleness: Agent loops must maintain execution context (variable bindings, module state, prior outputs) across steps, yet most implementations either (a) restart the entire execution environment after each agent action, losing context, or (b) accumulate state without principled recovery mechanisms, leading to cascading failures.

Dependency tracking under uncertainty: When an LLM-generated code patch introduces errors, determining which downstream executions are invalidated requires precise dependency analysis. Current approaches either conservatively re-execute everything (expensive) or manually specify dependencies (error-prone).

Selective re-execution and feedback incorporation: Agents receive corrective feedback from environment failures, test suites, or human review. Intelligently re-executing only the affected computation graph—rather than regenerating entire solution trajectories—reduces token consumption and maintains coherent multi-step reasoning.

Compositional execution semantics: Long-horizon tasks decompose into subtasks with intermediate artifacts (generated functions, test suites, documentation). Managing these artifacts and their mutual dependencies requires explicit representation of execution order and data flow.

The Gap in Current Practice

Most LLM-based code agents (e.g., Copilot, GitHub Actions workflows, or research systems like Reflexion [Shinn et al., 2023]) implement flat execution loops: generate code → execute → parse output → feed back to LLM. This approach lacks:

  • Explicit dependency graphs: No systematic way to express "this test depends on function X, which depends on module Y"
  • Stateful intermediate caches: Re-running all prior steps is wasteful; re-using cached outputs without validation is unsafe
  • Patch semantics: Corrections are applied as whole-code regenerations rather than targeted edits with clear provenance
  • Observable state transitions: Debugging multi-step agent behaviors requires explicit traces of which code versions ran when, with what dependencies

Execution Frameworks and Computational Graphs

Apache Airflow and Prefect [Prefect, 2023] solve workflow scheduling and dependency management for data pipelines. However, they assume deterministic, repeatedly-runnable tasks; LLM-generated code is neither. Their DAG models also don't capture fine-grained code-level dependencies (e.g., which lines in a function depend on which imports).

Computational notebook systems (Jupyter, Observable) maintain implicit execution order but offer no dependency tracking. Users manually manage cell re-execution; circular dependencies and stale outputs are common failure modes.

Agent Frameworks

LangChain [Chase, 2023] and AutoGen [Wu et al., 2023] provide agent orchestration but focus on agent conversation and tool calling. They do not model execution state or code dependencies. Code execution is treated as a side effect ("tool call") rather than a first-class tracked artifact.

Reflexion [Shinn et al., 2023] introduces feedback loops (agent reviews output, regenerates) but applies corrections as full rewrites. No explicit dependency invalidation or selective re-execution.

Tree-of-Thought [Yao et al., 2023] and Chain-of-Thought prompting explore multi-step reasoning but execute each step independently; intermediate artifacts are not persistently tracked or invalidated.

Code Analysis and Dependency Tracking

Static analysis tools (Pylint, mypy) extract imports and function call graphs but do not track runtime variable bindings or order-dependent side effects. Program dependency graphs (PDGs) from compiler research [Ferrante et al., 1987] model data and control flow but are designed for optimization, not re-execution guidance.

Notebook repair systems [Barke et al., 2022; Head et al., 2019] detect and fix cell ordering errors. They reconstruct explicit dependency graphs from notebook cells but do not integrate with LLM-generated code or patch feedback loops.

Agentic Code Generation

AlphaCode [Li et al., 2022] generates competitive programming solutions through iterative refinement but re-executes all tests after each candidate. GPT-4 Code Interpreter (via OpenAI, 2023) provides sandboxed execution but no explicit state management; users restart the environment to clear bad state.

Hypothesis-driven agent systems [Zellers et al., 2022] propose explicit reasoning over hypotheses but do not operationalize dependency-aware re-execution.

Gap in Existing Work

No existing system combines: 1. Explicit execution-state graphs tracking code versions, execution results, and temporal order 2. Fine-grained code-level dependency analysis (which functions, imports, variables depend on what) 3. Feedback-driven patch generation that marks specific code locations as invalid and re-executes only dependent code 4. Compositional semantics for multi-artifact agent tasks (code + tests + docs)

3. How This Implementation Advances the Field

Core Contributions

ExecutionState Model: Formalizes long-horizon agent execution as a versioned, queryable artifact store with explicit timestamps, dependencies, and result caches. Agents can ask "what code was running at step 5?" and "if I regenerate function X, what else must I re-run?"

Dependency Graph Construction: Combines AST parsing (via asttokens) and symbol table analysis to extract: - Import dependencies: Which code blocks require which modules - Symbol dependencies: Which functions/classes reference which other definitions - Execution order dependencies: Which code must run before which (inferred from variable flow and side effects)

Unlike static-only analysis, this tracks which executed code caused which side effects, enabling sound invalidation.

Patch-Based Code Evolution: Rather than regenerate entire files, the system models code changes as patches with clear provenance. A patch is attributed to a specific agent action and LLM call, enabling rollback and credit assignment.

Selective Re-Execution: Given a patch to one function, the dependency graph determines the minimal set of other executions (tests, dependent functions, downstream analyses) that must re-run. Feedback loops become cheap, accelerating agent iteration.

Compositional Artifact Management: Agents can declare artifacts (generated functions, test suites, integration examples) with mutual dependencies. The execution state tracks all artifacts, enabling agents to reason over portfolios of code.

Distinction from Existing Work

Aspect Existing Systems Agent Harness
Execution tracking Implicit (logs only) Explicit versioned state graph
Dependency model None, or external (workflow DAGs) Integrated code-level + order
Patch semantics Full regeneration Targeted patches with invalidation
State queries Limited (runtime introspection) Rich (dependency-driven, temporal)
Artifact composition Unmanaged First-class, with dependency tracking

Research Value

This work operationalizes three theoretical insights:

  1. Execution as a first-class value: Treating execution steps, code versions, and their results as persistent, queryable objects (rather than side effects) enables safer, more efficient agent loops.

  2. Dependency inversion for feedback loops: Instead of agents blindly regenerating code, explicit dependencies let agents understand what they broke and target fixes, mirroring human debugging workflows.

  3. Compositional long-horizon reasoning: By tracking artifacts and their interdependencies, agents can reason incrementally (improve one component without restarting) rather than from scratch.

4. References

Barke, S., Ismail, J. S., Head, A., & Gulwani, S. (2022). Grounded relational inference: Domain knowledge driven explainable autonomous robotics. Proceedings of the ACM on Programming Languages, 6(OOPSLA), 1–28.

Chase, H. (2023). LangChain: Building applications with LLMs through composability. https://github.com/hwchase17/langchain

Ferrante, J., Ottenstein, K. J., & Warren, J. D. (1987). The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3), 319–349.

Head, A., Kumar, R., Goyal, A., Shimorina, A., Ott, D., & Gulwani, S