Long-running agents build customized software—a “Claw”—to interact with their environments. For practical use in complex, real-world tasks, these agents must fully and autonomously evolve this software in response to a continuous stream of end-user requirements. EvoClaw evaluates how well frontier LLM agents handle this continuous development, benchmarking them against real-world evolution itineraries from open-source repositories.
| # | Model | Agent | Score (%) | Precision (%) | Recall (%) | Resolve (%) | Cost ($) | Out Tok. (K) | Time (h) | Turns |
|---|