EvoClaw:Evaluating LLM Agents on Continuous Software Evolution

Long-running agents build customized software—a “Claw”—to interact with their environments. For practical use in complex, real-world tasks, these agents must fully and autonomously evolve this software in response to a continuous stream of end-user requirements. EvoClaw evaluates how well frontier LLM agents handle this continuous development, benchmarking them against real-world evolution itineraries from open-source repositories.

Overall Cost / Performance

Leaderboard

# Model Agent Score (%) Precision (%) Recall (%) Resolve (%) Cost ($) Out Tok. (K) Time (h) Turns