EvoClaw:Evaluating AI Agents on Continuous Software Evolution

Long-running agents build customized software (a “Claw”) to interact with their environments. For practical use in complex, real-world tasks, these agents must fully and autonomously evolve this software in response to a continuous stream of end-user requirements. EvoClaw evaluates how well frontier LLM agents handle this continuous development, benchmarking them against real-world evolution itineraries from open-source repositories.

Overall Cost / Performance on EvoClaw

Official agent only

Leaderboard

#	Model	Agent	Score (%)	Precision (%)	Recall (%)	Resolve (%)	Cost ($)	Out Tok. (K)	Time (h)	Turns