Live-SWE-agent
Can Software Engineering Agents to Self-Evolve on the Fly?
Leaderboard
All results are evaluated using our Live-SWE-agent scaffold.
The ✓ badge indicates that results have been verified.
| # | Model | % Resolved | Avg. Cost ($) | Organization | Date |
|---|---|---|---|---|---|
|
Claude 4.5 Sonnet (20250929)
✓
|
$0.680 | Anthropic | 2025-11-15 | ||
|
GPT-5 (2025-08-07)
✓
|
$0.270 | OpenAI | 2025-11-15 | ||
|
GPT-5-mini (2025-08-07)
✓
|
$0.050 | OpenAI | 2025-11-15 |
Live-SWE-agent
Live-SWE-agent is the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems.
Figure below presents an overview of Live-SWE-agent.
First, 1 the agent take in both the project codebase and the description of the issue to be solved with only a limited set of tools (e.g., bash commands), aiming to generate and use its own tools on the fly while solving the issue.
During execution, at each step, it can choose to either 2 output a command (e.g., to use a tool) or 3 create a custom tool that can help it solve the issue. In Live-SWE-Agent, we define a custom tool as a script that can be executed in the environment.
Next, based on the 4 environmental feedback message, 5 we specifically ask the agent to reflect upon the past steps and decide whether a tool should be created.
This loop is repeated until the agent has submitted a solution 6 to the initial problem.
SWE-bench Verified
SWE-bench Verified benchmark contains 500 software development problems where the goal is to successfully modify the repository given a problem description. SWE-bench Verified is validated by human developers to ensure each problem description has sufficient amount of information to solve the issue.