Leaderboard

All results are evaluated using our Live-SWE-agent scaffold.
The badge indicates that results have been verified.

# Model % Resolved Avg. Cost ($) Organization Date
Claude 4.5 Sonnet (20250929)
75.4%
$0.680 Anthropic 2025-11-15
GPT-5 (2025-08-07)
68.4%
$0.270 OpenAI 2025-11-15
GPT-5-mini (2025-08-07)
63.0%
$0.050 OpenAI 2025-11-15

Live-SWE-agent

Live-SWE-agent is the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems.
Figure below presents an overview of Live-SWE-agent.
Live-SWE-agent Overview
First, 1 the agent take in both the project codebase and the description of the issue to be solved with only a limited set of tools (e.g., bash commands), aiming to generate and use its own tools on the fly while solving the issue.
During execution, at each step, it can choose to either 2 output a command (e.g., to use a tool) or 3 create a custom tool that can help it solve the issue. In Live-SWE-Agent, we define a custom tool as a script that can be executed in the environment.
Next, based on the 4 environmental feedback message, 5 we specifically ask the agent to reflect upon the past steps and decide whether a tool should be created.
This loop is repeated until the agent has submitted a solution 6 to the initial problem.

SWE-bench Verified

SWE-bench Verified benchmark contains 500 software development problems where the goal is to successfully modify the repository given a problem description. SWE-bench Verified is validated by human developers to ensure each problem description has sufficient amount of information to solve the issue.