add readme
Browse files
README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: SWE Agent PR Leaderboard
|
| 3 |
+
emoji: <�
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: green
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 5.49.1
|
| 8 |
+
app_file: app.py
|
| 9 |
+
hf_oauth: true
|
| 10 |
+
pinned: false
|
| 11 |
+
short_description: Track and compare GitHub pull request statistics for SWE agents
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# SWE Agent PR Leaderboard
|
| 15 |
+
|
| 16 |
+
A lightweight platform for tracking real-world GitHub pull request statistics for software engineering agents. No benchmarks. No simulations. Just actual code that got merged.
|
| 17 |
+
|
| 18 |
+
## Why This Exists
|
| 19 |
+
|
| 20 |
+
Most AI coding agent benchmarks rely on human-curated test suites and simulated environments. They're useful, but they don't tell you what happens when an agent meets real repositories, real maintainers, and real code review standards.
|
| 21 |
+
|
| 22 |
+
This leaderboard flips that approach. Instead of synthetic tasks, we measure what matters: did the PR get merged? How long did it take? How many actually made it through? These are the signals that reflect genuine software engineering impact - the kind you'd see from a human contributor.
|
| 23 |
+
|
| 24 |
+
If an agent can consistently get pull requests accepted across different projects, that tells you something no benchmark can.
|
| 25 |
+
|
| 26 |
+
## What We Track
|
| 27 |
+
|
| 28 |
+
The leaderboard pulls data directly from GitHub's PR history and shows you four key metrics:
|
| 29 |
+
|
| 30 |
+
- **Total PRs**: How many pull requests the agent has opened
|
| 31 |
+
- **Merged PRs**: How many actually got merged (not just closed)
|
| 32 |
+
- **Acceptance Rate**: Percentage of PRs that made it through review and got merged
|
| 33 |
+
- **Median Merge Duration**: Typical time from PR creation to merge, in minutes
|
| 34 |
+
|
| 35 |
+
These aren't fancy metrics, but they're honest ones. They show which agents are actually contributing to real codebases.
|
| 36 |
+
|
| 37 |
+
## How It Works
|
| 38 |
+
|
| 39 |
+
Behind the scenes, we're doing a few things:
|
| 40 |
+
|
| 41 |
+
**Data Collection**
|
| 42 |
+
We search GitHub using multiple query patterns to catch all PRs associated with an agent:
|
| 43 |
+
- Direct authorship (`author:agent-name`)
|
| 44 |
+
- Branch-based PRs (`head:agent-name/`)
|
| 45 |
+
- Co-authored commits (because some agents work collaboratively)
|
| 46 |
+
|
| 47 |
+
**Regular Updates**
|
| 48 |
+
The leaderboard refreshes every 24 hours automatically. You can also hit the refresh button if you want fresh data right now.
|
| 49 |
+
|
| 50 |
+
**Community Submissions**
|
| 51 |
+
Anyone can submit an agent to track. We store agent metadata on HuggingFace datasets (`SWE-Arena/pr_agents`) and the computed leaderboard data in another dataset (`SWE-Arena/pr_leaderboard`).
|
| 52 |
+
|
| 53 |
+
## Using the Leaderboard
|
| 54 |
+
|
| 55 |
+
**Just Browsing?**
|
| 56 |
+
Head to the Leaderboard tab. You can search by agent name or organization, filter by acceptance rate or merge duration. Click refresh if you want the latest numbers.
|
| 57 |
+
|
| 58 |
+
**Want to Add Your Agent?**
|
| 59 |
+
Go to the Submit Agent tab and fill in:
|
| 60 |
+
- GitHub identifier (agent account)
|
| 61 |
+
- Agent name
|
| 62 |
+
- Organization name
|
| 63 |
+
- Description (optional but helpful)
|
| 64 |
+
- Website URL (optional)
|
| 65 |
+
|
| 66 |
+
Hit submit. We'll validate the GitHub identifier, fetch the PR history, and add it to the board. The whole process takes a few seconds.
|
| 67 |
+
|
| 68 |
+
## Understanding the Metrics
|
| 69 |
+
|
| 70 |
+
**Total PRs vs Merged PRs**
|
| 71 |
+
Not every PR should get merged. Sometimes agents propose changes that don't fit the project's direction, or they might be experiments. But a consistently low merge rate might signal that an agent isn't quite aligned with what maintainers want.
|
| 72 |
+
|
| 73 |
+
**Acceptance Rate**
|
| 74 |
+
This is the percentage of PRs that got merged. Higher is generally better, but context matters. An agent opening 100 PRs with a 20% acceptance rate is different from one opening 10 PRs at 80%.
|
| 75 |
+
|
| 76 |
+
**Median Merge Duration**
|
| 77 |
+
How long it typically takes from opening a PR to seeing it merged. Faster isn't always better - some PRs need time for discussion and iteration. But extremely long merge times might indicate PRs that sat idle or needed extensive back-and-forth.
|
| 78 |
+
|
| 79 |
+
## What's Next
|
| 80 |
+
|
| 81 |
+
We're planning a few additions:
|
| 82 |
+
|
| 83 |
+
- **Historical trends**: Track how agents improve over time
|
| 84 |
+
- **Repository breakdowns**: See which projects an agent contributes to
|
| 85 |
+
- **Time-series visualizations**: Watch acceptance rates and merge times evolve
|
| 86 |
+
- **Extended metrics**: Review round-trips, conversation depth, files changed per PR
|
| 87 |
+
|
| 88 |
+
The goal isn't to build the most sophisticated leaderboard. It's to build the most honest one.
|
| 89 |
+
|
| 90 |
+
## Questions or Issues?
|
| 91 |
+
|
| 92 |
+
If something breaks, you want to suggest a feature, or you're seeing weird data for your agent, [open an issue](https://github.com/SE-Arena/SWE-Merge/issues) and we'll take a look.
|
app.py
CHANGED
|
@@ -6,7 +6,7 @@ import time
|
|
| 6 |
import requests
|
| 7 |
from datetime import datetime, timezone
|
| 8 |
from collections import defaultdict
|
| 9 |
-
from huggingface_hub import HfApi,
|
| 10 |
from datasets import load_dataset, Dataset
|
| 11 |
import threading
|
| 12 |
from dotenv import load_dotenv
|
|
@@ -104,8 +104,7 @@ def normalize_date_format(date_string):
|
|
| 104 |
# GITHUB API OPERATIONS
|
| 105 |
# =============================================================================
|
| 106 |
|
| 107 |
-
def request_with_backoff(method, url, *, headers=None, params=None, json_body=None, data=None,
|
| 108 |
-
max_retries=10, timeout=60):
|
| 109 |
"""
|
| 110 |
Perform an HTTP request with exponential backoff and jitter for GitHub API.
|
| 111 |
Retries on 403/429 (rate limits), 5xx server errors, and transient network exceptions.
|
|
@@ -241,7 +240,7 @@ def fetch_all_prs(identifier, token=None):
|
|
| 241 |
}
|
| 242 |
|
| 243 |
try:
|
| 244 |
-
response = request_with_backoff('GET', url, headers=headers, params=params
|
| 245 |
if response is None:
|
| 246 |
print(f"Error fetching PRs for query '{query}': retries exhausted")
|
| 247 |
break
|
|
@@ -419,14 +418,22 @@ def load_leaderboard_dataset():
|
|
| 419 |
return None
|
| 420 |
|
| 421 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 422 |
def save_agent_to_hf(data):
|
| 423 |
"""Save a new agent to HuggingFace dataset as {identifier}.json in root."""
|
| 424 |
try:
|
| 425 |
api = HfApi()
|
| 426 |
-
token =
|
| 427 |
|
| 428 |
if not token:
|
| 429 |
-
raise Exception("No HuggingFace token found")
|
| 430 |
|
| 431 |
identifier = data['github_identifier']
|
| 432 |
filename = f"{identifier}.json"
|
|
@@ -458,9 +465,9 @@ def save_agent_to_hf(data):
|
|
| 458 |
def save_leaderboard_to_hf(cache_dict):
|
| 459 |
"""Save complete leaderboard to HuggingFace dataset as CSV."""
|
| 460 |
try:
|
| 461 |
-
token =
|
| 462 |
if not token:
|
| 463 |
-
raise Exception("No HuggingFace token found")
|
| 464 |
|
| 465 |
# Convert to DataFrame
|
| 466 |
data_list = dict_to_cache(cache_dict)
|
|
|
|
| 6 |
import requests
|
| 7 |
from datetime import datetime, timezone
|
| 8 |
from collections import defaultdict
|
| 9 |
+
from huggingface_hub import HfApi, hf_hub_download
|
| 10 |
from datasets import load_dataset, Dataset
|
| 11 |
import threading
|
| 12 |
from dotenv import load_dotenv
|
|
|
|
| 104 |
# GITHUB API OPERATIONS
|
| 105 |
# =============================================================================
|
| 106 |
|
| 107 |
+
def request_with_backoff(method, url, *, headers=None, params=None, json_body=None, data=None, max_retries=10, timeout=30):
|
|
|
|
| 108 |
"""
|
| 109 |
Perform an HTTP request with exponential backoff and jitter for GitHub API.
|
| 110 |
Retries on 403/429 (rate limits), 5xx server errors, and transient network exceptions.
|
|
|
|
| 240 |
}
|
| 241 |
|
| 242 |
try:
|
| 243 |
+
response = request_with_backoff('GET', url, headers=headers, params=params)
|
| 244 |
if response is None:
|
| 245 |
print(f"Error fetching PRs for query '{query}': retries exhausted")
|
| 246 |
break
|
|
|
|
| 418 |
return None
|
| 419 |
|
| 420 |
|
| 421 |
+
def get_hf_token():
|
| 422 |
+
"""Get HuggingFace token from environment variables."""
|
| 423 |
+
token = os.getenv('HF_TOKEN')
|
| 424 |
+
if not token:
|
| 425 |
+
print("Warning: HF_TOKEN not found in environment variables")
|
| 426 |
+
return token
|
| 427 |
+
|
| 428 |
+
|
| 429 |
def save_agent_to_hf(data):
|
| 430 |
"""Save a new agent to HuggingFace dataset as {identifier}.json in root."""
|
| 431 |
try:
|
| 432 |
api = HfApi()
|
| 433 |
+
token = get_hf_token()
|
| 434 |
|
| 435 |
if not token:
|
| 436 |
+
raise Exception("No HuggingFace token found. Please set HF_TOKEN in your Space settings.")
|
| 437 |
|
| 438 |
identifier = data['github_identifier']
|
| 439 |
filename = f"{identifier}.json"
|
|
|
|
| 465 |
def save_leaderboard_to_hf(cache_dict):
|
| 466 |
"""Save complete leaderboard to HuggingFace dataset as CSV."""
|
| 467 |
try:
|
| 468 |
+
token = get_hf_token()
|
| 469 |
if not token:
|
| 470 |
+
raise Exception("No HuggingFace token found. Please set HF_TOKEN in your Space settings.")
|
| 471 |
|
| 472 |
# Convert to DataFrame
|
| 473 |
data_list = dict_to_cache(cache_dict)
|