zhimin-z
commited on
Commit
·
6435782
1
Parent(s):
73aa8ef
refine
Browse files
README.md
CHANGED
|
@@ -15,105 +15,88 @@ short_description: Track GitHub PR statistics for SWE agents
|
|
| 15 |
|
| 16 |
SWE-PR ranks software engineering agents by their real-world GitHub pull request performance.
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
Currently, the leaderboard tracks public GitHub PRs across open-source repositories where the agent has contributed.
|
| 21 |
|
| 22 |
## Why This Exists
|
| 23 |
|
| 24 |
-
Most AI coding agent benchmarks
|
| 25 |
-
|
| 26 |
-
This leaderboard flips that approach. Instead of synthetic tasks, we measure what matters: did the PR get merged? How many actually made it through? Is the agent improving over time? These are the signals that reflect genuine software engineering impact - the kind you'd see from a human contributor.
|
| 27 |
|
| 28 |
If an agent can consistently get pull requests accepted across different projects, that tells you something no benchmark can.
|
| 29 |
|
| 30 |
## What We Track
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
**Leaderboard Table**
|
| 35 |
-
- **Total PRs**:
|
| 36 |
-
- **Merged PRs**:
|
| 37 |
-
- **Acceptance Rate**: Percentage of concluded PRs that got merged
|
| 38 |
|
| 39 |
-
**Monthly Trends
|
| 40 |
-
Beyond the table, we show interactive charts tracking how each agent's performance evolves month-by-month:
|
| 41 |
- Acceptance rate trends (line plots)
|
| 42 |
- PR volume over time (bar charts)
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
**Why 6 Months?**
|
| 47 |
-
We focus on recent performance (last 6 months) to highlight active agents and current capabilities. This ensures the leaderboard reflects the latest versions of agents rather than outdated historical data, making it more relevant for evaluating current performance.
|
| 48 |
|
| 49 |
## How It Works
|
| 50 |
|
| 51 |
-
Behind the scenes, we're doing a few things:
|
| 52 |
-
|
| 53 |
**Data Collection**
|
| 54 |
-
We
|
| 55 |
-
-
|
| 56 |
|
| 57 |
**Regular Updates**
|
| 58 |
-
|
| 59 |
|
| 60 |
**Community Submissions**
|
| 61 |
-
Anyone can submit
|
| 62 |
|
| 63 |
## Using the Leaderboard
|
| 64 |
|
| 65 |
-
###
|
| 66 |
-
|
| 67 |
-
-
|
| 68 |
-
-
|
| 69 |
-
-
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
|
| 74 |
-
In the Submit Agent tab, provide:
|
| 75 |
-
- **GitHub identifier*** (required): Your agent's GitHub username or bot account
|
| 76 |
-
- **Agent name*** (required): Display name for the leaderboard
|
| 77 |
-
- **Developer*** (required): Your name or team name
|
| 78 |
-
- **Website*** (required): Link to your agent's homepage or documentation
|
| 79 |
-
|
| 80 |
-
Click Submit. We'll validate the GitHub account, fetch the PR history, and add your agent to the board. Initial data loading takes a few seconds.
|
| 81 |
|
| 82 |
## Understanding the Metrics
|
| 83 |
|
| 84 |
-
**Total PRs vs Merged PRs**
|
| 85 |
-
Not every PR should get merged. Sometimes agents propose changes that don't fit the project's direction, or they might be experiments. But a consistently low merge rate might signal that an agent isn't quite aligned with what maintainers want.
|
| 86 |
-
|
| 87 |
**Acceptance Rate**
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
**Monthly Trends**
|
| 97 |
-
|
| 98 |
-
- **
|
| 99 |
-
- **Bar charts**: How many PRs each agent created each month
|
| 100 |
|
| 101 |
-
|
| 102 |
-
- Consistent high
|
| 103 |
-
- Increasing trends
|
| 104 |
-
- High
|
| 105 |
|
| 106 |
## What's Next
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
-
|
| 111 |
-
-
|
| 112 |
-
-
|
| 113 |
-
- **Contribution patterns**: Identify whether agents are better at bugs, features, or documentation
|
| 114 |
-
|
| 115 |
-
Our goal is to make leaderboard data as transparent and reflective of real-world engineering outcomes as possible.
|
| 116 |
|
| 117 |
## Questions or Issues?
|
| 118 |
|
| 119 |
-
|
|
|
|
| 15 |
|
| 16 |
SWE-PR ranks software engineering agents by their real-world GitHub pull request performance.
|
| 17 |
|
| 18 |
+
No benchmarks. No sandboxes. Just real code that got merged.
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## Why This Exists
|
| 21 |
|
| 22 |
+
Most AI coding agent benchmarks use synthetic tasks and simulated environments. This leaderboard measures real-world performance: did the PR get merged? How many made it through? Is the agent improving?
|
|
|
|
|
|
|
| 23 |
|
| 24 |
If an agent can consistently get pull requests accepted across different projects, that tells you something no benchmark can.
|
| 25 |
|
| 26 |
## What We Track
|
| 27 |
|
| 28 |
+
Key metrics from the last 180 days:
|
| 29 |
|
| 30 |
**Leaderboard Table**
|
| 31 |
+
- **Total PRs**: Pull requests the agent has opened
|
| 32 |
+
- **Merged PRs**: PRs that got merged (not just closed)
|
| 33 |
+
- **Acceptance Rate**: Percentage of concluded PRs that got merged
|
| 34 |
|
| 35 |
+
**Monthly Trends**
|
|
|
|
| 36 |
- Acceptance rate trends (line plots)
|
| 37 |
- PR volume over time (bar charts)
|
| 38 |
|
| 39 |
+
We focus on 180 days to highlight current capabilities and active agents.
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
## How It Works
|
| 42 |
|
|
|
|
|
|
|
| 43 |
**Data Collection**
|
| 44 |
+
We mine GitHub activity from [GHArchive](https://www.gharchive.org/), tracking:
|
| 45 |
+
- PRs opened by the agent (`PullRequestEvent`)
|
| 46 |
|
| 47 |
**Regular Updates**
|
| 48 |
+
Leaderboard refreshes every Wednesday at 00:00 UTC.
|
| 49 |
|
| 50 |
**Community Submissions**
|
| 51 |
+
Anyone can submit an agent. We store metadata in `SWE-Arena/bot_data` and results in `SWE-Arena/leaderboard_data`. All submissions are validated via GitHub API.
|
| 52 |
|
| 53 |
## Using the Leaderboard
|
| 54 |
|
| 55 |
+
### Browsing
|
| 56 |
+
Leaderboard tab features:
|
| 57 |
+
- Searchable table (by agent name or website)
|
| 58 |
+
- Filterable columns (by acceptance rate)
|
| 59 |
+
- Monthly charts (acceptance trends and activity)
|
| 60 |
|
| 61 |
+
### Adding Your Agent
|
| 62 |
+
Submit Agent tab requires:
|
| 63 |
+
- **GitHub identifier**: Agent's GitHub username
|
| 64 |
+
- **Agent name**: Display name
|
| 65 |
+
- **Developer**: Your name or team
|
| 66 |
+
- **Website**: Link to homepage or docs
|
| 67 |
|
| 68 |
+
Submissions are validated and data loads within seconds.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
## Understanding the Metrics
|
| 71 |
|
|
|
|
|
|
|
|
|
|
| 72 |
**Acceptance Rate**
|
| 73 |
+
Percentage of concluded PRs that got merged:
|
| 74 |
|
| 75 |
+
```
|
| 76 |
+
Acceptance Rate = Merged PRs ÷ (Merged PRs + Closed-Unmerged PRs) × 100
|
| 77 |
+
```
|
| 78 |
|
| 79 |
+
Open PRs are excluded. We only count PRs where a decision has been made (merged or closed).
|
| 80 |
|
| 81 |
+
Context matters: 100 PRs at 20% acceptance differs from 10 PRs at 80%. Consider both rate and volume.
|
| 82 |
|
| 83 |
**Monthly Trends**
|
| 84 |
+
- **Line plots**: Acceptance rate changes over time
|
| 85 |
+
- **Bar charts**: PR volume per month
|
|
|
|
| 86 |
|
| 87 |
+
Patterns to watch:
|
| 88 |
+
- Consistent high rates = reliable code quality
|
| 89 |
+
- Increasing trends = improving agents
|
| 90 |
+
- High volume + good rates = productivity + quality
|
| 91 |
|
| 92 |
## What's Next
|
| 93 |
|
| 94 |
+
Planned improvements:
|
| 95 |
+
- Repository-based analysis
|
| 96 |
+
- Extended metrics (review round-trips, conversation depth, files changed)
|
| 97 |
+
- Merge time tracking
|
| 98 |
+
- Contribution patterns (bugs, features, docs)
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
## Questions or Issues?
|
| 101 |
|
| 102 |
+
[Open an issue](https://github.com/SE-Arena/SWE-PR/issues) for bugs, feature requests, or data concerns.
|
msr.py
CHANGED
|
@@ -397,53 +397,31 @@ def fetch_all_pr_metadata_streaming(conn, identifiers, start_date, end_date):
|
|
| 397 |
file_patterns_sql = '[' + ', '.join([f"'{fp}'" for fp in file_patterns]) + ']'
|
| 398 |
|
| 399 |
# Query for this batch
|
| 400 |
-
#
|
| 401 |
-
# newer data (Oct 2025+) has stripped-down objects. We use TRY() to handle both.
|
| 402 |
query = f"""
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
)
|
| 413 |
-
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
|
| 421 |
-
CAST(payload.pull_request.number AS VARCHAR)
|
| 422 |
-
) as url,
|
| 423 |
-
payload.action as event_action,
|
| 424 |
-
CASE
|
| 425 |
-
WHEN payload.action = 'opened' THEN actor.login
|
| 426 |
-
ELSE NULL
|
| 427 |
-
END as pr_author,
|
| 428 |
-
created_at as event_time,
|
| 429 |
-
TRY_CAST(json_extract_string(to_json(payload), '$.pull_request.merged_at') AS VARCHAR) as merged_at
|
| 430 |
-
FROM raw_events
|
| 431 |
-
WHERE payload.pull_request.number IS NOT NULL
|
| 432 |
-
),
|
| 433 |
-
pr_timeline AS (
|
| 434 |
-
SELECT
|
| 435 |
-
url,
|
| 436 |
-
MAX(pr_author) as pr_author,
|
| 437 |
-
MIN(CASE WHEN event_action = 'opened' THEN event_time END) as created_at,
|
| 438 |
-
MAX(CASE WHEN event_action = 'closed' THEN event_time END) as closed_at,
|
| 439 |
-
MAX(merged_at) as merged_at
|
| 440 |
-
FROM pr_events
|
| 441 |
-
GROUP BY url
|
| 442 |
)
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
AND
|
| 447 |
"""
|
| 448 |
|
| 449 |
try:
|
|
|
|
| 397 |
file_patterns_sql = '[' + ', '.join([f"'{fp}'" for fp in file_patterns]) + ']'
|
| 398 |
|
| 399 |
# Query for this batch
|
| 400 |
+
# Extract all PR metadata from payload, which is available in any PullRequestEvent
|
|
|
|
| 401 |
query = f"""
|
| 402 |
+
SELECT DISTINCT
|
| 403 |
+
CONCAT(
|
| 404 |
+
REPLACE(repo.url, 'api.github.com/repos/', 'github.com/'),
|
| 405 |
+
'/pull/',
|
| 406 |
+
CAST(payload.pull_request.number AS VARCHAR)
|
| 407 |
+
) as url,
|
| 408 |
+
TRY_CAST(json_extract_string(to_json(payload), '$.pull_request.user.login') AS VARCHAR) as pr_author,
|
| 409 |
+
TRY_CAST(json_extract_string(to_json(payload), '$.pull_request.created_at') AS VARCHAR) as created_at,
|
| 410 |
+
TRY_CAST(json_extract_string(to_json(payload), '$.pull_request.merged_at') AS VARCHAR) as merged_at,
|
| 411 |
+
TRY_CAST(json_extract_string(to_json(payload), '$.pull_request.closed_at') AS VARCHAR) as closed_at
|
| 412 |
+
FROM read_json(
|
| 413 |
+
{file_patterns_sql},
|
| 414 |
+
union_by_name=true,
|
| 415 |
+
filename=true,
|
| 416 |
+
compression='gzip',
|
| 417 |
+
format='newline_delimited',
|
| 418 |
+
ignore_errors=true,
|
| 419 |
+
maximum_object_size=2147483648
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 420 |
)
|
| 421 |
+
WHERE type = 'PullRequestEvent'
|
| 422 |
+
AND payload.pull_request.number IS NOT NULL
|
| 423 |
+
AND TRY_CAST(json_extract_string(to_json(payload), '$.pull_request.created_at') AS VARCHAR) IS NOT NULL
|
| 424 |
+
AND TRY_CAST(json_extract_string(to_json(payload), '$.pull_request.user.login') AS VARCHAR) IN ({identifier_list})
|
| 425 |
"""
|
| 426 |
|
| 427 |
try:
|