Spaces:
Running
Running
File size: 2,918 Bytes
20e7817 3648f40 593eeb1 20e7817 bfd423c 20e7817 4a632be 20e7817 b490c59 20e7817 b490c59 9b447a6 b490c59 9b447a6 4a632be 9b447a6 7de0447 9b447a6 b490c59 9b447a6 4a632be 9b447a6 717cb54 b490c59 4a632be 9b447a6 4a632be 9b447a6 b490c59 9b447a6 4a632be b490c59 9b447a6 4a632be 9b447a6 42ed355 9b447a6 b490c59 9b447a6 4a632be 9b447a6 4a632be 9b447a6 4a632be 9b447a6 4a632be 9b447a6 4a632be 9b447a6 4a632be 9b447a6 4a632be |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
title: SWE-Review
emoji: 👁️
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
hf_oauth: true
pinned: false
short_description: Track GitHub review statistics for SWE assistants
---
# SWE Assistant Review Leaderboard
SWE-Review ranks software engineering assistants by their real-world GitHub review performance.
No benchmarks. No sandboxes. Just real PR reviews from actual repositories.
## Why This Exists
Most AI assistant benchmarks use synthetic tasks and simulated environments. This leaderboard measures real-world performance: how many PRs did the assistant review? What percentage were merged? Were the reviews valuable?
If an assistant can consistently provide valuable reviews across different projects, that tells you something no benchmark can.
## What We Track
Key metrics from the last 180 days:
**Leaderboard Table**
- **Assistant**: Display name of the assistant
- **Website**: Link to the assistant's homepage or documentation
- **Total Reviews**: PR reviews the assistant has made
- **Merged PRs**: PRs reviewed by the assistant that were merged
- **Acceptance Rate**: Percentage of reviewed PRs that were merged
**Monthly Trends**
- Acceptance rate trends (line plots)
- Review volume over time (bar charts)
We focus on 180 days to highlight current capabilities and active assistants.
## How It Works
**Data Collection**
We mine GitHub activity from [GHArchive](https://www.gharchive.org/), tracking:
- PR reviews by the assistant (`PullRequestReviewEvent`)
- PR review comments by the assistant (`PullRequestReviewCommentEvent`)
For each reviewed PR, we determine status: Merged, Rejected (closed without merge), or Pending (still open).
**Regular Updates**
Leaderboard refreshes weekly (Wednesday at 00:00 UTC).
**Community Submissions**
Anyone can submit an assistant. We store metadata in `SWE-Arena/bot_data` and results in `SWE-Arena/leaderboard_data`. All submissions are validated via GitHub API.
## Understanding the Metrics
**Acceptance Rate**
Percentage of reviewed PRs ultimately merged:
```
Acceptance Rate = Merged PRs ÷ (Merged PRs + Rejected PRs) × 100
```
Pending PRs (still open) are excluded to measure only completed reviews.
What this tells us:
- High rates = valuable reviews identifying quality PRs
- Balanced rates = thorough, critical review practices
- Very low rates = potentially harsh or inaccurate reviews
Context matters: 100 reviews at 70% acceptance differs from 10 reviews at 100%. Consider both rate and volume.
## What's Next
Planned improvements:
- Repository-based analysis
- Extended metrics (response time, depth, message quality)
- Review sentiment analysis
- Review patterns (security, code quality, architecture)
- PR characteristics (size, complexity, type)
## Questions or Issues?
[Open an issue](https://github.com/SE-Arena/SWE-Review/issues) for bugs, feature requests, or data concerns. |