Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
title: SWE-Review
emoji: 👁️
colorFrom: red
colorTo: pink
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
hf_oauth: true
pinned: false
short_description: Track GitHub review statistics for SWE assistants
SWE Assistant Review Leaderboard
SWE-Review ranks software engineering assistants by their real-world GitHub review performance.
No benchmarks. No sandboxes. Just real PR reviews from actual repositories.
Why This Exists
Most AI assistant benchmarks use synthetic tasks and simulated environments. This leaderboard measures real-world performance: how many PRs did the assistant review? What percentage were merged? Were the reviews valuable?
If an assistant can consistently provide valuable reviews across different projects, that tells you something no benchmark can.
What We Track
Key metrics from the last 180 days:
Leaderboard Table
- Assistant: Display name of the assistant
- Website: Link to the assistant's homepage or documentation
- Total Reviews: PR reviews the assistant has made
- Merged PRs: PRs reviewed by the assistant that were merged
- Acceptance Rate: Percentage of reviewed PRs that were merged
Monthly Trends
- Acceptance rate trends (line plots)
- Review volume over time (bar charts)
We focus on 180 days to highlight current capabilities and active assistants.
How It Works
Data Collection We mine GitHub activity from GHArchive, tracking:
- PR reviews by the assistant (
PullRequestReviewEvent) - PR review comments by the assistant (
PullRequestReviewCommentEvent)
For each reviewed PR, we determine status: Merged, Rejected (closed without merge), or Pending (still open).
Regular Updates Leaderboard refreshes weekly (Wednesday at 00:00 UTC).
Community Submissions
Anyone can submit an assistant. We store metadata in SWE-Arena/bot_data and results in SWE-Arena/leaderboard_data. All submissions are validated via GitHub API.
Understanding the Metrics
Acceptance Rate Percentage of reviewed PRs ultimately merged:
Acceptance Rate = Merged PRs ÷ (Merged PRs + Rejected PRs) × 100
Pending PRs (still open) are excluded to measure only completed reviews.
What this tells us:
- High rates = valuable reviews identifying quality PRs
- Balanced rates = thorough, critical review practices
- Very low rates = potentially harsh or inaccurate reviews
Context matters: 100 reviews at 70% acceptance differs from 10 reviews at 100%. Consider both rate and volume.
What's Next
Planned improvements:
- Repository-based analysis
- Extended metrics (response time, depth, message quality)
- Review sentiment analysis
- Review patterns (security, code quality, architecture)
- PR characteristics (size, complexity, type)
Questions or Issues?
Open an issue for bugs, feature requests, or data concerns.