SWE-bench Contributor Analyzer
SWE-bench Contributor Analyzer
A tool for analyzing the SWE-bench and SWE-bench-verified datasets to identify GitHub issues and pull requests that any developer has contributed to, with comprehensive visualization of results. You can use this tool to discover your own contributions to codebases in the SWE-bench datasets.
📊 Sample Visualizations
View my personal SWE-bench contribution report to see an example of the visualization output from this tool.
This sample shows my personal contribution data, including:
- Distribution of contributions by repository
- Contribution types (issue author, PR author, commenter)
- Timeline of contributions over time
- Repository-type correlation heatmaps
- Detailed list of all contributions with links to GitHub
You can run this tool on your own GitHub username to generate similar visualizations for your SWE-bench contributions!
Overview
This repository contains two main components:
- SWE-bench Analyzer: Finds instances in the SWE-bench datasets where you have contributed as an author, commenter, assignee, or are mentioned in issues/PRs.
- Visualizer: Creates interactive charts and a comprehensive HTML report to visualize your contributions.
The analyzer can operate in two modes:
- Offline mode: Scans the dataset files directly for mentions of your username
- GitHub API mode (default): Makes API calls to GitHub to check for your contributions in more detail
Features
- Automatic dataset downloading from Hugging Face
- Secure credential storage using your system’s keyring
- Configuration caching for streamlined repeat usage
- GitHub API response caching to avoid rate limits and speed up repeat analyses
- Comprehensive analysis across both SWE-bench datasets
- Detailed reporting with contribution breakdowns by repository and type
- Results caching for viewing previous analyses without rerunning
- Interactive visualizations including:
- Repository distribution charts
- Contribution type analysis
- Timeline of contributions
- Repository and contribution type heatmaps
- Dataset comparisons
- Exportable HTML reports for easy sharing and presenting results
About the SWE-bench Datasets
SWE-bench is a benchmark dataset created by Princeton NLP to evaluate language models on real-world software engineering tasks.
-
SWE-bench: 2,294 real-world GitHub Issue-Pull Request pairs from 12 popular Python repositories including sympy, scikit-learn, matplotlib, pandas, and more. Each instance contains an issue description, code context, and a corresponding PR that fixes the issue.
-
SWE-bench-verified: 500 human-validated samples from the full dataset, carefully selected and verified to ensure high-quality examples.
Dataset Structure
Each SWE-bench instance includes:
instance_id: A unique identifier in the format{repo_name}__{issue_number}(e.g., “sympy__sympy-11400”)repo: The GitHub repository path (e.g., “sympy/sympy”)problem_statement: The original issue description from GitHubhints_text: Comments and discussion from the GitHub issue that provide guidance- Other metadata such as creation date, URL, and code context
Example instance:
{
"instance_id": "sympy__sympy-11400",
"repo": "sympy/sympy",
"title": "ccode(sinc(x)) doesn't work",
"url": "https://github.com/sympy/sympy/issues/11400",
"created_at": "2016-07-15T21:40:49Z",
"problem_statement": "ccode(sinc(x)) doesn't work\n...",
"hints_text": "The relevant file is sympy/printing/ccode.py\n..."
}
The datasets can be downloaded from Hugging Face:
The SWE-bench Contributor Analyzer helps you identify instances in these datasets where you’ve contributed as an author, commenter, or are mentioned in the issue/PR.
Installation
Prerequisites
- Python 3.6+
- Required packages:
pip install -r requirements.txt
This will install all necessary dependencies, including:
requests,tqdm,keyring- For the analyzerdatasets- For downloading SWE-bench datasetsmatplotlib,seaborn,pandas,numpy- For visualizations
Setup
- Clone or download this repository
- Install the required dependencies:
pip install -r requirements.txt
- Make the scripts executable (optional):
chmod +x swebench_analyzer.py
chmod +x visualize_results.py
Usage
Basic Usage
For first-time use, simply run:
python swebench_analyzer.py
The script will:
- Prompt for your GitHub username
- Securely ask for a GitHub API token
- Download both SWE-bench datasets from Hugging Face
- Analyze them using the GitHub API
- Save the results to
user_contributions.json
Command Line Options
usage: swebench_analyzer.py [-h] [--username USERNAME] [--dataset DATASET]
[--output OUTPUT] [--token TOKEN] [--no-github]
[--cache-dir CACHE_DIR] [--github-cache-dir GITHUB_CACHE_DIR]
[--no-cache] [--clear-cache] [--cache-expiry CACHE_EXPIRY]
[--load-results LOAD_RESULTS] [--refresh-token]
Analyze SWE-bench datasets for user contributions
optional arguments:
-h, --help show this help message and exit
--username USERNAME GitHub username to check
--dataset DATASET Path to dataset or dataset name (swe-bench, swe-bench-verified, both)
--output OUTPUT Output file path
--token TOKEN GitHub API token (optional, will use cached or prompt)
--no-github Disable GitHub API analysis (offline mode only)
--cache-dir CACHE_DIR
Directory to cache downloaded datasets
--github-cache-dir GITHUB_CACHE_DIR
Directory to cache GitHub API responses
--no-cache Disable GitHub API response caching
--clear-cache Clear GitHub API response cache before running
--cache-expiry CACHE_EXPIRY
Number of days after which to expire cached GitHub data
--load-results LOAD_RESULTS
Load previously saved results instead of running analysis
--refresh-token Force refresh of GitHub token
Analyzer Examples
Analyze both datasets (default)
python swebench_analyzer.py --username your-github-username
Analyze only one dataset
python swebench_analyzer.py --dataset swe-bench-verified
Use offline mode (no GitHub API)
python swebench_analyzer.py --no-github
Load previously saved results
python swebench_analyzer.py --load-results user_contributions.json
Update your GitHub API token
python swebench_analyzer.py --refresh-token
Clear the GitHub API cache
python swebench_analyzer.py --clear-cache
Disable GitHub API caching
python swebench_analyzer.py --no-cache
Set custom GitHub API cache directory
python swebench_analyzer.py --github-cache-dir ~/.github-cache
Set custom cache expiration time
python swebench_analyzer.py --cache-expiry 14 # Cache valid for 14 days
Visualizer Usage
After running the analyzer and generating the user_contributions.json file, visualize the results:
python visualize_results.py
This will:
- Load the data from
user_contributions.json - Create various visualizations
- Generate an HTML report in the
visualizations/directory - Save all chart images in the same directory
Visualizer Command Line Options
usage: visualize_results.py [-h] [--input INPUT] [--output-dir OUTPUT_DIR] [--show]
Visualize SWE-bench analyzer results
optional arguments:
-h, --help show this help message and exit
--input INPUT Input JSON file from swebench_analyzer.py (default: user_contributions.json)
--output-dir OUTPUT_DIR
Directory to save visualizations (default: visualizations)
--show Show visualizations instead of saving to files
Visualizer Examples
Specify a custom input file
python visualize_results.py --input path/to/results.json
Show visualizations interactively instead of saving them
python visualize_results.py --show
Specify a custom output directory
python visualize_results.py --output-dir my-visualizations
GitHub API Authentication
The script uses the GitHub API to check for your contributions. You’ll need a Personal Access Token with repository read permissions.
To create a token:
- Go to GitHub → Settings → Developer settings → Personal access tokens → Fine-grained tokens
- Generate a new token with repository read access permissions
- Provide this token when prompted by the script
For security, the token is stored in your system’s secure keyring and not in plain text.
Configuration
The script automatically creates a configuration file at ~/.swe-bench-analyzer.ini to store:
- Your GitHub username
- Dataset cache locations
- GitHub API cache settings
- Previous output file paths
- Most recent analysis timestamp
- Cache expiry configuration
This ensures that subsequent runs require minimal input from you.
The GitHub API cache is stored in ~/.swe-bench-cache/github by default, but can be changed via the configuration file or command-line arguments.
Output Format
Results are saved as a JSON file with the following structure:
{
"metadata": {
"username": "your-username",
"datasets": ["swe-bench", "swe-bench-verified"],
"analysis_mode": "github_api",
"date_analyzed": "2025-02-26 14:30:22",
"count": 5,
"cache_stats": {
"enabled": true,
"hits": 12,
"misses": 3,
"hit_rate": "80.0%"
}
},
"results": [
{
"instance_id": "sympy__sympy-22914",
"repo": "sympy/sympy",
"contribution_types": ["author", "mentioned_in_problem"],
"title": "Fix precision handling in Number class",
"url": "https://github.com/sympy/sympy/issues/22914",
"created_at": "2022-01-15T12:34:56Z",
"dataset_name": "swe-bench",
"dataset_info": {
"problem_statement": "...",
"hints_text": "..."
},
"github_info": {
"issue_found": true,
"comment_count": 5,
"from_cache": true
}
},
...
]
}
Troubleshooting
GitHub API Caching
The script now caches GitHub API responses locally to:
- Speed up subsequent runs
- Reduce GitHub API rate limit usage
- Allow offline analysis of previously fetched data
By default, cached responses expire after 7 days. You can:
- Change the expiry time with
--cache-expiry - Clear the cache with
--clear-cache - Disable caching with
--no-cache - Specify a custom cache location with
--github-cache-dir
Rate Limiting
If you encounter GitHub API rate limits, the script will:
- First use cached responses if available
- Automatically pause and wait for the rate limit to reset if needed
- Report cache hit/miss statistics to help you optimize
Missing Dependencies
If you’re missing required packages, the script will show an error message with installation instructions.
Dataset Download Issues
If the script can’t download the datasets:
- Check your internet connection
- Try specifying a different cache directory with
--cache-dir - Download the datasets manually and provide the file path with
--dataset
License
This script is provided as-is under the MIT License.
Acknowledgements
This entire project, including both the analyzer and visualization code, was generated with Claude Code with some initial help from Grok DeepSearch to explore the SWE-bench dataset structure. The visualizations, data processing logic, and GitHub API integration were all created through conversations with Claude.