Cli
cli
Functions:
| Name | Description |
|---|---|
extract |
Extract training data from GitHub repository pull requests asynchronously. |
main |
|
extract
Extract training data from GitHub repository pull requests asynchronously.
This function serves as the main CLI entry point for extracting training data from merged pull requests in a GitHub repository. It creates an AsyncGitHubPRExtractor instance and orchestrates the complete data extraction process, automatically saving results to a JSON file in the ./data/{owner}/{repo}/ directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
GitHub repository URL or Mai0313/SWEBenchV2 format, e.g., |
required |
|
int
|
Maximum number of pages to fetch for PRs, Defaults to 1. |
1
|
|
int
|
Number of PRs to fetch per API request page, Defaults to 1. |
1
|
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
Results are automatically saved to JSON file in ./data/ directory. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If repo_url format is invalid or repository doesn't exist. |
HTTPError
|
If GitHub API requests fail or authentication is invalid. |
RateLimitError
|
If GitHub API rate limit is exceeded (handled with automatic waiting). |
Example
Basic usage with minimal extraction:
await extract("Mai0313/SWEBenchV2")
Complete repository extraction:
await extract("https://github.com/Mai0313/SWEBenchV2", max_page=None, per_page=100)
Custom pagination for large repositories:
await extract("Mai0313/SWEBenchV2", max_page=10, per_page=50)
Note
- Requires GITHUB_TOKEN environment variable for API authentication
- Output files are saved to ./data/{owner}/{repo}/log_{timestamp}.json
- Uses asynchronous processing for optimal performance with concurrent API calls
- Automatically handles GitHub API rate limits with intelligent waiting