Github
GitHubAPISettings
Bases: BaseSettings
repo_url
repo_url: str = Field(..., title='Github Repository URL', description='This should be a full url to the repository, e.g. `https://github.com/Mai0313/SWEBenchV2` or `Mai0313/SWEBenchV2`', frozen=False, deprecated=False)
GitHubPRExtractorBase
Bases: GitHubAPISettings
token
token: str | None = Field(default=None, validation_alias='GITHUB_TOKEN', description='GitHub API token for authentication', frozen=False, deprecated=False)
base_url
base_url: str = Field(default='https://api.github.com', validation_alias='GITHUB_API_BASE_URL', description='Base URL for GitHub API', frozen=False, deprecated=False)
repo_url
repo_url: str = Field(..., title='Github Repository URL', description='This should be a full url to the repository, e.g. `https://github.com/Mai0313/SWEBenchV2` or `Mai0313/SWEBenchV2`', frozen=False, deprecated=False)
GitHubPRExtractor
Bases: GitHubPRExtractorBase
Methods:
| Name | Description |
|---|---|
get_rate_limit |
Retrieve current GitHub API rate limit information. |
get_merged_prs |
Fetch all merged pull requests from the GitHub repository. |
get_pr_files |
Retrieve all files modified in a specific pull request. |
get_file_content |
Retrieve the content of a file at a specific commit SHA. |
extract_pr_data |
Extract complete training data for a single pull request. |
extract_all_pr_data |
Extract training data from all merged pull requests in the repository. |
repo_url
repo_url: str = Field(..., title='Github Repository URL', description='This should be a full url to the repository, e.g. `https://github.com/Mai0313/SWEBenchV2` or `Mai0313/SWEBenchV2`', frozen=False, deprecated=False)
max_page
max_page: int | None = Field(default=None, title='Max Page', description='Maximum number of pages to fetch for PRs', frozen=False, deprecated=False)
per_page
per_page: int | None = Field(default=None, title='Per Page', description='Number of PRs to fetch per page', frozen=False, deprecated=False)
token
token: str | None = Field(default=None, validation_alias='GITHUB_TOKEN', description='GitHub API token for authentication', frozen=False, deprecated=False)
base_url
base_url: str = Field(default='https://api.github.com', validation_alias='GITHUB_API_BASE_URL', description='Base URL for GitHub API', frozen=False, deprecated=False)
get_rate_limit
Retrieve current GitHub API rate limit information.
Makes a synchronous request to GitHub's rate limit endpoint to check current usage and remaining quota for API calls.
Returns:
| Name | Type | Description |
|---|---|---|
RateLimit |
RateLimit
|
Current rate limit status including remaining calls and reset time. |
Source code in src/swebenchv2/datamodule/github.py
get_merged_prs
Fetch all merged pull requests from the GitHub repository.
Synchronously retrieves all merged pull requests by paginating through the GitHub API, handling rate limits and filtering for merged PRs only.
Returns:
| Type | Description |
|---|---|
list[PullRequest]
|
list[PullRequest]: List of all merged pull request objects from the repository. |
Source code in src/swebenchv2/datamodule/github.py
get_pr_files
Retrieve all files modified in a specific pull request.
Fetches the list of files that were changed in the given pull request, including metadata about additions, deletions, and file status.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
The pull request number to fetch files for. |
required |
Returns:
| Type | Description |
|---|---|
list[FileData]
|
list[FileData]: List of file data objects representing all modified files. |
Source code in src/swebenchv2/datamodule/github.py
get_file_content
Retrieve the content of a file at a specific commit SHA.
Fetches the raw content of a file from the repository at the specified commit, handling base64 decoding when necessary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
Path to the file within the repository. |
required |
|
str
|
Git commit SHA to retrieve the file content from. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The decoded file content as a string, or empty string if not found. |
Source code in src/swebenchv2/datamodule/github.py
extract_pr_data
Extract complete training data for a single pull request.
Processes a pull request to gather all modified files and their content before and after changes, creating a structured training data object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
PullRequest
|
Pull request object containing metadata and references. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
TrainingData |
TrainingData
|
Complete training data including PR info, formatted question, and file changes. |
Source code in src/swebenchv2/datamodule/github.py
extract_all_pr_data
Extract training data from all merged pull requests in the repository.
Orchestrates the complete extraction process by fetching all merged PRs and processing each one to create comprehensive training datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
bool
|
Whether to save the extraction results to a JSON file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ExtractionResult |
ExtractionResult
|
Complete extraction results with all PR training data and metadata. |
Source code in src/swebenchv2/datamodule/github.py
AsyncGitHubPRExtractor
Bases: GitHubPRExtractorBase
Methods:
| Name | Description |
|---|---|
get_rate_limit |
Retrieve current GitHub API rate limit information asynchronously. |
get_merged_prs |
Fetch all merged pull requests from the GitHub repository asynchronously. |
get_pr_files |
Retrieve all files modified in a specific pull request asynchronously. |
get_file_content |
Retrieve the content of a file at a specific commit SHA asynchronously. |
extract_pr_data |
Extract complete training data for a single pull request asynchronously. |
extract_all_pr_data |
Extract training data from all merged pull requests in the repository asynchronously. |
repo_url
repo_url: str = Field(..., title='Github Repository URL', description='This should be a full url to the repository, e.g. `https://github.com/Mai0313/SWEBenchV2` or `Mai0313/SWEBenchV2`', frozen=False, deprecated=False)
max_page
max_page: int | None = Field(default=None, title='Max Page', description='Maximum number of pages to fetch for PRs', frozen=False, deprecated=False)
per_page
per_page: int | None = Field(default=None, title='Per Page', description='Number of PRs to fetch per page', frozen=False, deprecated=False)
token
token: str | None = Field(default=None, validation_alias='GITHUB_TOKEN', description='GitHub API token for authentication', frozen=False, deprecated=False)
base_url
base_url: str = Field(default='https://api.github.com', validation_alias='GITHUB_API_BASE_URL', description='Base URL for GitHub API', frozen=False, deprecated=False)
get_rate_limit
Retrieve current GitHub API rate limit information asynchronously.
Makes an asynchronous request to GitHub's rate limit endpoint to check current usage and remaining quota for API calls.
Returns:
| Name | Type | Description |
|---|---|---|
RateLimit |
RateLimit
|
Current rate limit status including remaining calls and reset time. |
Source code in src/swebenchv2/datamodule/github.py
get_merged_prs
Fetch all merged pull requests from the GitHub repository asynchronously.
Asynchronously retrieves all merged pull requests by paginating through the GitHub API, handling rate limits and filtering for merged PRs only.
Returns:
| Type | Description |
|---|---|
list[PullRequest]
|
list[PullRequest]: List of all merged pull request objects from the repository. |
Source code in src/swebenchv2/datamodule/github.py
get_pr_files
Retrieve all files modified in a specific pull request asynchronously.
Asynchronously fetches the list of files that were changed in the given pull request, including metadata about additions, deletions, and file status.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int
|
The pull request number to fetch files for. |
required |
Returns:
| Type | Description |
|---|---|
list[FileData]
|
list[FileData]: List of file data objects representing all modified files. |
Source code in src/swebenchv2/datamodule/github.py
get_file_content
Retrieve the content of a file at a specific commit SHA asynchronously.
Asynchronously fetches the raw content of a file from the repository at the specified commit, handling base64 decoding when necessary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
Path to the file within the repository. |
required |
|
str
|
Git commit SHA to retrieve the file content from. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The decoded file content as a string, or empty string if not found. |
Source code in src/swebenchv2/datamodule/github.py
extract_pr_data
Extract complete training data for a single pull request asynchronously.
Asynchronously processes a pull request to gather all modified files and their content before and after changes, using concurrent requests for optimal performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
PullRequest
|
Pull request object containing metadata and references. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
TrainingData |
TrainingData
|
Complete training data including PR info, formatted question, and file changes. |
Source code in src/swebenchv2/datamodule/github.py
extract_all_pr_data
Extract training data from all merged pull requests in the repository asynchronously.
Orchestrates the complete extraction process by fetching all merged PRs and processing them concurrently to create comprehensive training datasets with optimal performance through async operations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
bool
|
Whether to save the extraction results to a JSON file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
ExtractionResult |
ExtractionResult
|
Complete extraction results with all PR training data and metadata. |