Skip to content

feat: vcspull import#510

Merged
tony merged 109 commits intomasterfrom
scraper
Feb 15, 2026
Merged

feat: vcspull import#510
tony merged 109 commits intomasterfrom
scraper

Conversation

@tony
Copy link
Member

@tony tony commented Feb 1, 2026

Summary

Adds a new vcspull import command to search and import repositories from remote services into vcspull configuration.

Closes #416

Features

  • Supported services: GitHub, GitLab, Codeberg, Gitea, Forgejo, AWS CodeCommit
  • Service aliases: gh, gl, cb, cc, aws for convenience
  • Import modes: user, org, search
  • Filtering: --language, --topics, --min-stars, --archived, --forks
  • Output: Human-readable (default), --json, --ndjson
  • Safety: --dry-run preview, --yes to skip confirmation

Usage Examples

Import a user's repositories:

$ vcspull import github torvalds -w ~/repos/linux --mode user

Import an organization's repositories:

$ vcspull import github django -w ~/study/python --mode org

Search and import repositories:

$ vcspull import github "machine learning" -w ~/ml-repos --mode search --min-stars 1000

Use with self-hosted GitLab:

$ vcspull import gitlab myuser -w ~/work --url https://gitlab.company.com

Preview without writing (dry run):

$ vcspull import codeberg user -w ~/oss --dry-run

Import from AWS CodeCommit:

$ vcspull import codecommit -w ~/work/aws --region us-east-1

Architecture

  • src/vcspull/_internal/remotes/: New package with service importers
    • base.py: RemoteRepo dataclass, ImportOptions, HTTPClient, error hierarchy
    • github.py, gitlab.py, gitea.py, codecommit.py: Service-specific implementations
  • src/vcspull/cli/import_repos.py: CLI command handler
  • No new dependencies: Uses stdlib urllib for HTTP, subprocess for AWS CLI

Test Plan

Automated Tests

  • 53 tests for remotes package (all importers, filtering, error handling)
  • 42 tests for CLI (argument parsing, output modes, edge cases)
  • All 839 project tests passing
  • Linting (ruff) passing
  • Type checking (mypy) passing

Authentication Requirements

Service Mode Auth Required
GitHub user, org, search No (for public repos)
GitLab user, org, search Yes (GITLAB_TOKEN)
Codeberg/Gitea org No
Codeberg/Gitea user, search Yes (CODEBERG_TOKEN/GITEA_TOKEN)
CodeCommit - Yes (AWS CLI configured)

Setup for Testing

Option A: Test via uvx (no clone required)

Note: --with typing_extensions is needed because package dependencies aren't fully resolved in isolated uvx environments.

Option B: Test from cloned branch

git clone --branch scraper https://github.com/vcs-python/vcspull.git vcspull-test cd vcspull-test uv sync uv run pytest # Run automated tests (839 should pass)

Manual Test Commands

Show help:

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import --help
uv run vcspull import --help

Show help (no args is equivalent to --help):

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import
uv run vcspull import

GitHub - user repos:

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import github torvalds -w ~/test --mode user --dry-run --limit 10
uv run vcspull import github torvalds -w ~/test --mode user --dry-run --limit 10

GitHub - org repos:

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import github django -w ~/test --mode org --dry-run --limit 10
uv run vcspull import github django -w ~/test --mode org --dry-run --limit 10

GitHub - search with min-stars filter:

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import github "machine learning" -w ~/test --mode search --dry-run --limit 5 --min-stars 1000
uv run vcspull import github "machine learning" -w ~/test --mode search --dry-run --limit 5 --min-stars 1000

Codeberg - org repos:

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import codeberg forgejo -w ~/test --mode org --dry-run --limit 10
uv run vcspull import codeberg forgejo -w ~/test --mode org --dry-run --limit 10

GitLab - org/group (requires token):

export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx" uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import gitlab gitlab-org -w ~/test --mode org --dry-run --limit 10
export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx" uv run vcspull import gitlab gitlab-org -w ~/test --mode org --dry-run --limit 10

GitLab - subgroup with slash notation (requires token):

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import gitlab gitlab-org/ci-cd -w ~/test --mode org --dry-run --limit 10
uv run vcspull import gitlab gitlab-org/ci-cd -w ~/test --mode org --dry-run --limit 10

JSON output:

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import github torvalds -w ~/test --dry-run --limit 3 --json
uv run vcspull import github torvalds -w ~/test --dry-run --limit 3 --json

NDJSON output:

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import github torvalds -w ~/test --dry-run --limit 3 --ndjson
uv run vcspull import github torvalds -w ~/test --dry-run --limit 3 --ndjson

Language filter:

uvx --with typing_extensions --from "git+https://github.com/vcs-python/vcspull@scraper" vcspull import github tony -w ~/test --dry-run --limit 5 --language Python
uv run vcspull import github tony -w ~/test --dry-run --limit 5 --language Python
@codecov
Copy link

codecov bot commented Feb 1, 2026

Codecov Report

❌ Patch coverage is 84.15385% with 103 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.10%. Comparing base (e4d1e88) to head (90bbd7d).
⚠️ Report is 110 commits behind head on master.

Files with missing lines Patch % Lines
src/vcspull/_internal/remotes/base.py 86.00% 16 Missing and 5 partials ⚠️
src/vcspull/_internal/remotes/gitea.py 78.82% 6 Missing and 12 partials ⚠️
src/vcspull/_internal/remotes/gitlab.py 84.09% 5 Missing and 9 partials ⚠️
src/vcspull/_internal/remotes/github.py 87.87% 4 Missing and 8 partials ⚠️
src/vcspull/cli/import_cmd/codecommit.py 58.33% 10 Missing ⚠️
src/vcspull/cli/import_cmd/codeberg.py 58.33% 5 Missing ⚠️
src/vcspull/cli/import_cmd/forgejo.py 64.28% 5 Missing ⚠️
src/vcspull/cli/import_cmd/gitea.py 64.28% 5 Missing ⚠️
src/vcspull/cli/import_cmd/github.py 64.28% 5 Missing ⚠️
src/vcspull/cli/import_cmd/gitlab.py 66.66% 5 Missing ⚠️
... and 1 more
Additional details and impacted files
@@ Coverage Diff @@ ## master #510 +/- ## ========================================== + Coverage 81.54% 82.10% +0.55%  ========================================== Files 16 27 +11 Lines 2254 2900 +646 Branches 473 581 +108 ========================================== + Hits 1838 2381 +543  - Misses 266 333 +67  - Partials 150 186 +36 

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
tony added a commit that referenced this pull request Feb 1, 2026
why: Document the new import feature for the changelog. what: - Add New features section for v1.51.x unreleased - Document vcspull import command with usage examples - List supported services, aliases, and filtering options
@tony tony requested a review from Copilot February 1, 2026 16:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new vcspull import CLI command and supporting “remote service importer” implementations to discover repositories from hosted services and write them into a vcspull config.

Changes:

  • Introduces vcspull import command with service selection, filtering, output modes (human/json/ndjson), confirmation, and dry-run.
  • Adds a new internal remotes package implementing GitHub/GitLab/Gitea(Codeberg/Forgejo)/CodeCommit importers plus shared HTTP/filtering primitives.
  • Adds comprehensive unit tests for the CLI command and each importer, plus changelog and logger name coverage updates.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/test_log.py Adds the new CLI module logger name to the logger discovery test.
tests/cli/test_import_repos.py Adds CLI-level tests for importer selection, config resolution, output modes, dry-run/confirmation, and error handling.
tests/_internal/remotes/test_gitlab.py Adds GitLab importer tests including auth-required search behavior.
tests/_internal/remotes/test_github.py Adds GitHub importer tests including filtering and limit handling.
tests/_internal/remotes/test_gitea.py Adds Gitea/Codeberg importer tests including search response variants.
tests/_internal/remotes/test_base.py Adds tests for shared base models/utilities (RemoteRepo, ImportOptions, filter_repo).
tests/_internal/remotes/conftest.py Adds shared HTTP mocking helpers and sample API payload fixtures for remotes tests.
tests/_internal/remotes/init.py Marks the remotes tests package.
src/vcspull/cli/import_repos.py Implements the vcspull import command handler and argument parsing.
src/vcspull/cli/init.py Registers the new import subcommand and help text/examples.
src/vcspull/_internal/remotes/gitlab.py Implements GitLab repository discovery (user/org/search) via GitLab REST API.
src/vcspull/_internal/remotes/github.py Implements GitHub repository discovery (user/org/search) via GitHub REST API.
src/vcspull/_internal/remotes/gitea.py Implements Gitea/Forgejo/Codeberg discovery via Gitea-compatible REST API.
src/vcspull/_internal/remotes/codecommit.py Implements CodeCommit discovery via AWS CLI subprocess calls.
src/vcspull/_internal/remotes/base.py Adds shared dataclasses, filtering logic, error hierarchy, and a small urllib-based HTTP client.
src/vcspull/_internal/remotes/init.py Exposes the remotes package public API (__all__).
CHANGES Documents the new vcspull import feature and usage examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tony
Copy link
Member Author

tony commented Feb 1, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

The new code follows existing patterns in the codebase for docstrings, imports, and test structure.

🤖 Generated with Claude Code

@tony
Copy link
Member Author

tony commented Feb 1, 2026

Code review

Found 1 issue:

  1. Missing filter_repo() call in CodeCommit importer - All other importers (GitHub, GitLab, Gitea) call filter_repo(repo, options) before yielding repositories, but CodeCommit yields directly without filtering. This means --language, --topics, --min-stars, --archived, and --forks options are silently ignored for CodeCommit imports.

repo = self._parse_repo(repo_metadata)
yield repo
count += 1

Compare to GitHub importer which correctly filters:

repo = self._parse_repo(item)
if filter_repo(repo, options):
yield repo

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@tony

This comment has been minimized.

@tony

This comment has been minimized.

@tony

This comment has been minimized.

@aschleifer
Copy link

The import itself seems to be working, but it imports the repositories with git+https://gitlab.com/...git urls. Is it possible to get an option to use ssh urls? otherwise I would have to manually edit the config after running the import command to get this fully usable.

tony added a commit that referenced this pull request Feb 8, 2026
why: Document the new import feature for the changelog. what: - Add New features section for v1.51.x unreleased - Document vcspull import command with usage examples - List supported services, aliases, and filtering options
tony added a commit that referenced this pull request Feb 8, 2026
why: Document the new import feature for the changelog. what: - Add New features section for v1.51.x unreleased - Document vcspull import command with usage examples - List supported services, aliases, and filtering options
tony added a commit that referenced this pull request Feb 8, 2026
why: Document the new import feature for the changelog. what: - Add New features section for v1.51.x unreleased - Document vcspull import command with usage examples - List supported services, aliases, and filtering options
tony added a commit that referenced this pull request Feb 9, 2026
why: Document the new import feature for the changelog. what: - Add New features section for v1.51.x unreleased - Document vcspull import command with usage examples - List supported services, aliases, and filtering options
@aschleifer
Copy link

The ssh based urls work now, but i noticed that it flattens the structure.

So basically I run vcspull import gitlab a/b -w ~/tmp --mode org -f vcspull.yaml and under group a/b i have some more tree like structure with multiple groups and each have different amount of repos under it, but in the resulting vcspull,.yaml I only get one level with the repos and not the groups in between. This won't work for me as i need to have the exact group/repository structure as it is in the gitlab online.

@tony
Copy link
Member Author

tony commented Feb 9, 2026

@aschleifer Yep - right now the GitLab importer flattens everything under a single workspace key (e.g. ~/tmp:), so subgroup paths get lost.

To implement “preserve GitLab structure” correctly, can you paste a small PII-redacted example of:

  1. The GitLab namespace tree under a/b (groups/subgroups + a few repo names)
  2. The vcspull.yaml you’d expect.

For the YAML, should this be represented as multiple workspace roots that mirror namespaces, e.g.

 ~/tmp/a/b: repo1: ... ~/tmp/a/b/subgroup: repo2: ...

or keep a single ~/tmp: and encode the namespace into the repo key (e.g. a/b/subgroup/repo)?

Also confirm the desired on-disk layout relative to -w (mirror GitLab path exactly vs a different local structure?). Perhaps for these nested structures, we should do nested by default unless otherwise stated?

@aschleifer
Copy link

aschleifer commented Feb 9, 2026

Example structure:

a (group) - b (group) - - c (group) - - - d (repository) - - - e (repository) - - f (group) - - - g (repository) - - h (repository) - z (group) 

Command: vcspull import gitlab a/b -w ~/tmp --mode org -f vcspull.yaml

Expected content of vcspull.yaml:

~/tmp: h: ... ~/tmp/c: d: ... e: ... ~/tmp/f: g: ... 

So basically the full tree structure under the targeted group.

tony added a commit that referenced this pull request Feb 14, 2026
why: Document the new import feature for the changelog. what: - Add New features section for v1.51.x unreleased - Document vcspull import command with usage examples - List supported services, aliases, and filtering options
@tony tony force-pushed the scraper branch 3 times, most recently from a8915f5 to 26d91ae Compare February 14, 2026 16:29
tony added 3 commits February 14, 2026 11:03
why: Enable importing repositories from GitHub, GitLab, Codeberg/Gitea/Forgejo, and AWS CodeCommit into vcspull configuration. what: - Add base.py with RemoteRepo dataclass, ImportOptions, ImportMode enum - Add HTTPClient for stdlib-only HTTP requests (urllib) - Add error hierarchy: AuthenticationError, RateLimitError, NotFoundError, etc. - Add GitHubImporter with user/org/search modes - Add GitLabImporter with group/search support (auth required for search) - Add GiteaImporter supporting Codeberg, Gitea, Forgejo instances - Add CodeCommitImporter using AWS CLI subprocess calls - Add filter_repo() for client-side filtering by language, topics, stars
why: Allow users to import repositories from remote services directly into their vcspull configuration without manual entry. what: - Add create_import_subparser() for CLI argument handling - Add import_repos() main function with full import workflow - Support services: github, gitlab, codeberg, gitea, forgejo, codecommit - Add service aliases (gh, gl, cb, cc, aws) - Add filtering: --language, --topics, --min-stars, --archived, --forks - Add output modes: human-readable, --json, --ndjson - Add --dry-run and --yes options for confirmation control - Require --workspace flag (no default guessing)
why: Make the import command accessible via vcspull CLI. what: - Import create_import_subparser, import_repos from import_repos module - Add IMPORT_DESCRIPTION with usage examples - Add import subparser to CLI - Add handler for import subparser in cli() function
tony added 13 commits February 14, 2026 20:10
why: GitHub Enterprise requires /api/v3 path prefix but the importer used the base URL as-is, unlike Gitea which correctly appends /api/v1. what: - Auto-append /api/v3 when base_url is provided and lacks /api/ path - Skip normalization for default api.github.com and pre-suffixed URLs - Add tests for GHE normalization, idempotency, and public URL
…pace why: When a workspace section in the config is not a dict, the import loop logged an error but returned exit 0 with a misleading success message ("All repositories already exist"). what: - Track workspace sections that fail validation in error_labels set - Return exit 1 before the "all exist" message when errors occurred - Add test asserting non-mapping workspace returns exit code 1
…bort why: When stdin is not a TTY and --yes is not provided, _run_import returned 0 (success) even though no import occurred. CI/automation scripts chaining on exit codes would incorrectly proceed. what: - Change return 0 to return 1 at the non-interactive abort path - Add return value assertion to test_import_repos_non_tty_aborts
…sponses why: dict.get("key", {}) returns None when the key exists with JSON null value, causing AttributeError on subsequent .get() calls. APIs may return null for deleted accounts, system repos, or self-hosted edge cases. what: - Change data.get("namespace", {}) to data.get("namespace") or {} in gitlab.py - Change data.get("owner", {}) to (data.get("owner") or {}) in github.py - Change data.get("owner", {}) to data.get("owner") or {} in gitea.py - Add test_github_parse_repo_null_owner - Add test_gitlab_parse_repo_null_namespace - Add test_gitea_parse_repo_null_owner
…g filter why: Help said "prefix filter" but the implementation uses substring matching (the `in` operator), which matches anywhere in the name. what: - Change help text from "prefix" to "substring" at codecommit.py
…ging why: Naive f"{url}?{urlencode(params)}" would produce a malformed URL with double question marks if the endpoint already contained query parameters. what: - Replace string concatenation with urllib.parse.urlsplit/urlunsplit to properly merge existing and new query parameters - Add test_http_client_get_merges_query_params to verify correct behavior
why: Authorization tokens sent via HTTP are visible to network observers. Users who provide http:// URLs with --url should be warned about the security risk. what: - Add warning log in HTTPClient.__init__ when token + non-HTTPS base URL - Add test_http_client_warns_on_non_https_with_token - Add test_http_client_no_warning_on_https_with_token
why: save_config_json had zero test coverage, and no integration test exercised the JSON config write path through _run_import. what: - Add test_save_config_json_write_and_readback - Add test_save_config_json_atomic_write - Add test_save_config_json_atomic_preserves_permissions - Add test_import_repos_json_config_write integration test
why: GitHub search API returns HTTP 422 when requesting results beyond offset 1000. Without a guard, the pagination loop would crash after partial progress when --limit exceeds 1000. what: - Add SEARCH_MAX_RESULTS = 1000 constant - Break pagination when page * DEFAULT_PER_PAGE >= SEARCH_MAX_RESULTS - Add test_github_search_caps_at_1000_results
why: subprocess.run without timeout blocks indefinitely if the AWS CLI hangs due to network issues or broken credential providers. HTTP-based importers already have a 30-second timeout via HTTPClient. what: - Add timeout=60 to subprocess.run in _run_aws_command - Catch subprocess.TimeoutExpired and raise ServiceUnavailableError - Add ServiceUnavailableError to imports - Add test_codecommit_timeout_raises_service_unavailable
… files why: Each file defined log = logging.getLogger(__name__) but never used it. The logging import and log variable are dead code. what: - Remove import logging and log variable from github.py, gitlab.py, codeberg.py, forgejo.py, and gitea.py CLI handlers
why: yaml.safe_load was used for all config files regardless of extension. While YAML is a superset of JSON, dispatching on file extension is semantically correct and produces more specific error messages for JSON parse failures. what: - Dispatch on config file suffix: json.loads for .json, yaml.safe_load for .yaml/.yml - Use broad except to catch both json.JSONDecodeError and yaml.YAMLError
why: The lambda-based mock caused a mypy type inference error. what: - Replace inline io.BytesIO mock with shared MockHTTPResponse fixture
tony added 6 commits February 15, 2026 06:32
…onfig loading why: The inline JSON/YAML dispatch duplicated what ConfigReader._from_file() already provides, creating an asymmetry with the save path that already uses ConfigReader._dump() via save_config_yaml/save_config_json. what: - Replace 12-line inline JSON/YAML dispatch block with ConfigReader._from_file() - Remove lazy imports of json and yaml that were only needed for inline dispatch
why: Project style guide requires one command per code block for copyability. what: - Split combined auth+import code blocks into separate blocks in 6 files - Add explanatory text between the blocks (github, gitlab, codeberg, gitea, forgejo, codecommit)
why: Consecutive code blocks without explanatory text leave the reader guessing. what: - Add "SSH (default):" label before the first block - Add "Use --https for HTTPS clone URLs:" before the second block
why: README omitted vcspull import despite it being a major v1.55 feature. what: - Add vcspull import to the config-creation sentence at line 71 - Add "Import from remote services" subsection with example commands
…commands why: Shortform flags are cryptic in user-facing docs; multi-flag one-liners are hard to scan and copy-paste. what: - Add "Prefer longform flags" rule to Documentation Standards - Add "Split multi-flag commands" rule with \-continuation style - Include Good/Bad examples showing both rules together
why: Shortform flags (-w, -f, -S, -v) are cryptic in user-facing docs; multi-flag one-liners are hard to scan and copy-paste. what: - Replace -w with --workspace in all doc code blocks - Replace -f with --file in all doc code blocks - Replace -S with --smart-case and -v with --invert-match in search docs - Split multi-flag commands onto \-continuation lines - Update prose references to prefer longform names - Remove redundant "Short form" examples from fmt.md
@tony
Copy link
Member Author

tony commented Feb 15, 2026

Addressed in 12899cf — all 6 service pages now have separate code blocks with prose between them.

@tony
Copy link
Member Author

tony commented Feb 15, 2026

Feasible doctests were added in f84a7a5. The remaining methods without doctests (fetch_repos(), HTTPClient.get(), CodeCommitImporter.is_authenticated, create_import_subparser()) are infeasible — they make live HTTP/API/subprocess calls. All are covered by regular pytest tests instead.

tony added 2 commits February 15, 2026 07:50
why: `vcspull import gitlab` (PR #510) fully replaces these community scripts with built-in pagination, dry-run, filtering, and config merging. what: - Remove scripts/generate_gitlab.py - Remove scripts/generate_gitlab.sh
…edirect why: The generation page referenced the now-removed gitlab scripts. Redirect readers to `vcspull import` which is the supported approach. what: - Replace generation.md content with stub pointing to {ref}cli-import - Update quickstart.md seealso to reference cli-import instead of config-generation - Remove generation toctree entry from configuration/index.md
tony added a commit that referenced this pull request Feb 15, 2026
why: `vcspull import gitlab` (PR #510) fully replaces these community scripts with built-in pagination, dry-run, filtering, and config merging. what: - Remove scripts/generate_gitlab.py - Remove scripts/generate_gitlab.sh
@tony tony changed the title feat: Add vcspull import command for remote repository discovery feat: vcspull import Feb 15, 2026
@tony tony merged commit a65ae03 into master Feb 15, 2026
9 checks passed
@tony tony deleted the scraper branch February 15, 2026 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants