An end-to-end tool for curating GitHub repositories into structured code datasets.
- Fast parallel processing - Download and extract with configurable workers
- Smart filtering - Only processes programming files using GitHub Linguist
- GPT-2 tokenization - Ready-to-use token counts for ML workflows
- Efficient caching - Uses ETags to avoid re-downloading unchanged repos
Perfect for curating training data, running code analysis, or creating repository archives.
cargo install --path .Create an input file with one GitHub repository per line:
"microsoft/vscode" "vercel/next.js" "tensorflow/tensorflow" "bitcoin/bitcoin" "rust-lang/rust" "kubernetes/kubernetes" "facebook/react" "docker/compose" "ansible/ansible" "elastic/elasticsearch"Download repositories:
codecurator download ./configs/repos.jsonlThis creates ZIP files in /zip/ directory. Downloads from main branch first, falls back to master if needed.
Extract and process:
codecurator extract ./configs/repos.jsonl --languages Python Rust VerilogProcesses all programming files, tokenizes content, and outputs structured data to /jsonl/ directory.
Deduplication:
codecurator dedupe ./configs/repos.jsonlHashes the contents of all files and deduplicates them. Stores the final data to /dedup/ by default.
Statistics:
$ bash stats/count_records.sh ./jsonl/ Total records: 110645 $ bash stats/count_tokens.sh ./dedup/ Total tokens: 346574283