Skip to content

ggcr/codecurator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeCurator

An end-to-end tool for curating GitHub repositories into structured code datasets.

  • Fast parallel processing - Download and extract with configurable workers
  • Smart filtering - Only processes programming files using GitHub Linguist
  • GPT-2 tokenization - Ready-to-use token counts for ML workflows
  • Efficient caching - Uses ETags to avoid re-downloading unchanged repos

Perfect for curating training data, running code analysis, or creating repository archives.

Installation

cargo install --path .

Usage

Create an input file with one GitHub repository per line:

"microsoft/vscode" "vercel/next.js" "tensorflow/tensorflow" "bitcoin/bitcoin" "rust-lang/rust" "kubernetes/kubernetes" "facebook/react" "docker/compose" "ansible/ansible" "elastic/elasticsearch"

Download repositories:

codecurator download ./configs/repos.jsonl

This creates ZIP files in /zip/ directory. Downloads from main branch first, falls back to master if needed.

Extract and process:

codecurator extract ./configs/repos.jsonl --languages Python Rust Verilog

Processes all programming files, tokenizes content, and outputs structured data to /jsonl/ directory.

Deduplication:

codecurator dedupe ./configs/repos.jsonl

Hashes the contents of all files and deduplicates them. Stores the final data to /dedup/ by default.

Statistics:

$ bash stats/count_records.sh ./jsonl/ Total records: 110645 $ bash stats/count_tokens.sh ./dedup/ Total tokens: 346574283

Visitors

About

Rust tool to massively download, process and tokenize GitHub repos as datasets without the GitHub API 🦀

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors