The first open-source framework for holistic, structured repository-level documentation across multilingual codebases
Developers spend nearly 58% of their time understanding codebases, yet maintaining comprehensive documentation remains challenging. While recent Large Language Models (LLMs) show promise for function-level documentation, they fail at the repository level, where capturing architectural patterns and cross-module interactions is essential.
CodeWiki is the first open-source framework for holistic repository-level documentation across seven programming languages, introducing innovations in hierarchical decomposition, recursive agentic processing, and multi-modal synthesis.
Figure 1: CodeWiki Framework operates in three main phases: (1) Repository analysis and hierarchical decomposition, (2) Recursive documentation generation with dynamic delegation, (3) Hierarchical assembly and synthesis
Dynamic programming-inspired strategy that breaks complex repositories into manageable modules while preserving architectural coherence. Handles codebases from 86K to 1.4M lines of code.
Multi-agent architecture with dynamic delegation capabilities that enables adaptive processing based on module complexity, maintaining quality at repository-level scope.
Generates comprehensive documentation including textual descriptions, architecture diagrams, data flows, and sequence diagrams for holistic understanding.
Evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment across 21 repositories.
| Language Category | CodeWiki (Sonnet-4) | DeepWiki | Improvement |
|---|---|---|---|
| High-Level (Python, JS, TS) | 79.14% | 68.67% | +10.47% |
| Managed (C#, Java) | 68.84% | 64.80% | +4.04% |
| Systems (C, C++) | 53.24% | 56.39% | -3.15% |
| Overall Average | 68.79% | 64.06% | +4.73% |
| Repository | Language | LOC | CodeWiki | DeepWiki | Improvement |
|---|---|---|---|---|---|
| All-Hands-AI--OpenHands | Python | 229K | 82.45% | 73.04% | +9.41% |
| puppeteer--puppeteer | TypeScript | 136K | 83.00% | 64.46% | +18.54% |
| sveltejs--svelte | JavaScript | 125K | 71.96% | 68.51% | +3.45% |
| Unity-Technologies--ml-agents | C# | 86K | 79.78% | 74.80% | +4.98% |
| elastic--logstash | Java | 117K | 57.90% | 54.80% | +3.10% |
View comprehensive results for all 21 repositories in our paper.
Watch CodeWiki in action as it generates comprehensive documentation for a real repository:
CLI Usage Example: Generating documentation with CodeWiki
# Install from source pip install git+https://github.com/FSoft-AI4Code/CodeWiki.git # Verify installation codewiki --version 1. Configure CodeWiki:
codewiki config set \ --api-key YOUR_API_KEY \ --base-url https://api.anthropic.com \ --main-model claude-sonnet-4 \ --cluster-model claude-sonnet-4 # Verify configuration codewiki config show codewiki config validate 2. Generate Documentation:
# Navigate to your project cd /path/to/your/project # Generate documentation (saved to ./docs/) codewiki generate # Generate with GitHub Pages HTML viewer codewiki generate --github-pages # Full-featured generation codewiki generate --create-branch --github-pages --verbose ./docs/ ├── overview.md # Repository overview (start here!) ├── module1.md # Module documentation ├── module2.md # Additional modules... ├── module_tree.json # Hierarchical module structure ├── first_module_tree.json # Initial clustering result ├── metadata.json # Generation metadata └── index.html # Interactive viewer (with --github-pages) If you use CodeWiki in your research, please cite our paper:
@misc{hoang2025codewikievaluatingaisability, title={CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases}, author={Anh Nguyen Hoang and Minh Le-Anh and Bach Le and Nghi D. Q. Bui}, year={2025}, eprint={2510.24428}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2510.24428}, }