Skip to content

syntax-syndicate/CodeWiki

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

AI-Powered Repository Documentation Generation β€’ Multi-Language Support β€’ Architecture-Aware Analysis

Generate holistic, structured documentation for large-scale codebases β€’ Cross-module interactions β€’ Visual artifacts and diagrams

Python version License: MIT GitHub stars arXiv

Quick Start β€’ CLI Commands β€’ Output Structure β€’ Paper

CodeWiki Framework


Quick Start

1. Install CodeWiki

# Install from source pip install git+https://github.com/FSoft-AI4Code/CodeWiki.git # Verify installation codewiki --version

2. Configure Your Environment

codewiki config set \ --api-key YOUR_API_KEY \ --base-url https://api.anthropic.com \ --main-model claude-sonnet-4 \ --cluster-model claude-sonnet-4

3. Generate Documentation

# Navigate to your project cd /path/to/your/project # Generate documentation codewiki generate # Generate with HTML viewer for GitHub Pages codewiki generate --github-pages --create-branch

That's it! Your documentation will be generated in ./docs/ with comprehensive repository-level analysis.

Usage Example

CLI Usage Example


What is CodeWiki?

CodeWiki is an open-source framework for automated repository-level documentation across seven programming languages. It generates holistic, architecture-aware documentation that captures not only individual functions but also their cross-file, cross-module, and system-level interactions.

Key Innovations

Innovation Description Impact
Hierarchical Decomposition Dynamic programming-inspired strategy that preserves architectural context Handles codebases of arbitrary size (86K-1.4M LOC tested)
Recursive Agentic System Adaptive multi-agent processing with dynamic delegation capabilities Maintains quality while scaling to repository-level scope
Multi-Modal Synthesis Generates textual documentation, architecture diagrams, data flows, and sequence diagrams Comprehensive understanding from multiple perspectives

Supported Languages

🐍 Python β€’ β˜• Java β€’ 🟨 JavaScript β€’ πŸ”· TypeScript β€’ βš™οΈ C β€’ πŸ”§ C++ β€’ πŸͺŸ C#


CLI Commands

Configuration Management

# Set up your API configuration codewiki config set \ --api-key <your-api-key> \ --base-url <provider-url> \ --main-model <model-name> \ --cluster-model <model-name> # Show current configuration codewiki config show # Validate your configuration codewiki config validate

Documentation Generation

# Basic generation codewiki generate # Custom output directory codewiki generate --output ./documentation # Create git branch for documentation codewiki generate --create-branch # Generate HTML viewer for GitHub Pages codewiki generate --github-pages # Enable verbose logging codewiki generate --verbose # Full-featured generation codewiki generate --create-branch --github-pages --verbose

Configuration Storage

  • API keys: Securely stored in system keychain (macOS Keychain, Windows Credential Manager, Linux Secret Service)
  • Settings: ~/.codewiki/config.json

Documentation Output

Generated documentation includes both textual descriptions and visual artifacts for comprehensive understanding.

Textual Documentation

  • Repository overview with architecture guide
  • Module-level documentation with API references
  • Usage examples and implementation patterns
  • Cross-module interaction analysis

Visual Artifacts

  • System architecture diagrams (Mermaid)
  • Data flow visualizations
  • Dependency graphs and module relationships
  • Sequence diagrams for complex interactions

Output Structure

./docs/ β”œβ”€β”€ overview.md # Repository overview (start here!) β”œβ”€β”€ module1.md # Module documentation β”œβ”€β”€ module2.md # Additional modules... β”œβ”€β”€ module_tree.json # Hierarchical module structure β”œβ”€β”€ first_module_tree.json # Initial clustering result β”œβ”€β”€ metadata.json # Generation metadata └── index.html # Interactive viewer (with --github-pages) 

Experimental Results

CodeWiki has been evaluated on CodeWikiBench, the first benchmark specifically designed for repository-level documentation quality assessment.

Performance by Language Category

Language Category CodeWiki (Sonnet-4) DeepWiki Improvement
High-Level (Python, JS, TS) 79.14% 68.67% +10.47%
Managed (C#, Java) 68.84% 64.80% +4.04%
Systems (C, C++) 53.24% 56.39% -3.15%
Overall Average 68.79% 64.06% +4.73%

Results on Representative Repositories

Repository Language LOC CodeWiki-Sonnet-4 DeepWiki Improvement
All-Hands-AI--OpenHands Python 229K 82.45% 73.04% +9.41%
puppeteer--puppeteer TypeScript 136K 83.00% 64.46% +18.54%
sveltejs--svelte JavaScript 125K 71.96% 68.51% +3.45%
Unity-Technologies--ml-agents C# 86K 79.78% 74.80% +4.98%
elastic--logstash Java 117K 57.90% 54.80% +3.10%

View comprehensive results: See paper for complete evaluation on 21 repositories spanning all supported languages.


How It Works

Architecture Overview

CodeWiki employs a three-stage process for comprehensive documentation generation:

  1. Hierarchical Decomposition: Uses dynamic programming-inspired algorithms to partition repositories into coherent modules while preserving architectural context across multiple granularity levels.

  2. Recursive Multi-Agent Processing: Implements adaptive multi-agent processing with dynamic task delegation, allowing the system to handle complex modules at scale while maintaining quality.

  3. Multi-Modal Synthesis: Integrates textual descriptions with visual artifacts including architecture diagrams, data-flow representations, and sequence diagrams for comprehensive understanding.

Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Codebase │───▢│ Hierarchical │───▢│ Multi-Agent β”‚ β”‚ Analysis β”‚ β”‚ Decomposition β”‚ β”‚ Processing β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Visual │◀───│ Multi-Modal │◀───│ Structured β”‚ β”‚ Artifacts β”‚ β”‚ Synthesis β”‚ β”‚ Content β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ 

Requirements

  • Python 3.12+
  • Node.js (for Mermaid diagram validation)
  • LLM API access (Anthropic Claude, OpenAI, etc.)
  • Git (for branch creation features)

Additional Resources

Documentation & Guides

Academic Resources

  • Paper - Full research paper with detailed methodology and results
  • Citation - How to cite CodeWiki in your research

Citation

If you use CodeWiki in your research, please cite:

@misc{hoang2025codewikievaluatingaisability, title={CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases}, author={Anh Nguyen Hoang and Minh Le-Anh and Bach Le and Nghi D. Q. Bui}, year={2025}, eprint={2510.24428}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2510.24428}, }

Star History

Star History Chart


License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Open-source framework for holistic, structured repository-level documentation across multilingual codebases

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 90.6%
  • HTML 9.2%
  • Dockerfile 0.2%