Evaluate AI skills with scientific rigor. Compare prompts with and without injected context using A/B testing, multiple LLM providers, and production-grade evaluation techniques.
Lightweight CLI tool for evaluating AI skills (SKILL.md) with Control vs Treatment testing using LiteLLM.
Inspired by LangChain skills-benchmarks.
📚 Full Documentation | Quick Start | GitHub Models Guide | Examples
Building AI applications that work reliably requires scientific validation. md-evals makes it easy:
| Challenge | Solution |
|---|---|
| 🤔 "Does my skill actually help?" | A/B test Control vs Treatment automatically |
| 💰 "Can't afford to evaluate with expensive APIs?" | Use free GitHub Models (Claude, GPT-4, DeepSeek) |
| 📊 "How do I know if my results are real?" | Hybrid regex + LLM-as-judge evaluation |
| 🔄 "Evaluating 100+ test cases manually is tedious" | Parallel workers, beautiful terminal output, JSON/Markdown export |
| ✅ "How do I prevent bad skills from merging?" | Built-in linter (400-line limit, best practices) |
| 🏗️ "Will this integrate with my CI/CD?" | Simple YAML config, exit codes for automation |
- ✨ A/B Testing: Compare Control (no skill) vs Treatment (with skill) prompts side-by-side
- 🎯 Multiple Treatments: Run wildcards like
LCC_*to test different skill variations in one go - 🧠 Hybrid Evaluation: Combine regex pattern matching + LLM-as-a-judge for flexible validation
- 🚀 Multiple LLM Providers: GitHub Models (free!), OpenAI, Anthropic, LiteLLM, and more
- 📋 Linter: Enforce 400-line limit, quality checks, and best practices for SKILL.md
- 📊 Rich Output: Beautiful terminal tables with pass rates, comparisons, and statistics
- 💾 Export: JSON, Markdown, or table format for reporting and analysis
- ⚡ Parallel Execution: Run multiple tests concurrently for faster feedback
- 🎉 GitHub Models Support: Use free/low-cost models (Claude 3.5, GPT-4, DeepSeek, Grok)
# Clone the repository git clone https://github.com/JNZader/md-evals.git cd md-evals # Install with uv (fastest) uv sync # Activate virtual environment source .venv/bin/activategit clone https://github.com/JNZader/md-evals.git cd md-evals # Install dependencies pip install -e .Requirements: Python 3.12+
md-evals initThis creates:
eval.yaml- Your evaluation configSKILL.md- Template for your AI skill
md-evals runmd-evals lint # Validate SKILL.md md-evals list # List treatments and tests# 1. Create evaluation md-evals init # 2. Preflight auth (env var first, gh login fallback) md-evals smoke --provider github-models # 3. Run with GitHub Models (free!) export GITHUB_TOKEN="github_pat_..." md-evals run --provider github-models --model claude-3.5-sonnet --config eval.yaml # 4. View results # → Beautiful table with Control vs Treatment comparison # → Pass rates and statisticsEvaluate your skills completely free using GitHub's Models API (public preview):
# Preferred: set GITHUB_TOKEN directly export GITHUB_TOKEN="github_pat_..." # Fallback for users already logged in with GitHub CLI gh auth login # Verify auth preflight before first run md-evals smoke --provider github-models --config examples/eval_with_github_models.yaml# Use Claude 3.5 Sonnet (200k context, free!) md-evals run --config eval.yaml --provider github-models --model claude-3.5-sonnet # Or use GPT-4o md-evals run --config eval.yaml --provider github-models --model gpt-4o # Or use DeepSeek R1 (fastest) md-evals run --config eval.yaml --provider github-models --model deepseek-r1| Model | Context | Best For | Cost |
|---|---|---|---|
claude-3.5-sonnet | 200k | Reasoning, complex tasks | 🟢 Free |
gpt-4o | 128k | General-purpose, balanced | 🟢 Free |
deepseek-r1 | 64k | Speed, cost efficiency | 🟢 Free |
grok-3 | 128k | Latest, edge cases | 🟢 Free |
Rate Limits: 15 requests/min (public preview) · Full Guide →
Create eval.yaml to define your evaluation. Here's a complete example:
name: "My AI Skill Evaluation" version: "1.0" description: "Evaluate skill effectiveness with Control vs Treatment" defaults: model: "claude-3.5-sonnet" provider: "github-models" # Free! (or: openai, anthropic, etc.) temperature: 0.7 max_tokens: 500 treatments: CONTROL: description: "Baseline: No skill injected" skill_path: null WITH_SKILL: description: "Treatment: With skill injected" skill_path: "./SKILL.md" WITH_SKILL_V2: description: "Alternative skill variant" skill_path: "./SKILL_V2.md" tests: - name: "test_basic_greeting" prompt: "Greet {name} and ask how they're doing." variables: name: "Alice" evaluators: - type: "regex" name: "has_greeting" pattern: "(hello|hi|greetings)" - type: "llm" name: "is_friendly" criteria: "Does the response feel warm and friendly?" - name: "test_complex_reasoning" prompt: "Explain {concept} to a {audience}." variables: concept: "quantum computing" audience: "5-year-old child" evaluators: - type: "llm" name: "is_age_appropriate" criteria: "Is the explanation suitable for a 5-year-old?"| Section | Purpose |
|---|---|
defaults | LLM model, provider, temperature, token limits |
treatments | Different skill configurations to compare |
tests | Test cases with prompts, variables, and evaluators |
type: regex- Pattern matching (fast, deterministic)type: llm- LLM-as-judge (flexible, intelligent)
| Command | Purpose |
|---|---|
md-evals init | 🚀 Scaffold eval.yaml and SKILL.md templates |
md-evals run | |
md-evals run --treatment WITH_SKILL | 🎯 Run specific treatment |
md-evals lint | ✅ Validate SKILL.md (400-line limit, best practices) |
md-evals list | 📋 List available treatments and tests |
md-evals list-models | 🤖 List available models per provider |
md-evals smoke --provider github-models --config eval.yaml | 🧪 Local preflight (provider, config, auth) |
# Evaluate with default provider md-evals run # Use specific provider and model md-evals run --provider github-models --model claude-3.5-sonnet # Run only specific treatment md-evals run --treatment WITH_SKILL # Export results as JSON md-evals run --output json > results.json # Run with 4 parallel workers md-evals run -n 4 # Repeat each test 5 times (for statistical significance) md-evals run --count 5 # Export to Markdown report md-evals run --output markdown > report.md # Validate before running md-evals lint-c, --config FILE- Config file (default:eval.yaml)-t, --treatment TREATMENT- Run specific treatment(s)-m, --model MODEL- Override model-p, --provider PROVIDER- Provider:github-models,openai,anthropic, etc.-n WORKERS- Parallel workers (default: 1)--count N- Repeat tests N times for statistical validation-o, --output FORMAT- Output format:table(default),json,markdown--no-lint- Skip SKILL.md linting--debug- Enable debug logging
-p, --provider PROVIDER- Filter by provider-v, --verbose- Show metadata (temperature ranges, costs, rate limits)
# Install with dev dependencies uv sync --extra dev # Activate virtual environment source .venv/bin/activatemd-evals has a comprehensive test suite with 94.95% code coverage and 321 passing tests.
# Run all tests pytest # Run tests in parallel (73% faster) pytest -n 4 # View coverage report pytest --cov=md_evals --cov-report=html open htmlcov/index.htmlComplete testing guides for different audiences:
| Guide | Audience | Purpose |
|---|---|---|
| TESTING.md | Everyone | How to run tests, markers, parallel execution |
| TEST_DEVELOPMENT_GUIDE.md | Developers | Writing new tests, fixtures, mocking strategies |
| TEST_ARCHITECTURE.md | Tech Leads | Test organization, fixture hierarchy, isolation patterns |
| TEST_CI_INTEGRATION.md | DevOps/CI Engineers | CI/CD setup, Docker, reporting, multiple platforms |
| TEST_QUICK_REFERENCE.md | All | Command cheat sheet, one-liners, common patterns |
| TEST_COVERAGE_ANALYSIS.md | Maintainers | Coverage gaps, improvement roadmap, module analysis |
# Run only unit tests (fast feedback) pytest -m unit # Run only integration tests pytest -m integration # Run specific test file pytest tests/test_github_models_provider.py -v # Debug a specific test pytest tests/test_engine.py::TestExecutionEngine::test_run_basic -vvv --pdb # Run tests that match pattern pytest -k "github_models" # Skip slow tests (faster local development) pytest -m "not slow" # Generate all reports pytest -n 4 \ --cov=md_evals \ --cov-report=html \ --cov-report=xml \ --cov-report=json \ --junit-xml=test-results.xml- Overall: 94.95% (production standard: 90%)
- Critical modules: >95% (engine, evaluators, config)
- Test count: 321 tests (unit, integration, E2E, performance)
- Execution time: 6.63s parallel / 22.09s serial
tests/ ├── conftest.py # Shared fixtures and config ├── test_cli.py # CLI command tests (100+ tests) ├── test_engine.py # Core evaluation engine ├── test_evaluator.py # Regex & LLM evaluators ├── test_github_models_provider.py # Provider tests (43 tests) ├── test_e2e_workflow.py # End-to-end workflow tests ├── test_linter.py # SKILL.md validation ├── test_reporter.py # Report generation └── ... (10+ test files total) | Configuration | Time | Speedup |
|---|---|---|
| Serial | 22.09s | — |
| Parallel (4 workers) | 6.63s | 73% |
| Unit tests only | ~5s | 78% |
| Fast tests (no slow) | ~10s | 55% |
For more details, see TESTING.md.
md_evals/ ├── cli.py # Command-line interface ├── engine.py # Evaluation engine (A/B testing) ├── llm.py # LLM provider interface ├── providers/ # LLM provider implementations │ ├── github_models.py # GitHub Models (free!) │ ├── openai_provider.py │ ├── anthropic_provider.py │ └── litellm_provider.py ├── evaluators/ # Evaluation strategies │ ├── regex_evaluator.py │ └── llm_evaluator.py └── config.py # YAML config parsing tests/ ├── test_engine.py ├── test_github_models_provider.py # 43 tests ├── test_provider_registry.py # 11 tests └── ... - Full Guide - Installation, tutorials, API reference
- GitHub Models Setup - Free LLM evaluation guide
- Examples - Real-world usage examples
We welcome contributions! Please see CONTRIBUTING.md for:
- Fork → Branch → Pull Request workflow
- Code style guidelines (Ruff, 100 char lines)
- Testing requirements (>80% coverage)
- Conventional Commits format
- CODE_OF_CONDUCT.md - Our community standards
- SECURITY.md - Vulnerability disclosure process
- Issues - Report bugs or request features
- Discussions - Ask questions and share ideas
MIT