Skip to content

JNZader/md-evals

md-evals

GitHub License Python 3.12+ Tests GitHub Models

Evaluate AI skills with scientific rigor. Compare prompts with and without injected context using A/B testing, multiple LLM providers, and production-grade evaluation techniques.

Lightweight CLI tool for evaluating AI skills (SKILL.md) with Control vs Treatment testing using LiteLLM.

Inspired by LangChain skills-benchmarks.

📚 Full Documentation | Quick Start | GitHub Models Guide | Examples

Why md-evals?

Building AI applications that work reliably requires scientific validation. md-evals makes it easy:

Challenge Solution
🤔 "Does my skill actually help?" A/B test Control vs Treatment automatically
💰 "Can't afford to evaluate with expensive APIs?" Use free GitHub Models (Claude, GPT-4, DeepSeek)
📊 "How do I know if my results are real?" Hybrid regex + LLM-as-judge evaluation
🔄 "Evaluating 100+ test cases manually is tedious" Parallel workers, beautiful terminal output, JSON/Markdown export
✅ "How do I prevent bad skills from merging?" Built-in linter (400-line limit, best practices)
🏗️ "Will this integrate with my CI/CD?" Simple YAML config, exit codes for automation

Features

  • A/B Testing: Compare Control (no skill) vs Treatment (with skill) prompts side-by-side
  • 🎯 Multiple Treatments: Run wildcards like LCC_* to test different skill variations in one go
  • 🧠 Hybrid Evaluation: Combine regex pattern matching + LLM-as-a-judge for flexible validation
  • 🚀 Multiple LLM Providers: GitHub Models (free!), OpenAI, Anthropic, LiteLLM, and more
  • 📋 Linter: Enforce 400-line limit, quality checks, and best practices for SKILL.md
  • 📊 Rich Output: Beautiful terminal tables with pass rates, comparisons, and statistics
  • 💾 Export: JSON, Markdown, or table format for reporting and analysis
  • Parallel Execution: Run multiple tests concurrently for faster feedback
  • 🎉 GitHub Models Support: Use free/low-cost models (Claude 3.5, GPT-4, DeepSeek, Grok)

Installation

Using uv (Recommended)

# Clone the repository git clone https://github.com/JNZader/md-evals.git cd md-evals # Install with uv (fastest) uv sync # Activate virtual environment source .venv/bin/activate

Using pip

git clone https://github.com/JNZader/md-evals.git cd md-evals # Install dependencies pip install -e .

Requirements: Python 3.12+

Quick Start

1. Initialize your evaluation

md-evals init

This creates:

  • eval.yaml - Your evaluation config
  • SKILL.md - Template for your AI skill

2. Run evaluation

md-evals run

3. Check your skill

md-evals lint # Validate SKILL.md md-evals list # List treatments and tests

⏱️ Complete example in 2 minutes

# 1. Create evaluation md-evals init # 2. Preflight auth (env var first, gh login fallback) md-evals smoke --provider github-models # 3. Run with GitHub Models (free!) export GITHUB_TOKEN="github_pat_..." md-evals run --provider github-models --model claude-3.5-sonnet --config eval.yaml # 4. View results # → Beautiful table with Control vs Treatment comparison # → Pass rates and statistics

🎉 GitHub Models: Free LLM Evaluation

Evaluate your skills completely free using GitHub's Models API (public preview):

Setup (One-time)

# Preferred: set GITHUB_TOKEN directly export GITHUB_TOKEN="github_pat_..." # Fallback for users already logged in with GitHub CLI gh auth login # Verify auth preflight before first run md-evals smoke --provider github-models --config examples/eval_with_github_models.yaml

Run Evaluation with Free Models

# Use Claude 3.5 Sonnet (200k context, free!) md-evals run --config eval.yaml --provider github-models --model claude-3.5-sonnet # Or use GPT-4o md-evals run --config eval.yaml --provider github-models --model gpt-4o # Or use DeepSeek R1 (fastest) md-evals run --config eval.yaml --provider github-models --model deepseek-r1

Available Models

Model Context Best For Cost
claude-3.5-sonnet 200k Reasoning, complex tasks 🟢 Free
gpt-4o 128k General-purpose, balanced 🟢 Free
deepseek-r1 64k Speed, cost efficiency 🟢 Free
grok-3 128k Latest, edge cases 🟢 Free

Rate Limits: 15 requests/min (public preview) · Full Guide →

Configuration

Create eval.yaml to define your evaluation. Here's a complete example:

name: "My AI Skill Evaluation" version: "1.0" description: "Evaluate skill effectiveness with Control vs Treatment" defaults: model: "claude-3.5-sonnet" provider: "github-models" # Free! (or: openai, anthropic, etc.) temperature: 0.7 max_tokens: 500 treatments: CONTROL: description: "Baseline: No skill injected" skill_path: null WITH_SKILL: description: "Treatment: With skill injected" skill_path: "./SKILL.md" WITH_SKILL_V2: description: "Alternative skill variant" skill_path: "./SKILL_V2.md" tests: - name: "test_basic_greeting" prompt: "Greet {name} and ask how they're doing." variables: name: "Alice" evaluators: - type: "regex" name: "has_greeting" pattern: "(hello|hi|greetings)" - type: "llm" name: "is_friendly" criteria: "Does the response feel warm and friendly?" - name: "test_complex_reasoning" prompt: "Explain {concept} to a {audience}." variables: concept: "quantum computing" audience: "5-year-old child" evaluators: - type: "llm" name: "is_age_appropriate" criteria: "Is the explanation suitable for a 5-year-old?"

Key Sections

Section Purpose
defaults LLM model, provider, temperature, token limits
treatments Different skill configurations to compare
tests Test cases with prompts, variables, and evaluators

Evaluators

  • type: regex - Pattern matching (fast, deterministic)
  • type: llm - LLM-as-judge (flexible, intelligent)

Commands

Command Purpose
md-evals init 🚀 Scaffold eval.yaml and SKILL.md templates
md-evals run ▶️ Run evaluations (Control vs Treatment)
md-evals run --treatment WITH_SKILL 🎯 Run specific treatment
md-evals lint ✅ Validate SKILL.md (400-line limit, best practices)
md-evals list 📋 List available treatments and tests
md-evals list-models 🤖 List available models per provider
md-evals smoke --provider github-models --config eval.yaml 🧪 Local preflight (provider, config, auth)

Common Workflows

# Evaluate with default provider md-evals run # Use specific provider and model md-evals run --provider github-models --model claude-3.5-sonnet # Run only specific treatment md-evals run --treatment WITH_SKILL # Export results as JSON md-evals run --output json > results.json # Run with 4 parallel workers md-evals run -n 4 # Repeat each test 5 times (for statistical significance) md-evals run --count 5 # Export to Markdown report md-evals run --output markdown > report.md # Validate before running md-evals lint

Full Options Reference

run

  • -c, --config FILE - Config file (default: eval.yaml)
  • -t, --treatment TREATMENT - Run specific treatment(s)
  • -m, --model MODEL - Override model
  • -p, --provider PROVIDER - Provider: github-models, openai, anthropic, etc.
  • -n WORKERS - Parallel workers (default: 1)
  • --count N - Repeat tests N times for statistical validation
  • -o, --output FORMAT - Output format: table (default), json, markdown
  • --no-lint - Skip SKILL.md linting
  • --debug - Enable debug logging

list-models

  • -p, --provider PROVIDER - Filter by provider
  • -v, --verbose - Show metadata (temperature ranges, costs, rate limits)

Development

Setup

# Install with dev dependencies uv sync --extra dev # Activate virtual environment source .venv/bin/activate

Testing

md-evals has a comprehensive test suite with 94.95% code coverage and 321 passing tests.

Quick Start

# Run all tests pytest # Run tests in parallel (73% faster) pytest -n 4 # View coverage report pytest --cov=md_evals --cov-report=html open htmlcov/index.html

Test Documentation

Complete testing guides for different audiences:

Guide Audience Purpose
TESTING.md Everyone How to run tests, markers, parallel execution
TEST_DEVELOPMENT_GUIDE.md Developers Writing new tests, fixtures, mocking strategies
TEST_ARCHITECTURE.md Tech Leads Test organization, fixture hierarchy, isolation patterns
TEST_CI_INTEGRATION.md DevOps/CI Engineers CI/CD setup, Docker, reporting, multiple platforms
TEST_QUICK_REFERENCE.md All Command cheat sheet, one-liners, common patterns
TEST_COVERAGE_ANALYSIS.md Maintainers Coverage gaps, improvement roadmap, module analysis

Common Testing Tasks

# Run only unit tests (fast feedback) pytest -m unit # Run only integration tests pytest -m integration # Run specific test file pytest tests/test_github_models_provider.py -v # Debug a specific test pytest tests/test_engine.py::TestExecutionEngine::test_run_basic -vvv --pdb # Run tests that match pattern pytest -k "github_models" # Skip slow tests (faster local development) pytest -m "not slow" # Generate all reports pytest -n 4 \ --cov=md_evals \ --cov-report=html \ --cov-report=xml \ --cov-report=json \ --junit-xml=test-results.xml

Test Coverage

  • Overall: 94.95% (production standard: 90%)
  • Critical modules: >95% (engine, evaluators, config)
  • Test count: 321 tests (unit, integration, E2E, performance)
  • Execution time: 6.63s parallel / 22.09s serial

Test Structure

tests/ ├── conftest.py # Shared fixtures and config ├── test_cli.py # CLI command tests (100+ tests) ├── test_engine.py # Core evaluation engine ├── test_evaluator.py # Regex & LLM evaluators ├── test_github_models_provider.py # Provider tests (43 tests) ├── test_e2e_workflow.py # End-to-end workflow tests ├── test_linter.py # SKILL.md validation ├── test_reporter.py # Report generation └── ... (10+ test files total) 

Performance

Configuration Time Speedup
Serial 22.09s
Parallel (4 workers) 6.63s 73%
Unit tests only ~5s 78%
Fast tests (no slow) ~10s 55%

For more details, see TESTING.md.

Project Structure

md_evals/ ├── cli.py # Command-line interface ├── engine.py # Evaluation engine (A/B testing) ├── llm.py # LLM provider interface ├── providers/ # LLM provider implementations │ ├── github_models.py # GitHub Models (free!) │ ├── openai_provider.py │ ├── anthropic_provider.py │ └── litellm_provider.py ├── evaluators/ # Evaluation strategies │ ├── regex_evaluator.py │ └── llm_evaluator.py └── config.py # YAML config parsing tests/ ├── test_engine.py ├── test_github_models_provider.py # 43 tests ├── test_provider_registry.py # 11 tests └── ... 

Community & Support

📖 Documentation

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Fork → Branch → Pull Request workflow
  • Code style guidelines (Ruff, 100 char lines)
  • Testing requirements (>80% coverage)
  • Conventional Commits format

📋 Community

📝 License

MIT

About

Lightweight CLI tool for evaluating AI skills (SKILL.md) with Control vs Treatment testing using LiteLLM

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors