Generate and evaluate agent skills based on traces with agents. Create skills with teacher models (expensive/slow) that student models (cheap/fast) can use to perform harder tasks reliably.
Install upskill:
pip install upskill # or just use uv uvx upskillCreate a new skill
upskill generate "write good git commit messages" # or based on previous agent traces upskill generate "document the pattern" --from ./trace.md # Skills are saved to ./skills/{skill-name}/ by defaultGenerate a skill with a teaching model and evaluate it on a student model.
upskill generate "write good git commit messages" --model sonnet --eval-model haikuBenchmark a set of models against a skill.
upskill eval ./skills/git-commit-messages/ -m haiku -m sonnet # logs pretty printed to the terminalView the results later.
upskill runs --skill git-commit-messagesGenerate a skill from a task description with automatic evaluation and refinement.
upskill generate TASK [OPTIONS]Arguments:
TASK- Description of what the skill should teach
Options:
-e, --example- Input -> output example (can be repeated)--tool- Generate from MCP tool schema (path#tool_name)-f, --from PATH- Improve from existing skill dir or agent trace file (auto-detected)-m, --model MODEL- Model for generation (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514')-o, --output PATH- Output directory for skill--no-eval- Skip evaluation and refinement--eval-model MODEL- Different model to evaluate skill on--runs-dir PATH- Directory for run logs (default: ./runs)--log-runs / --no-log-runs- Log run data (default: enabled)
Examples:
# Basic usage upskill generate "parse JSON Schema files" # Make and evaluate skills for less powerful models upskill generate "write git commits" --model sonnet --eval-model haiku # Improve an existing skill (auto-detected as directory) upskill generate "add more error handling examples" --from ./skills/api-errors/ # Generate from an agent trace file (auto-detected as file) upskill generate "document the pattern" --from ./trace.json # Skip evaluation during generation (evaluate separately with upskill eval) upskill generate "parse YAML" --no-evalOutput:
Generating skill with sonnet... Generating test cases... Evaluating on sonnet... (attempt 1) 60% -> 100% (+40%) OK git-commit-messages Write clear, conventional commit messages that follow best practices. SKILL.md ~450 tokens baseline ████████████░░░░░░░░ 60% with skill ████████████████████ 100% (+40%) tokens: 1200 → 800 (-33%) Saved to ./skills/git-commit-messages Evaluate an existing skill against test cases. Supports single-model evaluation with baseline comparison, or multi-model benchmarking.
upskill eval SKILL_PATH [OPTIONS]Arguments:
SKILL_PATH- Path to skill directory containing SKILL.md
Options:
-t, --tests PATH- Test cases JSON file-m, --model MODEL- Model(s) to evaluate against (repeatable for multi-model benchmarking)--runs N- Number of runs per model (default: 1)--provider [anthropic|openai|generic]- API provider (auto-detected as 'generic' when --base-url is provided)--base-url URL- Custom API endpoint for local models--no-baseline- Skip baseline comparison-v, --verbose- Show per-test results--log-runs / --no-log-runs- Log run data (default: enabled)--runs-dir PATH- Directory for run logs
Examples:
# Basic evaluation with baseline comparison upskill eval ./skills/my-skill/ # With verbose output upskill eval ./skills/my-skill/ -v # Custom test cases upskill eval ./skills/my-skill/ --tests ./tests.json # Evaluate on specific model upskill eval ./skills/my-skill/ -m haiku # Multi-model benchmarking (compare models) upskill eval ./skills/my-skill/ -m haiku -m sonnet # Multiple runs per model for statistical significance upskill eval ./skills/my-skill/ -m haiku -m sonnet --runs 5 # Evaluate on local model (llama.cpp server) upskill eval ./skills/my-skill/ \ -m "unsloth/GLM-4.7-Flash-GGUF:Q4_0" \ --base-url http://localhost:8080/v1 # Skip baseline (just test with skill) upskill eval ./skills/my-skill/ --no-baseline # Disable run logging upskill eval ./skills/my-skill/ --no-log-runsBenchmark output:
Evaluating my-skill across 2 model(s) 3 test case(s), 5 run(s) per model haiku Pass rate: 4/5 (80%) Avg assertions: 2.8/3 sonnet Pass rate: 5/5 (100%) Avg assertions: 3.0/3 ┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ Model ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃ ┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ haiku │ 4/5 │ 2.8/3 │ 1250 │ │ sonnet │ 5/5 │ 3.0/3 │ 1890 │ └────────┴───────────┴────────────────┴────────────┘ Test cases JSON format:
[ {"input": "Write a commit for adding login", "expected": {"contains": ["feat", "login"]}}, {"input": "Fix the null pointer bug", "expected": {"contains": ["fix", "bug"]}} ]List all generated skills in a tree view.
upskill list [OPTIONS]Options:
-d, --dir PATH- Skills directory to list-v, --verbose- Show skill contents preview
Examples:
# List skills in default directory upskill list # List from custom directory upskill list -d ./my-skills/ # Show preview of skill contents upskill list -vOutput:
./skills ├── git-commit-messages │ ├── Write clear, conventional commit messages... │ └── files │ └── SKILL.md ├── api-error-handling │ ├── Handle API errors gracefully with proper logging... │ └── files │ ├── SKILL.md │ └── references/error-codes.md └── yaml-parsing ├── Parse YAML files safely with schema validation... └── files ├── SKILL.md └── scripts/validate.py View run results as a plot, or export to CSV. By default, shows a visual comparison of baseline vs with-skill performance.
upskill runs [OPTIONS]Options:
-d, --dir PATH- Runs directory-s, --skill TEXT- Filter by skill name(s) (repeatable)-m, --model TEXT- Filter by model(s) (repeatable)--metric [success|tokens]- Metric to display (default: success)--csv PATH- Export to CSV instead of plot
Examples:
# View results plot (default) upskill runs # Filter by skill and models upskill runs -s my-skill -m haiku -m sonnet # Show token usage instead of success rate upskill runs --metric tokens # Export to CSV upskill runs --csv ./results.csv # Custom runs directory upskill runs -d ./my-runs/Plot output:
skill: git-commit-messages haiku baseline ████████████░░░░░░░░ 60% with skill ████████████████░░░░ 80% (+20%) sonnet baseline ████████████░░░░░░░░ 60% with skill ████████████████████ 100% (+40%) Matrix view (multiple skills and models):
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ skill ┃ haiku ┃ sonnet ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ git-commit-messages │ 60%→80% │ 60%→100% │ │ api-error-handling │ 40%→70% │ 50%→90% │ │ yaml-parsing │ 70%→90% │ 80%→100% │ └─────────────────────┴──────────────┴──────────────┘ Skills are saved in a standard directory format:
./skills/{skill-name}/ ├── SKILL.md # Main skill instructions ├── references/ # Supporting documents (optional) └── scripts/ # Executable scripts (optional) Example SKILL.md:
# git-commit-messages Write clear, conventional commit messages that follow best practices. ## Instructions This skill teaches how to write effective git commit messages following the Conventional Commits specification. ## Format Commit messages should follow this structure: <type>(<scope>): <subject> <body> <footer> ## Types - `feat`: New feature - `fix`: Bug fix - `docs`: Documentation changes ... ## Examples ### Simple feature commit feat(auth): add password reset functionality ### Bug fix with explanation fix(api): handle null response from user service The user service can return null when not found. Added proper null checking to prevent crashes. Closes #123By default, upskill logs all runs to ./runs/. Each run creates:
./runs/ ├── 2025_01_21_15_30/ # Batch folder (timestamp) │ ├── run_1/ │ │ ├── run_metadata.json # Model, task, timing │ │ └── run_result.json # Pass/fail, assertions, tokens │ ├── run_2/ │ │ └── ... │ └── batch_summary.json # Aggregate results └── results.csv # Summary CSV (after `upskill runs`) Disable with --no-log-runs.
model: sonnet # Default generation model eval_model: haiku # Default evaluation model (optional) skills_dir: ./skills # Where to save skills runs_dir: ./runs # Where to save run logs max_refine_attempts: 3 # Refinement iterationsPlace in your project directory to customize FastAgent settings:
default_model: sonnet logger: progress_display: true show_chat: false streaming: markdown # MCP servers (optional) mcp: servers: fetch: command: "uvx" args: ["mcp-server-fetch"]# Required for Anthropic models ANTHROPIC_API_KEY=sk-ant-... # Required for OpenAI models OPENAI_API_KEY=sk-... # Optional: custom endpoints ANTHROPIC_BASE_URL=http://localhost:8080 OPENAI_API_BASE=http://localhost:11434/v1 # For local models (generic provider) GENERIC_BASE_URL=http://localhost:8080/v1 GENERIC_API_KEY=local # Optional, defaults to "local"from upskill import ( generate_skill, generate_tests, evaluate_skill, refine_skill, Config, ) # Load configuration config = Config.load() # Generate a skill skill = await generate_skill( "parse JSON Schema files", model="sonnet", config=config, ) # Generate test cases tests = await generate_tests("parse JSON Schema files") # Evaluate the skill results = await evaluate_skill( skill, tests, model="haiku", config=config, ) print(f"Skill lift: {results.skill_lift:.0%}") print(f"Token savings: {results.token_savings:.0%}") print(f"Is beneficial: {results.is_beneficial}") # Refine based on failures if not results.is_beneficial: from upskill.evaluate import get_failure_descriptions failures = get_failure_descriptions(results) improved_skill = await refine_skill(skill, failures)upskill uses FastAgent model format:
<provider>.<model>.<reasoning_effort?> Examples:
sonnet- Anthropic Claude Sonnet (alias)haiku- Anthropic Claude Haiku (alias)opus- Anthropic Claude Opus (alias)anthropic.claude-sonnet-4-20250514- Full model nameopenai.gpt-4.1- OpenAI GPT-4.1openai.o3-mini.low- OpenAI o3-mini with low reasoning effortgeneric.llama3.2:latest- Local model via Ollamageneric.my-model- Local model via llama.cpp or other OpenAI-compatible server
upskill supports local models through any OpenAI-compatible endpoint (Ollama, llama.cpp, vLLM, etc.).
Quick start with Ollama:
# Start Ollama (default port 11434) ollama serve # Evaluate with a local model upskill eval ./skills/my-skill/ \ --model llama3.2:latest \ --base-url http://localhost:11434/v1With llama.cpp server:
# Start llama.cpp server ./llama-server -m model.gguf --port 8080 # Evaluate with the local model upskill eval ./skills/my-skill/ \ --model my-model \ --base-url http://localhost:8080/v1When --base-url is provided, the provider is automatically set to generic unless you specify --provider explicitly.