English | 中文
Automated search framework for rubric-based reward modeling, inspired by: Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
This repository keeps iterative as a baseline and uses evolutionary as the default search mode.
- Unified runtime configuration with typed enums (
autosr.types) and layered config dataclasses (autosr.config) - Composition-root factory (
ComponentFactory) for backend-aware dependency wiring - Canonical domain models in
autosr.data_modelswith compatibility re-export inautosr.models - Search extensibility:
- Parent selection:
rank,tournament,top_k - Adaptive mutation:
fixed,success_feedback,exploration_decay,diversity_driven - Iteration scope:
global_batch(dataset-level) andprompt_local(prompt-level independent evolution)
- Parent selection:
- LLM architecture split into transport config (
autosr.llm_config) and runtime config (autosr.config) - Reproducibility outputs:
run_manifestembedded in output JSON- archived manifest and replay script under
<output_parent>/run_records/
autosr/cli.py- Parses CLI args only
- Builds
RuntimeConfig - Delegates runtime wiring to
ComponentFactory
autosr/factory.py- Single composition root for backend selection and component assembly
- Auto-resolves rank-based judge when all candidates provide
metadata.rank
autosr/config.py- Runtime-level configuration:
RuntimeConfigLLMBackendConfigSearchAlgorithmConfigObjectiveConfig(compat alias:ObjectiveFunctionConfig)InitializerStrategyConfig,ContentExtractionConfig,VerifierConfig
- Runtime-level configuration:
autosr/llm_config.py- Low-level LLM transport/model config (
LLMConfig,RoleModelConfig)
- Low-level LLM transport/model config (
autosr/types.py- Shared enums:
BackendType,SearchMode,EvolutionIterationScope,SelectionStrategyAdaptiveMutationSchedule,InitializerStrategy,ExtractionStrategy,LLMRole
- Shared enums:
autosr/data_models.py: canonical domain entities (Rubric,Criterion,PromptExample, ...)autosr/models.py: compatibility import layerautosr/exceptions.py: shared LLM exceptions (LLMCallError,LLMParseError)autosr/io_utils.py: dataset/rubric I/O and run-record persistenceautosr/run_records/use_cases.py: run manifest + reproducible shell script generation
autosr/search/config.py:IterativeConfig,EvolutionaryConfig,SearchResultautosr/search/iterative.py: iterative baseline implementationautosr/search/evolutionary.py: evolutionary algorithm implementationautosr/search/strategies.py: reusable search helpersautosr/search/selection_strategies.py: parent selection policiesautosr/search/adaptive_mutation.py: mutation scheduler and diversity metricsautosr/search/use_cases.py: searcher entrypoints exports
autosr/llm_components/base.py: request/retry base + prompt rendering fallbackautosr/llm_components/parsers.py: response normalization/validationautosr/llm_components/use_cases.py: initializer/proposer/verifier/judge implementationsautosr/llm_components/factory.py: legacy helper kept for compatibilityautosr/content_extraction/strategies.py:tag/regex/identityextractionautosr/content_extraction/use_cases.py: extraction-decorated verifierautosr/prompts/loader.py+autosr/prompts/constants.py: file templates and constant fallback
autosr/: core packageprompts/: prompt templates (supports locale folders such asprompts/zh/andprompts/en/)tests/:unittesttest suitescripts/: unit/integration/formal run scriptsexamples/: demo datasets and examplesartifacts/: default output directory
Requires Python >=3.11 and uv.
uv syncRun commands with uv run:
uv run python -m autosr.cli --helpDefault (evolutionary):
uv run python -m autosr.cli \ --dataset examples/demo_dataset.json \ --mode evolutionary \ --output artifacts/best_rubrics.jsonIterative baseline:
uv run python -m autosr.cli \ --dataset examples/single_case.json \ --mode iterative \ --output artifacts/best_rubrics_iterative.jsonEvolutionary with custom strategy and prompt locale:
uv run python -m autosr.cli \ --dataset examples/single_case_with_rank.json \ --mode evolutionary \ --output artifacts/best_rubrics_rank.json \ --selection-strategy top_k \ --adaptive-mutation diversity_driven \ --prompt-language zh--backend {auto,mock,llm}:
auto(default): resolve tollmwhen API key exists, elsemockllm: requires API key (LLM_API_KEYby default, configurable via--api-key-env)mock: local heuristic components only
Default endpoint/model:
--base-url https://openrouter.ai/api/v1--model-default stepfun/step-3.5-flash:free
Role-specific model override is supported:
--model-initializer--model-proposer--model-verifier--model-judge
Prompt locale loading order:
prompts/<language>/(when--prompt-languageis set)prompts/- built-in constants in code
Formal LLM-backed flow:
export LLM_API_KEY="..." ./scripts/run_formal_search.sh \ examples/call_summary_dataset_with_rank_single.json \ evolutionary \ artifacts/best_rubrics_formal_call_summary.jsonNote:
scripts/run_formal_search.shnow defaults to--evolution-iteration-scope prompt_local- Override with environment variable if needed:
EVOLUTION_ITERATION_SCOPE=global_batch ./scripts/run_formal_search.sh
Objective:
score = TailAcc - lambda_var * TailVar + mu_diverse * DiverseTailAcc
Common flags:
--generations,--population-size,--mutations-per-round,--batch-size--mutation-parent-count(number of parent rubrics used for mutation each generation)--tail-fraction,--lambda-var,--mu-diverse--pair-confidence-prior(pairwise confidence shrinkage; set0to disable)--selection-strategy {rank,tournament,top_k}--adaptive-mutation {fixed,success_feedback,exploration_decay,diversity_driven}--evolution-iteration-scope {global_batch,prompt_local}--stop-when-distinguished/--no-stop-when-distinguished(prompt-local early stop)--distinguish-margin(override top-margin threshold; default uses objective tie tolerance)
Verifier grading scale:
- Supports continuous criterion scores in
0-5(preferred) and0-1(compatible). - Final rubric score is normalized to
0-1before objective computation.
Iteration behavior:
global_batch:- Original dataset-level generations
- each generation evolves only the selected hard prompts (
batch_size)
prompt_local:- each prompt evolves independently for up to
generations - no cross-prompt batching dependency
- can stop early per prompt when top candidates are already distinguished
- each prompt evolves independently for up to
Input JSON must contain top-level prompts:
{ "prompts": [ { "prompt_id": "p1", "prompt": "Write ...", "candidates": [ { "candidate_id": "c1", "text": "response text", "source": "strong", "metadata": { "quality": 0.91, "rank": 1 } } ] } ] }Notes:
prompt_idandpromptare required- each prompt must provide at least 2 candidates
metadata.rankis optional (1is best); if present for all candidates, rank-based judge is auto-selected
Main output JSON (--output) includes:
best_rubrics(array; each item may includebest_candidate_idandcandidate_scores)best_objective_scoresbest_scores(legacy alias ofbest_objective_scores)- optional
run_manifest
Per-run reproducibility files are written to:
<output_parent>/run_records/<output_stem>_<run_id>.manifest.json<output_parent>/run_records/<output_stem>_<run_id>.reproduce.sh
Unit tests:
./scripts/run_tests_unit.shIntegration tests (requires API key):
export LLM_API_KEY="..." ./scripts/run_tests_integration.shAggregate entrypoint:
./scripts/run_tests.shRun all tests directly:
uv run python -m unittest discover -s tests -p "test_*.py"Architecture-focused regression set:
uv run python -m unittest \ tests.test_architecture_refactor \ tests.test_cli_backend_selection \ tests.test_cli_best_candidates \ tests.test_io_utils \ tests.test_search_config_enum_unification \ tests.test_data_models_compat \ tests.test_exceptions_module \ tests.test_evolutionary_decoupling- Import domain entities from
autosr.data_modelsin new code. - Prefer
ComponentFactory(RuntimeConfig(...))over manual runtime wiring. - Keep secrets in environment variables only (
LLM_API_KEY, optionalLLM_BASE_URL,LLM_MODEL).