A benchmarking suite for comparing different request router implementations in Ray's LLM serving system. Compares Power-of-2 routers vs Prefix-Aware routers across various replica counts.
custom-router-api-benchmarks/ ├── scripts/ │ ├── engine_metrics.py # Metrics collection and parsing from Ray /metrics endpoint │ ├── sweep_replicas.py # Main replica scaling benchmark script │ └── visualize_replica_sweep.py # Visualization and analysis of results └── results/ ├── replica_sweep/ # Current benchmark results (JSON + raw metrics) └── visualizations/ # Generated plots and charts cd scripts/ python sweep_replicas.pycd scripts/ python visualize_replica_sweep.py- Ray cluster, see k8s install steps here
- Docker image: rayproject/ray-llm:nightly-py311-cu128
- Ray nightly wheel
- ANYSCALE_RAY_SERVE_THROUGHPUT_OPT=1
- RAYLLM_ROUTER_TO_MODEL_REPLICA_RATIO=8