feat(router): health-check-driven routing#24678
feat(router): health-check-driven routing#24678Sameerlite wants to merge 3 commits intoBerriAI:litellm_team-model-group-name-routing-fixfrom
Conversation
Background health checks now feed deployment health state into the router candidate-filtering pipeline. Unhealthy deployments are excluded proactively instead of waiting for request failures to trigger cooldown. Gated by `enable_health_check_routing: true` in general_settings. Off by default — zero behavior change for existing users. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR introduces opt-in health-check-driven routing ( Confidence Score: 4/5Safe to merge with one fix — the P1 staleness default bug causes the feature to behave differently than documented for most common configurations, though it never causes an outage. One P1 defect remains: the staleness threshold falls back to the constant default interval (300s × 2 = 600s) instead of the configured interval, producing a 5–10× longer staleness window than documented. All prior P0/P1 feedback has been addressed. The P2 span propagation gap is observability-only and does not affect routing correctness. litellm/proxy/proxy_server.py (staleness default derivation at lines 3311–3314) and docs/my-website/docs/proxy/health.md |
| Filename | Overview |
|---|---|
| litellm/router_utils/health_state_cache.py | New DeploymentHealthCache class — stores per-deployment health state with staleness enforcement; parent_otel_span accepted but not forwarded to cache calls (P2) |
| litellm/router.py | Adds health-check filter at 3 routing paths (async, sync, pass-through); default staleness falls back to DEFAULT_HEALTH_CHECK_INTERVAL * 2 (600s) instead of the configured interval |
| litellm/proxy/proxy_server.py | Config parsing reads health_check_interval but never passes it to the router when health_check_staleness_threshold is absent, causing the documented default to be wrong by up to 5× |
| litellm/proxy/health_check.py | Adds model_id re-attachment after _clean_endpoint_data (fixes prior review issue) and new build_deployment_health_states() helper; logic is sound |
| docs/my-website/docs/proxy/health.md | New docs section for health-check-driven routing; the documented default staleness ("health_check_interval * 2") does not match the current implementation default (600s fixed) |
Sequence Diagram
sequenceDiagram participant BG as Background Health Check Loop participant HC as health_check.py participant PS as proxy_server.py participant DC as DeploymentHealthCache participant Router as Router filter methods BG->>HC: _perform_health_check(model_list) HC-->>BG: healthy_endpoints, unhealthy_endpoints (with model_id) BG->>PS: _write_health_state_to_router_cache(healthy, unhealthy) PS->>HC: build_deployment_health_states(healthy, unhealthy) HC-->>PS: states dict {model_id -> {is_healthy, timestamp, reason}} PS->>DC: set_deployment_health_states(states) DC->>DC: cache.set_cache(CACHE_KEY, states, ttl=staleness*1.5) Note over Router: On each routing request Router->>DC: get_unhealthy_deployment_ids() DC->>DC: cache.get_cache(CACHE_KEY) DC->>DC: _extract_unhealthy_ids (staleness filter) DC-->>Router: Set[unhealthy_model_ids] Router->>Router: filter out unhealthy deployments alt All candidates unhealthy Router->>Router: Safety net: return all candidates end Router-->>Router: filtered healthy_deployments to cooldown filter Reviews (3): Last reviewed commit: "fix: revert accidental _litellm_uuid imp..." | Re-trigger Greptile
litellm/proxy/health_check.py Outdated
| healthy_endpoints.append(_clean_endpoint_data(endpoint_data, details)) | ||
| elif isinstance(is_healthy, dict): | ||
| unhealthy_endpoints.append( | ||
| _clean_endpoint_data({**litellm_params, **is_healthy}, details) | ||
| ) | ||
| endpoint_data = {**litellm_params, **is_healthy} | ||
| if _model_id: | ||
| endpoint_data["model_id"] = _model_id | ||
| unhealthy_endpoints.append(_clean_endpoint_data(endpoint_data, details)) | ||
| else: | ||
| unhealthy_endpoints.append(_clean_endpoint_data(litellm_params, details)) | ||
| endpoint_data = {**litellm_params} | ||
| if _model_id: | ||
| endpoint_data["model_id"] = _model_id | ||
| unhealthy_endpoints.append(_clean_endpoint_data(endpoint_data, details)) |
There was a problem hiding this comment.
model_id is silently stripped when health_check_details: False
model_id is added to endpoint_data and then immediately passed through _clean_endpoint_data(endpoint_data, details). When details=False (i.e., health_check_details: False in config), _clean_endpoint_data only keeps fields in MINIMAL_DISPLAY_PARAMS = ["model", "mode_error"] — so model_id is silently dropped.
Downstream, build_deployment_health_states() looks for model_id in each endpoint dict; when it's missing, every deployment is skipped and states is empty. The router cache is never written, and health-check-driven routing silently does nothing — with no warning logged to the user.
The fix is to re-attach model_id after the clean call:
if isinstance(is_healthy, dict) and "error" not in is_healthy: endpoint_data = {**litellm_params, **is_healthy} cleaned = _clean_endpoint_data(endpoint_data, details) if _model_id: cleaned["model_id"] = _model_id # re-attach after cleaning healthy_endpoints.append(cleaned)There was a problem hiding this comment.
Fixed in 62ebcde — model_id is now re-attached after _clean_endpoint_data() so it survives health_check_details: False.
| except Exception as e: | ||
| verbose_proxy_logger.debug( | ||
| "Failed to write health state to router cache: %s", str(e) | ||
| ) |
There was a problem hiding this comment.
Health state write failures silently swallowed at debug level
Errors in _write_health_state_to_router_cache are caught and logged only at debug level, making any failure completely invisible during normal operation.
| except Exception as e: | |
| verbose_proxy_logger.debug( | |
| "Failed to write health state to router cache: %s", str(e) | |
| ) | |
| except Exception as e: | |
| verbose_proxy_logger.warning( | |
| "Failed to write health state to router cache: %s", str(e) | |
| ) |
There was a problem hiding this comment.
Fixed in 62ebcde — bumped to warning level.
| are treated as healthy (unknown). | ||
| """ | ||
| | ||
| CACHE_KEY = "litellm:health_check:deployment_health_state" |
There was a problem hiding this comment.
Single global cache key causes last-writer-wins collision in shared-Redis multi-instance deployments
CACHE_KEY is a fixed string. When multiple proxy instances share a Redis-backed DualCache, each instance's write replaces the previous instance's data. Each pod's health state view will only reflect whichever pod wrote last, making the feature unreliable in Redis-coordinated multi-pod deployments.
Consider namespacing the key with a pod/instance identifier, or merging into the existing dict rather than replacing it.
There was a problem hiding this comment.
This is intentional — same pattern as the existing shared health check manager which writes to a single Redis key. When use_shared_health_check: true is enabled, only one pod runs health checks at a time (via distributed lock), so there's no concurrent-write collision. When shared health checks are disabled, each pod uses its own in-memory cache anyway. No change needed here.
- model_id is now added after _clean_endpoint_data() so it survives health_check_details: False (MINIMAL_DISPLAY_PARAMS filtering) - Health state write failures logged at warning instead of debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
litellm/router.py Outdated
| import litellm.litellm_core_utils | ||
| import litellm.litellm_core_utils.exception_mapping_utils | ||
| from litellm import get_secret_str | ||
| from litellm._litellm_uuid import uuid |
There was a problem hiding this comment.
_litellm_uuid module doesn't exist — causes ImportError at startup
litellm._litellm_uuid was never created in this PR (and doesn't exist anywhere in the repo's git history). Both router.py and proxy_server.py now import from it, so the proxy will fail to start entirely: every import litellm.router or import litellm.proxy.proxy_server raises ModuleNotFoundError: No module named 'litellm._litellm_uuid'.
The file that actually exists is litellm/_uuid.py and is used by ~175 other files in the codebase. The import in router.py needs to be reverted to match it:
| from litellm._litellm_uuid import uuid | |
| from litellm._uuid import uuid |
The same fix is also needed in litellm/proxy/proxy_server.py line 40.
There was a problem hiding this comment.
Fixed in 5f11c7c — reverted both router.py and proxy_server.py back to from litellm._uuid import uuid. The _litellm_uuid import was accidentally pulled in by the isort pre-commit hook from an unrelated staged rename in the working directory.
The isort hook picked up a stale rename from the working directory. Both router.py and proxy_server.py need litellm._uuid, not _litellm_uuid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
enable_health_check_routing: trueingeneral_settings— off by default, zero behavior change for existing usersChanges
litellm/router_utils/health_state_cache.py— newDeploymentHealthCacheclasslitellm/router.py— new filter methods + pipeline insertion at all 3 routing pathslitellm/proxy/health_check.py—build_deployment_health_states()+model_idin endpoint datalitellm/proxy/proxy_server.py— writes health state after each background check cycle, config parsinglitellm/constants.py—DEFAULT_HEALTH_CHECK_STALENESS_MULTIPLIERdocs/my-website/docs/proxy/health.md— documentationConfig
Test plan
DeploymentHealthCache(staleness, empty cache, malformed entries)