Skip to content

feat(router): health-check-driven routing#24678

Open
Sameerlite wants to merge 3 commits intoBerriAI:litellm_team-model-group-name-routing-fixfrom
Sameerlite:litellm_health-check-driven-routing
Open

feat(router): health-check-driven routing#24678
Sameerlite wants to merge 3 commits intoBerriAI:litellm_team-model-group-name-routing-fixfrom
Sameerlite:litellm_health-check-driven-routing

Conversation

@Sameerlite
Copy link
Contributor

@Sameerlite Sameerlite commented Mar 27, 2026

Summary

  • Background health checks now feed deployment health state into the router candidate-filtering pipeline
  • Unhealthy deployments are excluded proactively instead of waiting for request failures to trigger cooldown
  • Gated behind enable_health_check_routing: true in general_settingsoff by default, zero behavior change for existing users
  • Safety net: if all deployments are unhealthy, filter is bypassed (never causes total outage)
  • Staleness handling: stale health state is ignored, falls back to cooldown-only behavior

Changes

  • litellm/router_utils/health_state_cache.py — new DeploymentHealthCache class
  • litellm/router.py — new filter methods + pipeline insertion at all 3 routing paths
  • litellm/proxy/health_check.pybuild_deployment_health_states() + model_id in endpoint data
  • litellm/proxy/proxy_server.py — writes health state after each background check cycle, config parsing
  • litellm/constants.pyDEFAULT_HEALTH_CHECK_STALENESS_MULTIPLIER
  • docs/my-website/docs/proxy/health.md — documentation

Config

general_settings: background_health_checks: true health_check_interval: 60 enable_health_check_routing: true # opt-in health_check_staleness_threshold: 120 # optional, default = interval * 2

Test plan

  • 8 unit tests for DeploymentHealthCache (staleness, empty cache, malformed entries)
  • 8 unit tests for router filter (unhealthy removal, safety net, disabled flag, async)
  • Existing health check tests pass (21 + 22)
  • Existing router tests pass (55)
  • Manual proxy test with bad API key deployment — verify exclusion after first health check cycle
  • Manual test with all bad keys — verify safety net bypass
Background health checks now feed deployment health state into the router candidate-filtering pipeline. Unhealthy deployments are excluded proactively instead of waiting for request failures to trigger cooldown. Gated by `enable_health_check_routing: true` in general_settings. Off by default — zero behavior change for existing users. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Mar 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 27, 2026 10:05am

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 27, 2026

Greptile Summary

This PR introduces opt-in health-check-driven routing (enable_health_check_routing: true), wiring the existing background health check loop into the router's candidate-filtering pipeline. The new DeploymentHealthCache stores per-deployment health state from each check cycle; at routing time the router filters out deployments marked unhealthy before applying cooldown. A safety net ensures no total outage if all candidates are marked unhealthy, and a staleness threshold prevents stale data from permanently blocking a recovered deployment.\n\nPrior review concerns (silent model_id stripping and debug-level error logging) are both addressed in this revision.\n\nKey findings:\n\n- Staleness default mismatch (P1): When health_check_staleness_threshold is not explicitly configured, proxy_server.py never passes health_check_interval to the router. The router falls back to DEFAULT_HEALTH_CHECK_INTERVAL * 2 = 600s regardless of the configured interval. For a common health_check_interval: 60 setup this means the actual staleness window is 600s, not the documented 120s. The fix is to compute and pass the correct default (health_check_interval * DEFAULT_HEALTH_CHECK_STALENESS_MULTIPLIER) in proxy_server.py when health_check_staleness_threshold is absent.\n\n- parent_otel_span silently dropped (P2): DeploymentHealthCache.get_unhealthy_deployment_ids and async_get_unhealthy_deployment_ids accept parent_otel_span but do not forward it to DualCache.get_cache / async_get_cache. Since DualCache propagates the span to Redis, the tracing context is lost on all health-routing cache reads.\n\n- The docs claim "default staleness = health_check_interval * 2" which is only accurate if the P1 above is fixed.

Confidence Score: 4/5

Safe to merge with one fix — the P1 staleness default bug causes the feature to behave differently than documented for most common configurations, though it never causes an outage.

One P1 defect remains: the staleness threshold falls back to the constant default interval (300s × 2 = 600s) instead of the configured interval, producing a 5–10× longer staleness window than documented. All prior P0/P1 feedback has been addressed. The P2 span propagation gap is observability-only and does not affect routing correctness.

litellm/proxy/proxy_server.py (staleness default derivation at lines 3311–3314) and docs/my-website/docs/proxy/health.md

Important Files Changed

Filename Overview
litellm/router_utils/health_state_cache.py New DeploymentHealthCache class — stores per-deployment health state with staleness enforcement; parent_otel_span accepted but not forwarded to cache calls (P2)
litellm/router.py Adds health-check filter at 3 routing paths (async, sync, pass-through); default staleness falls back to DEFAULT_HEALTH_CHECK_INTERVAL * 2 (600s) instead of the configured interval
litellm/proxy/proxy_server.py Config parsing reads health_check_interval but never passes it to the router when health_check_staleness_threshold is absent, causing the documented default to be wrong by up to 5×
litellm/proxy/health_check.py Adds model_id re-attachment after _clean_endpoint_data (fixes prior review issue) and new build_deployment_health_states() helper; logic is sound
docs/my-website/docs/proxy/health.md New docs section for health-check-driven routing; the documented default staleness ("health_check_interval * 2") does not match the current implementation default (600s fixed)

Sequence Diagram

sequenceDiagram participant BG as Background Health Check Loop participant HC as health_check.py participant PS as proxy_server.py participant DC as DeploymentHealthCache participant Router as Router filter methods BG->>HC: _perform_health_check(model_list) HC-->>BG: healthy_endpoints, unhealthy_endpoints (with model_id) BG->>PS: _write_health_state_to_router_cache(healthy, unhealthy) PS->>HC: build_deployment_health_states(healthy, unhealthy) HC-->>PS: states dict {model_id -> {is_healthy, timestamp, reason}} PS->>DC: set_deployment_health_states(states) DC->>DC: cache.set_cache(CACHE_KEY, states, ttl=staleness*1.5) Note over Router: On each routing request Router->>DC: get_unhealthy_deployment_ids() DC->>DC: cache.get_cache(CACHE_KEY) DC->>DC: _extract_unhealthy_ids (staleness filter) DC-->>Router: Set[unhealthy_model_ids] Router->>Router: filter out unhealthy deployments alt All candidates unhealthy Router->>Router: Safety net: return all candidates end Router-->>Router: filtered healthy_deployments to cooldown filter 
Loading

Reviews (3): Last reviewed commit: "fix: revert accidental _litellm_uuid imp..." | Re-trigger Greptile

Comment on lines +216 to +226
healthy_endpoints.append(_clean_endpoint_data(endpoint_data, details))
elif isinstance(is_healthy, dict):
unhealthy_endpoints.append(
_clean_endpoint_data({**litellm_params, **is_healthy}, details)
)
endpoint_data = {**litellm_params, **is_healthy}
if _model_id:
endpoint_data["model_id"] = _model_id
unhealthy_endpoints.append(_clean_endpoint_data(endpoint_data, details))
else:
unhealthy_endpoints.append(_clean_endpoint_data(litellm_params, details))
endpoint_data = {**litellm_params}
if _model_id:
endpoint_data["model_id"] = _model_id
unhealthy_endpoints.append(_clean_endpoint_data(endpoint_data, details))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 model_id is silently stripped when health_check_details: False

model_id is added to endpoint_data and then immediately passed through _clean_endpoint_data(endpoint_data, details). When details=False (i.e., health_check_details: False in config), _clean_endpoint_data only keeps fields in MINIMAL_DISPLAY_PARAMS = ["model", "mode_error"] — so model_id is silently dropped.

Downstream, build_deployment_health_states() looks for model_id in each endpoint dict; when it's missing, every deployment is skipped and states is empty. The router cache is never written, and health-check-driven routing silently does nothing — with no warning logged to the user.

The fix is to re-attach model_id after the clean call:

 if isinstance(is_healthy, dict) and "error" not in is_healthy: endpoint_data = {**litellm_params, **is_healthy} cleaned = _clean_endpoint_data(endpoint_data, details) if _model_id: cleaned["model_id"] = _model_id # re-attach after cleaning healthy_endpoints.append(cleaned)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 62ebcde — model_id is now re-attached after _clean_endpoint_data() so it survives health_check_details: False.

Comment on lines +2140 to +2143
except Exception as e:
verbose_proxy_logger.debug(
"Failed to write health state to router cache: %s", str(e)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Health state write failures silently swallowed at debug level

Errors in _write_health_state_to_router_cache are caught and logged only at debug level, making any failure completely invisible during normal operation.

Suggested change
except Exception as e:
verbose_proxy_logger.debug(
"Failed to write health state to router cache: %s", str(e)
)
except Exception as e:
verbose_proxy_logger.warning(
"Failed to write health state to router cache: %s", str(e)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 62ebcde — bumped to warning level.

are treated as healthy (unknown).
"""

CACHE_KEY = "litellm:health_check:deployment_health_state"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Single global cache key causes last-writer-wins collision in shared-Redis multi-instance deployments

CACHE_KEY is a fixed string. When multiple proxy instances share a Redis-backed DualCache, each instance's write replaces the previous instance's data. Each pod's health state view will only reflect whichever pod wrote last, making the feature unreliable in Redis-coordinated multi-pod deployments.

Consider namespacing the key with a pod/instance identifier, or merging into the existing dict rather than replacing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional — same pattern as the existing shared health check manager which writes to a single Redis key. When use_shared_health_check: true is enabled, only one pod runs health checks at a time (via distributed lock), so there's no concurrent-write collision. When shared health checks are disabled, each pod uses its own in-memory cache anyway. No change needed here.

- model_id is now added after _clean_endpoint_data() so it survives health_check_details: False (MINIMAL_DISPLAY_PARAMS filtering) - Health state write failures logged at warning instead of debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
import litellm.litellm_core_utils
import litellm.litellm_core_utils.exception_mapping_utils
from litellm import get_secret_str
from litellm._litellm_uuid import uuid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 _litellm_uuid module doesn't exist — causes ImportError at startup

litellm._litellm_uuid was never created in this PR (and doesn't exist anywhere in the repo's git history). Both router.py and proxy_server.py now import from it, so the proxy will fail to start entirely: every import litellm.router or import litellm.proxy.proxy_server raises ModuleNotFoundError: No module named 'litellm._litellm_uuid'.

The file that actually exists is litellm/_uuid.py and is used by ~175 other files in the codebase. The import in router.py needs to be reverted to match it:

Suggested change
from litellm._litellm_uuid import uuid
from litellm._uuid import uuid

The same fix is also needed in litellm/proxy/proxy_server.py line 40.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5f11c7c — reverted both router.py and proxy_server.py back to from litellm._uuid import uuid. The _litellm_uuid import was accidentally pulled in by the isort pre-commit hook from an unrelated staged rename in the working directory.

The isort hook picked up a stale rename from the working directory. Both router.py and proxy_server.py need litellm._uuid, not _litellm_uuid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant