feat(router): health-check-driven routing by Sameerlite · Pull Request #24678 · BerriAI/litellm

Sameerlite · 2026-03-27T09:14:26Z

Summary

Background health checks now feed deployment health state into the router candidate-filtering pipeline
Unhealthy deployments are excluded proactively instead of waiting for request failures to trigger cooldown
Gated behind enable_health_check_routing: true in general_settings — off by default, zero behavior change for existing users
Safety net: if all deployments are unhealthy, filter is bypassed (never causes total outage)
Staleness handling: stale health state is ignored, falls back to cooldown-only behavior

Changes

litellm/router_utils/health_state_cache.py — new DeploymentHealthCache class
litellm/router.py — new filter methods + pipeline insertion at all 3 routing paths
litellm/proxy/health_check.py — build_deployment_health_states() + model_id in endpoint data
litellm/proxy/proxy_server.py — writes health state after each background check cycle, config parsing
litellm/constants.py — DEFAULT_HEALTH_CHECK_STALENESS_MULTIPLIER
docs/my-website/docs/proxy/health.md — documentation

Config

general_settings: background_health_checks: true health_check_interval: 60 enable_health_check_routing: true # opt-in health_check_staleness_threshold: 120 # optional, default = interval * 2

Test plan

8 unit tests for DeploymentHealthCache (staleness, empty cache, malformed entries)
8 unit tests for router filter (unhealthy removal, safety net, disabled flag, async)
Existing health check tests pass (21 + 22)
Existing router tests pass (55)
Manual proxy test with bad API key deployment — verify exclusion after first health check cycle
Manual test with all bad keys — verify safety net bypass

Background health checks now feed deployment health state into the router candidate-filtering pipeline. Unhealthy deployments are excluded proactively instead of waiting for request failures to trigger cooldown. Gated by `enable_health_check_routing: true` in general_settings. Off by default — zero behavior change for existing users. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vercel · 2026-03-27T09:14:31Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	Mar 27, 2026 10:05am

greptile-apps · 2026-03-27T09:21:12Z

Greptile Summary

This PR introduces opt-in health-check-driven routing (enable_health_check_routing: true), wiring the existing background health check loop into the router's candidate-filtering pipeline. The new DeploymentHealthCache stores per-deployment health state from each check cycle; at routing time the router filters out deployments marked unhealthy before applying cooldown. A safety net ensures no total outage if all candidates are marked unhealthy, and a staleness threshold prevents stale data from permanently blocking a recovered deployment.\n\nPrior review concerns (silent model_id stripping and debug-level error logging) are both addressed in this revision.\n\nKey findings:\n\n- Staleness default mismatch (P1): When health_check_staleness_threshold is not explicitly configured, proxy_server.py never passes health_check_interval to the router. The router falls back to DEFAULT_HEALTH_CHECK_INTERVAL * 2 = 600s regardless of the configured interval. For a common health_check_interval: 60 setup this means the actual staleness window is 600s, not the documented 120s. The fix is to compute and pass the correct default (health_check_interval * DEFAULT_HEALTH_CHECK_STALENESS_MULTIPLIER) in proxy_server.py when health_check_staleness_threshold is absent.\n\n- parent_otel_span silently dropped (P2): DeploymentHealthCache.get_unhealthy_deployment_ids and async_get_unhealthy_deployment_ids accept parent_otel_span but do not forward it to DualCache.get_cache / async_get_cache. Since DualCache propagates the span to Redis, the tracing context is lost on all health-routing cache reads.\n\n- The docs claim "default staleness = health_check_interval * 2" which is only accurate if the P1 above is fixed.

Confidence Score: 4/5

Safe to merge with one fix — the P1 staleness default bug causes the feature to behave differently than documented for most common configurations, though it never causes an outage.

One P1 defect remains: the staleness threshold falls back to the constant default interval (300s × 2 = 600s) instead of the configured interval, producing a 5–10× longer staleness window than documented. All prior P0/P1 feedback has been addressed. The P2 span propagation gap is observability-only and does not affect routing correctness.

litellm/proxy/proxy_server.py (staleness default derivation at lines 3311–3314) and docs/my-website/docs/proxy/health.md

Important Files Changed

Filename	Overview
litellm/router_utils/health_state_cache.py	New `DeploymentHealthCache` class — stores per-deployment health state with staleness enforcement; `parent_otel_span` accepted but not forwarded to cache calls (P2)
litellm/router.py	Adds health-check filter at 3 routing paths (async, sync, pass-through); default staleness falls back to `DEFAULT_HEALTH_CHECK_INTERVAL * 2` (600s) instead of the configured interval
litellm/proxy/proxy_server.py	Config parsing reads `health_check_interval` but never passes it to the router when `health_check_staleness_threshold` is absent, causing the documented default to be wrong by up to 5×
litellm/proxy/health_check.py	Adds `model_id` re-attachment after `_clean_endpoint_data` (fixes prior review issue) and new `build_deployment_health_states()` helper; logic is sound
docs/my-website/docs/proxy/health.md	New docs section for health-check-driven routing; the documented default staleness ("health_check_interval * 2") does not match the current implementation default (600s fixed)

Sequence Diagram

sequenceDiagram participant BG as Background Health Check Loop participant HC as health_check.py participant PS as proxy_server.py participant DC as DeploymentHealthCache participant Router as Router filter methods BG->>HC: _perform_health_check(model_list) HC-->>BG: healthy_endpoints, unhealthy_endpoints (with model_id) BG->>PS: _write_health_state_to_router_cache(healthy, unhealthy) PS->>HC: build_deployment_health_states(healthy, unhealthy) HC-->>PS: states dict {model_id -> {is_healthy, timestamp, reason}} PS->>DC: set_deployment_health_states(states) DC->>DC: cache.set_cache(CACHE_KEY, states, ttl=staleness*1.5) Note over Router: On each routing request Router->>DC: get_unhealthy_deployment_ids() DC->>DC: cache.get_cache(CACHE_KEY) DC->>DC: _extract_unhealthy_ids (staleness filter) DC-->>Router: Set[unhealthy_model_ids] Router->>Router: filter out unhealthy deployments alt All candidates unhealthy Router->>Router: Safety net: return all candidates end Router-->>Router: filtered healthy_deployments to cooldown filter

_{Reviews (3): Last reviewed commit: "fix: revert accidental _litellm_uuid imp..." | Re-trigger Greptile}

greptile-apps · 2026-03-27T09:21:15Z

litellm/proxy/health_check.py

+ healthy_endpoints.append(_clean_endpoint_data(endpoint_data, details))
 elif isinstance(is_healthy, dict):
- unhealthy_endpoints.append(
- _clean_endpoint_data({**litellm_params, **is_healthy}, details)
- )
+ endpoint_data = {**litellm_params, **is_healthy}
+ if _model_id:
+ endpoint_data["model_id"] = _model_id
+ unhealthy_endpoints.append(_clean_endpoint_data(endpoint_data, details))
 else:
- unhealthy_endpoints.append(_clean_endpoint_data(litellm_params, details))
+ endpoint_data = {**litellm_params}
+ if _model_id:
+ endpoint_data["model_id"] = _model_id
+ unhealthy_endpoints.append(_clean_endpoint_data(endpoint_data, details))


model_id is silently stripped when health_check_details: False

model_id is added to endpoint_data and then immediately passed through _clean_endpoint_data(endpoint_data, details). When details=False (i.e., health_check_details: False in config), _clean_endpoint_data only keeps fields in MINIMAL_DISPLAY_PARAMS = ["model", "mode_error"] — so model_id is silently dropped.

Downstream, build_deployment_health_states() looks for model_id in each endpoint dict; when it's missing, every deployment is skipped and states is empty. The router cache is never written, and health-check-driven routing silently does nothing — with no warning logged to the user.

The fix is to re-attach model_id after the clean call:

if isinstance(is_healthy, dict) and "error" not in is_healthy: endpoint_data = {**litellm_params, **is_healthy} cleaned = _clean_endpoint_data(endpoint_data, details) if _model_id: cleaned["model_id"] = _model_id # re-attach after cleaning healthy_endpoints.append(cleaned)

Fixed in 62ebcde — model_id is now re-attached after _clean_endpoint_data() so it survives health_check_details: False.

greptile-apps · 2026-03-27T09:21:20Z

litellm/proxy/proxy_server.py

+ except Exception as e:
+ verbose_proxy_logger.debug(
+ "Failed to write health state to router cache: %s", str(e)
+ )


Health state write failures silently swallowed at debug level

Errors in _write_health_state_to_router_cache are caught and logged only at debug level, making any failure completely invisible during normal operation.

Suggested change

except Exception as e:

verbose_proxy_logger.debug(

"Failed to write health state to router cache: %s", str(e)

)

except Exception as e:

verbose_proxy_logger.warning(

"Failed to write health state to router cache: %s", str(e)

)

Fixed in 62ebcde — bumped to warning level.

greptile-apps · 2026-03-27T09:21:21Z

litellm/router_utils/health_state_cache.py

+ are treated as healthy (unknown).
+ """
+
+ CACHE_KEY = "litellm:health_check:deployment_health_state"


Single global cache key causes last-writer-wins collision in shared-Redis multi-instance deployments

CACHE_KEY is a fixed string. When multiple proxy instances share a Redis-backed DualCache, each instance's write replaces the previous instance's data. Each pod's health state view will only reflect whichever pod wrote last, making the feature unreliable in Redis-coordinated multi-pod deployments.

Consider namespacing the key with a pod/instance identifier, or merging into the existing dict rather than replacing it.

This is intentional — same pattern as the existing shared health check manager which writes to a single Redis key. When use_shared_health_check: true is enabled, only one pod runs health checks at a time (via distributed lock), so there's no concurrent-write collision. When shared health checks are disabled, each pod uses its own in-memory cache anyway. No change needed here.

- model_id is now added after _clean_endpoint_data() so it survives health_check_details: False (MINIMAL_DISPLAY_PARAMS filtering) - Health state write failures logged at warning instead of debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-27T09:36:14Z

litellm/router.py

 import litellm.litellm_core_utils
 import litellm.litellm_core_utils.exception_mapping_utils
 from litellm import get_secret_str
+from litellm._litellm_uuid import uuid


_litellm_uuid module doesn't exist — causes ImportError at startup

litellm._litellm_uuid was never created in this PR (and doesn't exist anywhere in the repo's git history). Both router.py and proxy_server.py now import from it, so the proxy will fail to start entirely: every import litellm.router or import litellm.proxy.proxy_server raises ModuleNotFoundError: No module named 'litellm._litellm_uuid'.

The file that actually exists is litellm/_uuid.py and is used by ~175 other files in the codebase. The import in router.py needs to be reverted to match it:

Suggested change

from litellm._litellm_uuid import uuid

from litellm._uuid import uuid

The same fix is also needed in litellm/proxy/proxy_server.py line 40.

Fixed in 5f11c7c — reverted both router.py and proxy_server.py back to from litellm._uuid import uuid. The _litellm_uuid import was accidentally pulled in by the isort pre-commit hook from an unrelated staged rename in the working directory.

The isort hook picked up a stale rename from the working directory. Both router.py and proxy_server.py need litellm._uuid, not _litellm_uuid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vercel bot deployed to Preview March 27, 2026 09:16 View deployment

greptile-apps bot reviewed Mar 27, 2026

View reviewed changes

vercel bot deployed to Preview March 27, 2026 09:31 View deployment

greptile-apps bot reviewed Mar 27, 2026

View reviewed changes

arielb1-sun-security mentioned this pull request Mar 27, 2026

LiteLLM issue summary - 2026-03-27 arielb1-sun-security/copilot-studio-test#50

Open

fix: revert accidental _litellm_uuid import back to _uuid

5f11c7c

The isort hook picked up a stale rename from the working directory. Both router.py and proxy_server.py need litellm._uuid, not _litellm_uuid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vercel bot deployed to Preview March 27, 2026 10:05 View deployment

This was referenced Mar 27, 2026

LiteLLM issue summary - 2026-03-27 arielb1-sun-security/copilot-studio-test#51

Open

LiteLLM issue summary - 2026-03-27 arielb1-sun-security/copilot-studio-test#52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(router): health-check-driven routing#24678

feat(router): health-check-driven routing#24678
Sameerlite wants to merge 3 commits intoBerriAI:litellm_team-model-group-name-routing-fixfrom
Sameerlite:litellm_health-check-driven-routing

Sameerlite commented Mar 27, 2026 •

edited

Loading

vercel bot commented Mar 27, 2026 •

edited

Loading

greptile-apps bot commented Mar 27, 2026 •

edited

Loading

Important Files Changed

greptile-apps bot Mar 27, 2026

Sameerlite Mar 27, 2026

greptile-apps bot Mar 27, 2026

Sameerlite Mar 27, 2026

greptile-apps bot Mar 27, 2026

Sameerlite Mar 27, 2026

greptile-apps bot Mar 27, 2026

Sameerlite Mar 27, 2026

Labels

1 participant

	from litellm._litellm_uuid import uuid
	from litellm._uuid import uuid

Uh oh!

Conversation

Sameerlite commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Config

Test plan

vercel bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

greptile-apps bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Sameerlite Mar 27, 2026

Choose a reason for hiding this comment

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Sameerlite Mar 27, 2026

Choose a reason for hiding this comment

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Sameerlite Mar 27, 2026

Choose a reason for hiding this comment

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Sameerlite Mar 27, 2026

Choose a reason for hiding this comment

Labels

1 participant

Sameerlite commented Mar 27, 2026 •

edited

Loading

vercel bot commented Mar 27, 2026 •

edited

Loading

greptile-apps bot commented Mar 27, 2026 •

edited

Loading