- Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Thanos, Prometheus and Golang version used:
v0.41.0
What happened:
Cluster upgrades appear to occasionally cause a 503 response for Thanos Receive Router whilst running in split routing and ingestion mode.
When a cluster is upgraded, we expect node drains which will cause each of the Pods in the hashring to be rescheduled to a new node. The current configuration (replication factor of 3 and hashring size of 3) should support this disruption affecting error burn budgets.
However, whilst Pods are being rescheduled, we see that sometimes, there is an issue with the calculation, which means that Thanos will appear down to the Prometheus remote write client, when that is never the case.
As a consequence, we receive a burst of retries for the Prometheus client, for a batch of requests that will never process, due to Thanos reporting out of order samples for the series it tries to append.
In this scenario, ingestion is essentially blocked until the “unavailable” replica becomes ready once again.
Hashring remained static throughout.
What you expected to happen:
Ingest to continue when one replica is unavailable and when the the request will never succeed.
I see https://github.com/thanos-io/thanos/pull/8720/changes has merged as a related issue but I think the problem described here is still valid.
How to reproduce it (as minimally and precisely as possible):
Have some consistent source of out of order sample ingestion. Upgrade a Kubernetes cluster.
| Date/Time (UTC) | Event |
|---|---|
| 08:03 | Cluster upgrade begins |
| 08:51:45 | GRPC response status code "Unavailable" spikes for router metrics |
| 08:54:45 | Ingest-2 becomes unready due to node drain as per KSM |
| ~08:55:15 | Ingest-2 becomes ready as per KSM |
| ~08:55:15 | GRPC response status code "AlreadyExists" spikes for router metrics |
| ~08:55:15 | Increase in out of order samples across all ingest Pods |
| ~08:55:15 | Increase in 503 across all ingest Pods |
| 08:58:30 | Ingest-0 becomes unready due to node drain as per KSM |
| 08:58:30 | Error rate returns to normal |
| 08:59:30 | Ingest-0 becomes ready as per KSM |
Anything else we need to know:
The failure appears to occur when we get a genuine out-of-order sample, whilst one of the nodes in the hashring is marked as unavailable by Thanos.
We can compare graphs for a cluster where the alert regularly fires vs one where it seldom fires and correlate that the rate of out-of-order on most clusters goes through lots of time periods where the rate is zero vs a more spikey pattern.
Due to the way Thanos handles multiple errors, when we have a mixed error set where one error is deemed retriable and one isn’t, we will return a 5xx response and force the client to retry. In this case, we are essentially telling the client to continue to retry a request that can never succeeed.
The “AlreadyExists” error is misleading because the remote peer will return this result in the case of a failed append due to out-of-order samples. Prometheus will actually handle an exact duplicate entry as a non-error as per source code.
Therefore the increase in out-of-order total metric is valid but it continues to rise as the failed request gets retried over and over essentially blocking ingestion.
We can only recover from such a situation when all replicas come back online and return the same error response.
I'm suggesting that the fix might be too change the order of priority in
Line 1265 in e88c22b
| func (es *writeErrors) Cause() error { |
I mean we can assume if we have 3 nodes and 1 unavailable but the two live members return a 409 then we don't need to block the client, since the unavailable member will likely return a 409 when it comes online anyway, or a 200 at best but we will still fail to reach a quorum so its not very helpful to wait!