Skip to content

Receive in split routing and ingest mode returning 503 during rollouts #8727

@philipgough

Description

@philipgough

Thanos, Prometheus and Golang version used:
v0.41.0

What happened:

Cluster upgrades appear to occasionally cause a 503 response for Thanos Receive Router whilst running in split routing and ingestion mode.

When a cluster is upgraded, we expect node drains which will cause each of the Pods in the hashring to be rescheduled to a new node. The current configuration (replication factor of 3 and hashring size of 3) should support this disruption affecting error burn budgets.

However, whilst Pods are being rescheduled, we see that sometimes, there is an issue with the calculation, which means that Thanos will appear down to the Prometheus remote write client, when that is never the case.

As a consequence, we receive a burst of retries for the Prometheus client, for a batch of requests that will never process, due to Thanos reporting out of order samples for the series it tries to append.

In this scenario, ingestion is essentially blocked until the “unavailable” replica becomes ready once again.

Hashring remained static throughout.

What you expected to happen:

Ingest to continue when one replica is unavailable and when the the request will never succeed.

I see https://github.com/thanos-io/thanos/pull/8720/changes has merged as a related issue but I think the problem described here is still valid.

How to reproduce it (as minimally and precisely as possible):

Have some consistent source of out of order sample ingestion. Upgrade a Kubernetes cluster.

Image
Date/Time (UTC) Event
08:03 Cluster upgrade begins
08:51:45 GRPC response status code "Unavailable" spikes for router metrics
08:54:45 Ingest-2 becomes unready due to node drain as per KSM
~08:55:15 Ingest-2 becomes ready as per KSM
~08:55:15 GRPC response status code "AlreadyExists" spikes for router metrics
~08:55:15 Increase in out of order samples across all ingest Pods
~08:55:15 Increase in 503 across all ingest Pods
08:58:30 Ingest-0 becomes unready due to node drain as per KSM
08:58:30 Error rate returns to normal
08:59:30 Ingest-0 becomes ready as per KSM
Image Image Image Image Image

Anything else we need to know:

The failure appears to occur when we get a genuine out-of-order sample, whilst one of the nodes in the hashring is marked as unavailable by Thanos.

We can compare graphs for a cluster where the alert regularly fires vs one where it seldom fires and correlate that the rate of out-of-order on most clusters goes through lots of time periods where the rate is zero vs a more spikey pattern.

Due to the way Thanos handles multiple errors, when we have a mixed error set where one error is deemed retriable and one isn’t, we will return a 5xx response and force the client to retry. In this case, we are essentially telling the client to continue to retry a request that can never succeeed.

The “AlreadyExists” error is misleading because the remote peer will return this result in the case of a failed append due to out-of-order samples. Prometheus will actually handle an exact duplicate entry as a non-error as per source code.
Therefore the increase in out-of-order total metric is valid but it continues to rise as the failed request gets retried over and over essentially blocking ingestion.

We can only recover from such a situation when all replicas come back online and return the same error response.

I'm suggesting that the fix might be too change the order of priority in

func (es *writeErrors) Cause() error {
but potentially we should also take into account the number of unavailable replicas in the hashring.

I mean we can assume if we have 3 nodes and 1 unavailable but the two live members return a 409 then we don't need to block the client, since the unavailable member will likely return a 409 when it comes online anyway, or a 200 at best but we will still fail to reach a quorum so its not very helpful to wait!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions