- Notifications
You must be signed in to change notification settings - Fork 327
Avoid tracestate error with sample rate provided through headers #1808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid tracestate error with sample rate provided through headers #1808
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
eyalkoren left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this seems OK, I need to understand what we are missing here - the assumption is that TraceState#set() can only be invoked if a the transaction is started as a root, meaning - without traceparent header. And as far as I can see, addTextHeader () should only be invoked if there is a valid traceparent header. Any attempt to create a child transaction when there is already an active transaction should raise a red flag, right?
As for multiple attempts to set the sample rate based on the tracestate header - I can see how this may happen, and we should guard from that and treat it as illegal state or something.
apm-agent-core/src/main/java/co/elastic/apm/agent/impl/transaction/TraceContext.java Outdated Show resolved Hide resolved
| // sample rate could already have been set through headers | ||
| // we have to make sure to only use sampler rate as fallback only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come it could have been set through headers? Isn't asRootSpan() only invoked when no trace context headers are received (i.e. root)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While asRootSpan is only called with root transactions, the tracestate header is set through:
co.elastic.apm.agent.impl.transaction.TraceState#addTextHeader --> co.elastic.apm.agent.impl.transaction.TraceContext#getFromTraceContextTextHeaders ----> co.elastic.apm.agent.opentracingimpl.ExternalSpanContextInstrumentation#parseTextMap Which might just be an anomaly of our OpenTracing bridge implementation that make it try to set tracestate twice, hopefully with the same value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can we simplify and only look at a scenario where TraceState#addTextHeader() is invoked when the sample rate was already set?
And when this happens with a different value from the already-set value, we should:
- not override the sample rate, the first one being set is the one actually being used
- log about that, probably in
debugif it is a valid state - make sure we fix the header to contain the actual sample rate being used, and not the one read from the header - we want to propagate the right one
In addition, can we assume that TraceState#set() should never be invoked when sample rate is already set, and if that happens - log a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes, first wins is a sensible strategy here.
- logging with
debugis fine as it's an exceptional case that would only be investigated. - yes, we should only propagate valid header and de-duplicate.
Even if we should not call TraceState#set twice, having a consistent behavior with TraceState.addTextHeader() would be nice, thus logging both with debug seems the right option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's do that then.
Just clarifying my rationale: I don't think consistency is a consideration. If we think something should never happen, we can log it as error, as opposed to something we can expect.
So, besides a scenario of multiple tracestate headers, any other scenario of trying to set a sample rate when it is already set, is either a bug, or a scenario we didn't think of. Either way it may indicate that distributed tracing will be broken for this trace.
| After discussing this with @eyalkoren this morning, we agreed on the following behavior:
(1) while the agent just corrects what could be an invalid upstream tracestate header, we have to log it as it could be an indicator of another bug with an upstream service also instrumented by one of our agents. |
eyalkoren left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment/suggestion, but other than that LGTM
| } else if (!Double.isNaN(sampleRate)) { | ||
| log.warn("sample rate already set to {}, trying to set it to {} through header will be ignored", sampleRate, value); | ||
| headerValue = rewriteRemoveInvalidHeader(headerValue, vendorStart, vendorEnd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically, shouldn't we check that as early as possible? The line above suggests we may have overridden the sample-rate already.
…-tracestate-exception
…-tracestate-exception
What does this PR do?
When using OpenTracing API, there are some code paths that could lead to setting transaction sample rate multiple times.
Because we try to minimize the amount of parsing & memory allocation involved, we only allowed to set the sample rate once using either:
co.elastic.apm.agent.impl.transaction.TraceState#setfor root transaction with value from samplerco.elastic.apm.agent.impl.transaction.TraceState#addTextHeaderfor headers parsingThis PR changes the existing behavior (throw an exception) to:
Checklist