Skip to content

Conversation

@SylvainJuge
Copy link
Member

@SylvainJuge SylvainJuge commented May 10, 2021

What does this PR do?

When using OpenTracing API, there are some code paths that could lead to setting transaction sample rate multiple times.

Because we try to minimize the amount of parsing & memory allocation involved, we only allowed to set the sample rate once using either:

  • co.elastic.apm.agent.impl.transaction.TraceState#set for root transaction with value from sampler
  • co.elastic.apm.agent.impl.transaction.TraceState#addTextHeader for headers parsing

This PR changes the existing behavior (throw an exception) to:

  • ignore when trying to set value more than once with an identical value.
  • throw an exception only when trying to use a different value.
  • only try to set sample rate from sampler only when it's not already set.

Checklist

  • This is a bugfix
@ghost
Copy link

ghost commented May 10, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Started by user eyalkoren

  • Start Time: 2021-05-12T04:24:42.299+0000

  • Duration: 55 min 48 sec

  • Commit: 6589a1e

Test stats 🧪

Test Results
Failed 0
Passed 2131
Skipped 19
Total 2150

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 2131
Skipped 19
Total 2150

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this seems OK, I need to understand what we are missing here - the assumption is that TraceState#set() can only be invoked if a the transaction is started as a root, meaning - without traceparent header. And as far as I can see, addTextHeader () should only be invoked if there is a valid traceparent header. Any attempt to create a child transaction when there is already an active transaction should raise a red flag, right?
As for multiple attempts to set the sample rate based on the tracestate header - I can see how this may happen, and we should guard from that and treat it as illegal state or something.

Comment on lines 432 to 433
// sample rate could already have been set through headers
// we have to make sure to only use sampler rate as fallback only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come it could have been set through headers? Isn't asRootSpan() only invoked when no trace context headers are received (i.e. root)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While asRootSpan is only called with root transactions, the tracestate header is set through:

co.elastic.apm.agent.impl.transaction.TraceState#addTextHeader --> co.elastic.apm.agent.impl.transaction.TraceContext#getFromTraceContextTextHeaders ----> co.elastic.apm.agent.opentracingimpl.ExternalSpanContextInstrumentation#parseTextMap 

Which might just be an anomaly of our OpenTracing bridge implementation that make it try to set tracestate twice, hopefully with the same value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So can we simplify and only look at a scenario where TraceState#addTextHeader() is invoked when the sample rate was already set?
And when this happens with a different value from the already-set value, we should:

  1. not override the sample rate, the first one being set is the one actually being used
  2. log about that, probably in debug if it is a valid state
  3. make sure we fix the header to contain the actual sample rate being used, and not the one read from the header - we want to propagate the right one

In addition, can we assume that TraceState#set() should never be invoked when sample rate is already set, and if that happens - log a warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes, first wins is a sensible strategy here.
  2. logging with debug is fine as it's an exceptional case that would only be investigated.
  3. yes, we should only propagate valid header and de-duplicate.

Even if we should not call TraceState#set twice, having a consistent behavior with TraceState.addTextHeader() would be nice, thus logging both with debug seems the right option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's do that then.
Just clarifying my rationale: I don't think consistency is a consideration. If we think something should never happen, we can log it as error, as opposed to something we can expect.
So, besides a scenario of multiple tracestate headers, any other scenario of trying to set a sample rate when it is already set, is either a bug, or a scenario we didn't think of. Either way it may indicate that distributed tracing will be broken for this trace.

@SylvainJuge
Copy link
Member Author

After discussing this with @eyalkoren this morning, we agreed on the following behavior:

  • trying to call TraceState#set multiple times is an error and likely a bug in the agent, we should throw an exception
  • trying to call TraceState#set once a valid header has been parsed is also an error and should throw an exception
  • multiple calls to TraceState#addTextHeader are expected due to multi-valued headers (valid for HTTP).
  • if there are multiple values of sample rate in header, then we only take the 1st and ignore others but issue a warning in log (1).
  • if there is any invalid header value, we rewrite the header to avoid propagating an invalid header
  • if there is any rounding difference, then we also rewrite the header

(1) while the agent just corrects what could be an invalid upstream tracestate header, we have to log it as it could be an indicator of another bug with an upstream service also instrumented by one of our agents.

Copy link
Contributor

@eyalkoren eyalkoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment/suggestion, but other than that LGTM

Comment on lines 136 to 138
} else if (!Double.isNaN(sampleRate)) {
log.warn("sample rate already set to {}, trying to set it to {} through header will be ignored", sampleRate, value);
headerValue = rewriteRemoveInvalidHeader(headerValue, vendorStart, vendorEnd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically, shouldn't we check that as early as possible? The line above suggests we may have overridden the sample-rate already.

@SylvainJuge SylvainJuge merged commit 3b2fdbe into elastic:master May 12, 2021
@SylvainJuge SylvainJuge deleted the avoid-tracestate-exception branch May 12, 2021 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants