Avoid that the reporter blocks server shutdown #554

felixbarny · 2019-04-02T08:58:38Z

This only happens after the APM Server is not available for a while. In that case, the reporter queue fills up and can never really drain because the reporter blocks its thread to throttle APM Server connection retries. On shutdown, the reporter tries to flush by registering a flush event. The registration blocks until a new slot in the ring buffer becomes available. But due to the throttling only once every 36 seconds, an event is picked up from the ring buffer. So a shutdown takes around 512 * 36 seconds.

felixbarny · 2019-04-02T08:59:37Z

The bug likely was introduced here: #410, in combination with #397

…ailable slots in the ring buffer

codecov-io · 2019-04-02T09:17:21Z

Codecov Report

Merging #554 into master will decrease coverage by 2.14%.
The diff coverage is 40%.

@@ Coverage Diff @@ ## master #554 +/- ## ============================================ - Coverage 65.25% 63.11% -2.15%  Complexity 68 68 ============================================ Files 180 180 Lines 7209 6815 -394 Branches 863 780 -83 ============================================ - Hits 4704 4301 -403  + Misses 2251 2249 -2  - Partials 254 265 +11

Impacted Files	Coverage Δ	Complexity Δ
...pm/agent/report/IntakeV2ReportingEventHandler.java	`76.23% <ø> (-4.79%)`	`0 <0> (ø)`
...co/elastic/apm/agent/report/ApmServerReporter.java	`57.84% <40%> (+3.84%)`	`0 <0> (ø)`	⬇️
.../apm/agent/report/serialize/DslJsonSerializer.java	`81.84% <0%> (-6.41%)`	`0% <0%> (ø)`
...tic/apm/agent/configuration/CoreConfiguration.java	`96.12% <0%> (-2.14%)`	`0% <0%> (ø)`
...a/co/elastic/apm/agent/report/ReporterFactory.java	`78.26% <0%> (-0.91%)`	`0% <0%> (ø)`
.../main/java/co/elastic/apm/agent/impl/MetaData.java	`100% <0%> (ø)`	`0% <0%> (ø)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6482904...7e1e4d7. Read the comment docs.

eyalkoren · 2019-04-03T03:03:19Z

On shutdown, the reporter tries to publish a shutdown event and then calls the Disruptor.shutdown which does a busy spin to see if there are still events to publish and times out after 5 seconds:

apm-agent-java/apm-agent-core/src/main/java/co/elastic/apm/agent/report/ApmServerReporter.java

Line 221 in 24424d6

disruptor.shutdown(5, TimeUnit.SECONDS);

Then it wakes up the Handler thread, which may try to flush current buffered serialized data, but not flush all events in buffer:

apm-agent-java/apm-agent-core/src/main/java/co/elastic/apm/agent/report/ApmServerReporter.java

Line 225 in 24424d6

reportingEventHandler.close();

This also changes the state of the Handler so that it shouldn't continue trying to send events, but in any case, this thread is a daemon thread, so it shouldn't prevent shutdown.

The only blocking thing here seems to be the publish, which I assume would be fixed by changing to tryPublish as you did. So I see how the shutdown is delayed by the publish call (which doesn't guarantee success after 36 seconds if other publishers are trying to send events as well) plus the 5 seconds wait, but not longer.

What am I missing here?

felixbarny · 2019-04-03T06:53:33Z

The only blocking thing here seems to be the publish, which I assume would be fixed by changing to tryPublish as you did.

That's correct. I think with this change we're all good.

Avoid that the reporter blocks server shutdown

eda644c

felixbarny requested a review from eyalkoren April 2, 2019 08:58

felixbarny self-assigned this Apr 2, 2019

felixbarny added the type: bug label Apr 2, 2019

Update comment reflecting that reporter.flush() is not waiting for av…

7e1e4d7

…ailable slots in the ring buffer

felixbarny added the [zube]: In Review label Apr 2, 2019

eyalkoren approved these changes Apr 3, 2019

View reviewed changes

felixbarny merged commit c9149db into elastic:master Apr 3, 2019

felixbarny deleted the avoid-reporter-blocking-shutdown branch April 3, 2019 08:45

zube bot added [zube]: Done and removed [zube]: In Review labels Apr 3, 2019

alvarolobato removed the [zube]: Done label Apr 9, 2019

SylvainJuge added bug Bugs and removed type: bug labels Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid that the reporter blocks server shutdown #554

Avoid that the reporter blocks server shutdown #554

Uh oh!

felixbarny commented Apr 2, 2019 •

edited

Loading

felixbarny commented Apr 2, 2019 •

edited

Loading

codecov-io commented Apr 2, 2019 •

edited

Loading

eyalkoren commented Apr 3, 2019 •

edited

Loading

felixbarny commented Apr 3, 2019

Labels

5 participants

Avoid that the reporter blocks server shutdown #554

Avoid that the reporter blocks server shutdown #554

Uh oh!

Conversation

felixbarny commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

felixbarny commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov-io commented Apr 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

eyalkoren commented Apr 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

felixbarny commented Apr 3, 2019

Labels

5 participants

felixbarny commented Apr 2, 2019 •

edited

Loading

felixbarny commented Apr 2, 2019 •

edited

Loading

codecov-io commented Apr 2, 2019 •

edited

Loading

eyalkoren commented Apr 3, 2019 •

edited

Loading