Skip to content

Conversation

@felixbarny
Copy link
Member

@felixbarny felixbarny commented Apr 2, 2019

This only happens after the APM Server is not available for a while. In that case, the reporter queue fills up and can never really drain because the reporter blocks its thread to throttle APM Server connection retries. On shutdown, the reporter tries to flush by registering a flush event. The registration blocks until a new slot in the ring buffer becomes available. But due to the throttling only once every 36 seconds, an event is picked up from the ring buffer. So a shutdown takes around 512 * 36 seconds.

@felixbarny felixbarny requested a review from eyalkoren April 2, 2019 08:58
@felixbarny felixbarny self-assigned this Apr 2, 2019
@felixbarny
Copy link
Member Author

felixbarny commented Apr 2, 2019

The bug likely was introduced here: #410, in combination with #397

@codecov-io
Copy link

codecov-io commented Apr 2, 2019

Codecov Report

Merging #554 into master will decrease coverage by 2.14%.
The diff coverage is 40%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #554 +/- ## ============================================ - Coverage 65.25% 63.11% -2.15%  Complexity 68 68 ============================================ Files 180 180 Lines 7209 6815 -394 Branches 863 780 -83 ============================================ - Hits 4704 4301 -403  + Misses 2251 2249 -2  - Partials 254 265 +11
Impacted Files Coverage Δ Complexity Δ
...pm/agent/report/IntakeV2ReportingEventHandler.java 76.23% <ø> (-4.79%) 0 <0> (ø)
...co/elastic/apm/agent/report/ApmServerReporter.java 57.84% <40%> (+3.84%) 0 <0> (ø) ⬇️
.../apm/agent/report/serialize/DslJsonSerializer.java 81.84% <0%> (-6.41%) 0% <0%> (ø)
...tic/apm/agent/configuration/CoreConfiguration.java 96.12% <0%> (-2.14%) 0% <0%> (ø)
...a/co/elastic/apm/agent/report/ReporterFactory.java 78.26% <0%> (-0.91%) 0% <0%> (ø)
.../main/java/co/elastic/apm/agent/impl/MetaData.java 100% <0%> (ø) 0% <0%> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6482904...7e1e4d7. Read the comment docs.

@eyalkoren
Copy link
Contributor

eyalkoren commented Apr 3, 2019

On shutdown, the reporter tries to publish a shutdown event and then calls the Disruptor.shutdown which does a busy spin to see if there are still events to publish and times out after 5 seconds:

Then it wakes up the Handler thread, which may try to flush current buffered serialized data, but not flush all events in buffer:

This also changes the state of the Handler so that it shouldn't continue trying to send events, but in any case, this thread is a daemon thread, so it shouldn't prevent shutdown.

The only blocking thing here seems to be the publish, which I assume would be fixed by changing to tryPublish as you did. So I see how the shutdown is delayed by the publish call (which doesn't guarantee success after 36 seconds if other publishers are trying to send events as well) plus the 5 seconds wait, but not longer.

What am I missing here?

@felixbarny
Copy link
Member Author

The only blocking thing here seems to be the publish, which I assume would be fixed by changing to tryPublish as you did.

That's correct. I think with this change we're all good.

@felixbarny felixbarny merged commit c9149db into elastic:master Apr 3, 2019
@felixbarny felixbarny deleted the avoid-reporter-blocking-shutdown branch April 3, 2019 08:45
@SylvainJuge SylvainJuge added bug Bugs and removed type: bug labels Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Bugs

5 participants