Restarted the watch after stream reset #19055 by vivek807 · Pull Request #19056 · apache/druid

vivek807 · 2026-02-25T22:54:31Z

Fixes #19055.

Description

Restarted the watch after stream reset

Fixed the bug #19055

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

Copilot

Pull request overview

Fixes issue #19055 by improving resiliency of the Kubernetes service-discovery watch so broker node inventory can recover after watch stream disruptions.

Changes:

Updated WatchResult to be usable with try-with-resources by extending AutoCloseable.
Refactored NodeRoleWatcher.keepWatching to use try-with-resources and added explicit handling for HTTP/2 stream reset conditions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/WatchResult.java`	Adjusts the watch iterator contract to support automatic resource management.
`extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java`	Updates watch loop to auto-close resources and attempts to restart after stream resets/timeouts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...ons-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/WatchResult.java

...s-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java

FrankChen021 · 2026-02-27T08:19:06Z

@vivek807 Thanks for reporting the issue and submitting this PR for fix. can you address above comments?

vivek807 · 2026-03-04T10:56:15Z

@vivek807 Thanks for reporting the issue and submitting this PR for fix. can you address above comments?

updated, please recheck.

capistrant

Still think we need to push further in removing the okhttp internals from this PR. My initial thought is that we just need to handle all IOExceptions with a full re-list. We can keep the one off handling of the known ok IOException that a simple retry of the watch from same resource version will work for. But for all others force a re-list? My biggest fear is that this is an overreaction.

capistrant · 2026-03-18T19:20:08Z

.../kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/DefaultK8sApiClient.java

- } else {
- throw ex;
 }
+ if (ex.getCause() instanceof StreamResetException) {


this is still relying on an okhttp internal. Maybe we can just catch any wrapped IOException and re-throw? then catch IOException in the discovery provider after we catch the socket timeout. and log/return to force a re-list? One fear is that is too heavy handed though as full list is expensive, especially in clusters with lots of pods.

Restarted the watch after stream reset

781096d

FrankChen021 added the Kubernetes label Feb 26, 2026

FrankChen021 requested a review from Copilot February 26, 2026 02:04

Copilot started reviewing on behalf of FrankChen021 February 26, 2026 02:04 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Addressed review comments

830ea9f

vivek807 requested a review from FrankChen021 March 4, 2026 11:00

vivek807 force-pushed the deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset branch from e79643a to dc86628 Compare March 4, 2026 12:02

Add unit test and addresses styling issues

268b1ff

vivek807 force-pushed the deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset branch from dc86628 to 268b1ff Compare March 4, 2026 12:52

github-actions bot added the Area - Dependencies label Mar 4, 2026

capistrant requested changes Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarted the watch after stream reset #19055#19056

Restarted the watch after stream reset #19055#19056
vivek807 wants to merge 3 commits intoapache:masterfrom
deep-bi:deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset

vivek807 commented Feb 25, 2026

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

FrankChen021 commented Feb 27, 2026

vivek807 commented Mar 4, 2026

capistrant left a comment

capistrant Mar 18, 2026

Labels

4 participants

Conversation

vivek807 commented Feb 25, 2026

Description

Fixed the bug #19055

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

FrankChen021 commented Feb 27, 2026

vivek807 commented Mar 4, 2026

capistrant left a comment

Choose a reason for hiding this comment

capistrant Mar 18, 2026

Choose a reason for hiding this comment

Labels

4 participants