Skip to content

Restarted the watch after stream reset #19055#19056

Open
vivek807 wants to merge 3 commits intoapache:masterfrom
deep-bi:deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset
Open

Restarted the watch after stream reset #19055#19056
vivek807 wants to merge 3 commits intoapache:masterfrom
deep-bi:deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset

Conversation

@vivek807
Copy link
Contributor

Fixes #19055.

Description

Restarted the watch after stream reset

Fixed the bug #19055

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes issue #19055 by improving resiliency of the Kubernetes service-discovery watch so broker node inventory can recover after watch stream disruptions.

Changes:

  • Updated WatchResult to be usable with try-with-resources by extending AutoCloseable.
  • Refactored NodeRoleWatcher.keepWatching to use try-with-resources and added explicit handling for HTTP/2 stream reset conditions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/WatchResult.java Adjusts the watch iterator contract to support automatic resource management.
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java Updates watch loop to auto-close resources and attempts to restart after stream resets/timeouts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@FrankChen021
Copy link
Member

@vivek807 Thanks for reporting the issue and submitting this PR for fix. can you address above comments?

@vivek807
Copy link
Contributor Author

vivek807 commented Mar 4, 2026

@vivek807 Thanks for reporting the issue and submitting this PR for fix. can you address above comments?

updated, please recheck.

@vivek807 vivek807 requested a review from FrankChen021 March 4, 2026 11:00
@vivek807 vivek807 force-pushed the deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset branch from e79643a to dc86628 Compare March 4, 2026 12:02
@vivek807 vivek807 force-pushed the deep/feature/19055-service-discovery-watch-not-recovering-from-stream-reset branch from dc86628 to 268b1ff Compare March 4, 2026 12:52
Copy link
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think we need to push further in removing the okhttp internals from this PR. My initial thought is that we just need to handle all IOExceptions with a full re-list. We can keep the one off handling of the known ok IOException that a simple retry of the watch from same resource version will work for. But for all others force a re-list? My biggest fear is that this is an overreaction.

} else {
throw ex;
}
if (ex.getCause() instanceof StreamResetException) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still relying on an okhttp internal. Maybe we can just catch any wrapped IOException and re-throw? then catch IOException in the discovery provider after we catch the socket timeout. and log/return to force a re-list? One fear is that is too heavy handed though as full list is expensive, especially in clusters with lots of pods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment