Restarted the watch after stream reset #19055#19056
Restarted the watch after stream reset #19055#19056vivek807 wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes issue #19055 by improving resiliency of the Kubernetes service-discovery watch so broker node inventory can recover after watch stream disruptions.
Changes:
- Updated
WatchResultto be usable with try-with-resources by extendingAutoCloseable. - Refactored
NodeRoleWatcher.keepWatchingto use try-with-resources and added explicit handling for HTTP/2 stream reset conditions.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/WatchResult.java | Adjusts the watch iterator contract to support automatic resource management. |
extensions-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java | Updates watch loop to auto-close resources and attempts to restart after stream resets/timeouts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...ons-core/kubernetes-extensions/src/main/java/org/apache/druid/k8s/discovery/WatchResult.java Show resolved Hide resolved
...s-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java Show resolved Hide resolved
...s-extensions/src/main/java/org/apache/druid/k8s/discovery/K8sDruidNodeDiscoveryProvider.java Show resolved Hide resolved
| @vivek807 Thanks for reporting the issue and submitting this PR for fix. can you address above comments? |
updated, please recheck. |
e79643a to dc86628 Compare dc86628 to 268b1ff Compare
capistrant left a comment
There was a problem hiding this comment.
Still think we need to push further in removing the okhttp internals from this PR. My initial thought is that we just need to handle all IOExceptions with a full re-list. We can keep the one off handling of the known ok IOException that a simple retry of the watch from same resource version will work for. But for all others force a re-list? My biggest fear is that this is an overreaction.
| } else { | ||
| throw ex; | ||
| } | ||
| if (ex.getCause() instanceof StreamResetException) { |
There was a problem hiding this comment.
this is still relying on an okhttp internal. Maybe we can just catch any wrapped IOException and re-throw? then catch IOException in the discovery provider after we catch the socket timeout. and log/return to force a re-list? One fear is that is too heavy handed though as full list is expensive, especially in clusters with lots of pods.
Fixes #19055.
Description
Restarted the watch after stream reset
Fixed the bug #19055
This PR has: