LoadBalancer keyed on slot instead of primary node, not reset on NodesManager.initialize() #3683
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Pull Request check-list
Description of change
LoadBalancernow usesslot_to_idxinstead ofprimary_to_idx, using slot as the key instead of primary node name.NodesManagerresets ininitialize, it no longer runsLoadBalancer.reset()which would clear theslot_to_idxdictionary.TestNodesManager.test_load_balancerupdated accordingly.As noted in #3681, reseting the load balancer on
NodesManager.initialize()causes the index associated with the primary node to reset to 0. If aConnectionErrororTimeoutErroris raised by an attempt to connect to a primary node,NodesManager.initialize()is called, and the the load balancer's index for that node will reset to 0. Therefore, the next attempt in the retry loop will not move on from the primary node to a replica node (with index > 0) as expected, but will instead retry the primary node again (and presumably raise the same error).Since
NodesManager.initialize()being called onConnectionErrororTimeoutErroris the valid strategy, and since the primary node's host will often be replaced in tandem with events that cause these errors (e.g. when a primary node is deleted and then recreated in Kubernetes), keying theLoadBalancerdictionary on the primary node's name (host:port) doesn't feel appropriate. Instead, keying the dictionary on the Redis Cluster's slot seems to be a better strategy. As such, theserver_indexcorresponding to keyslotdoesn't need to be reset to 0 onNodesManager.initialize()as theslotisn't expected to change and need to be reset, only thehost:portwould require such. Instead, theslotcan maintain its "state" even when theNodesManageris reinitialized, thus resolving #3681.With the fix in this PR implemented, the output of the loop from #3681 becomes what is expected when the primary node goes down (the load balancer continues to the next node on a
TimeoutError):