Fix/8693 rls control channel state monitoring#9137
Conversation
Fix control channel connectivity monitoring to track TRANSIENT_FAILURE state explicitly. Only reset backoff timers when transitioning from TRANSIENT_FAILURE to READY, not for benign state changes like READY → IDLE → READY. Fixes grpc#8693
- Add testOnlyInitialReadyDone channel for proper test synchronization - Signal when monitoring goroutine processes initial READY state - Tests wait for this signal instead of using time.Sleep - All synchronization now uses channels/callbacks - no arbitrary sleeps - Tests pass consistently with race detector Addresses review feedback about removing time.Sleep for state transitions.
…el-state-monitoring
…rt timeouts - Replace goto/label patterns with labeled for loops and break - Replace default cases with time.After(10ms) for proper timing - Remove impossible TRANSIENT_FAILURE handling in IDLE test
Replace 5 identical labeled-for-select loops with a shared helper function that waits for a specific connectivity state on the channel.
…l-channel-state-monitoring
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9137 +/- ##
==========================================
- Coverage 83.21% 83.09% -0.12%
==========================================
Files 414 417 +3
Lines 33489 33649 +160
==========================================
+ Hits 27868 27961 +93
- Misses 4207 4263 +56
- Partials 1414 1425 +11
🚀 New features to boost your workflow:
|
|
@eshitachandwani : Since you were the first reviewer on the original PR, I'm assigning this to you again for review. Please do take a look my open comments from the original PR and ensure that they are handled here. Thanks. |
| cc.connectivityStateCh.Put(st) | ||
|
|
||
| var callBackToReady bool | ||
| cc.mu.Lock() |
There was a problem hiding this comment.
Why do we need this mutex lock? From what I understand the OnMessage is already locked when executed and also executed serially.See here Is there another reason to have the lock?
| verifyRLSRequest(t, rlsReqCh, true) | ||
|
|
||
| // Verify that the control channel moves to READY. | ||
| wantStates := []connectivity.State{connectivity.Connecting, connectivity.Ready} |
There was a problem hiding this comment.
Why are we not using waitForConnectivityState here?
There was a problem hiding this comment.
It was meant to wait for a sequence of Connecting followed by Ready state, replaced with two subsequent calls of waitForConnectivityState
| // state is reset for cache entries in this scenario. It also verifies that: | ||
| // - Backoff is NOT reset when the control channel first becomes READY (i.e., | ||
| // the initial CONNECTING → READY transition should not trigger a backoff reset) | ||
| // - Backoff is NOT reset for READY → IDLE → READY transitions (benign state changes) |
There was a problem hiding this comment.
This test does not seem to be testing point 2 here i.e.
- Backoff is NOT reset for READY → IDLE → READY transitions (benign state changes)
There was a problem hiding this comment.
yeah, this case is covered by TestControlChannelIdleTransitionNoBackoffReset, removed the confusing point in the comment
| select { | ||
| case <-resetBackoffDone: | ||
| t.Fatal("Backoff reset was triggered for initial READY state, want no reset") | ||
| case <-time.After(10 * time.Millisecond): |
There was a problem hiding this comment.
Can we replace this with the already defined const defaultTestShortTimeout here and below?
This PR is a continuation and finalization of the stale/closed PR #8720.
It addresses the remaining feedback from the maintainers and the code review tools to resolve control channel state monitoring issues in the RLS balancer.
RELEASE NOTES: N/A
Fixes #8693