Skip to content

Fix/8693 rls control channel state monitoring#9137

Open
mswierq wants to merge 20 commits into
grpc:masterfrom
mswierq:fix/8693-rls-control-channel-state-monitoring
Open

Fix/8693 rls control channel state monitoring#9137
mswierq wants to merge 20 commits into
grpc:masterfrom
mswierq:fix/8693-rls-control-channel-state-monitoring

Conversation

@mswierq
Copy link
Copy Markdown

@mswierq mswierq commented May 20, 2026

This PR is a continuation and finalization of the stale/closed PR #8720.

It addresses the remaining feedback from the maintainers and the code review tools to resolve control channel state monitoring issues in the RLS balancer.

RELEASE NOTES: N/A

Fixes #8693

ulascansenturk and others added 18 commits November 20, 2025 23:39
Fix control channel connectivity monitoring to track TRANSIENT_FAILURE
state explicitly. Only reset backoff timers when transitioning from
TRANSIENT_FAILURE to READY, not for benign state changes like
READY → IDLE → READY.

Fixes grpc#8693
- Add testOnlyInitialReadyDone channel for proper test synchronization
- Signal when monitoring goroutine processes initial READY state
- Tests wait for this signal instead of using time.Sleep
- All synchronization now uses channels/callbacks - no arbitrary sleeps
- Tests pass consistently with race detector

Addresses review feedback about removing time.Sleep for state transitions.
…rt timeouts

- Replace goto/label patterns with labeled for loops and break
- Replace default cases with time.After(10ms) for proper timing
- Remove impossible TRANSIENT_FAILURE handling in IDLE test
Replace 5 identical labeled-for-select loops with a shared helper
function that waits for a specific connectivity state on the channel.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 73.91304% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.09%. Comparing base (bb023f8) to head (bfb9b39).
⚠️ Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
balancer/rls/control_channel.go 73.91% 3 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9137      +/-   ##
==========================================
- Coverage   83.21%   83.09%   -0.12%     
==========================================
  Files         414      417       +3     
  Lines       33489    33649     +160     
==========================================
+ Hits        27868    27961      +93     
- Misses       4207     4263      +56     
- Partials     1414     1425      +11     
Files with missing lines Coverage Δ
balancer/rls/control_channel.go 83.13% <73.91%> (-3.83%) ⬇️

... and 41 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mswierq mswierq marked this pull request as ready for review May 20, 2026 17:56
@easwars easwars added Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. Type: Internal Cleanup Refactors, etc Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. and removed Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. labels May 20, 2026
@easwars easwars added this to the 1.82 Release milestone May 20, 2026
@easwars easwars requested a review from eshitachandwani May 20, 2026 18:40
@easwars
Copy link
Copy Markdown
Contributor

easwars commented May 20, 2026

@eshitachandwani : Since you were the first reviewer on the original PR, I'm assigning this to you again for review. Please do take a look my open comments from the original PR and ensure that they are handled here. Thanks.

Comment thread balancer/rls/control_channel.go Outdated
cc.connectivityStateCh.Put(st)

var callBackToReady bool
cc.mu.Lock()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this mutex lock? From what I understand the OnMessage is already locked when executed and also executed serially.See here Is there another reason to have the lock?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, fixed

Comment thread balancer/rls/balancer_test.go Outdated
verifyRLSRequest(t, rlsReqCh, true)

// Verify that the control channel moves to READY.
wantStates := []connectivity.State{connectivity.Connecting, connectivity.Ready}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not using waitForConnectivityState here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was meant to wait for a sequence of Connecting followed by Ready state, replaced with two subsequent calls of waitForConnectivityState

Comment thread balancer/rls/balancer_test.go Outdated
// state is reset for cache entries in this scenario. It also verifies that:
// - Backoff is NOT reset when the control channel first becomes READY (i.e.,
// the initial CONNECTING → READY transition should not trigger a backoff reset)
// - Backoff is NOT reset for READY → IDLE → READY transitions (benign state changes)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not seem to be testing point 2 here i.e.
- Backoff is NOT reset for READY → IDLE → READY transitions (benign state changes)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this case is covered by TestControlChannelIdleTransitionNoBackoffReset, removed the confusing point in the comment

Comment thread balancer/rls/balancer_test.go Outdated
select {
case <-resetBackoffDone:
t.Fatal("Backoff reset was triggered for initial READY state, want no reset")
case <-time.After(10 * time.Millisecond):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this with the already defined const defaultTestShortTimeout here and below?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification Type: Internal Cleanup Refactors, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rls: Update logic in the control channel connectivity state monitoring goroutine

4 participants