fix: stop memory leak from orphaned CR reflector goroutines on repeated CRD discovery#2920
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes elevated/unbounded memory growth and goroutine leaks when custom resource state config is enabled by making CRD discovery idempotent and ensuring custom-resource reflectors stop on both CRD removal and context cancellation.
Changes:
- Make
CRDiscoverer.AppendToMapidempotent (no duplicate kinds; don’t replace existing stop channels). - Ensure custom-resource reflectors stop when either the GVK stop channel fires or the builder context is cancelled.
- Add/extend tests covering idempotency, channel cleanup, and leak simulations.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/store/builder.go | Updates CR reflector stop behavior to also honor builder context cancellation. |
| internal/store/builder_test.go | Adds unit tests around the combined stop channel behavior for CR reflectors. |
| internal/discovery/types.go | Prevents duplicate kind entries and stop-channel replacement in repeated discovery updates. |
| internal/discovery/types_test.go | Adds deterministic unit tests for Append/Remove idempotency and channel closure. |
| internal/discovery/memleak_test.go | Adds simulation-style tests intended to demonstrate pre/post fix memory & goroutine behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @mrueg addressed the copilot suggestions and CI is now green. Ready for a review when you get a chance. |
|
Any idea when this will be released? |
|
@jullianow This will be included in the upcoming release, we are working towards it. Please stay tuned. Thanks. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
The tests introduced here copy the behavior as it is now, and test it. However, since the behavior is not encapsulated (as a testable function), this needs to be updated every time there's a change. I'll suggest doing so and calling that here, or dropping these otherwise.
There was a problem hiding this comment.
Thanks, makes sense. Extracted the bridge goroutine into newCRReflectorStopCh() in builder.go. Both tests now call that directly instead of copying the logic inside of the tests.
There was a problem hiding this comment.
No need to reproduce the current faulty behavior. It'd be fine to just see if the patched behavior works as expected. Tests implore maintenance too, so we'd benefit from testing traits in a way that balances manageable maintenance (PTAL at my other comment regarding reuse) with reasonable coverage, and dropping any extraneous ones that are arguably redundant in the long-term.
There was a problem hiding this comment.
Thanks. Dropped:
- appendToMapBuggy - as you called out, no need to test the existence of bug
- TestGoroutineLeakSimulation - it was duplicating the bridge goroutine logic from the store package and is already covered by the builder_test.go tests
Also, replaced TestMemoryLeakSimulation with TestAppendToMapStability which only asserts fixed behavior.
|
/triage accepted |
|
Hi all, I took a look at this fix and I believe that the issue with the hanging channels would also be fixed by my PR #2872. However, I see that my PR is missing the parent context cancellation logic from builder.go. |
|
Hi @alexandernorth - Thanks for the heads up. Since we already have done a few rounds of review on this one and left with a few test comments, I'd suggest we get this one merged first. Post that, I am happy to review your PR and get you the help needed to move forward on that one. Hope that works! |
|
Hi @rexagod - addressed both your inline comments. PTAL when you get a chance. Thanks! |
…discovery Co-authored-by: Oleg Zaytsev <1511481+colega@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/approve For other maintainers to review, feel free to unhold. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bhope, rexagod The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
|
@bhope When is this expected to be released? |
|
@jullianow very soon, we are actively preparing the release. Stay tuned. |
|
thanks you @bhope |
Elevated and unbounded memory growth introduced in v2.18.0 when custom resource state config is in use.
Root Causes
Fix
Also, added tests to cover idempotency and cleanup in the discovery package - verifying no duplicate kinds or channel replacement on repeated AppendToMap calls, and that RemoveFromMap closes channels so reflectors stop cleanly.
Test Results:
TestMemoryLeakSimulation- 5 GVKs × 500 poll cyclesTestGoroutineLeakSimulation- 5 GVKs × 20 store rebuildsFixes #2867