Skip to content

fix: stop memory leak from orphaned CR reflector goroutines on repeated CRD discovery#2920

Merged
k8s-ci-robot merged 1 commit into
kubernetes:mainfrom
bhope:fix-mem-leak
May 5, 2026
Merged

fix: stop memory leak from orphaned CR reflector goroutines on repeated CRD discovery#2920
k8s-ci-robot merged 1 commit into
kubernetes:mainfrom
bhope:fix-mem-leak

Conversation

@bhope
Copy link
Copy Markdown
Member

@bhope bhope commented Apr 10, 2026

Elevated and unbounded memory growth introduced in v2.18.0 when custom resource state config is in use.

Root Causes

  1. AppendToMap overwrites stop channels and appends duplicate kinds on every call (internal/discovery/types.go). Since PollForCacheUpdates calls it for every known GVK each cycle, old stop channels were silently replaced, orphaning any reflector goroutine blocking on them.
  2. CR reflectors ignore context cancellation (internal/store/builder.go). Unlike standard reflectors started with reflector.Run(b.ctx.Done()), custom resource reflectors were started with only their GVK-specific stop channel - no context cancellation path at all.

Fix

  • AppendToMap: skip the append if the kind already exists; skip make(chan struct{}) if a channel already exists for the GVK.
  • startReflector: wrap the GVK stop channel with a bridge goroutine that also selects on b.ctx.Done(), so CR reflectors stop on both CRD deletion and context cancellation.

Also, added tests to cover idempotency and cleanup in the discovery package - verifying no duplicate kinds or channel replacement on repeated AppendToMap calls, and that RemoveFromMap closes channels so reflectors stop cleanly.

Test Results:

TestMemoryLeakSimulation - 5 GVKs × 500 poll cycles

Buggy (pre-fix) Fixed (post-fix)
Kind entries in map 2500 5
Stop channels live 5 5
Heap growth (KB) +88 -8

TestGoroutineLeakSimulation - 5 GVKs × 20 store rebuilds

Buggy (pre-fix) Fixed (post-fix)
Goroutines leaked 100 0

Fixes #2867

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 10, 2026
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Instrumentation Apr 10, 2026
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 10, 2026
@mrueg mrueg requested a review from Copilot April 10, 2026 20:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes elevated/unbounded memory growth and goroutine leaks when custom resource state config is enabled by making CRD discovery idempotent and ensuring custom-resource reflectors stop on both CRD removal and context cancellation.

Changes:

  • Make CRDiscoverer.AppendToMap idempotent (no duplicate kinds; don’t replace existing stop channels).
  • Ensure custom-resource reflectors stop when either the GVK stop channel fires or the builder context is cancelled.
  • Add/extend tests covering idempotency, channel cleanup, and leak simulations.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/store/builder.go Updates CR reflector stop behavior to also honor builder context cancellation.
internal/store/builder_test.go Adds unit tests around the combined stop channel behavior for CR reflectors.
internal/discovery/types.go Prevents duplicate kind entries and stop-channel replacement in repeated discovery updates.
internal/discovery/types_test.go Adds deterministic unit tests for Append/Remove idempotency and channel closure.
internal/discovery/memleak_test.go Adds simulation-style tests intended to demonstrate pre/post fix memory & goroutine behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/store/builder.go Outdated
Comment thread internal/store/builder_test.go
Comment thread internal/store/builder_test.go Outdated
Comment thread internal/discovery/memleak_test.go Outdated
Comment thread internal/discovery/memleak_test.go Outdated
@bhope
Copy link
Copy Markdown
Member Author

bhope commented Apr 13, 2026

Hi @mrueg addressed the copilot suggestions and CI is now green. Ready for a review when you get a chance.

@jullianow
Copy link
Copy Markdown

Any idea when this will be released?

@bhope
Copy link
Copy Markdown
Member Author

bhope commented Apr 14, 2026

@jullianow This will be included in the upcoming release, we are working towards it. Please stay tuned. Thanks.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/store/builder_test.go Outdated
@mrueg
Copy link
Copy Markdown
Member

mrueg commented Apr 15, 2026

Hi @mrueg addressed the copilot suggestions and CI is now green. Ready for a review when you get a chance.

Thanks for looking into those comments.
Unfortunately I won't have access to a way to test it until mid May due to private travel.

@rexagod can you take a look?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests introduced here copy the behavior as it is now, and test it. However, since the behavior is not encapsulated (as a testable function), this needs to be updated every time there's a change. I'll suggest doing so and calling that here, or dropping these otherwise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense. Extracted the bridge goroutine into newCRReflectorStopCh() in builder.go. Both tests now call that directly instead of copying the logic inside of the tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to reproduce the current faulty behavior. It'd be fine to just see if the patched behavior works as expected. Tests implore maintenance too, so we'd benefit from testing traits in a way that balances manageable maintenance (PTAL at my other comment regarding reuse) with reasonable coverage, and dropping any extraneous ones that are arguably redundant in the long-term.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Dropped:

  1. appendToMapBuggy - as you called out, no need to test the existence of bug
  2. TestGoroutineLeakSimulation - it was duplicating the bridge goroutine logic from the store package and is already covered by the builder_test.go tests

Also, replaced TestMemoryLeakSimulation with TestAppendToMapStability which only asserts fixed behavior.

@rexagod rexagod moved this from Needs Triage to In Progress in SIG Instrumentation Apr 21, 2026
@rexagod
Copy link
Copy Markdown
Member

rexagod commented Apr 21, 2026

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 21, 2026
@alexandernorth
Copy link
Copy Markdown

Hi all, I took a look at this fix and I believe that the issue with the hanging channels would also be fixed by my PR #2872. However, I see that my PR is missing the parent context cancellation logic from builder.go.
Would you also be open to the approach I use regarding storing the stopChans, but then adding the logic from this PR's builder.go to ensure that parent context cancellation also propagates correctly?
Let me know your thoughts!

@rexagod rexagod mentioned this pull request Apr 22, 2026
@bhope
Copy link
Copy Markdown
Member Author

bhope commented Apr 28, 2026

Hi @alexandernorth - Thanks for the heads up. Since we already have done a few rounds of review on this one and left with a few test comments, I'd suggest we get this one merged first. Post that, I am happy to review your PR and get you the help needed to move forward on that one. Hope that works!

@bhope
Copy link
Copy Markdown
Member Author

bhope commented Apr 28, 2026

Hi @rexagod - addressed both your inline comments. PTAL when you get a chance. Thanks!

…discovery

Co-authored-by: Oleg Zaytsev <1511481+colega@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/app/server.go
@rexagod
Copy link
Copy Markdown
Member

rexagod commented Apr 29, 2026

/approve
/lgtm
/hold

For other maintainers to review, feel free to unhold.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bhope, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2026
@bhope
Copy link
Copy Markdown
Member Author

bhope commented May 5, 2026

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2026
@k8s-ci-robot k8s-ci-robot merged commit d40135d into kubernetes:main May 5, 2026
17 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SIG Instrumentation May 5, 2026
@jullianow
Copy link
Copy Markdown

@bhope When is this expected to be released?

@bhope
Copy link
Copy Markdown
Member Author

bhope commented May 5, 2026

@jullianow very soon, we are actively preparing the release. Stay tuned.

@jullianow
Copy link
Copy Markdown

thanks you @bhope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Development

Successfully merging this pull request may close these issues.

Elevated Memory Utilization (v2.18.0)

7 participants