Fix CosmosClient memory leak in CosmosDbFactory.GetCosmosClient#92
Open
bm77525-kr wants to merge 1 commit into
Open
Fix CosmosClient memory leak in CosmosDbFactory.GetCosmosClient#92bm77525-kr wants to merge 1 commit into
bm77525-kr wants to merge 1 commit into
Conversation
GetOrAdd was being called with the eager TValue overload of ConcurrentDictionary, constructing a new CosmosClient on every call and silently discarding it when the cache already held an entry for the key. Discarded instances hold unmanaged state and have finalizers; they accumulate until the pod is OOMKilled. Switch to the lazy Func<TKey, TValue> overload so CreateCosmosClient only runs on cache miss. Add a regression test that counts CreateCosmosClient invocations via a test subclass. This requires changing the class from 'internal sealed' to 'internal' and CreateCosmosClient from 'private' to 'protected internal virtual'. Signed-off-by: Brendan Morante <brendan.morante@kroger.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix CosmosClient memory leak in
CosmosDbFactory.GetCosmosClientSummary
CosmosDbFactory.GetCosmosClientallocates a newCosmosClienton every call due to aConcurrentDictionary.GetOrAddoverload misuse. The class comment explicitly says "it is recommended to maintain a single instance of CosmosClient per lifetime of the application". The implementation does the opposite. In a deployed scaler this leaks 2-4CosmosClientinstances per KEDA poll cycle and OOMs the pod within ~30-50 minutes against a 512Mi container limit.This PR changes one line to use the lazy-factory overload, makes
CreateCosmosClienta virtual method so a regression test can count its invocations, and adds an xUnit test that pins the new behavior.Problem
src/Scaler/Services/CosmosDbFactory.cs(before):The second argument here is the
TValueoverload ofGetOrAdd. The value is constructed before the dictionary lookup runs. Behavior on every call:CreateCosmosClient(...)runs, allocating a newCosmosClient(HTTP/gRPC channels, retry timers, address-resolution caches, telemetry pipelines).GetOrAddchecks the key. If it's already cached, the freshly-constructed instance is silently discarded.CosmosClientimplementsIDisposableand holds significant unmanaged state. Discarded instances are neverDispose()'d and are promoted into older GC generations because they have finalizers, so they accumulate.CosmosDbMetricProvider.GetPartitionCountAsynccallsGetCosmosClienttwice per invocation (once for the lease container, once for the data container). KEDA polls eachScaledObjectviaIsActiveand/orGetMetricseverypollingIntervalseconds (default 30s). Effective leak rate: 2-4CosmosClientinstances per 30s poll ≈ 240-480 leaked instances/hour.Regression introduction
The bug was introduced on 2025-11-12 in commit
01a2505, "Managed Identity Support for CosmosDB External Scaler (#86)". Parent commit:bac322b.The pre-#86 code was correct: it passed
CreateCosmosClientas a method group toGetOrAdd, which the compiler resolves toFunc<string, CosmosClient>and dispatches to the lazy overload:When #86 added managed-identity support,
CreateCosmosClientgrew two additional parameters (useCredentials,clientId). The author needed to thread these new arguments through, and the natural-looking edit was to invoke the method directly:The call-site shape barely changed (same dictionary, same method name, same
GetOrAdd), but overload resolution silently flipped fromGetOrAdd(TKey, Func<TKey, TValue>)toGetOrAdd(TKey, TValue). The compiler had no reason to warn: both overloads exist and both compile. The behavioral change is invisible at the call site and only manifests as a slow-burn memory leak in production.The regression test in this PR is designed to catch exactly this class of mistake going forward: any future signature change to
CreateCosmosClientthat re-introduces eager invocation will failGetCosmosClient_OnlyConstructsOnceForSameKey.Fix
src/Scaler/Services/CosmosDbFactory.cs(after):Switches to the
Func<TKey, TValue>overload ofGetOrAdd. The factory delegate only runs on cache miss. Once the keys for the lease container and data container are cached (after the first poll), no furtherCosmosClientallocations happen for the lifetime of the process.Other changes
To enable a regression test that counts
CreateCosmosClientinvocations:internal sealed class CosmosDbFactory→internal class CosmosDbFactoryprivate CosmosClient CreateCosmosClient(...)→protected internal virtual CosmosClient CreateCosmosClient(...)These two visibility changes are the minimum needed to allow a test subclass to override the create method. No production behavior changes.
Tests
New file:
src/Scaler.Tests/CosmosDbFactoryTests.csTwo test cases, both using a
CountingCosmosDbFactorytest subclass that overridesCreateCosmosClientto count invocations viaInterlocked.Increment:GetCosmosClient_OnlyConstructsOnceForSameKeyCreateCosmosClientran exactly once.GetCosmosClient_ConstructsOncePerDistinctKeyCreateCosmosClientran exactly twice.The first test is the regression assertion: it fails on the buggy version with
Expected: 1, Actual: 3and passes on the fix.Verification
To confirm the regression test catches the bug, temporarily revert just the lambda fix on line 17 (keep the
virtual/ un-sealed changes):Run:
GetCosmosClient_OnlyConstructsOnceForSameKeywill fail:Restore the fix; both tests pass.
Impact
CosmosClientper distinct(endpointOrConnection, clientId)key) instead of growing linearly to the OOM ceiling. Expected reduction in pod memory: from ~500 MB right before OOM down to ~50-100 MB steady-state.cosmosdb-partitioncountscaler will see fewer HPA scaling oscillations and therefore fewerLeaseLostExceptionevents on their leases container during steady state.