Skip to content

Fix CosmosClient memory leak in CosmosDbFactory.GetCosmosClient#92

Open
bm77525-kr wants to merge 1 commit into
kedacore:mainfrom
bm77525-kr:fix/cosmos-client-leak
Open

Fix CosmosClient memory leak in CosmosDbFactory.GetCosmosClient#92
bm77525-kr wants to merge 1 commit into
kedacore:mainfrom
bm77525-kr:fix/cosmos-client-leak

Conversation

@bm77525-kr
Copy link
Copy Markdown

Fix CosmosClient memory leak in CosmosDbFactory.GetCosmosClient

Summary

CosmosDbFactory.GetCosmosClient allocates a new CosmosClient on every call due to a ConcurrentDictionary.GetOrAdd overload misuse. The class comment explicitly says "it is recommended to maintain a single instance of CosmosClient per lifetime of the application". The implementation does the opposite. In a deployed scaler this leaks 2-4 CosmosClient instances per KEDA poll cycle and OOMs the pod within ~30-50 minutes against a 512Mi container limit.

This PR changes one line to use the lazy-factory overload, makes CreateCosmosClient a virtual method so a regression test can count its invocations, and adds an xUnit test that pins the new behavior.

Problem

src/Scaler/Services/CosmosDbFactory.cs (before):

public CosmosClient GetCosmosClient(string endpointOrConnection, bool useCredentials, string clientId)
{
    return _cosmosClientCache.GetOrAdd(
        (endpointOrConnection, clientId),
        CreateCosmosClient(endpointOrConnection, useCredentials, clientId));   // eager
}

The second argument here is the TValue overload of GetOrAdd. The value is constructed before the dictionary lookup runs. Behavior on every call:

  1. CreateCosmosClient(...) runs, allocating a new CosmosClient (HTTP/gRPC channels, retry timers, address-resolution caches, telemetry pipelines).
  2. GetOrAdd checks the key. If it's already cached, the freshly-constructed instance is silently discarded.
  3. CosmosClient implements IDisposable and holds significant unmanaged state. Discarded instances are never Dispose()'d and are promoted into older GC generations because they have finalizers, so they accumulate.

CosmosDbMetricProvider.GetPartitionCountAsync calls GetCosmosClient twice per invocation (once for the lease container, once for the data container). KEDA polls each ScaledObject via IsActive and/or GetMetrics every pollingInterval seconds (default 30s). Effective leak rate: 2-4 CosmosClient instances per 30s poll ≈ 240-480 leaked instances/hour.

Regression introduction

The bug was introduced on 2025-11-12 in commit 01a2505, "Managed Identity Support for CosmosDB External Scaler (#86)". Parent commit: bac322b.

The pre-#86 code was correct: it passed CreateCosmosClient as a method group to GetOrAdd, which the compiler resolves to Func<string, CosmosClient> and dispatches to the lazy overload:

// before #86 (correct)
public CosmosClient GetCosmosClient(string connection)
{
    return _cosmosClientCache.GetOrAdd(connection, CreateCosmosClient);
}

private CosmosClient CreateCosmosClient(string connection) { ... }

When #86 added managed-identity support, CreateCosmosClient grew two additional parameters (useCredentials, clientId). The author needed to thread these new arguments through, and the natural-looking edit was to invoke the method directly:

// after #86 (accidentally eager)
public CosmosClient GetCosmosClient(string endpointOrConnection, bool useCredentials, string clientId)
{
    return _cosmosClientCache.GetOrAdd(
        (endpointOrConnection, clientId),
        CreateCosmosClient(endpointOrConnection, useCredentials, clientId));   // method invocation, not method group
}

The call-site shape barely changed (same dictionary, same method name, same GetOrAdd), but overload resolution silently flipped from GetOrAdd(TKey, Func<TKey, TValue>) to GetOrAdd(TKey, TValue). The compiler had no reason to warn: both overloads exist and both compile. The behavioral change is invisible at the call site and only manifests as a slow-burn memory leak in production.

The regression test in this PR is designed to catch exactly this class of mistake going forward: any future signature change to CreateCosmosClient that re-introduces eager invocation will fail GetCosmosClient_OnlyConstructsOnceForSameKey.

Fix

src/Scaler/Services/CosmosDbFactory.cs (after):

public CosmosClient GetCosmosClient(string endpointOrConnection, bool useCredentials, string clientId)
{
    return _cosmosClientCache.GetOrAdd(
        (endpointOrConnection, clientId),
        _ => CreateCosmosClient(endpointOrConnection, useCredentials, clientId));   // lazy
}

Switches to the Func<TKey, TValue> overload of GetOrAdd. The factory delegate only runs on cache miss. Once the keys for the lease container and data container are cached (after the first poll), no further CosmosClient allocations happen for the lifetime of the process.

Other changes

To enable a regression test that counts CreateCosmosClient invocations:

  • internal sealed class CosmosDbFactoryinternal class CosmosDbFactory
  • private CosmosClient CreateCosmosClient(...)protected internal virtual CosmosClient CreateCosmosClient(...)

These two visibility changes are the minimum needed to allow a test subclass to override the create method. No production behavior changes.

Tests

New file: src/Scaler.Tests/CosmosDbFactoryTests.cs

Two test cases, both using a CountingCosmosDbFactory test subclass that overrides CreateCosmosClient to count invocations via Interlocked.Increment:

Test Asserts
GetCosmosClient_OnlyConstructsOnceForSameKey After 3 calls with the same connection string: returned references are identical AND CreateCosmosClient ran exactly once.
GetCosmosClient_ConstructsOncePerDistinctKey After 4 calls alternating between 2 connection strings: CreateCosmosClient ran exactly twice.

The first test is the regression assertion: it fails on the buggy version with Expected: 1, Actual: 3 and passes on the fix.

Verification

To confirm the regression test catches the bug, temporarily revert just the lambda fix on line 17 (keep the virtual / un-sealed changes):

return _cosmosClientCache.GetOrAdd(
    (endpointOrConnection, clientId),
    CreateCosmosClient(endpointOrConnection, useCredentials, clientId));   // eager (buggy)

Run:

dotnet test src/Scaler.Tests/Keda.CosmosDb.Scaler.Tests.csproj \
  --filter "FullyQualifiedName~CosmosDbFactoryTests"

GetCosmosClient_OnlyConstructsOnceForSameKey will fail:

Assert.Equal() Failure
Expected: 1
Actual:   3

Restore the fix; both tests pass.

Impact

  • Memory: Steady-state working set should stay flat per process (one CosmosClient per distinct (endpointOrConnection, clientId) key) instead of growing linearly to the OOM ceiling. Expected reduction in pod memory: from ~500 MB right before OOM down to ~50-100 MB steady-state.
  • OOMs: Eliminated for the leak-driven cycle. Pods should now achieve normal lifetimes bounded by node maintenance / deploys, not by 30-minute OOM cadence.
  • Downstream: Any consumer with a Cosmos DB change-feed trigger using KEDA's cosmosdb-partitioncount scaler will see fewer HPA scaling oscillations and therefore fewer LeaseLostException events on their leases container during steady state.
  • Compatibility: No public API changes. No configuration changes. No deployment changes. Drop-in replacement.

GetOrAdd was being called with the eager TValue overload of
ConcurrentDictionary, constructing a new CosmosClient on every call
and silently discarding it when the cache already held an entry for
the key. Discarded instances hold unmanaged state and have
finalizers; they accumulate until the pod is OOMKilled.

Switch to the lazy Func<TKey, TValue> overload so
CreateCosmosClient only runs on cache miss.

Add a regression test that counts CreateCosmosClient invocations via
a test subclass. This requires changing the class from
'internal sealed' to 'internal' and CreateCosmosClient from
'private' to 'protected internal virtual'.

Signed-off-by: Brendan Morante <brendan.morante@kroger.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant