Skip to content

Gate named-blob 404->503 translation behind a config kill switch#3264

Merged
SophieGuo410 merged 1 commit into
linkedin:masterfrom
abha-mutalik:gate-named-blob-503-translation
May 22, 2026
Merged

Gate named-blob 404->503 translation behind a config kill switch#3264
SophieGuo410 merged 1 commit into
linkedin:masterfrom
abha-mutalik:gate-named-blob-503-translation

Conversation

@abha-mutalik
Copy link
Copy Markdown
Contributor

Summary

PR #3234 (AMBRY-14247) changed the named-blob GET path to translate BlobDoesNotExist -> AmbryUnavailable (HTTP 503) when the named-blob metadata row exists but every storage replica returns BlobNotFound. The PR's own risk note flags potential retry-storm amplification on the inconsistency path, since the implicit 404 circuit-breaker is removed.

This adds a kill switch so the behavior can be disabled per-cluster without a code rollback.

  • New config router.named.blob.translate.not.found.to.unavailable.enabled (default true, preserves current behavior).
  • When false, NonBlockingRouter.translateNamedBlobMissingInStorage returns the original BlobDoesNotExist unchanged.
  • The observability counter namedBlobMetadataExistsButStorageNotFoundCount still increments when the config is disabled so the underlying event rate remains visible during/after a rollback.

Testing Done

  • Added testNamedBlobMissingInStorageNotTranslatedWhenConfigDisabled which sets the config to false, drives storage to BlobNotFound, and asserts the surfaced error is BlobDoesNotExist while the counter still increments.
  • Overloaded setUpNamedBlobAndPut to accept custom Properties; existing call sites unchanged.
  • Existing three named-blob translation tests still pass under default config (true).
  • Ran the full parameterized suite:
    ./gradlew :ambry-router:test --tests "com.github.ambry.router.NonBlockingRouterTest.testNamedBlob*"
    
    24 tests passed (4 parameterized configs x 6 tests).

Risk

Default is unchanged behavior, so this is a no-op rollout. Setting the new flag to false restores the pre-#3234 behavior (404 surfaces to clients) on the inconsistency path only; the metric remains visible either way.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

PR linkedin#3234 (AMBRY-14247) translates BlobDoesNotExist to AmbryUnavailable on the
named-blob GET path when metadata exists but storage replicas return NOT_FOUND.
The PR's own risk note flags retry-storm amplification on the inconsistency
path. Add router.named.blob.translate.not.found.to.unavailable.enabled
(default true) so the behavior can be disabled per-cluster without a code
rollback. The observability counter still increments when disabled so the
underlying event rate remains visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 22, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.23%. Comparing base (52ba813) to head (193fd52).
⚠️ Report is 392 commits behind head on master.

Files with missing lines Patch % Lines
...ava/com/github/ambry/router/NonBlockingRouter.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3264       +/-   ##
=============================================
- Coverage     64.24%   51.23%   -13.01%     
+ Complexity    10398     8678     -1720     
=============================================
  Files           840      931       +91     
  Lines         71755    79544     +7789     
  Branches       8611     9526      +915     
=============================================
- Hits          46099    40757     -5342     
- Misses        23004    35409    +12405     
- Partials       2652     3378      +726     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@abha-mutalik abha-mutalik requested a review from SophieGuo410 May 22, 2026 19:56
if (e instanceof RouterException
&& ((RouterException) e).getErrorCode() == RouterErrorCode.BlobDoesNotExist) {
routerMetrics.namedBlobMetadataExistsButStorageNotFoundCount.inc();
if (!routerConfig.routerNamedBlobTranslateNotFoundToUnavailableEnabled) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we just use the config to gate the translateNamedBlobMissingInStorage method?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the clause here so that the metric on the above line namedBlobMetadataExistsButStorageNotFoundCount still increments when the config is disabled, so that we still have insight into it whenever the metadata-vs-storage divergence happens.

@SophieGuo410 SophieGuo410 merged commit 0e63196 into linkedin:master May 22, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants