prometheus: add Registry.DisableHelpTextValidation opt-in#2006
Open
spor3006 wants to merge 1 commit into
Open
Conversation
When two collectors emit metrics with the same fully-qualified name but
slightly different help text — most commonly two libraries (or two
versions of the same library) registering compatible metrics into the
same registry, e.g. a help string gaining or losing a trailing period
across a client_golang upgrade — Registry.Gather currently returns
collected metric foo {…} has help "X." but should have "X"
turning a metadata difference into a hard /metrics outage for the whole
binary. The Prometheus server itself is lenient about help-text drift on
scrape, so this strict check makes interop between libraries harder than
it needs to be.
Add an opt-in escape hatch:
reg := prometheus.NewRegistry()
reg.DisableHelpTextValidation()
When disabled, processMetric aggregates later metrics into the family
that the first metric created, keeping the first help text. The default
behavior is unchanged — strict help-text validation remains the
recommended setting because it surfaces accidental help-text changes
early. Pedantic registries unconditionally override the toggle so that
a pedantic caller never silently gets lenient semantics.
Implementation:
* Registry gains a private helpTextValidationDisabled field.
* Gather() snapshots the effective toggle (toggle && !pedantic) under
the existing RLock so all processMetric calls in one Gather see a
consistent value.
* processMetric takes the flag as a new parameter; the rest of its
behavior (type checks, suffix collisions, pedantic desc-consistency
checks) is unchanged.
* Gatherers.Gather (cross-Gatherer merge) and checkDescConsistency
(pedantic per-Desc check) are deliberately left strict — they cover
different concerns and changing them is out of scope.
Tests:
* TestGatherRejectsHelpTextMismatchByDefault locks in
backwards-compatibility: the strict error is still produced when
the toggle has not been flipped.
* TestGatherAcceptsHelpTextMismatchWhenDisabled is the regression
test for the issue — same fixture as above, with the toggle
enabled, Gather succeeds.
* TestGatherWithDisabledHelpTextValidationKeepsFirstHelp uses a
single collector that emits metrics in deterministic source order
to pin down the documented "first help text wins" guarantee.
* TestPedanticRegistryOverridesHelpTextValidationToggle verifies
that flipping the toggle on a pedantic Registry has no effect.
All four are stable under `go test -count=10 -race`.
Fixes prometheus#1820
Signed-off-by: Sparshal Kothari <41056517+spor3006@users.noreply.github.com>
3d3b81a to
c0f198b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Registry.Gathercurrently returns an error when two collectors emit metrics with the same fully-qualified name but slightly different help text:The most common trigger is two libraries — or two versions of the same library — registering compatible metrics into the same registry. A trivial difference like a help string gaining or losing a trailing period across a
client_golangupgrade turns a metadata mismatch into a hard/metricsoutage for the whole binary (see #1820).The Prometheus server itself is lenient about help-text drift on scrape, so this strict check makes inter-library interop harder than it needs to be.
Design
This PR is strictly additive to the public API and leaves the default behavior unchanged. It adds one new method:
When the toggle is enabled,
processMetricaggregates later metrics into the family the first metric created and keeps the first help text seen. Other validation (type checks, suffix collisions, dim-hash consistency atRegistertime, pedantic per-Descchecks) is unchanged.Why opt-in rather than relaxing the default
The strict default has value: it surfaces accidental help-text changes during development and testing. Silently overwriting a refactored help string under the user is a debuggability footgun. Forcing callers to consciously opt in to the lenient semantics keeps that safety net for everyone who isn't fighting the interop problem.
Pedantic registries
Calling
DisableHelpTextValidationon a Registry returned byNewPedanticRegistryhas no effect. The pedantic caller has opted into the strictest set of checks; mixing that with lenient help validation would be internally contradictory. The override is computed once perGatherunder the existingRLock:What this PR does NOT change
Gatherers.Gather(cross-Gatherer merge) andcheckDescConsistency(pedantic per-Desccheck) are intentionally left strict. They cover different concerns fromprocessMetric's per-fqName check, and changing them is out of scope.Register()still rejects mismatched dim-hashes (which includes help text) at registration time for checked collectors. The toggle only affects whatGatherdoes with metrics that nevertheless reachprocessMetricwith mismatched help — in practice, the unchecked-collector path.Tests
Four new tests in
prometheus/registry_test.go, all stable undergo test -count=10 -race:TestGatherRejectsHelpTextMismatchByDefaultTestGatherAcceptsHelpTextMismatchWhenDisabledTestGatherWithDisabledHelpTextValidationKeepsFirstHelpTestPedanticRegistryOverridesHelpTextValidationToggleTwo unexported test helpers are added:
helpTextCollector— unchecked collector emitting one metric with configurable fqName/help/label, for the two-collectors-into-one-registry scenarios.orderedHelpTextCollector— single unchecked collector emitting multiple metrics with deterministic source order, for the "first wins" assertion (two separate collectors would race because the registry fans collection out across goroutines).Verification
go test ./prometheus/...— passes (existing + new).go test -race ./prometheus/...— passes.go vet ./prometheus/...— clean.gofmt— clean.Note:
prometheus/collectorsandTestGoCollector*continue to fail on this branch and onmainbecause Go 1.25 added/removed runtime metrics (htmlmetacontenturlescape,httpcookiemaxnum,urlmaxqueryparams) that aren't in theexpectedMetricsfixture. Pre-existing, unrelated to this fix.Trade-offs considered
Optsstruct / new constructor (NewLenientRegistry). Rejected as heavier API surface for a single boolean. A method on*Registrymatches the existingNewPedanticRegistry-style pattern: one entry point for the strict mode, one method to opt into the lenient mode.prometheuspackage does not have a logger in scope here, and threading one through purely for this case would be a much larger change. Documented as the explicit trade-off in the godoc.Gatherers.Gather/checkDescConsistency. Out of scope. The motivating use case is single-registry interop; cross-Gatherer merging is a separate user contract. Happy to file a follow-up if maintainers think the same reasoning applies there.Closes #1820.
/cc @bwplotka @kakkoyun @vesari