Skip to content

Issues: classification engine + unified subject resolver + grouped triage UI#811

Open
nadaverell wants to merge 49 commits into
mainfrom
feat/issues-ui
Open

Issues: classification engine + unified subject resolver + grouped triage UI#811
nadaverell wants to merge 49 commits into
mainfrom
feat/issues-ui

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

@nadaverell nadaverell commented May 28, 2026

End-to-end Issues overhaul in one PR: detection engine, identity layer, grouped triage UI, and the foundation hardening that came out of review.

Engine (internal/issues, internal/k8s)

  • Classify each Issue by symptom category + group; compose across the four detection sources.
  • Owner-resolved, deterministic grouping identity (stable issue_id keyed on subject + category).
  • Fold flat rows into the grouped issue model.

Unified subject resolver (pkg/subject) — new

  • One resolver for identity and app-grouping: Tier-1 Subject (owner-collapsed controller, derived from ownerRefs, zero setup) + Tier-2 AppOverlay (8-tier declared-key overlay — Flux / Argo / Helm / app.kubernetes.io/* — with provenance, confidence, and retained conflicts[]).
  • internal/issues/identity.go consumes it; StableID is byte-identical to the prior hash (no re-key).
  • The package doc is honest about scope: it's the shared identity vocabulary consumed by Issues today; pkg/topology consumption is staged for feat: Applications backend — /api/applications + PackageRow app-overlay #823, not yet wired (AppOverlay ships as tested-but-unconsumed scaffold for that work).

GA-blockers

  • Monotonic crashloop classification — key on RestartCount + LastTerminationState so oscillation no longer churns issue_id.
  • Double-row suppression — severity-gated, so a critical parent is never downgraded behind a warning child.
  • Critical-into-unknown gaps closed — PVC Lost, Job / CronJob failures (and CAPI + bad-image/container-create, below).
  • CRD-condition noise floor — transient-reason aware.

UI (packages/k8s-ui IssuesView)

  • Grouped, single-open-accordion triage queue; detector reason surfaced on the collapsed row.
  • Stable ordering keyed on first_seen (onset) on both server and UI — last_seen churns to compose-time every poll, so sorting on it reshuffles rows under auto-refresh. Stable row identity keeps the expanded card from dropping on refetch.
  • Onset age column (chronic-vs-acute signal) + "Started X ago · last seen Y ago" in the expanded body.
  • group / namespace optional on the Issue type to match the backend omitempty wire.

Hosts

  • Radar OSS mounts the shared IssuesView at a per-cluster /issues route (thin IssuesPane host over a useIssues/api/issues hook), mirroring the hub's fleet ProblemsPage. Reachable but not yet linked in the nav — deliberate, so the surface lands before it's promoted.

Detection precision/recall — validated against real clusters

An empirical audit (GKE / EKS / kind) drove a precision/recall pass, each fix re-verified live:

  • GitOps reconciler health — new DetectGitOpsProblems surfaces ArgoCD Application (health Degraded/Missing, sync OutOfSync when automated, ComparisonError) + Flux Kustomization/HelmRelease (Ready=False, non-transient) — a class the generic CRD-condition fallback structurally can't read. gitops-demo went 0 → 7 GitOps issues. Flux source CRDs (GitRepository et al.) and Argo control-plane CRDs classify as operator_condition_failed, not force-fit into gitops_sync_failed.
  • Fewer false positives — recovered-after-crash pods no longer flagged; completed-Job pods skipped by missing-ref detection; scaled-to-0-backed Services labeled as such (including mid 1→0 scale-down); cadence-aware CronJob staleness.
  • Severity calibration — inert dangling refs (singleton-StatefulSet headless Service, deprecated GKE PSP RoleBindings) dropped from blanket critical.
  • New high_restart category for genuine thrash that isn't a classic CrashLoopBackOff.

Detection/classification gaps closed (second review pass)

  • CAPI now triageable — Cluster / KubeadmControlPlane → control_plane_not_ready; Machine / MachineDeployment / MachineHealthCheck → machine_not_ready (gated on cluster.x-k8s.io). Detection already existed; only the last classification hop was missing, so a control-plane outage was un-triageable.
  • Bad-image-tag / container-create failures detectedInvalidImageName / RunContainerError / CreateContainerError / ImageInspectError (shared isFatalWaitingReason; classified in lockstep). These were silently healthy before — InvalidImageName never self-resolves.
  • StatefulSet compares ready vs desired (spec), not just created (status.replicas), so an ordinal-0-wedged rollout surfaces.
  • Count is the subject-excluded fan-out size (matches the UI "Affected resources (N)" header + TS), not len(members).

Contract hardening (review)

  • Grouped filtering on the public shape — Compose folds to grouped before applying severity/kind + CEL, so kind=Deployment matches a pod-evidenced Deployment issue and count > N sees the member total.
  • RBAC-safe CRD namespace fanout — cluster-scoped CRDs use the gated cluster-wide list; explicit namespace sets list per-namespace; namespace-scoped informers are no longer silently dropped.
  • info excluded from the live queue — severity normalized to critical|warning (info stays honest at the Problem layer but isn't "what's broken now").
  • Owner grouping consistent — Argo automated.enabled:false treated as manual; cache-aware pod owner resolution (no phantom Deployments, in Issues and /top); pod missing-ref issues fold under their workload.
  • CEL — added first_seen, grouping_scope, restart_count, last_terminated_reason bindings (first_seen is the queryable onset axis; last_seen is near-useless for "older than…").

Foundation hardening (review-driven)

  • Detector output type ProblemDetection; the detector layer (problems.go) split into detection.go / capi.go / gitops.go with explicit layer framing.
  • Identity/status ownership moved off pkg/audit / pkg/packages onto neutral leaf packages: pkg/resourceid (ResourceKey + builtin Kind→Group) and pkg/conditions (transient-reason vocabulary + one shared FindFalseCondition). Future Applications/Packages surfaces depend on the leaves, not on audit/packages.
  • issues.go god-file split by concern (compose / source_conditions / normalize / dedupe / filters); GroupCategoryGroup; single shared SeverityRank (killed 3 clones); navigateToResourceList deduped in App.tsx.

Cross-repo dependency

Merge alongside radar-hub #57 — mirrors the new CEL bindings into the hub's filter env. Additive (no removed/renamed bindings), but without it the hub's pre-validation would reject fleet-issue filters that reference the new fields.


Deferred follow-ups (agreed in review):

  • Cross-subject incident correlation / blast-radius ranking (one ConfigMap → N workloads as one incident; PVC↔pod cross-layer dedup).
  • Provider → per-source interfaces (R2) — defer until the source set stabilizes.
  • Affected fixed struct → kind-keyed map (R10) — API debt; bundle with an IssueGroup/IssueEvidence DTO split.
  • Frontend queue-primitive extraction shared with Checks (B2/B6/B8).

Note

Medium Risk
Large refactor of issue identity, grouping order, and API/CEL fields can change which rows appear, how they merge, and break fleet filters until hub #57 lands; detection/dedupe logic affects triage severity and completeness.

Overview
This PR reshapes the Issues pipeline from flat detector rows into a grouped triage model: each row gets a derived category / category_group, a stable id keyed on subject + cause, optional members/affected fan-out, and compose runs dedupe → group → filters/CEL so public queries match the workload subject (e.g. kind=Deployment, count > N on grouped totals).

Detection layer: k8s.Problem becomes Detection, with logic split across dedicated detectors (including new DetectGitOpsProblems). Emitters add owner resolution, fingerprints for multi-cause rows, and many precision fixes (info severity dropped at compose, scaled-to-zero services, WaitForFirstConsumer PVCs, RWO multi-replica conflicts, Job/CronJob/CAPI/GitOps coverage, etc.).

Classification: New Classify / GroupOf maps source/kind/reason (and crash context) into the symptom taxonomy; dedupeWorkloadDegradedOverChild suppresses redundant parent rollups when a specific child exists, with a severity gate so critical parents are not hidden behind warning children.

CRD fallback: Generic .status.conditions handling moves to source_conditions, skips curated GitOps/CAPI kinds, applies a transient/suspend/observedGeneration noise floor, improves namespace fan-out via ListDynamicAllNamespaces, and special-cases Argo Rollout failures.

CEL / API contract: Issue filters gain category, category_group, first_seen, grouping_scope, restart_count, last_terminated_reason; cluster is removed from bindings (single-cluster Radar). Sorting prefers first_seen over churning last_seen. Code is modularized (compose, grouping, identitypkg/subject, shared pkg/conditions).

Reviewed by Cursor Bugbot for commit e3a659a. Bugbot is set up for automated code reviews on this repo. Configure here.

@nadaverell nadaverell requested a review from hisco as a code owner May 28, 2026 12:49
Comment thread packages/k8s-ui/src/components/issues/IssuesView.tsx Outdated
Comment thread packages/k8s-ui/src/components/issues/IssuesView.tsx Outdated
Comment thread packages/k8s-ui/src/components/issues/IssuesView.tsx Outdated
@nadaverell nadaverell force-pushed the feat/issues-ui branch 2 times, most recently from 5333aad to 0f6b860 Compare May 28, 2026 17:25
@nadaverell
Copy link
Copy Markdown
Contributor Author

Follow-up 3512f39: the detector reason/message now rides the collapsed row (was expand-only), filling the dead band between the title and the severity badge so the highest-value triage signal reads without a click. Full text + crash context stay in the expanded body. Also makes Issue.group/namespace optional to match the omitempty wire.

Consumer: the hub fleet Issues page lands in radar-hub-web #85.

Comment thread packages/k8s-ui/src/components/issues/types.ts Outdated
@nadaverell nadaverell changed the title k8s-ui: shared IssuesView — grouped live-issue triage queue Issues: classification engine + unified subject resolver + grouped triage UI May 29, 2026
Comment thread packages/k8s-ui/src/components/issues/severity.ts
Comment thread pkg/packages/gitops.go Outdated
Comment thread pkg/packages/gitops.go
Comment thread internal/issues/grouping.go
Comment thread packages/k8s-ui/src/components/issues/types.ts
Comment thread packages/k8s-ui/src/components/issues/IssuesView.tsx
Comment thread internal/issues/grouping.go
Comment thread internal/issues/category.go
Comment thread internal/k8s/gitops.go Outdated
Comment thread internal/k8s/gitops.go Outdated
Adds a pure, deterministic classifier (Category, with a fixed Category→Group
rollup) over the signal radar already emits — Source + Kind + Reason +
crash context — and wires it into Compose so every /api/issues and MCP
`issues` row carries `category` + `category_group`. Both are server-emitted
labels (the UI renders the rollup without its own category→group map) and
both are exposed as CEL filter bindings.

`unknown` is first-class: categories whose detectors don't exist yet, plus
CronJob / Job / CAPI / PVC-Lost / Node-Cordoned, fall through to it rather
than being force-fit into a neat bucket.
Every issue now carries three additive identity fields:

- Owner: the topmost stable controller of a Pod problem (Pod→Deployment,
  not the intermediate ReplicaSet), resolved at detection time via the
  existing topOwnerForPod and carried on k8s.Problem alongside the
  RestartCount/LastTerminatedReason crash context.
- GroupingScope: workload|service|pvc|ingress|node|unknown — the subject's
  coarse bucket (drives the future UI section, part of the ID).
- ID: deterministic cluster-local hash(scope, subject key, category),
  identical for every member row that rolls up to the same subject+category.
  The hub namespaces it by cluster_id for global uniqueness.

Subject = the topmost owner when one was resolved (member pods key on their
workload), else the resource itself. resourceKey reuses
pkg/audit.ResourceKey so issue grouping and audit deep-links share one key
format rather than drifting.

Purely additive — rows are not yet collapsed; the shared ID is the handle
the collapse fold keys on (next slice). No consumer contract changes.
GroupIssues collapses the flat evidence rows into the public operational
model — one row per shared id (subject+category). A Deployment whose 3 pods
all ImagePullBackOff is one issue with affected:{pods:3} + bounded member
refs, not three rows.

- /api/issues + MCP issues return grouped rows by default; the cap now
  counts issue groups, not replica fan-out.
- /api/issues?view=flat returns the raw pre-fold evidence rows for
  debugging ("what folded into this group?"). MCP stays grouped-only —
  agents use get_resource/get_events for raw state.
- Compose() stays flat internally, so summarycontext's per-resource index
  is unchanged; Filters.Grouped gates the fold.
- Representative rules (deterministic): severity = max member, category =
  shared, subject = topmost owner, reason/message/crash-context from the
  worst member, age = oldest onset, last_seen = newest, members sorted +
  capped at 10 with members_truncated past that.

Table-tested — grouping bugs are trust bugs; every consumer inherits them.
The presentation sibling to the Checks queue (ChecksView): one row per
grouped issue (subject + category), severity rail + pill, single-open
accordion expanding to the diagnosis (reason/message + pod crash context)
plus the subject and affected-member deep-links. Consumed by BOTH OSS
single-cluster and the hub fleet view (the host wires resourceHref /
onResourceClick / clusterLabel) so the two surfaces can't diverge.

Reuses the established shared atoms (ClusterName, EmptyState) and the EXACT
Checks severity hues (critical=red, warning=amber = Checks medium) so the
two queues read as one product. Identity (IssueResourceRef + resourceKey)
matches the Checks contract + audit.ResourceKey; named IssueResourceRef to
avoid colliding with the core single-cluster ResourceRef (same reason
Checks uses CheckResourceRef). Faceting stays the host page's job
(FleetPageShell), so there are no in-component filter chips.

Types mirror radar's grouped Issue (internal/issues.GroupIssues).
…es identity

New pkg/subject is the one resolver the platform plan calls the #1 prerequisite:
- Tier-1 Subject = owner-collapsed root controller (Pod->RS->Deployment,
  Pod->Job->CronJob), with explicit bare/Node/operator-CR anchors. Deterministic,
  label-free. Walks via injected OwnerResolver/OperatorRootHook so the package
  imports neither internal/* nor pkg/topology.
- Tier-2 AppOverlay = 8-tier declared-key precedence (Flux/Argo/Helm tiers 1-5
  consolidated from managedby.go in its native order argo-instance<Helm; labels
  6-8 net-new) with provenance + confidence + retained conflicts[]; nil when raw
  wins (bare-app opt-in).
- IssueID/CheckID + ScopeForKind move here. IssueID is byte-identical -> no re-key.
- internal/issues/identity.go:enrichIdentity migrated to consume it (Scope aliased).

Topology determineGroupKey migration (consume Subject for identity + AppOverlay
for grouping) is the next step. Verified: go build/vet/test green for pkg/subject
+ internal/issues.
…aps, CRD noise

#1 detector monotonicity (internal/k8s/health.go): classify crashloop from
   RestartCount + LastTerminationState, stable across Waiting->Running->Waiting
   so the category-hashed issue_id stops churning.
#2 suppress parent workload_degraded/rollout_stalled when an equal-or-worse child
   symptom exists on the same subject — severity-gated so a critical rollup is
   never downgraded to a warning child.
#3 PVC Lost + Job/CronJob failures classified (were discarded to unknown).
#5 CRD-condition noise floor: transient-aware via shared
   packages.IsTransientConditionReason (one source of truth with the GitOps path).
+ table tests. Verified: go build/vet/test green (internal/issues, internal/k8s, pkg/packages).
Comment thread internal/issues/compose.go
Comment thread internal/k8s/detection.go
Comment thread internal/k8s/detection.go
- detectArgoAppProblems checks health==Degraded (critical) BEFORE the error-
  condition branch (high), so a Degraded app that also carries a Comparison/
  SyncError stays critical instead of being downgraded to warning. Test pins
  the degraded-and-error case.
- argoErrorCondition allowlists ComparisonError/InvalidSpecError/SyncError by
  exact type instead of matching any type containing 'Error' (matches its doc).
- Reconcile the Count contract docs (types.go, compose.go, types.ts) with the
  shipped subject-excluded fan-out semantics: 0/omitted for single-resource.
numberUnavailable is defined by the DaemonSet controller as
desiredNumberScheduled - numberAvailable, so the gap > numberUnavailable
guard is a no-op against any conformant controller and only diverges on
stale/inconsistent status — a version-dependent false-positive path with
no real coverage gain. Genuinely-Pending DaemonSet pods already surface via
the scheduling source. Back to the plain numberUnavailable > 0 signal.
internal/k8s is a grab-bag (cache, client, discovery, history, metrics...).
Prefix the detector layer so it clusters in a directory listing and announces
its role: detect.go (core + Detection type), detect_capi.go, detect_gitops.go,
detect_missing_refs.go, detect_scheduling.go (+ matching tests). Pure git mv,
no logic change. health.go is left as-is — it mixes health classification,
several detectors, and generic FormatAge/Truncate/Parse* utilities, so a deeper
split (not a rename) is the right move there.
health.go mixed three concerns. Split (same package, no caller changes):
- health.go keeps pod/node health CLASSIFICATION (ClassifyPodHealth,
  ClassifyNodeHealth, PodProblemReason, restart/crashloop helpers).
- detect_node.go: DetectNodeProblems + DetectVersionSkew (joins the detect_ family).
- detect_workload.go: DetectHPAProblems + DetectCronJobProblems.
- format.go: FormatAge/Truncate/ParseCPUToMillis/ParseMemoryToBytes — generic
  display/quantity helpers used in 8 files, which never belonged in health.go.
Test files split to match.
stats.FilterErrorSample = firstErr.Error()
}
}
out = filtered
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEL error log denominator corrupted by in-place filtering

Low Severity

The filtered := out[:0] alias shares the backing array with out, so as append(filtered, i) writes matched elements, it overwrites elements of the original out slice in place. The range out iteration still visits the correct number of elements (the range expression is evaluated once), but at line 171 the len(out) in the log message still reflects the original length — which is correct for this log. However, the real issue is that errCount is compared against len(out), which still holds the pre-filter length, but after the loop out = filtered is assigned, meaning if a subsequent reader logged len(out) it would get the filtered count. This is benign for now but the in-place filter pattern is fragile — if any future code between the loop and the out = filtered assignment reads out, it sees partially-overwritten data.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b94d0d8. Configure here.

Comment thread internal/issues/issues_test.go
return true
}
return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused function isCuratedCRDGroup after refactor to isCuratedCRDKind

Low Severity

isCuratedCRDGroup is only called from within isCuratedCRDKind as a helper. While this is technically fine, the old detectGenericCRDIssues called isCuratedCRDGroup directly (matching all kinds in CAPI groups). The new isCuratedCRDKind correctly extends this with per-kind gating for GitOps groups, but isCuratedCRDGroup is now a private implementation detail that could be inlined — minor dead-weight after the refactor.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b94d0d8. Configure here.

Comment thread internal/issues/dedupe.go
Comment thread internal/issues/category.go
…al replica counts, Argo Rollout cause)

From a live-cluster audit of /api/issues — the page had identity+severity+reason
inline but a few categories forced a click for the one fact that determines the fix:

- Pod detections now carry the kubelet waiting/terminated Message
  (PodProblemMessage). image_pull_failed shows the image ref + 'NotFound'/
  'unauthorized'; container_waiting shows 'secret "x" not found' — previously
  empty, forcing a click. Consistent with missing_config_ref, which already did this.
- workload_degraded reports available/DESIRED (spec.replicas), not
  available/created — a Deployment wanting 1 with 0 pods showed a meaningless
  '0/0 available'; now '0/1'.
- Argo Rollout: the generic condition reader surfaced Healthy=False/RolloutHealthy
  (reads as 'healthy', warn) and buried the cause. argoRolloutFailure now prefers
  the definitive failure — InvalidSpec, then ProgressDeadlineExceeded — as a
  critical with the precise spec error. Conservative: only overrides on
  unambiguous failures, so a mid-progress rollout is untouched.

Validated against nonprod; new test pins the Rollout condition priority.
Comment thread internal/issues/grouping.go Outdated
…eep raw

The containerd/CRI image-pull error is one long self-repeating string (it
re-quotes the image ref at every wrapped layer), which made the issue row's
headline an unscannable wall of text. issueMessageParts now derives a clean
headline for image-pull issues — 'Image not found: <ref>' / 'Not authorized
to pull image' / 'Registry unreachable' / 'Registry rate-limited', image ref
shown once — and keeps the raw CRI string as a secondary detail line under
What's wrong. Falls back to the raw message verbatim for any shape it doesn't
recognize, so nothing is ever lost or hidden.

Gated on image-pull (category/reason) so a generic 'not found' in another
category's message (e.g. missing_config_ref's 'secret "x" not found') is
never mislabeled. The wire/API/MCP message stays raw — only display changes.
Comment thread internal/k8s/detect.go
durDur = now.Sub(cond.LastTransitionTime.Time)
}
problems = append(problems, Problem{
problems = append(problems, Detection{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deployment emits both unavailable and rollout-stuck detections simultaneously

Medium Severity

When a Deployment has both UnavailableReplicas > 0 AND a ProgressDeadlineExceeded condition, two Detection rows are emitted for the same Deployment — one classified as workload_degraded and one as rollout_stalled. Since both are parentRollupCategories, dedupeWorkloadDegradedOverChild can suppress each independently if child symptoms exist, but when no child symptom exists the user sees two separate issues for one Deployment. The unavailableReplicas > 0 block doesn't continue after appending, so the rollout-stuck check always runs. This existed in the old code too but the new dedup and grouping machinery makes it more visible — the two rows get distinct IDs (different categories) and appear as separate grouped issues in the UI.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dcc0191. Configure here.

if r, m, found := argoRolloutFailure(u); found {
issReason, issMsg, severity = r, m, SeverityCritical
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argo Rollout argoRolloutFailure overrides reason but not condition type

Medium Severity

When argoRolloutFailure returns a definitive failure like ProgressDeadlineExceeded, the issReason is set to just the raw reason string (e.g. "ProgressDeadlineExceeded"), bypassing condTypeReason which would produce "Progressing: ProgressDeadlineExceeded". This means the Rollout issue's Reason field format is inconsistent with all other condition-sourced issues. More importantly, classifyIssue feeds i.Reason into Classify, where the SourceCondition + argoproj.io + Kind=Rollout path returns CategoryRolloutStalled — but the test on line 41 of category_test.go expects Reason: "Ready: ProgressDeadlineExceeded" (with the condition-type prefix), not bare "ProgressDeadlineExceeded". The classification still works because the Rollout case doesn't switch on Reason, but the inconsistent reason format could confuse downstream consumers and CEL filters matching on reason.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dcc0191. Configure here.

Comment thread packages/k8s-ui/src/components/issues/types.ts Fixed
…c cycles, rename heuristic resolver

pkg/subject is the canonical identity primitive #823 will build on, so its
contract must be exact before that wiring lands. Review-driven:

- OwnerResolver doc now states the contract explicitly: Kubernetes CONTROLLER
  ownership only (ownerReferences[].controller chain), NOT declarative/app
  management. Removes the trap that pointed the topology adapter at
  walkTopmostOwner — which follows every EdgeManages edge, and EdgeManages is
  overloaded to include GitOps/Helm management (Argo App→resource,
  GitRepo→Kustomization). Wrapping it would collapse a Deployment's Subject up
  to its Application, erasing the Tier-1/Tier-2 boundary. #823 must resolve from
  controllerReferences (k8s.topOwnerForPodResolved is the reference impl).
  TestResolveSubject_StopsAtController pins it: Pod→RS→Deployment yields
  Deployment even when an Application manages it.
- Ownership cycles now resolve to a deterministic, start-independent
  representative (min refKey) instead of the last-hop-before-revisit (which gave
  a→b but b→a). A canonical identity can't depend on traversal start.
  TestResolveSubject_CycleIsDeterministic pins it.
- PodOwnerResolver → HeuristicPodOwnerResolver: it fabricates a Deployment from
  any ReplicaSet by hash-stripping (wrong for Rollouts/custom controllers). It
  has zero production callers (test-only); the name now carries the caveat so it
  can't be mistaken for canonical. Dropped the doc's false claim that issues
  passes it (issues uses only ScopeForKind + StableID).
…umented

Round 2 of the canonical-primitive review — close the paths back to
'management edge as identity':

- HeuristicPodOwnerResolver no longer falls back to refs[0] when there's no
  controller ownerRef. The contract says controller-only; the fallback
  collapsed a pod under an arbitrary NON-controller owner. Now a pod with only
  non-controller refs resolves to itself (bare). New test pins both arms.
- Resolve doc corrected: obj feeds ONLY the Tier-2 overlay; ownership ALWAYS
  comes from the injected resolver. Resolve(ref, obj, nil, …) yields a bare
  Subject even if obj has ownerRefs — the previous 'obj supplies BOTH' framing
  was misleading. Spelled out that a pod owner-walk needs an injected resolver.
- Scrubbed EdgeManages / walkTopmostOwner from Tier-1 language in the Subject
  doc and OwnerLookup (operatorroots.go) — those comments are the #823
  integration guide, and 'derivable from the EdgeManages chain' reintroduced
  the exact ambiguity the contract fix removes. Now: controller ownerReferences
  / controllerRef-derived edges only.
Cross-cluster audit (gke-management) surfaced a false-positive class: PVCs bound
to a WaitForFirstConsumer StorageClass sit in Pending BY DESIGN until a consuming
pod is scheduled — dormant/scaled-to-zero/orphaned volumes stay there forever and
aren't a fault. The detector flagged every Pending PVC >5min regardless of
binding mode, lighting up benign awaiting-consumer volumes.

pvcAwaitsFirstConsumer resolves the PVC's StorageClass (explicit or cluster
default) and suppresses the row when binding mode is WaitForFirstConsumer — a
genuinely-stuck consumer still surfaces as an unschedulable pod via the
scheduling source. Immediate-binding Pending (real provisioning failure) and
missing-StorageClass (separate detector, critical) are unaffected; verified the
kind-bench missing-SC PVC still flags. Also added a message on the remaining
Pending rows (was bare 'Pending').
…tectors, chronic onset, total rep order

Review round on the identity/grouping spine:

- ID discriminator gains a stable cause Fingerprint so distinct causes on one
  subject+category no longer collapse into a single row (a workload missing both
  a ConfigMap AND a Secret was one missing_config_ref showing only one target).
  The missing-ref detector sets it from the target-bearing message (stable,
  deterministic — NOT the flapping reason, which would re-key on refresh);
  unknown keys on source+reason; every single-cause category stays category-only
  and byte-identical (no re-key). Test pins split + fold + no-re-key.
- Curated CAPI/GitOps detectors route the all-scope path through ListWatched
  instead of List(gvr,"") — the latter is cluster-wide-only and silently drops
  namespace-scoped contents in namespace-restricted installs (the generic
  fallback already did this). Verified no regression on a cluster-wide install.
- FirstSeen no longer resets to now for detectors without a duration: HPA/CronJob
  now stamp AgeSeconds (resource age), and fromProblem falls back to AgeSeconds
  when DurationSeconds is 0 — chronic issues stop sorting as fresh.
- betterRepresentative is now a true total comparator (group/kind/ns/name/source/
  reason/message), so the donated representative is deterministic regardless of
  input order.

(The 5th finding — HeuristicPodOwnerResolver refs[0] fallback — was already fixed
in 2b16a5e.)
Comment thread pkg/subject/subject.go
// check-id). The exact byte layout is load-bearing — any change re-keys every
// existing record — so it must not be altered.
func StableID(scope Scope, groupingKey, discriminator string) string {
sum := sha256.Sum256([]byte(string(scope) + "\x00" + groupingKey + "\x00" + discriminator))
…ive detail

- PodSecurityViolation gets its own category (pod_security_violation, security
  group) instead of the misleading admission_webhook_blocking — PSA is built-in
  admission, not a webhook. cert-manager: only Kind=Certificate maps to
  certificate_not_ready; Issuer/ClusterIssuer/Order/Challenge → operator_condition
  (a not-ready Issuer isn't a certificate problem). Fixed before these harden
  into IDs/filters.
- GitOps failures no longer under-ranked vs the detail view: Argo
  ComparisonError/SyncError/InvalidSpecError and auto-sync HealthMissing, and
  Flux Ready=False (genuine reason), are now critical (were warning). OutOfSync
  drift stays warning (self-heals).
- Decisive detail: Argo HealthDegraded/HealthMissing now carry the app's real
  status.health.message; CAPI Cluster/Machine Failed phase carry
  status.failureMessage/failureReason (capiFailureDetail). Tests updated.
…inks

- Extract issues.ListResponse + NewListResponse; /api/issues (HTTP) and the MCP
  issues tool both build from it instead of hand-rolling identical maps, so the
  contract can't drift (and the hub mirrors one shape). MCP keeps its narrowHint;
  both keep visibility. Wire output unchanged.
- Local Issues view surfaces truncation: useIssues now carries total/
  total_matched, and IssuesPane shows 'Showing N of M (capped)' when the queue
  was windowed — local Radar no longer presents a capped list as complete.
- Issue/Audit resource clicks encode the opened resource in the URL
  (?resource=ns/name, the resources view's own deep-link shape) so refresh/share
  keeps the drawer open instead of dropping it.
…nt contract

The issues tool exposes grouped logical subjects, but the agent's follow-up
(get_resource/list_resources/search/diagnose) contradicted it. Three fixes so
an AI SRE agent gets the same picture on drill-down:

- BuildIssueIndex now composes GROUPED issues and counts each against its
  subject AND every affected member, so get_resource Deployment surfaces the
  Deployment-grouped crashloop (was issueCount=0 — keyed only by the evidence
  Pod). issueCount is now consistent with the issues tool.
- diagnose returns relatedIssues — the grouped issues whose subject or member
  is the diagnosed object — so the agent sees 'crashloop + missing ConfigMap +
  HPA can't-scale' up front instead of re-deriving from raw logs. (No issue-id
  input, per scope.)
- The issues CEL filter schema now documents all runtime bindings (category,
  category_group, grouping_scope, restart_count, last_terminated_reason,
  first_seen) and leads with first_seen — agents were missing strong filters
  like category_group=='startup' or restart_count>10.
Comment thread internal/k8s/detect.go
// detections from per-kind detectors (HPA/CronJob) that don't track how long
// the problem has persisted — without it, the issues layer would derive
// FirstSeen=now on every compose and a chronic issue would sort as fresh.
func resourceAge[T metav1.Object](now time.Time, items []T, namespace, name string) time.Duration {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comments for two functions accidentally merged together

Low Severity

The doc comment for pvcAwaitsFirstConsumer (lines 552–556) is concatenated directly with the doc comment for resourceAge (lines 557–561) into a single comment block. Go attaches the entire merged block as godoc for resourceAge, leaving pvcAwaitsFirstConsumer at line 571 undocumented. Looks like a rebase/split artifact where the two functions were reordered but the comments weren't separated with a blank line or function boundary.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f003355. Configure here.

Comment thread internal/k8s/detect.go
desired := d.Status.Replicas
if d.Spec.Replicas != nil && *d.Spec.Replicas > desired {
desired = *d.Spec.Replicas
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Desired replicas fallback ignores nil spec.Replicas default

Low Severity

The comment says the fix prevents a misleading "0/0 available" when a Deployment wants 1 replica but has 0 pods created. However, desired falls back to d.Status.Replicas and only overrides from Spec.Replicas when it's non-nil. When Spec.Replicas is nil (Kubernetes default = 1), the fix doesn't apply — desired stays at Status.Replicas (possibly 0), producing the exact misleading message the comment describes. The same pattern applies to the StatefulSet block.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f003355. Configure here.

…ity honesty, ref optionality, golden tests

From the MCP-focused review (do-now set; pushed back on the overzealous items):

- CAPI MachineDeployment/MachineHealthCheck (cluster.x-k8s.io) and
  KubeadmControlPlane (controlplane.cluster.x-k8s.io) now use group-qualified
  discovery instead of kind-only — can't attach to a same-kind CRD from another
  group.
- HPA gets a cause fingerprint ('hpa:<problem>') so one HPA that's BOTH maxed
  and unable-to-scale surfaces as two distinct issues (distinct fixes) rather
  than collapsing. (Targeted — not blanket fingerprinting every detector.)
- IssuesPane surfaces incomplete RBAC visibility as a caveat banner, so an empty
  queue under degraded visibility reads 'limited visibility' not 'nothing
  broken' (useIssues now carries visibility).
- IssueResourceRef group/namespace are optional to match the Go omitempty wire
  (a Node/core-group member arrives without them); callers default to ''.
- Copy: 'grouped by the resource they affect' (scopes span service/PVC/node/…,
  not just workloads).
- Golden contract tests: a Pod-evidenced issue surfaces on its owning Deployment
  in the issue index (get_resource) and in RelatedIssues (diagnose) + on the Pod.

Held for a Hub-coordinated pass: cluster CEL binding (#5/#9). Pushed back:
cold-cache owner churn (transient/rare), affected-truncation (not reachable in
OSS), source-vocab (opaque pass-through; only stale test fixtures).
…#1 + #13)

Conceded from review: get_resource/diagnose built issueSummary via a SEPARATE
flat-by-exact-resource path (computeMCPIssueSummary / computeIssueSummaryForResource)
that BuildIssueIndex never touched — so get_resource Deployment/web returned an
empty summary while the issues tool showed it broken. And both BuildIssueIndex
and RelatedIssues iterated the grouped issue's inline Members (capped at
maxInlineMembers=10), dropping pods #11+ of a large fan-out.

Both now resolve a resource's grouped issues from FLAT evidence (uncapped) +
the resolved owner, keyed/deduped per grouped-issue ID:
- RelatedIssues matches grouped subjects AND every flat evidence row.
- BuildIssueIndex counts distinct grouped issues per resource (each evidence
  resource + its owner).
- computeMCPIssueSummary + computeIssueSummaryForResource route through
  RelatedIssues, so get_resource on a workload (or any affected pod, capped or
  not) surfaces the same grouped issues the issues tool shows.

Tests pin owner-rollup AND the uncapped (>10 members) case.
- #2 (Service): a Service that is BOTH no-ready-endpoints AND has an unresolved
  named targetPort now stays two issues (distinct fixes — the workload vs the
  Service port spec) via stable fingerprints, instead of collapsing under one
  service_no_endpoints row.
- #5 (cluster): removed the 'cluster' CEL binding + the always-empty Issue.Cluster
  field + its activation projection. A single Radar is one cluster, so
  Issue.Cluster was always empty and a forwarded 'cluster == x' matched nothing —
  the advertised filter returned the wrong answer. Cross-cluster scoping is the
  hub's clusters=/target mechanism (applied at fan-out), not a per-issue
  predicate. MCP/filter docs updated to say so.
// Final tiebreak: two grouped rows can share a subject (kind/ns/name)
// and differ only by category. Keeps the order total + deterministic.
return a.Category < b.Category
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort comment says category-before-kind but code disagrees

Medium Severity

The lessIssue doc comment states the tiebreak order is "category / kind / namespace / name" but the actual implementation sorts by kind → namespace → name → category. The comment also says this "Matches the shared UI comparator (k8s-ui issues/types.ts:compareIssues)." If the UI comparator was authored to match the documented order rather than the code, server and client will disagree on row ordering, causing the expanded accordion card to jump on every auto-refresh — exactly the instability the PR aims to eliminate.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d4152cd. Configure here.

… sort

- #7: issues(namespace="prod") for a namespace the caller can't access now
  returns 403 forbidden (MCP + REST), not an empty list. 'unauthorized' must
  not read as 'nothing broken' — a real trust gap for an SRE agent. The
  no-explicit-namespace path still returns empty (the caller asked for nothing
  specific).
- #9: lessIssue tiebreak is now (namespace, name, id) — byte-identical to the
  shared UI comparator's single-cluster order (the UI's only extra key, cluster,
  is constant for one cluster). The parity claim in the comment is now true
  instead of aspirational.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 9 total unresolved issues (including 7 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ee4fffc. Configure here.

if kind == "Rollout" && strings.Contains(strings.ToLower(gvr.Group), "argoproj.io") {
if r, m, found := argoRolloutFailure(u); found {
issReason, issMsg, severity = r, m, SeverityCritical
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argo Rollout condition reason not formatted consistently

Medium Severity

When argoRolloutFailure fires for an Argo Rollout, issReason is set to the raw condition reason (e.g. "ProgressDeadlineExceeded") instead of the condType + ": " + reason format used by condTypeReason. This means the Rollout's Reason field changes format depending on whether the failure override fires, and classification in Classify sees a bare reason like "ProgressDeadlineExceeded" rather than the "Ready: ProgressDeadlineExceeded" form, which could affect downstream matching or display consistency.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ee4fffc. Configure here.

Comment thread internal/k8s/detect.go
// render a misleading "0/0 available". spec.replicas is the goal.
desired := d.Status.Replicas
if d.Spec.Replicas != nil && *d.Spec.Replicas > desired {
desired = *d.Spec.Replicas
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deployment desired replicas uses wrong fallback value

Medium Severity

The desired variable for Deployment is initialized to d.Status.Replicas (created pods) and only overridden when spec.replicas > status.replicas. But spec.replicas is the authoritative desired count — it can also be lower than status.replicas during a scale-down, and status.replicas includes terminating pods. The intent (per the comment) is to report against the goal, so desired should always prefer spec.replicas when non-nil, not only when it's greater.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ee4fffc. Configure here.

…O conflicts, order scheduler clauses by blast radius

- isStableCrashLoop trusts a probe-gated Ready as a recovery signal, so a
  container that crashed at startup and is now serving clears immediately
  instead of reading crashloop-critical for the full 5m Running window.
  Gated on a readiness probe being defined (without one, Ready just mirrors
  Running and flips during a loop's between-crash blip).
- New volume_access_mode_conflict detector: a Deployment wanting >1 replica
  that mounts a ReadWriteOnce PVC is flagged with the fix (RWX / StatefulSet
  volumeClaimTemplates / 1 replica) — the config-level root cause, named from
  spec, distinct from the observed multi-attach symptom.
- summarizeReasons orders scheduler clauses by nodes-rejected descending, so
  the widest-blast-radius constraint leads instead of the scheduler's
  arbitrary predicate order.
…ollout deadlocks

- New pdb_blocks_evictions detector: a PodDisruptionBudget that allows 0
  voluntary disruptions while all selected pods are healthy (maxUnavailable=0
  or minAvailable>=replicas) silently blocks node drains and cluster upgrades.
  Keyed on status.DisruptionsAllowed==0 with structural guards (observed
  generation current, healthy>=desired) so transient zero-budget during a
  real outage isn't flagged.
- Enrich the existing 'Rollout stuck' row: when a stuck Deployment mounts a
  ReadWriteOnce PVC and isn't strategy: Recreate, append the root cause + fix.
  The surge pod can't attach the volume the old pod holds — the classic
  rollout deadlock, named on the row that's already firing (no new noise).
…s, fix merged doc, complete category labels

- normalizeImagePullMessage: drop the redundant 'lookup .* no such' branch
  (already covered by 'no such host') — removes the unbounded .* CodeQL flagged.
- Remove unused resourceKey/resourceRefKey from the issues module (Checks owns
  its own copy; no issues consumer).
- Restore pvcAwaitsFirstConsumer's doc comment (had merged into resourceAge's).
- Add curated labels for pod_security_violation, control_plane_not_ready,
  machine_not_ready.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants