Skip to content

Issues: symptom classification + owner-grouping (engine)#803

Closed
nadaverell wants to merge 3 commits into
mainfrom
feat/issues-category-classifier
Closed

Issues: symptom classification + owner-grouping (engine)#803
nadaverell wants to merge 3 commits into
mainfrom
feat/issues-category-classifier

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

@nadaverell nadaverell commented May 27, 2026

Builds the Radar Issues engine: every operational issue is classified by symptom category, gets a stable identity, and is grouped under its owning workload — so /api/issues and the MCP issues tool emit a triage queue, not a per-object feed. Pure/deterministic, table-tested, MCP-first.

What changed (3 commits)

  1. Classify — a pure (Source, Kind, Reason)category classifier (~25 categories → 11 groups, unknown first-class), wired into Compose. Every row carries category + category_group, both server-emitted labels and CEL filter bindings. Grounded in radar's actual reason vocabulary.
  2. Identity — resolve each Pod problem's topmost stable controller (Pod→Deployment, not the intermediate ReplicaSet) at detection time; derive a grouping_scope and a deterministic cluster-local id = hash(scope, subject key, category). resourceKey reuses pkg/audit.ResourceKey so issues and audit deep-links share one key format.
  3. GroupGroupIssues folds the flat evidence rows into the public model: one row per id with affected counts + bounded member refs. Grouped by default on /api/issues + MCP; the cap now counts issue groups, not replica fan-out.

Notes for review

  • ?view=flat on /api/issues returns the raw pre-fold rows for debugging ("what folded into this group?"). MCP stays grouped-only — agents use get_resource/get_events for raw state.
  • Compose() stays flat internally, so summarycontext's per-resource index is unchanged; a Filters.Grouped flag gates the fold.
  • Representative rules are deterministic: severity = max member, subject = topmost owner, reason/message/crash-context from the worst member, age = oldest onset, last_seen = newest, members sorted + capped at 10 with members_truncated.
  • Taxonomy gaps fall through to unknown deliberately (CronJob/Job/CAPI/PVC-Lost/Node-Cordoned, and categories whose detectors don't exist yet).

Pairs with skyhook-dev/radar-hub#52 (forwards the new fields through the fleet pivot). SPA grouped IssuesView is a follow-up.

🤖 Generated with Claude Code

Adds a pure, deterministic classifier (Category, with a fixed Category→Group
rollup) over the signal radar already emits — Source + Kind + Reason +
crash context — and wires it into Compose so every /api/issues and MCP
`issues` row carries `category` + `category_group`. Both are server-emitted
labels (the UI renders the rollup without its own category→group map) and
both are exposed as CEL filter bindings.

`unknown` is first-class: categories whose detectors don't exist yet, plus
CronJob / Job / CAPI / PVC-Lost / Node-Cordoned, fall through to it rather
than being force-fit into a neat bucket.
Every issue now carries three additive identity fields:

- Owner: the topmost stable controller of a Pod problem (Pod→Deployment,
  not the intermediate ReplicaSet), resolved at detection time via the
  existing topOwnerForPod and carried on k8s.Problem alongside the
  RestartCount/LastTerminatedReason crash context.
- GroupingScope: workload|service|pvc|ingress|node|unknown — the subject's
  coarse bucket (drives the future UI section, part of the ID).
- ID: deterministic cluster-local hash(scope, subject key, category),
  identical for every member row that rolls up to the same subject+category.
  The hub namespaces it by cluster_id for global uniqueness.

Subject = the topmost owner when one was resolved (member pods key on their
workload), else the resource itself. resourceKey reuses
pkg/audit.ResourceKey so issue grouping and audit deep-links share one key
format rather than drifting.

Purely additive — rows are not yet collapsed; the shared ID is the handle
the collapse fold keys on (next slice). No consumer contract changes.
GroupIssues collapses the flat evidence rows into the public operational
model — one row per shared id (subject+category). A Deployment whose 3 pods
all ImagePullBackOff is one issue with affected:{pods:3} + bounded member
refs, not three rows.

- /api/issues + MCP issues return grouped rows by default; the cap now
  counts issue groups, not replica fan-out.
- /api/issues?view=flat returns the raw pre-fold evidence rows for
  debugging ("what folded into this group?"). MCP stays grouped-only —
  agents use get_resource/get_events for raw state.
- Compose() stays flat internally, so summarycontext's per-resource index
  is unchanged; Filters.Grouped gates the fold.
- Representative rules (deterministic): severity = max member, category =
  shared, subject = topmost owner, reason/message/crash-context from the
  worst member, age = oldest onset, last_seen = newest, members sorted +
  capped at 10 with members_truncated past that.

Table-tested — grouping bugs are trust bugs; every consumer inherits them.
@nadaverell nadaverell requested a review from hisco as a code owner May 27, 2026 22:40
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4b11cd6. Configure here.

GroupScaling Group = "scaling"
GroupSecurity Group = "security"
GroupControlPlane Group = "control_plane"
GroupApplication Group = "application"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused GroupApplication constant is dead code

Low Severity

GroupApplication is declared as a Group constant but no category in the categoryGroup map maps to it, and it's not referenced anywhere else in the codebase. Unlike the forward-declared categories (e.g. CategoryDNSFailure) which at least have entries in categoryGroup, this group constant has zero consumers — making it truly dead code rather than a planned placeholder.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4b11cd6. Configure here.

@nadaverell
Copy link
Copy Markdown
Contributor Author

Subsumed into #811 — the classification engine (the three issues: commits) is now the bottom of that stack, which also adds the unified pkg/subject resolver, the GA-blockers, and the grouped triage UI. Consolidating to one PR per review request. Reopen if we want the engine reviewed separately.

@nadaverell nadaverell closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant