Skip to content

feat: scheduling-blocker + admission-failure detection (ResourceQuota / LimitRange / PodSecurity / webhook)#826

Closed
nadaverell wants to merge 17 commits into
mainfrom
feat/scheduling-diagnostics
Closed

feat: scheduling-blocker + admission-failure detection (ResourceQuota / LimitRange / PodSecurity / webhook)#826
nadaverell wants to merge 17 commits into
mainfrom
feat/scheduling-diagnostics

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

Surfaces admission-time and scheduling-time pod-template rejections as a first-class failure class in issues / diagnose / dashboard, alongside a typed ResourceQuotas cache so quota-blocked pods are reachable by name.

This branch was originally split out of #775 to keep that PR reviewable; the scheduling half is the work that didn't come along for the ride. Picking it up now because the SREGym bench exposed namespace_memory_limit as the one scenario both arms (kubectl + radar) failed twice — neither tool has anywhere to look for "the apiserver is rejecting this Deployment's pod template," and the upcoming bench batch has five more admission-layer scenarios in the same shape (pvc_claim_mismatch, taint_no_toleration, service_port_conflict, persistent_volume_affinity_violation, resource_request_too_large).

What's in here

Engine (internal/k8s/scheduling.go, internal/issues)

  • DetectAdmissionProblems — reads controller FailedCreate events and classifies the message into exceeded quota, LimitRange, PodSecurity, or webhook. Bounded by a 30-minute "still happening" window (the controller re-emits continuously while stuck), one row per blocked owner-collapsed subject, latest-blocker semantics so a quota-cleared-then-webhook-rejected sequence shows the current cause not the first-seen one.
  • Scheduling source for the existing issues pipeline, severity-tagged, dedup'd against runtime-pod issues on the same owner so a workload_degraded row isn't surfaced alongside the admission-failure row that explains it.
  • Deliberately reactive: a quota that's merely saturated is not surfaced proactively — namespace capacity context is not a live failure. Only quota-rejected pod creates count.

Typed cache (pkg/k8score)

  • ResourceQuotas lister + capability bit + RBAC, so consumers (engine, diagnose, REST) can read quotas by name without going through the dynamic client every time.
  • internal/k8s/fetch.go quota-fetch paths so the resource is addressable by get_resource etc.

MCP / REST / UI

  • feat(mcp): surface scheduling in issues / diagnose / dashboard — scheduling blockers show up in the MCP tools agents use, with the blocker's reason and the parsed quota/LimitRange/webhook detail in the diagnose narrative.
  • feat(ui): surface scheduling root cause in pod / namespace / topology — pods stuck Pending render the blocker reason in the renderer chrome; namespace view surfaces active quota saturation as context.
  • Dashboard REST endpoint includes scheduling rows.

Tests

  • scheduling_test.go covers Job/DaemonSet admission cross-check, quota boundary, post-bind event handling, dedup semantics (latest-blocker, owner-collapsed).
  • Issues engine integration tests around scheduling source.

State and what's needed before merge

  • Branch is 28 commits behind `main`. A merge or rebase is needed; the branch's prior pattern was Merge origin/main. I did not refresh it before pushing because that touches code I haven't reviewed and we wanted it visible in origin first.
  • Last two commits are fix(review): responses to earlier feedback (Bugbot + manual) — those review threads were on the pre-split state; current reviewers may want a fresh pass.
  • Risk profile is similar to Issues: classification engine + unified subject resolver + grouped triage UI #811: it changes the shape of issues / diagnose output for a new failure class, but it's reactive (event-window-bounded) so it can't reintroduce noise in clusters that don't have active FailedCreate events.

Why this is the right priority right now

radar-bench-sregym/radar-improvement-ideas.md (next to this checkout, under ~/sky/ws3/) flagged this as the highest impact-to-effort gap from the N=2 bench: the running radar build is runtime-pod-centric and has no admission-layer signal, which is exactly the failure class neither tool cracked in the bench. The work is already done; it just needed to come back out of the drawer.

nadaverell added 17 commits May 26, 2026 00:07
Decompose why a Pod can't run into structured signals:
- bind-time: PodScheduled=False → parse the scheduler verdict + resolve node
  affinity/selector misses against the node cache, naming the offending label
  (e.g. "no node has kubernetes.io/arch=arm64")
- admission: controller FailedCreate (quota/LimitRange/PodSecurity/webhook) +
  proactive ResourceQuota saturation — the layer with no Pod to inspect
- post-bind: ContainerCreating decoded into CNI IP-exhaustion + volume
  attach/mount, cross-checked against still-stuck pods

Add ResourceQuota to the typed informer cache (mirroring LimitRange) so the
proactive quota read + a browsable ResourceQuota view work. The generic
problem detector now defers unschedulable pods to the scheduling source so
they aren't double-reported as a bare "Pending".
New SourceScheduling composes the three scheduling detectors through the
issues pipeline (default-on, high-signal operational state). /api/issues, the
MCP issues tool, and per-resource summaryContext now surface placement/
admission/post-bind failures, filterable via source=scheduling. ParseSources
accepts the new value; the Provider gains DetectScheduling.
- issues tool: source=scheduling documented and in the default set
- diagnose: a schedulability section scoped to the workload — its unschedulable
  pods, its ReplicaSet's FailedCreate, and any namespace ResourceQuota
  saturation (the one-shot answer for an admission/quota stall)
- get_dashboard: scheduling rows roll into the problem list; admission rows
  have no Pod, so the dashboard pod loop never surfaced them before
- PodRenderer: lead the banner with the decomposed scheduler verdict instead
  of a bare "Unschedulable" (untolerated taints, insufficient resources, and
  affinity/selector misses named). New PodProblem.detail keeps message exact
  so filter-chip matching is unaffected.
- NamespaceRenderer: a ResourceQuota usage section with per-resource
  saturation bars (amber >=90%, red >=100%) — quota pressure was shown nowhere
  despite being exactly why a namespace stops admitting pods. Fetched via a new
  useNamespaceQuotas hook over /api/resources/resourcequotas.
- topology tooltips: scheduling-aware guidance for the new reason keywords
  (Unschedulable, QuotaExceeded, IPExhaustion, VolumeMount/Attach, …).
/api/dashboard (the home ProblemsPanel source) is a separate builder from the
MCP get_dashboard one wired earlier — it only gathered DetectProblems +
DetectMissingRefs, so unschedulable pods and quota saturation never reached
the home view. Append the three scheduling detectors directly (bypassing the
Missing-ref Pod filter, since an Unschedulable row is the reason, not a dup).
Verified live: the panel now shows the arch-mismatch Unschedulable row (with
the offending label named) and the 99% QuotaNearLimit row.
- server: route /api/resources/resourcequotas through the typed informer in
  handleListResources + handleGetResource (it fell through to the dynamic
  cache, so the namespace quota UI could read [] on first open before sync).
- scheduling: restrict the quota-pressure check to pod-admission-relevant
  resources (cpu/memory/pods/ephemeral-storage/requests.*/limits.*/PVC) so an
  object-count quota (configmaps/services) no longer shows as "blocks new pods".
- scheduling: cross-check the involved workload's current readiness before
  emitting an admission FailedCreate row — a since-recovered workload no longer
  surfaces as critical off a lingering event.
- dashboard: skip unschedulable pods in the REST rollup (they're owned by the
  scheduling rows) so they don't double-surface; fix the stale comment.
- frontend: thread the namespace quota fetch error through — 403 hides the
  section, but 500/503 now shows a note instead of silently rendering quota-free.
- types: drop dead NodeFacts.Taints/Unschedulable + TaintFact (written, never
  read); document the SchedulingReason union invariant.
- mcp: add scheduling to the issues tool Description defaults + example.
- comments: correct the node-fit resolver doc (no taint cache-join); strip
  external bench scenario name.
- tests: cache-level integration tests for the quota ramp + S1 filter, bind-time
  node-fit naming, and the admission recovered-workload cross-check; ParseSources
  scheduling token; frontend summarizeSchedulerMessage.
…ostics

# Conflicts:
#	internal/issues/issues_test.go
#	internal/mcp/tools.go
…a tones

Replace inline raw red/orange Tailwind in the namespace quota section with the
shared severity-color constants, per the repo styling rule.
- detectAdmissionFailures: dedup FailedCreate rows by involved object. A
  quota-blocked controller emits one event per attempt, each with a different
  generated pod name (distinct cached events), so one workload produced many
  near-identical rows. Now one row per workload.
- admissionTargetStillBlocked: gate on created-count (Status.Replicas /
  CurrentNumberScheduled) below desired, not readiness. A workload whose pods
  were created but stay not-ready for another reason (e.g. unschedulable after
  a quota was raised) is no longer admission-blocked, so a stale FailedCreate
  no longer surfaces a critical QuotaExceeded row.
- admissionTargetStillBlocked (Job): a terminally-failed Job (Failed>0) no
  longer counts as blocked — only a Job that has created nothing (Active,
  Succeeded, Failed all 0) does.
- diagnose schedulingFindingsForWorkload: tighten the Deployment→ReplicaSet
  match to a single hyphen-free hash suffix (isReplicaSetOf), so diagnosing
  "api" no longer claims "api-gateway-<hash>".
- tests: dedup assertion, created-but-not-ready skip, isReplicaSetOf table.
…oundary

- Add TestDetectAdmissionProblems_JobAndDaemonSetCrossCheck: a Job that created
  no pod and a partially-scheduled DaemonSet surface QuotaExceeded; a terminally-
  failed Job (Failed>0) and a fully-scheduled DaemonSet are skipped — pins the
  net-new Job (Failed==0) and DaemonSet (CurrentNumberScheduled) cross-check
  branches that the ReplicaSet test didn't exercise.
- Add a below-threshold (50%) quota case to the saturation test so the >=90%
  warn boundary is pinned, not just the >=90%/100% arms.
- Reword the Job cross-check comment to state the true invariant (any of
  Active/Succeeded/Failed > 0 means a pod was created) instead of the
  inaccurate "terminally-failed (backoffLimit)" phrasing; note the mid-retry
  trade-off explicitly.
- Replace the opaque "S1 filter" test comment with the real mechanism name
  (isPodAdmissionQuotaResource).
… first-seen

FailedCreate events are deduped per involved object, but informer List
order is arbitrary and the active blocker can change within the 30m
window (quota cleared, webhook now rejects). Keep the latest event by
LastTimestamp so the surfaced reason reflects the current cause, not
whichever the cache iterated first. Pin with a quota→webhook test.

Also fix stale source= comments/examples to include scheduling.
Two Bugbot findings:

- DetectPostBindProblems kept the first qualifying kubelet event per pod
  by informer order, so a stale blocker could win when the cause changed
  (NetworkNotReady → FailedMount). Keep the latest by LastTimestamp,
  mirroring detectAdmissionFailures.

- A pod stuck post-bind surfaced twice in the issues composer: a generic
  problem-source Pending row AND the richer scheduling-source row. Dedup
  in the composer so the scheduling row wins for the same Pod. A plain
  DetectProblems skip can't do this — the problem threshold is 5m but the
  post-bind event window is 10m, so a pod stuck >10m would lose its only
  row.
…rface

The /api/dashboard builder is separate from the issues composer: its pod
health rollup flagged long-Pending pods as warnings (skipping only
unschedulable) while also appending post-bind scheduling rows, yielding
two rows for one stuck pod (bare Pending + the richer VolumeMount/CNI
row). Compute the post-bind-owned pod set up front and skip those in the
rollup the same way unschedulable pods are skipped; reuse the slice for
the scheduling append. Gap-free — only pods that actually get a post-bind
row are skipped, so a pod past the 10m post-bind window keeps its rollup
row.
…taxonomy filters

issues is now one curated operational stream — workload/pod problems,
dangling refs, pod-startup blockers, and False CRD conditions — severity
ranked. Detection provenance is no longer a user/agent filter axis:

- Drop the source= filter from /api/issues and the MCP issues tool. source
  survives only as an output label on each row + a CEL filter binding.
- Remove event + kyverno from issue composition entirely (and the
  include_events/include_kyverno flags). Raw events live in get_events /
  the timeline; policy posture lives in get_cluster_audit. Shrinks the
  issues.Provider interface (WarningEvents/KyvernoFindings/KyvernoStatus)
  and deletes the source-parse plumbing.
- SchedulingGated pods are no longer flagged Unschedulable (gate on
  reason==Unschedulable, matching the frontend).
- Remove proactive ResourceQuota saturation from the stream — a saturated
  quota is namespace capacity context, not a live failure; the reactive
  FailedCreate path and the Namespace quota UI still cover it. This also
  fixes diagnose over-attributing namespace quota to unrelated workloads.
- Rename diagnose's scheduling response field to startupBlockers (it spans
  bind-time, admission, and post-bind — not just scheduling).
- Drop the crippled include=logs from get_resource (use get_pod_logs /
  get_workload_logs / diagnose).
- Refresh docs/mcp.md tool table (was missing 6+ central tools) and fix the
  "non-destructive" wording — write tools are destructiveHint:true.

NOTE: /api/issues no longer reads source=/include_*; extra query params are
ignored (no 400). radar-hub-web should be checked for any fleet view that
relied on source=kyverno/include_kyverno.
…resh stale docs

Addresses review findings on the issues/scheduling work:

- clusterrole.yaml: add resourcequotas to the core read-only rule. The PR
  caches + probes ResourceQuota (capabilities.go) but in-cluster installs
  could not list it, silently hiding the Namespace quota section + the
  ResourceQuota API/UI for the users who need them.
- get_resource: include=logs was dropped but silently became a no-op; return
  a logsError pointing to get_pod_logs / get_workload_logs / diagnose so a
  client on a stale schema is redirected instead of seeing empty success.
- Refresh stale docs/comments the issues refactor missed: issues/types.go
  (package doc, Severity mapping, Source doc, Issue doc still described the
  removed event/kyverno sources + source-as-filter), summarycontext.go
  (referenced removed Filters fields), docs/mcp.md "non-destructive" line,
  docs/integrations.md (Kyverno-via-/api/issues no longer exists).
- Delete now-dead policy_reports_testhooks.go (its only consumer was the
  removed issues_handler_test.go).
- Tests: pin the CEL source binding (now the only source-slicing path) and
  startupBlockersForWorkload workload-scoping (its contract changed).

The origin/main merge (prior commit) absorbs #780, resolving the apparent
client.ts cache-seeding revert (merge skew - this branch never touched it).
…nknown include values; document resourcequotas RBAC

Follow-up review findings on the issues/MCP work:

- docs/integrations.md + issues MCP tool description wrongly routed Kyverno
  PolicyReport findings to the cluster audit (/api/audit + get_cluster_audit).
  Audit consumes only typed K8s + Crossplane and has zero PolicyReport input.
  Kyverno surfaces per-resource (PolicyReport detail view + resourceContext
  policy rollup); say that, and stop pointing agents at a tool that returns
  nothing for it.
- Issue struct doc: the snapshot-timestamp note was only true for problem/
  missing_ref/scheduling (LastSeen=compose time); condition rows set both
  timestamps to the condition's lastTransitionTime. Distinguish the two.
- get_resource: include=logs was guarded, but every OTHER unknown token
  (typos, or "relationships" which moved to resourceContext) still silently
  no-op'd. Surface unknown include values via includeError so a token that
  did nothing is reported, not swallowed.
- README.md + docs/in-cluster.md: document the resourcequotas (and
  LimitRanges) read grant the chart ClusterRole now carries, so the
  supported-resources lists match the deployed RBAC.
@nadaverell
Copy link
Copy Markdown
Contributor Author

Closing — investigation showed this work is already in main.

internal/k8s/scheduling.go on this branch is byte-identical to main's, and main's version has been there since fab0799 (PR #775, merged 2026-05-26). All the admission-detection symbols I claimed were "unmerged" (DetectAdmissionProblems, detectAdmissionFailures, classifyAdmissionFailure, admissionFailureWindow) are in main and verified present in our SREGym bench daemon binary (vcs.revision=5598d94, built 2026-05-27).

My earlier check used git merge-base --is-ancestor <branch-sha> HEAD to test "is this work merged" — that returned "no" because #775 squash-merged and the original commit SHAs didn't survive into main. The content was there the whole time. I should have compared file contents, not commit SHAs.

Net result of cherry-picking this branch onto current main: would roll back ~12 merged PRs (certificates inventory, metrics, audit raw, chart RBAC coverage tests, internal/mcp/tools.go slim-down, etc.) for no benefit, since the scheduling work is already there. The one branch-only file (packages/k8s-ui/src/components/resources/resources-search-sidebar-hint.test.ts) tests a hint UI that #633 explicitly dropped.

Real follow-up is on the SREGym bench side: namespace_memory_limit failed on both arms in two runs even though the admission detector is in the binary. Investigating directly whether the scenario produces FailedCreate … exceeded quota events the detector can classify, vs whether it's a surfacing-into-MCP-tools gap.

@nadaverell nadaverell closed this May 30, 2026
@nadaverell nadaverell deleted the feat/scheduling-diagnostics branch May 30, 2026 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant