feat: scheduling-blocker + admission-failure detection (ResourceQuota / LimitRange / PodSecurity / webhook)#826
feat: scheduling-blocker + admission-failure detection (ResourceQuota / LimitRange / PodSecurity / webhook)#826nadaverell wants to merge 17 commits into
Conversation
Decompose why a Pod can't run into structured signals: - bind-time: PodScheduled=False → parse the scheduler verdict + resolve node affinity/selector misses against the node cache, naming the offending label (e.g. "no node has kubernetes.io/arch=arm64") - admission: controller FailedCreate (quota/LimitRange/PodSecurity/webhook) + proactive ResourceQuota saturation — the layer with no Pod to inspect - post-bind: ContainerCreating decoded into CNI IP-exhaustion + volume attach/mount, cross-checked against still-stuck pods Add ResourceQuota to the typed informer cache (mirroring LimitRange) so the proactive quota read + a browsable ResourceQuota view work. The generic problem detector now defers unschedulable pods to the scheduling source so they aren't double-reported as a bare "Pending".
New SourceScheduling composes the three scheduling detectors through the issues pipeline (default-on, high-signal operational state). /api/issues, the MCP issues tool, and per-resource summaryContext now surface placement/ admission/post-bind failures, filterable via source=scheduling. ParseSources accepts the new value; the Provider gains DetectScheduling.
- issues tool: source=scheduling documented and in the default set - diagnose: a schedulability section scoped to the workload — its unschedulable pods, its ReplicaSet's FailedCreate, and any namespace ResourceQuota saturation (the one-shot answer for an admission/quota stall) - get_dashboard: scheduling rows roll into the problem list; admission rows have no Pod, so the dashboard pod loop never surfaced them before
- PodRenderer: lead the banner with the decomposed scheduler verdict instead of a bare "Unschedulable" (untolerated taints, insufficient resources, and affinity/selector misses named). New PodProblem.detail keeps message exact so filter-chip matching is unaffected. - NamespaceRenderer: a ResourceQuota usage section with per-resource saturation bars (amber >=90%, red >=100%) — quota pressure was shown nowhere despite being exactly why a namespace stops admitting pods. Fetched via a new useNamespaceQuotas hook over /api/resources/resourcequotas. - topology tooltips: scheduling-aware guidance for the new reason keywords (Unschedulable, QuotaExceeded, IPExhaustion, VolumeMount/Attach, …).
/api/dashboard (the home ProblemsPanel source) is a separate builder from the MCP get_dashboard one wired earlier — it only gathered DetectProblems + DetectMissingRefs, so unschedulable pods and quota saturation never reached the home view. Append the three scheduling detectors directly (bypassing the Missing-ref Pod filter, since an Unschedulable row is the reason, not a dup). Verified live: the panel now shows the arch-mismatch Unschedulable row (with the offending label named) and the 99% QuotaNearLimit row.
- server: route /api/resources/resourcequotas through the typed informer in handleListResources + handleGetResource (it fell through to the dynamic cache, so the namespace quota UI could read [] on first open before sync). - scheduling: restrict the quota-pressure check to pod-admission-relevant resources (cpu/memory/pods/ephemeral-storage/requests.*/limits.*/PVC) so an object-count quota (configmaps/services) no longer shows as "blocks new pods". - scheduling: cross-check the involved workload's current readiness before emitting an admission FailedCreate row — a since-recovered workload no longer surfaces as critical off a lingering event. - dashboard: skip unschedulable pods in the REST rollup (they're owned by the scheduling rows) so they don't double-surface; fix the stale comment. - frontend: thread the namespace quota fetch error through — 403 hides the section, but 500/503 now shows a note instead of silently rendering quota-free. - types: drop dead NodeFacts.Taints/Unschedulable + TaintFact (written, never read); document the SchedulingReason union invariant. - mcp: add scheduling to the issues tool Description defaults + example. - comments: correct the node-fit resolver doc (no taint cache-join); strip external bench scenario name. - tests: cache-level integration tests for the quota ramp + S1 filter, bind-time node-fit naming, and the admission recovered-workload cross-check; ParseSources scheduling token; frontend summarizeSchedulerMessage.
…ostics # Conflicts: # internal/issues/issues_test.go # internal/mcp/tools.go
…a tones Replace inline raw red/orange Tailwind in the namespace quota section with the shared severity-color constants, per the repo styling rule.
- detectAdmissionFailures: dedup FailedCreate rows by involved object. A quota-blocked controller emits one event per attempt, each with a different generated pod name (distinct cached events), so one workload produced many near-identical rows. Now one row per workload. - admissionTargetStillBlocked: gate on created-count (Status.Replicas / CurrentNumberScheduled) below desired, not readiness. A workload whose pods were created but stay not-ready for another reason (e.g. unschedulable after a quota was raised) is no longer admission-blocked, so a stale FailedCreate no longer surfaces a critical QuotaExceeded row. - admissionTargetStillBlocked (Job): a terminally-failed Job (Failed>0) no longer counts as blocked — only a Job that has created nothing (Active, Succeeded, Failed all 0) does. - diagnose schedulingFindingsForWorkload: tighten the Deployment→ReplicaSet match to a single hyphen-free hash suffix (isReplicaSetOf), so diagnosing "api" no longer claims "api-gateway-<hash>". - tests: dedup assertion, created-but-not-ready skip, isReplicaSetOf table.
…oundary - Add TestDetectAdmissionProblems_JobAndDaemonSetCrossCheck: a Job that created no pod and a partially-scheduled DaemonSet surface QuotaExceeded; a terminally- failed Job (Failed>0) and a fully-scheduled DaemonSet are skipped — pins the net-new Job (Failed==0) and DaemonSet (CurrentNumberScheduled) cross-check branches that the ReplicaSet test didn't exercise. - Add a below-threshold (50%) quota case to the saturation test so the >=90% warn boundary is pinned, not just the >=90%/100% arms. - Reword the Job cross-check comment to state the true invariant (any of Active/Succeeded/Failed > 0 means a pod was created) instead of the inaccurate "terminally-failed (backoffLimit)" phrasing; note the mid-retry trade-off explicitly. - Replace the opaque "S1 filter" test comment with the real mechanism name (isPodAdmissionQuotaResource).
… first-seen FailedCreate events are deduped per involved object, but informer List order is arbitrary and the active blocker can change within the 30m window (quota cleared, webhook now rejects). Keep the latest event by LastTimestamp so the surfaced reason reflects the current cause, not whichever the cache iterated first. Pin with a quota→webhook test. Also fix stale source= comments/examples to include scheduling.
Two Bugbot findings: - DetectPostBindProblems kept the first qualifying kubelet event per pod by informer order, so a stale blocker could win when the cause changed (NetworkNotReady → FailedMount). Keep the latest by LastTimestamp, mirroring detectAdmissionFailures. - A pod stuck post-bind surfaced twice in the issues composer: a generic problem-source Pending row AND the richer scheduling-source row. Dedup in the composer so the scheduling row wins for the same Pod. A plain DetectProblems skip can't do this — the problem threshold is 5m but the post-bind event window is 10m, so a pod stuck >10m would lose its only row.
…rface The /api/dashboard builder is separate from the issues composer: its pod health rollup flagged long-Pending pods as warnings (skipping only unschedulable) while also appending post-bind scheduling rows, yielding two rows for one stuck pod (bare Pending + the richer VolumeMount/CNI row). Compute the post-bind-owned pod set up front and skip those in the rollup the same way unschedulable pods are skipped; reuse the slice for the scheduling append. Gap-free — only pods that actually get a post-bind row are skipped, so a pod past the 10m post-bind window keeps its rollup row.
…taxonomy filters issues is now one curated operational stream — workload/pod problems, dangling refs, pod-startup blockers, and False CRD conditions — severity ranked. Detection provenance is no longer a user/agent filter axis: - Drop the source= filter from /api/issues and the MCP issues tool. source survives only as an output label on each row + a CEL filter binding. - Remove event + kyverno from issue composition entirely (and the include_events/include_kyverno flags). Raw events live in get_events / the timeline; policy posture lives in get_cluster_audit. Shrinks the issues.Provider interface (WarningEvents/KyvernoFindings/KyvernoStatus) and deletes the source-parse plumbing. - SchedulingGated pods are no longer flagged Unschedulable (gate on reason==Unschedulable, matching the frontend). - Remove proactive ResourceQuota saturation from the stream — a saturated quota is namespace capacity context, not a live failure; the reactive FailedCreate path and the Namespace quota UI still cover it. This also fixes diagnose over-attributing namespace quota to unrelated workloads. - Rename diagnose's scheduling response field to startupBlockers (it spans bind-time, admission, and post-bind — not just scheduling). - Drop the crippled include=logs from get_resource (use get_pod_logs / get_workload_logs / diagnose). - Refresh docs/mcp.md tool table (was missing 6+ central tools) and fix the "non-destructive" wording — write tools are destructiveHint:true. NOTE: /api/issues no longer reads source=/include_*; extra query params are ignored (no 400). radar-hub-web should be checked for any fleet view that relied on source=kyverno/include_kyverno.
…resh stale docs Addresses review findings on the issues/scheduling work: - clusterrole.yaml: add resourcequotas to the core read-only rule. The PR caches + probes ResourceQuota (capabilities.go) but in-cluster installs could not list it, silently hiding the Namespace quota section + the ResourceQuota API/UI for the users who need them. - get_resource: include=logs was dropped but silently became a no-op; return a logsError pointing to get_pod_logs / get_workload_logs / diagnose so a client on a stale schema is redirected instead of seeing empty success. - Refresh stale docs/comments the issues refactor missed: issues/types.go (package doc, Severity mapping, Source doc, Issue doc still described the removed event/kyverno sources + source-as-filter), summarycontext.go (referenced removed Filters fields), docs/mcp.md "non-destructive" line, docs/integrations.md (Kyverno-via-/api/issues no longer exists). - Delete now-dead policy_reports_testhooks.go (its only consumer was the removed issues_handler_test.go). - Tests: pin the CEL source binding (now the only source-slicing path) and startupBlockersForWorkload workload-scoping (its contract changed). The origin/main merge (prior commit) absorbs #780, resolving the apparent client.ts cache-seeding revert (merge skew - this branch never touched it).
…nknown include values; document resourcequotas RBAC Follow-up review findings on the issues/MCP work: - docs/integrations.md + issues MCP tool description wrongly routed Kyverno PolicyReport findings to the cluster audit (/api/audit + get_cluster_audit). Audit consumes only typed K8s + Crossplane and has zero PolicyReport input. Kyverno surfaces per-resource (PolicyReport detail view + resourceContext policy rollup); say that, and stop pointing agents at a tool that returns nothing for it. - Issue struct doc: the snapshot-timestamp note was only true for problem/ missing_ref/scheduling (LastSeen=compose time); condition rows set both timestamps to the condition's lastTransitionTime. Distinguish the two. - get_resource: include=logs was guarded, but every OTHER unknown token (typos, or "relationships" which moved to resourceContext) still silently no-op'd. Surface unknown include values via includeError so a token that did nothing is reported, not swallowed. - README.md + docs/in-cluster.md: document the resourcequotas (and LimitRanges) read grant the chart ClusterRole now carries, so the supported-resources lists match the deployed RBAC.
|
Closing — investigation showed this work is already in
My earlier check used Net result of cherry-picking this branch onto current Real follow-up is on the SREGym bench side: |
Surfaces admission-time and scheduling-time pod-template rejections as a first-class failure class in issues / diagnose / dashboard, alongside a typed
ResourceQuotascache so quota-blocked pods are reachable by name.This branch was originally split out of #775 to keep that PR reviewable; the scheduling half is the work that didn't come along for the ride. Picking it up now because the SREGym bench exposed
namespace_memory_limitas the one scenario both arms (kubectl + radar) failed twice — neither tool has anywhere to look for "the apiserver is rejecting this Deployment's pod template," and the upcoming bench batch has five more admission-layer scenarios in the same shape (pvc_claim_mismatch,taint_no_toleration,service_port_conflict,persistent_volume_affinity_violation,resource_request_too_large).What's in here
Engine (
internal/k8s/scheduling.go,internal/issues)DetectAdmissionProblems— reads controllerFailedCreateevents and classifies the message intoexceeded quota,LimitRange,PodSecurity, orwebhook. Bounded by a 30-minute "still happening" window (the controller re-emits continuously while stuck), one row per blocked owner-collapsed subject, latest-blocker semantics so a quota-cleared-then-webhook-rejected sequence shows the current cause not the first-seen one.workload_degradedrow isn't surfaced alongside the admission-failure row that explains it.Typed cache (
pkg/k8score)ResourceQuotaslister + capability bit + RBAC, so consumers (engine, diagnose, REST) can read quotas by name without going through the dynamic client every time.internal/k8s/fetch.goquota-fetch paths so the resource is addressable byget_resourceetc.MCP / REST / UI
feat(mcp): surface scheduling in issues / diagnose / dashboard— scheduling blockers show up in the MCP tools agents use, with the blocker's reason and the parsed quota/LimitRange/webhook detail in the diagnose narrative.feat(ui): surface scheduling root cause in pod / namespace / topology— pods stuck Pending render the blocker reason in the renderer chrome; namespace view surfaces active quota saturation as context.Tests
scheduling_test.gocovers Job/DaemonSet admission cross-check, quota boundary, post-bind event handling, dedup semantics (latest-blocker, owner-collapsed).State and what's needed before merge
Merge origin/main. I did not refresh it before pushing because that touches code I haven't reviewed and we wanted it visible in origin first.fix(review):responses to earlier feedback (Bugbot + manual) — those review threads were on the pre-split state; current reviewers may want a fresh pass.FailedCreateevents.Why this is the right priority right now
radar-bench-sregym/radar-improvement-ideas.md(next to this checkout, under~/sky/ws3/) flagged this as the highest impact-to-effort gap from the N=2 bench: the running radar build is runtime-pod-centric and has no admission-layer signal, which is exactly the failure class neither tool cracked in the bench. The work is already done; it just needed to come back out of the drawer.