Skip to content

[WIP][KubeRay][Autoscaler] Add KubeRay-side support for idle TTL termination#4815

Draft
400Ping wants to merge 1 commit into
ray-project:masterfrom
400Ping:feature/idle-termination
Draft

[WIP][KubeRay][Autoscaler] Add KubeRay-side support for idle TTL termination#4815
400Ping wants to merge 1 commit into
ray-project:masterfrom
400Ping:feature/idle-termination

Conversation

@400Ping
Copy link
Copy Markdown
Contributor

@400Ping 400Ping commented May 8, 2026

Why are these changes needed?

Ray Autoscaler can already scale idle worker Pods down, but an idle RayCluster can still leave the head Pod, Services, and RayCluster custom resource running. For users whose head Pod consumes non-trivial reserved capacity, this leaves cost behind even after the Ray workload is idle.

This PR adds the KubeRay-side API and operator support for cluster-level idle termination via the Ray Autoscaler.

The intended flow is:

User sets spec.autoscalerOptions.ttlSecondsAfterIdle
        |
        v
Ray Autoscaler detects that the whole Ray cluster is idle past the TTL
        |
        v
Ray Autoscaler patches RayCluster.status.conditions:
  IdleTTLExpired=True
        |
        v
KubeRay operator observes the condition
        |
        v
KubeRay operator deletes the RayCluster

This keeps the component ownership split clean:

  • Ray Autoscaler owns Ray-level idleness detection because it has access to Ray workload and node state.
  • KubeRay operator owns Kubernetes resource lifecycle because it already reconciles RayCluster resources.

This PR does not implement Ray-level idle detection. A corresponding Ray Autoscaler change is required to read ttlSecondsAfterIdle, decide when the cluster is idle past the TTL, and patch the IdleTTLExpired status condition.

Changes in this PR:

  • Add spec.autoscalerOptions.ttlSecondsAfterIdle.
  • Add the IdleTTLExpired RayCluster condition type.
  • Allow the autoscaler Role to patch rayclusters/status.
  • Delete the RayCluster when IdleTTLExpired=True and idle TTL termination is still enabled in spec.
  • Validate that ttlSecondsAfterIdle is non-negative and requires enableInTreeAutoscaling=true.
  • Update generated CRDs, Helm chart values/tests, API reference docs, deepcopy, and applyconfiguration.

Related issue number

Related to #4768

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: 400Ping <jiekaichang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant