Skip to content

Feat/remove restart policy never if autoscaler v2 and ray 2.55 or greater#4816

Open
fscnick wants to merge 11 commits into
ray-project:masterfrom
fscnick:feat/remove-restart-policy-never-if-autoscaler-v2
Open

Feat/remove restart policy never if autoscaler v2 and ray 2.55 or greater#4816
fscnick wants to merge 11 commits into
ray-project:masterfrom
fscnick:feat/remove-restart-policy-never-if-autoscaler-v2

Conversation

@fscnick
Copy link
Copy Markdown
Collaborator

@fscnick fscnick commented May 8, 2026

Why are these changes needed?

Starting from ray 2.55.0, the autoscaler restart is supported. The RestartPolicy is Never might not be necessary.

In this PR, it leverages the rayVersion to determine the version of ray. If the version parsed is failed, it would consider it not valid.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

fscnick added 2 commits May 7, 2026 22:27
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
@fscnick fscnick force-pushed the feat/remove-restart-policy-never-if-autoscaler-v2 branch from 51e5eb6 to 31cf894 Compare May 8, 2026 15:30
@fscnick fscnick marked this pull request as ready for review May 9, 2026 00:40
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31cf894175

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// rayproject/ray:2.55.0-py310 → "2.55.0"
// rayproject/ray:2.55.0@sha256:abc → "2.55.0"
// rayproject/ray:2.55 → "2.55"
var rayImageVersionRegex = regexp.MustCompile(`rayproject/ray:(\d+\.\d+(?:\.\d+)?)`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse Ray version from all supported Ray image names

The new version gate only matches rayproject/ray:<ver> (rayImageVersionRegex), so autoscaler-v2 clusters using other supported Ray images (for example rayproject/ray-ml:* in ray-operator/config/samples/ray-service.stable-diffusion.yaml) always fail parsing and are treated as "version unknown". In those cases the operator/webhook keeps enforcing restartPolicy: Never, which means the intended 2.55+ relaxation never activates even when the image tag is new enough.

Useful? React with 👍 / 👎.

@fscnick fscnick marked this pull request as draft May 9, 2026 02:24
@fscnick fscnick marked this pull request as ready for review May 11, 2026 15:59
Comment on lines +473 to +475
// Use the headGroupSpec to determine whether the RestartPolicy should be Never or not, since the head pod is the one that runs the autoscaler.
// The error is ignored here because the function will return false if there's an error parsing the version.
// For example, if rayVersion is empty or unparseable, it considers the feature is not valid.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Use the headGroupSpec" but the code actually reads from the cluster-level spec (instance.Spec.RayVersion, IsAutoscalingEnabled, IsAutoscalingV2Enabled), not from HeadGroupSpec. Would it be better updating the comment to reflect that both head and worker templates use the same cluster-level gate to decide RestartPolicy ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed at 5978fd1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be pre-existing typo "should have the correct"

@@ -1465,10 +1526,18 @@ func TestDefaultWorkerPodTemplate_Autoscaling(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed at 0937f50 along with the above.

if utils.IsAutoscalingV2Enabled(&instance.Spec) {
setAutoscalerV2EnvVars(&podTemplate)
podTemplate.Spec.RestartPolicy = corev1.RestartPolicyNever
if autoscalerRestartValid, _ := utils.IsRayVersionAtLeast(instance.Spec.RayVersion, utils.MinAutoscalerRestartValidVersion); !autoscalerRestartValid {
Copy link
Copy Markdown
Contributor

@AndySung320 AndySung320 May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error from IsRayVersionAtLeast is silently ignored here. The safe-default behavior (falling back to Never) makes sense, but if rayVersion is empty or unparseable, the user has no way to tell whether Never was set because the version is too old or because their config is wrong. Would it be worth adding a warning log when the error is non-nil?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed at 6be130a along with the similar one in DefaultHeadPodTemplate.

return fmt.Errorf("Currently, SidecarMode doesn't support SubmitterConfig")
}

if rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != "" && rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != corev1.RestartPolicyNever {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this restart policy also need a change?

Suggested change
// The error is ignored here because the function will return false if there's an error parsing the version.
// For example, if rayVersion is empty or unparseable, it considers the feature is not valid.
autoscalerRestartValid, _ := IsRayVersionAtLeast(rayJob.Spec.RayClusterSpec.RayVersion, MinAutoscalerRestartValidVersion)
if !autoscalerRestartValid && rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != "" && rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != corev1.RestartPolicyNever {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that certain about this. This is set to Never because of SidecarMode not the autoscaler.

fscnick added 3 commits May 16, 2026 09:21
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 6be130a. Configure here.

Comment thread ray-operator/controllers/ray/common/pod.go Outdated
expectedRestartPolicy: "",
},
"Pod template with autoscaling v1 enabled should the correct autoscaler v1 fields": {
"Pod template with autoscaling v1 enabled should the the correct autoscaler v1 fields": {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo: "have"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix at f3458fd.

mergeAutoscalerOverrides(&autoscalerContainer, instance.Spec.AutoscalerOptions)
podTemplate.Spec.Containers = append(podTemplate.Spec.Containers, autoscalerContainer)

// The error is ignored here because the function will return false if there's an error parsing the version.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we keep err here, i think we should also re-word the comment?

}

// Use the RayVersion and autoscaler version to determine whether the RestartPolicy should be Never or not, since the head pod is the one that runs the autoscaler.
// The error is ignored here because the function will return false if there's an error parsing the version.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed at 565d9a8 along with the above.

fscnick added 2 commits May 19, 2026 21:15
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants