Feat/remove restart policy never if autoscaler v2 and ray 2.55 or greater#4816
Feat/remove restart policy never if autoscaler v2 and ray 2.55 or greater#4816fscnick wants to merge 11 commits into
Conversation
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
51e5eb6 to
31cf894
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31cf894175
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // rayproject/ray:2.55.0-py310 → "2.55.0" | ||
| // rayproject/ray:2.55.0@sha256:abc → "2.55.0" | ||
| // rayproject/ray:2.55 → "2.55" | ||
| var rayImageVersionRegex = regexp.MustCompile(`rayproject/ray:(\d+\.\d+(?:\.\d+)?)`) |
There was a problem hiding this comment.
Parse Ray version from all supported Ray image names
The new version gate only matches rayproject/ray:<ver> (rayImageVersionRegex), so autoscaler-v2 clusters using other supported Ray images (for example rayproject/ray-ml:* in ray-operator/config/samples/ray-service.stable-diffusion.yaml) always fail parsing and are treated as "version unknown". In those cases the operator/webhook keeps enforcing restartPolicy: Never, which means the intended 2.55+ relaxation never activates even when the image tag is new enough.
Useful? React with 👍 / 👎.
Signed-off-by: fscnick <fscnick.dev@gmail.com>
…rt-policy-never-if-autoscaler-v2
| // Use the headGroupSpec to determine whether the RestartPolicy should be Never or not, since the head pod is the one that runs the autoscaler. | ||
| // The error is ignored here because the function will return false if there's an error parsing the version. | ||
| // For example, if rayVersion is empty or unparseable, it considers the feature is not valid. |
There was a problem hiding this comment.
The comment says "Use the headGroupSpec" but the code actually reads from the cluster-level spec (instance.Spec.RayVersion, IsAutoscalingEnabled, IsAutoscalingV2Enabled), not from HeadGroupSpec. Would it be better updating the comment to reflect that both head and worker templates use the same cluster-level gate to decide RestartPolicy ?
There was a problem hiding this comment.
seems to be pre-existing typo "should have the correct"
| @@ -1465,10 +1526,18 @@ func TestDefaultWorkerPodTemplate_Autoscaling(t *testing.T) { | |||
| if utils.IsAutoscalingV2Enabled(&instance.Spec) { | ||
| setAutoscalerV2EnvVars(&podTemplate) | ||
| podTemplate.Spec.RestartPolicy = corev1.RestartPolicyNever | ||
| if autoscalerRestartValid, _ := utils.IsRayVersionAtLeast(instance.Spec.RayVersion, utils.MinAutoscalerRestartValidVersion); !autoscalerRestartValid { |
There was a problem hiding this comment.
The error from IsRayVersionAtLeast is silently ignored here. The safe-default behavior (falling back to Never) makes sense, but if rayVersion is empty or unparseable, the user has no way to tell whether Never was set because the version is too old or because their config is wrong. Would it be worth adding a warning log when the error is non-nil?
There was a problem hiding this comment.
fixed at 6be130a along with the similar one in DefaultHeadPodTemplate.
| return fmt.Errorf("Currently, SidecarMode doesn't support SubmitterConfig") | ||
| } | ||
|
|
||
| if rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != "" && rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != corev1.RestartPolicyNever { |
There was a problem hiding this comment.
Does this restart policy also need a change?
| // The error is ignored here because the function will return false if there's an error parsing the version. | |
| // For example, if rayVersion is empty or unparseable, it considers the feature is not valid. | |
| autoscalerRestartValid, _ := IsRayVersionAtLeast(rayJob.Spec.RayClusterSpec.RayVersion, MinAutoscalerRestartValidVersion) | |
| if !autoscalerRestartValid && rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != "" && rayJob.Spec.RayClusterSpec.HeadGroupSpec.Template.Spec.RestartPolicy != corev1.RestartPolicyNever { |
There was a problem hiding this comment.
I'm not that certain about this. This is set to Never because of SidecarMode not the autoscaler.
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 6be130a. Configure here.
…rt-policy-never-if-autoscaler-v2
Signed-off-by: fscnick <fscnick.dev@gmail.com>
| expectedRestartPolicy: "", | ||
| }, | ||
| "Pod template with autoscaling v1 enabled should the correct autoscaler v1 fields": { | ||
| "Pod template with autoscaling v1 enabled should the the correct autoscaler v1 fields": { |
| mergeAutoscalerOverrides(&autoscalerContainer, instance.Spec.AutoscalerOptions) | ||
| podTemplate.Spec.Containers = append(podTemplate.Spec.Containers, autoscalerContainer) | ||
|
|
||
| // The error is ignored here because the function will return false if there's an error parsing the version. |
There was a problem hiding this comment.
since we keep err here, i think we should also re-word the comment?
| } | ||
|
|
||
| // Use the RayVersion and autoscaler version to determine whether the RestartPolicy should be Never or not, since the head pod is the one that runs the autoscaler. | ||
| // The error is ignored here because the function will return false if there's an error parsing the version. |
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>

Why are these changes needed?
Starting from ray 2.55.0, the autoscaler restart is supported. The
RestartPolicyisNevermight not be necessary.In this PR, it leverages the
rayVersionto determine the version of ray. If the version parsed is failed, it would consider it not valid.Related issue number
Checks