Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion ray-operator/test/e2erayservice/support.go
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,12 @@ func RayServiceSampleYamlApplyConfiguration() *rayv1ac.RayServiceSpecApplyConfig
price: 2
ray_actor_options:
num_cpus: 0.1
- name: PearStand
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add PearStand?

Copy link
Copy Markdown
Contributor Author

@AndySung320 AndySung320 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Ray’s official core spec, actors default to num_cpus=1 for scheduling if not explicitly specified.
Because PearStand was defined in the graph but omitted in our serveConfigV2, it didn't get any custom ray_actor_options, so Ray automatically assigned it the default 1 CPU token.
Previously, this was masked because our head node had limits.cpu: 2 (which made KubeRay pass --num-cpus=2 to Ray). Now that we removed the limit, KubeRay falls back to using requests.cpu: 1. With only 1 total CPU token available in Ray, PearStand's default 1-CPU demand broke the budget and caused the scheduling failure.
Adding PearStand here with num_cpus=0.1 explicitly overrides Ray's 1-CPU default and aligns it with other deployments.

See the controller log showing PearStand failed to schedule with only 0.4 CPU available:
Screenshot 2026-05-22 at 1 17 31 PM

num_replicas: 1
user_config:
price: 4
ray_actor_options:
num_cpus: 0.1
- name: FruitMarket
num_replicas: 1
ray_actor_options:
Expand Down Expand Up @@ -127,7 +133,6 @@ func RayServiceSampleYamlApplyConfiguration() *rayv1ac.RayServiceSpecApplyConfig
corev1.ResourceMemory: resource.MustParse("2Gi"),
}).
WithLimits(corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("2"),
corev1.ResourceMemory: resource.MustParse("3Gi"),
})))))))
}
Expand Down
2 changes: 0 additions & 2 deletions ray-operator/test/e2erayservice/testdata/ray-service.ft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ spec:
cpu: 300m
memory: 1G
limits:
cpu: 500m
memory: 2G
workerGroupSpecs:
- replicas: 1
Expand All @@ -68,7 +67,6 @@ spec:
cpu: 300m
memory: 1G
limits:
cpu: 500m
memory: 1G
---
kind: ConfigMap
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ spec:
cpu: "1"
memory: 1G
limits:
cpu: "1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worker CPU limit not removed in autoscaling YAML

Medium Severity

The CPU limit (cpu: "500m") on the worker pod in rayservice.autoscaling.yaml was not removed, while CPU limits were consistently removed from both head and worker pods in all other RayService test YAML files (rayservice.static.yaml, rayservice.deletiondelay.yaml, ray-service.ft.yaml). This appears to be an oversight that leaves the autoscaling worker pod susceptible to the same CPU throttling-related flakiness this PR intends to eliminate.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5a93c83. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory: 2G
ports:
- containerPort: 6379
Expand All @@ -69,5 +68,4 @@ spec:
cpu: "300m"
memory: "1G"
limits:
cpu: "500m"
memory: "1G"
Original file line number Diff line number Diff line change
Expand Up @@ -61,5 +61,4 @@ spec:
cpu: "300m"
memory: "1G"
limits:
cpu: "500m"
Comment thread
cursor[bot] marked this conversation as resolved.
memory: "1G"
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ spec:
cpu: "300m"
memory: "1G"
limits:
cpu: "500m"
memory: "2G"
ports:
- containerPort: 6379
Expand Down Expand Up @@ -60,5 +59,4 @@ spec:
cpu: "300m"
memory: "1G"
limits:
cpu: "500m"
memory: "1G"
Loading