Remove cpu limit for rayservice e2e test#4859
Conversation
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
andrewsykim
left a comment
There was a problem hiding this comment.
Overall makes sense to me, would not be surprised if CPU limits contribute to some level of flakiness in e2e tests
| price: 2 | ||
| ray_actor_options: | ||
| num_cpus: 0.1 | ||
| - name: PearStand |
There was a problem hiding this comment.
According to Ray’s official core spec, actors default to num_cpus=1 for scheduling if not explicitly specified.
Because PearStand was defined in the graph but omitted in our serveConfigV2, it didn't get any custom ray_actor_options, so Ray automatically assigned it the default 1 CPU token.
Previously, this was masked because our head node had limits.cpu: 2 (which made KubeRay pass --num-cpus=2 to Ray). Now that we removed the limit, KubeRay falls back to using requests.cpu: 1. With only 1 total CPU token available in Ray, PearStand's default 1-CPU demand broke the budget and caused the scheduling failure.
Adding PearStand here with num_cpus=0.1 explicitly overrides Ray's 1-CPU default and aligns it with other deployments.
See the controller log showing PearStand failed to schedule with only 0.4 CPU available:

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
|
The worker CPU limit in |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 5a93c83. Configure here.
| cpu: "1" | ||
| memory: 1G | ||
| limits: | ||
| cpu: "1" |
There was a problem hiding this comment.
Worker CPU limit not removed in autoscaling YAML
Medium Severity
The CPU limit (cpu: "500m") on the worker pod in rayservice.autoscaling.yaml was not removed, while CPU limits were consistently removed from both head and worker pods in all other RayService test YAML files (rayservice.static.yaml, rayservice.deletiondelay.yaml, ray-service.ft.yaml). This appears to be an oversight that leaves the autoscaling worker pod susceptible to the same CPU throttling-related flakiness this PR intends to eliminate.
Reviewed by Cursor Bugbot for commit 5a93c83. Configure here.


Why are these changes needed?
Remove CPU resource limits from RayService e2e test specs.
Previously, a 500m CPU limit on the head pod caused dashboard startup timeouts and flaky tests (fixed in #4702 by raising the limit to 1).
However, CPU limits are unnecessary in this test environment and can still cause throttling under load. Removing them entirely eliminates this type of flakiness rather than tuning the limit value.
ref
Related issue number
Checks