Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/reference/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ _Appears in:_
| `imagePullPolicy` _[PullPolicy](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#pullpolicy-v1-core)_ | ImagePullPolicy optionally overrides the autoscaler container's image pull policy. This override is provided for autoscaler testing and development. | | |
| `securityContext` _[SecurityContext](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#securitycontext-v1-core)_ | SecurityContext defines the security options the container should be run with.<br />If set, the fields of SecurityContext override the equivalent fields of PodSecurityContext.<br />More info: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/ | | |
| `idleTimeoutSeconds` _integer_ | IdleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.<br />Defaults to 60 (one minute). It is not read by the KubeRay operator but by the Ray autoscaler. | | |
| `ttlSecondsAfterIdle` _integer_ | TTLSecondsAfterIdle is the number of seconds to wait before deleting an idle RayCluster.<br />The Ray autoscaler observes cluster idleness and reports the IdleTTLExpired status condition. The KubeRay operator deletes the RayCluster when the condition is true. | | |
| `upscalingMode` _[UpscalingMode](#upscalingmode)_ | UpscalingMode is "Conservative", "Default", or "Aggressive."<br />Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.<br />Default: Upscaling is not rate-limited.<br />Aggressive: An alias for Default; upscaling is not rate-limited.<br />It is not read by the KubeRay operator but by the Ray autoscaler. | | Enum: [Default Aggressive Conservative] <br /> |
| `version` _[AutoscalerVersion](#autoscalerversion)_ | Version is the version of the Ray autoscaler.<br />Setting this to v1 will explicitly use autoscaler v1.<br />Setting this to v2 will explicitly use autoscaler v2.<br />If this isn't set, the Ray version determines the autoscaler version.<br />In Ray 2.47.0 and later, the default autoscaler version is v2. It's v1 before that. | | Enum: [v1 v2] <br /> |
| `env` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Optional list of environment variables to set in the autoscaler container. | | |
Expand Down Expand Up @@ -665,6 +666,7 @@ _Appears in:_
| `imagePullPolicy` _[PullPolicy](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#pullpolicy-v1-core)_ | ImagePullPolicy optionally overrides the autoscaler container's image pull policy. This override is provided for autoscaler testing and development. | | |
| `securityContext` _[SecurityContext](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#securitycontext-v1-core)_ | SecurityContext defines the security options the container should be run with.<br />If set, the fields of SecurityContext override the equivalent fields of PodSecurityContext.<br />More info: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/ | | |
| `idleTimeoutSeconds` _integer_ | IdleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.<br />Defaults to 60 (one minute). It is not read by the KubeRay operator but by the Ray autoscaler. | | |
| `ttlSecondsAfterIdle` _integer_ | TTLSecondsAfterIdle is the number of seconds to wait before deleting an idle RayCluster.<br />The Ray autoscaler observes cluster idleness and reports the IdleTTLExpired status condition. The KubeRay operator deletes the RayCluster when the condition is true. | | |
| `upscalingMode` _[UpscalingMode](#upscalingmode)_ | UpscalingMode is "Conservative", "Default", or "Aggressive."<br />Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.<br />Default: Upscaling is not rate-limited.<br />Aggressive: An alias for Default; upscaling is not rate-limited.<br />It is not read by the KubeRay operator but by the Ray autoscaler. | | Enum: [Default Aggressive Conservative] <br /> |
| `env` _[EnvVar](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envvar-v1-core) array_ | Optional list of environment variables to set in the autoscaler container. | | |
| `envFrom` _[EnvFromSource](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.28/#envfromsource-v1-core) array_ | Optional list of sources to populate environment variables in the autoscaler container. | | |
Expand Down
6 changes: 6 additions & 0 deletions helm-chart/kuberay-operator/crds/ray.io_rayclusters.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions helm-chart/kuberay-operator/crds/ray.io_raycronjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,9 @@ spec:
type: string
type: object
type: object
ttlSecondsAfterIdle:
format: int32
type: integer
upscalingMode:
enum:
- Default
Expand Down
6 changes: 6 additions & 0 deletions helm-chart/kuberay-operator/crds/ray.io_rayjobs.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions helm-chart/kuberay-operator/crds/ray.io_rayservices.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions helm-chart/ray-cluster/tests/raycluster_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ tests:
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 60
ttlSecondsAfterIdle: 1800
imagePullPolicy: IfNotPresent
env:
- name: ENV_KEY
Expand Down Expand Up @@ -71,6 +72,7 @@ tests:
value:
upscalingMode: Default
idleTimeoutSeconds: 60
ttlSecondsAfterIdle: 1800
imagePullPolicy: IfNotPresent
env:
- name: ENV_KEY
Expand Down
2 changes: 2 additions & 0 deletions helm-chart/ray-cluster/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ head:
# upscalingMode: Default
# idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
# idleTimeoutSeconds: 60
# ttlSecondsAfterIdle is the number of seconds to wait before deleting an idle RayCluster.
# ttlSecondsAfterIdle: 1800
# imagePullPolicy optionally overrides the autoscaler container's default image pull policy (IfNotPresent).
# imagePullPolicy: IfNotPresent
# Optionally specify the autoscaler container's securityContext.
Expand Down
6 changes: 6 additions & 0 deletions ray-operator/apis/ray/v1/raycluster_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,10 @@ type AutoscalerOptions struct {
// Defaults to 60 (one minute). It is not read by the KubeRay operator but by the Ray autoscaler.
// +optional
IdleTimeoutSeconds *int32 `json:"idleTimeoutSeconds,omitempty"`
// TTLSecondsAfterIdle is the number of seconds to wait before deleting an idle RayCluster.
// The Ray autoscaler observes cluster idleness and reports the IdleTTLExpired status condition. The KubeRay operator deletes the RayCluster when the condition is true.
// +optional
TTLSecondsAfterIdle *int32 `json:"ttlSecondsAfterIdle,omitempty"`
// UpscalingMode is "Conservative", "Default", or "Aggressive."
// Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
// Default: Upscaling is not rate-limited.
Expand Down Expand Up @@ -371,6 +375,8 @@ const (
RayClusterSuspending RayClusterConditionType = "RayClusterSuspending"
// RayClusterSuspended is set to true when all Pods belonging to a suspending RayCluster are deleted. Note that RayClusterSuspending and RayClusterSuspended cannot both be true at the same time.
RayClusterSuspended RayClusterConditionType = "RayClusterSuspended"
// IdleTTLExpired is set to true by the Ray autoscaler when the cluster has been idle longer than spec.autoscalerOptions.ttlSecondsAfterIdle.
IdleTTLExpired RayClusterConditionType = "IdleTTLExpired"
)

// HeadInfo gives info about head
Expand Down
5 changes: 5 additions & 0 deletions ray-operator/apis/ray/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions ray-operator/apis/ray/v1alpha1/raycluster_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,9 @@ type AutoscalerOptions struct {
// IdleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
// Defaults to 60 (one minute). It is not read by the KubeRay operator but by the Ray autoscaler.
IdleTimeoutSeconds *int32 `json:"idleTimeoutSeconds,omitempty"`
// TTLSecondsAfterIdle is the number of seconds to wait before deleting an idle RayCluster.
// The Ray autoscaler observes cluster idleness and reports the IdleTTLExpired status condition. The KubeRay operator deletes the RayCluster when the condition is true.
TTLSecondsAfterIdle *int32 `json:"ttlSecondsAfterIdle,omitempty"`
// UpscalingMode is "Conservative", "Default", or "Aggressive."
// Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
// Default: Upscaling is not rate-limited.
Expand Down
5 changes: 5 additions & 0 deletions ray-operator/apis/ray/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions ray-operator/config/crd/bases/ray.io_rayclusters.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions ray-operator/config/crd/bases/ray.io_raycronjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,9 @@ spec:
type: string
type: object
type: object
ttlSecondsAfterIdle:
format: int32
type: integer
upscalingMode:
enum:
- Default
Expand Down
6 changes: 6 additions & 0 deletions ray-operator/config/crd/bases/ray.io_rayjobs.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions ray-operator/config/crd/bases/ray.io_rayservices.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions ray-operator/controllers/ray/common/rbac.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,11 @@ func BuildRole(cluster *rayv1.RayCluster) (*rbacv1.Role, error) {
Resources: []string{"rayclusters"},
Verbs: []string{"get", "patch"},
},
{
APIGroups: []string{"ray.io"},
Resources: []string{"rayclusters/status"},
Verbs: []string{"patch"},
},
},
}

Expand Down
19 changes: 19 additions & 0 deletions ray-operator/controllers/ray/common/rbac_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import (
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
corev1 "k8s.io/api/core/v1"
rbacv1 "k8s.io/api/rbac/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

rayv1 "github.com/ray-project/kuberay/ray-operator/apis/ray/v1"
Expand Down Expand Up @@ -85,3 +86,21 @@ func TestBuildRoleBindingSubjectAndRoleRefName(t *testing.T) {
})
}
}

func TestBuildRoleAllowsAutoscalerToPatchRayClusterStatus(t *testing.T) {
cluster := &rayv1.RayCluster{
ObjectMeta: metav1.ObjectMeta{
Name: "raycluster-sample",
Namespace: "default",
},
}

role, err := BuildRole(cluster)
require.NoError(t, err)

assert.Contains(t, role.Rules, rbacv1.PolicyRule{
APIGroups: []string{"ray.io"},
Resources: []string{"rayclusters/status"},
Verbs: []string{"patch"},
})
}
16 changes: 16 additions & 0 deletions ray-operator/controllers/ray/raycluster_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,15 @@ func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, instance
return ctrl.Result{}, nil
}

if cond := meta.FindStatusCondition(instance.Status.Conditions, string(rayv1.IdleTTLExpired)); cond != nil && cond.Status == metav1.ConditionTrue && isIdleTTLTerminationEnabled(instance) {
logger.Info("Deleting RayCluster because the idle TTL has expired", "condition", cond)
r.Recorder.Eventf(instance, corev1.EventTypeNormal, string(rayv1.IdleTTLExpired), "%s", cond.Message)
if err := r.Delete(ctx, instance); err != nil {
return ctrl.Result{RequeueAfter: DefaultRequeueDuration}, client.IgnoreNotFound(err)
}
return ctrl.Result{}, nil
}

reconcileFuncs := []reconcileFunc{
r.reconcileAutoscalerServiceAccount,
r.reconcileAutoscalerRole,
Expand Down Expand Up @@ -355,6 +364,13 @@ func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, instance
return ctrl.Result{RequeueAfter: time.Duration(requeueAfterSeconds) * time.Second}, nil
}

func isIdleTTLTerminationEnabled(instance *rayv1.RayCluster) bool {
return instance != nil &&
utils.IsAutoscalingEnabled(&instance.Spec) &&
instance.Spec.AutoscalerOptions != nil &&
instance.Spec.AutoscalerOptions.TTLSecondsAfterIdle != nil
}

func (r *RayClusterReconciler) reconcileAuthSecret(ctx context.Context, instance *rayv1.RayCluster) error {
logger := ctrl.LoggerFrom(ctx)
logger.Info("Reconciling Auth")
Expand Down
Loading
Loading