Abstract In our blogs “SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency” and "SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases", we discussed SuperSkyRay’s support for running Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh as well as SuperSkyRay’s non-disruptive handling of Ray apps that outgrow single-cluster placement via extending them to multi-cluster placement. In this blog, we consider SuperSkyRay’s handling of KubeRay RayServices that outgrow the single Kubernetes (K8s) clusters hosting them due to Ray cluster upgrade or reconfigure with zero downtime. To support zero downtime (default), the RayService keeps the current Ray cluster running while it brings up an additional Ray cluster with the new configuration; the upgrade or reconfiguration is incomplete until the new version of the Ray cluster is available. SuperSkyRay can reschedule a RayService deployed on a single cluster onto a different cluster to avoid the update stalling indefinitely when there are insufficient resources for a second RayCluster. While this relocation involves downtime, it is appropriate when time-to-update is critical and resources are limited. Introduction When any field in spec.rayClusterConfig of a running RayService is changed, KubeRay by default performs a zero downtime upgrade of the Ray cluster as follows. It keeps the current copy of the Ray cluster running to continue processing service requests while it deploys an additional version of the Ray cluster with the updates. Once the new version is fully ready, it switches the service to using the updated Ray cluster and removes the old Ray cluster. While this avoids service downtime, it requires that the K8s cluster hosting the RayService have sufficient resources to run two copies of the Ray cluster. When this is not possible, the service update remains incomplete for an indefinite period of time, which is undesirable. (RayService no-downtime upgrade can be disabled by setting ENABLE_ZERO_DOWNTIME to false, so cluster config changes do not engender any upgrade operation, which can also be undesirable.) When Nova detects that the schedule group running on a single cluster has pending pods, it looks to reschedule the group. If skip-capacity-relocate is not set, it will first look for an alternative single-cluster placement. When the group contains a RayService with a Ray cluster, it seeks an alternative single cluster that is sufficient for one copy of the Ray cluster, which works fine for the update case since the relocated RayService is restarted with only the most recent Ray cluster configuration. While this relocation will engender RayService downtime, it may be worthwhile to achieve the service update in a timely manner. Note that if skip-capacity-relocate option is set, the RayService will not be relocated and the service update will remain incomplete until sufficient resources are available in the cluster. SuperSkyRay could be extended to perform cross-cluster placement of the new Ray cluster, while maintaining the existing Ray cluster on the current K8s cluster, but the ROI of adding this complexity is unclear; we note that KubeRay is moving to no-downtime incremental upgrades, which will reduce the resource requirements of updating RayService Ray clusters. SuperSkyRay New Cluster Reschedule Operation SuperSkyRay’s group rescheduling is triggered as in our previous blog "SuperSkyRay: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases". In this case, however, since skip-capacity-relocate is unset, an alternative single cluster placement is considered. When another placement is found, the manifests for objects in the scheduling group are removed from the old cluster and added to the new, and the workload is redeployed. SuperSkyRay Example Use Case Let’s look at an example use case where Nova has placed a group containing a RayService prediction service on an on-premise K8s cluster, as shown in Figure 1, using an AKS “on-prem” cluster for illustration. We then manually update the configuration of the Ray cluster in the service, leading KubeRay to create a second copy of the Ray cluster with the updated configuration in the service. This second copy does not fit on the on-premise K8s cluster, so the update is blocked. SuperSkyRay reschedules the group containing the RayService to the AKS “cloud” cluster where the updated service is deployed, as shown in Figure 2. Note we could optionally trigger a reschedule of the updated service back to the on-premise cluster, if desired. Appendix A contains the details for running this use case on AKS cloud K8s clusters. Conclusion In this blog, we explained how SuperSkyRay handles a Ray app that outgrows its original cluster after an upgrade or reconfiguration, by rescheduling the app to another K8s cluster to prevent updates from stalling due to insufficient resources. While this Ray app relocation involves downtime, it is appropriate when resources are limited and time-to-update is critical. Have you experienced RayService RayCluster updates blocking indefinitely due to insufficient resources to run a second copy of the RayCluster? Cilium Cluster Mesh is open-source and a free trial version of Nova is available here. Please give SuperSkyRay a try and let us know how it goes! Appendix A: Example DetailsSetup SuperSkyRay
Run Example Use Case
Cleanup
Authors: Anne Holler (Chief Scientist, Elotl) Liz Rice (Chief Open Source Officer, Isovalent at Cisco) Contributors: Dan Wendlandt (Co-Founder, Isovalent at Cisco) Nicholas Lane (Principal Solutions Architect, Isovalent at Cisco) Comments are closed.
|
Topic
All
Archives
November 2025
|

RSS Feed