SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency11/2/2025
Abstract This blog presents SuperSkyRay, a name we gave to supporting Ray app execution via KubeRay across Kubernetes (K8s) clusters running the Cilium Cluster Mesh multi-cluster datapath. SuperSkyRay uses the Nova K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation. SuperSkyRay addresses the resource and time inefficiency that occurs when resources needed for Ray apps are fragmented across K8s clusters. Introduction Organizations using KubeRay to run the Ray ML platform on K8s often have multiple clusters for reasons such as resource availability and cost, service continuity, geo-location, and quality of service. SkyRay reduces the toil of managing instances of KubeRay running on a fleet of K8s clusters by providing policy-driven resource-aware scheduling of Ray apps onto K8s clusters. However, SkyRay does not address the inefficiency that occurs if the desired scale of a Ray app exceeds the spare capacity of any single cluster in the fleet, while at the same time the fleet has sufficient idle resources fragmented across clusters. In this case, the app runs with fewer resources than desired or is delayed until enough single-cluster capacity is freed. This inefficiency could be addressed if the Ray app could be run across multiple K8s clusters. This blog presents SuperSkyRay, which supports Ray app execution via KubeRay across K8s clusters running the Cilium Cluster Mesh multi-cluster datapath. SuperSkyRay uses the Nova K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation. We describe SuperSkyRay’s components and placement operation and then give an example use case running a RayService for prediction across on-premise and cloud clusters. The example achieves better utilization and time-to-results than possible with single-cluster placement in the case that needed resources are fragmented. SuperSkyRay Components Ray, KubeRayRay is an open-source unified framework designed to simplify the development and scaling of distributed applications, particularly for AI workloads. Ray includes:
Nova, SkyRay Nova is a K8s workload fleet manager that schedules groups of K8s objects onto K8s workload clusters, according to policies and available capacity. Cluster selection can utilize cluster names, labels, attributes, priorities, and available capacity, and placement can handle single or duplicate workload group instances with optional customization per instance. A Nova workload group placed using an available-capacity policy is gang-scheduled, meaning no member is scheduled until the entire group can fit. We note that Nova interoperates with cluster autoscalers, including the K8s Cluster Autoscaler and Luna, and optionally supports just-in-time workload clusters, allowing K8s clusters to scale to 0 or be removed when idle and restored/recreated or cloned when needed. The structure of Nova is shown in Figure 2. In 2024, we introduced SkyRay to extend KubeRay from a single K8s cluster to multi-cluster multi-cloud operation via interoperation with the Nova policy-driven resource-aware fleet manager. Nova automatically selects each Ray app’s target K8s cluster, on which KubeRay handles the app. To set up SkyRay, Nova is used with a spread/duplicate policy to deploy KubeRay and its CRDs onto all of its workload clusters, so each cluster is KubeRay-enabled. Then, whenever a KubeRay CR is submitted to Nova for placement, Nova applies the policy relevant to that CR to select a workload cluster, on which KubeRay deploys and monitors the associated Ray pods. We note that Nova recognizes the Ray CRs and can determine their resource needs, so Nova can do available-capacity placement of Ray objects. The structure of SkyRay is shown in Figure 3. Cluster Mesh, SuperSkyRay Cilium Cluster Mesh joins multiple K8s clusters into a unified network, regardless of the K8s distribution or location of each cluster. Cluster Mesh can combine services running across K8s clusters, allowing service workers to be spread across clusters. To do this, Cluster Mesh requires that such services be marked with the annotation service.cilium.io/global: "true". SuperSkyRay augments SkyRay with Cilium Cluster Mesh to allow KubeRay-deployed Ray clusters to span multiple K8s clusters. This is handled via Nova with its multi-cluster-capacity option enabled, a configuration we call SuperNova. If no single cluster has sufficient free capacity for placement of a group using an available-capacity policy, SuperNova checks if it is possible to place the group using resources from multiple clusters, and if so, it chooses that placement. SuperSkyRay includes specialized knowledge in SuperNova about how to handle cross-cluster placement of Ray clusters running with KubeRay and Cluster Mesh. The structure of SuperSkyRay is shown in Figure 4. We designed SuperSkyRay’s operation to be transparent to KubeRay, to interoperate with standard KubeRay installations and minimally-changed Ray app CRs. That said, SuperSkyRay imposes several KubeRay-related requirements:
SuperSkyRay Cross-Cluster Placement Operation SuperSkyRay cross-cluster placement operates as follows:
SuperSkyRay Example Use Case An example SuperSkyRay use case involves running large-scale prediction across on-premise and cloud clusters for better utilization and time-to-results than single-cluster placement. This use case, called “AI Workload Cloud Bursting”, was presented at Cisco Live 2025. In Appendix A, we describe how to run a simplified version of this use case using only AKS cloud K8s clusters, for ease of trial. The outcome of the simplified placement is depicted in Figure 5. A demo of the scenario is available here. Conclusion In this blog, we described the components and operation of SuperSkyRay. We presented an example use case it enables, which involves running a RayService for prediction across a fleet comprising an on-premise and cloud K8s cluster. The Ray app doesn’t fit on either K8s cluster, but can fit using the spare resources on both clusters. SuperSkyRay schedules it across the clusters, increasing utilization and reducing time-to-results relative to single-cluster placement. In subsequent blogs, SuperSkyRay, Part 2 & SuperSkyRay, Part 3, we’ll present SuperSkyRay’s handling of dynamic Ray app use cases, including scaling an online on-premise prediction service to add a cloud cluster worker without migration downtime, and bursting to another cluster to facilitate update of a running Ray service. Do you have use cases where bursting your Ray workload across K8s clusters would save you money and/or time? Cilium Cluster Mesh is open-source and a free trial version of Nova is available here. Please give SuperSkyRay a try and let us know how it goes! Appendix A: Example DetailsSetup SuperSkyRay
Run Example Use Case
Cleanup
Authors: Anne Holler (Chief Scientist, Elotl) Liz Rice (Chief Open Source Officer, Isovalent at Cisco) Contributors: Dan Wendlandt (Co-Founder, Isovalent at Cisco) Nicholas Lane (Principal Solutions Architect, Isovalent at Cisco) Comments are closed.
|
Topic
All
Archives
November 2025
|

RSS Feed