Elotl
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
Search

SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency

11/2/2025

 

Abstract

Picture
This blog presents SuperSkyRay, a name we gave to supporting Ray app execution via KubeRay across Kubernetes (K8s) clusters running the Cilium Cluster Mesh multi-cluster datapath.  SuperSkyRay uses the Nova K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation.  SuperSkyRay addresses the resource and time inefficiency that occurs when resources needed for Ray apps are fragmented across K8s clusters.

Introduction

Organizations using KubeRay to run the Ray ML platform on K8s often have multiple clusters for reasons such as resource availability and cost, service continuity, geo-location, and quality of service.  SkyRay reduces the toil of managing instances of KubeRay running on a fleet of K8s clusters by providing policy-driven resource-aware scheduling of Ray apps onto K8s clusters.  However, SkyRay does not address the inefficiency that occurs if the desired scale of a Ray app exceeds the spare capacity of any single cluster in the fleet, while at the same time the fleet has sufficient idle resources fragmented across clusters. In this case, the app runs with fewer resources than desired or is delayed until enough single-cluster capacity is freed.  This inefficiency could be addressed if the Ray app could be run across multiple K8s clusters.
This blog presents SuperSkyRay, which supports Ray app execution via KubeRay across K8s clusters running the Cilium Cluster Mesh multi-cluster datapath.  SuperSkyRay uses the Nova K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation.  We describe SuperSkyRay’s components and placement operation and then give an example use case running a RayService for prediction across on-premise and cloud clusters.  The example achieves better utilization and time-to-results than possible with single-cluster placement in the case that needed resources are fragmented.

SuperSkyRay Components

Ray, KubeRay

Ray is an open-source unified framework designed to simplify the development and scaling of distributed applications, particularly for AI workloads.  Ray includes:
  • Ray core: supplies primitives to simplify building and scaling distributed applications.
  • Ray AI libraries: support running a variety of distributed ML tasks.
  • Ray clusters: provide Ray workers connected to a Ray head for running Ray apps.

KubeRay handles the creation, deletion, and scaling of Ray clusters, jobs, and services on a K8s cluster. The structure of KubeRay is shown in Figure 1. KubeRay supports three K8s Custom Resource Definitions:

  • RayCluster
    • For creating a Ray cluster with the specified resources and attributes.
  • RayJob
    • For creating a Ray cluster and submitting a job to it when the cluster is ready.
    • Can optionally delete the Ray cluster once the job finishes.
    • Often used for ML/AI training or batch prediction.
  • RayService
    • For creating a Ray cluster and running a Ray Serve deployment graph.
    • Offers zero-downtime upgrades, high availability, and Ray Serve autoscaling.
    • Often used for ML/AI online serving.
KubeRay deployments can optionally also include the Ray Autoscaler, which automatically adds and removes worker nodes from a Ray cluster based on resource requests.
Picture
Figure 1: KubeRay Structure

Nova, SkyRay

Nova is a K8s workload fleet manager that schedules groups of K8s objects onto K8s workload clusters, according to policies and available capacity.  Cluster selection can utilize cluster names, labels, attributes, priorities, and available capacity, and placement can handle single or duplicate workload group instances with optional customization per instance.  A Nova workload group placed using an available-capacity policy is gang-scheduled, meaning no member is scheduled until the entire group can fit.  We note that Nova interoperates with cluster autoscalers, including the K8s Cluster Autoscaler and Luna, and optionally supports just-in-time workload clusters, allowing K8s clusters to scale to 0 or be removed when idle and restored/recreated or cloned when needed.  The structure of Nova is shown in Figure 2.
Picture
Figure 2: Nova Structure
In 2024, we introduced SkyRay to extend KubeRay from a single K8s cluster to multi-cluster multi-cloud operation via interoperation with the Nova policy-driven resource-aware fleet manager.  Nova automatically selects each Ray app’s target K8s cluster, on which KubeRay handles the app.  To set up SkyRay, Nova is used with a spread/duplicate policy to deploy KubeRay and its CRDs onto all of its workload clusters, so each cluster is KubeRay-enabled.  Then, whenever a KubeRay CR is submitted to Nova for placement, Nova applies the policy relevant to that CR to select a workload cluster, on which KubeRay deploys and monitors the associated Ray pods.  We note that Nova recognizes the Ray CRs and can determine their resource needs, so Nova can do available-capacity placement of Ray objects.  The structure of SkyRay is shown in Figure 3.
Picture
Figure 3: SkyRay Structure

Cluster Mesh, SuperSkyRay

Cilium Cluster Mesh joins multiple K8s clusters into a unified network, regardless of the K8s distribution or location of each cluster.  Cluster Mesh can combine services running across K8s clusters, allowing service workers to be spread across clusters.  To do this, Cluster Mesh requires that such services be marked with the annotation service.cilium.io/global: "true".

SuperSkyRay augments SkyRay with Cilium Cluster Mesh to allow KubeRay-deployed Ray clusters to span multiple K8s clusters.  This is handled via Nova with its multi-cluster-capacity option enabled, a configuration we call SuperNova.  If no single cluster has sufficient free capacity for placement of a group using an available-capacity policy, SuperNova checks if it is possible to place the group using resources from multiple clusters, and if so, it chooses that placement.  SuperSkyRay includes specialized knowledge in SuperNova about how to handle cross-cluster placement of Ray clusters running with KubeRay and Cluster Mesh.  The structure of SuperSkyRay is shown in Figure 4.
Picture
Figure 4: SuperSkyRay Structure

We designed SuperSkyRay’s operation to be transparent to KubeRay, to interoperate with standard KubeRay installations and minimally-changed Ray app CRs.  That said, SuperSkyRay imposes several KubeRay-related requirements:
  • To allow Cluster Mesh to join the Ray cluster head service across multiple K8s clusters, the Cluster Mesh global service annotation must be included in the Ray head service manifest as shown here.  Also, the Ray cluster head service must have an IP; for recent KubeRay releases, either the ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE option must be set or the service must be configured to use (say) the NodePort type.
  • To allow SuperSkyRay to do the Ray worker updates needed for cross-cluster operation, the Ray autoscaler must be enabled (example here), even for fixed size Ray clusters.  Enabling Ray autoscaler instructs KubeRay that Ray cluster worker nodes are externally managed so that KubeRay refrains from doing Ray worker node scaling operations.

SuperSkyRay Cross-Cluster Placement Operation

SuperSkyRay cross-cluster placement operates as follows:
  • When SuperNova chooses cross-cluster placement of a Ray app, the Nova control plane places the Ray object manifest into the Nova scheduling configmap for the K8s cluster on which the Ray cluster head is slated to run.
  • The workload cluster Nova agent schedule controller that monitors that Nova scheduling configmap then deploys the Ray object manifest onto its workload cluster.
  • The KubeRay instance on that cluster materializes the K8s deployments, services, and jobs associated with that Ray object.
  • The Nova agent status controller running on that cluster detects Ray cluster worker pods that are pending in the cluster and that are intended to be scheduled on another cluster.  It replaces those pods with placeholder pods to satisfy KubeRay’s Ray cluster goal state; without placeholder pods, KubeRay will not transition the Ray cluster to the ready state.   And it places the manifests for those worker pods into the Nova scheduling configmap of the K8s cluster on which they were intended to run.
  • The Nova agent status controller also detects head services for Ray clusters with cross-cluster placement and duplicates the manifest of those services into the Nova scheduling configmap of the other clusters that will host Ray workers, as required by Cilium Cluster Mesh to combine cross-cluster services.

SuperSkyRay Example Use Case

An example SuperSkyRay use case involves running large-scale prediction across on-premise and cloud clusters for better utilization and time-to-results than single-cluster placement.  This use case, called “AI Workload Cloud Bursting”, was presented at Cisco Live 2025.

In Appendix A, we describe how to run a simplified version of this use case using only AKS cloud K8s clusters, for ease of trial.  The outcome of the simplified placement is depicted in Figure 5.  A demo of the scenario is available here.
Picture
Figure 5: SuperSkyRay cross-cluster RayService placement

Conclusion

In this blog, we described the components and operation of SuperSkyRay.  We presented an example use case it enables, which involves running a RayService for prediction across a fleet comprising an on-premise and cloud K8s cluster.  The Ray app doesn’t fit on either K8s cluster, but can fit using the spare resources on both clusters.  SuperSkyRay schedules it across the clusters, increasing utilization and reducing time-to-results relative to single-cluster placement.

In subsequent blogs, SuperSkyRay, Part 2 & SuperSkyRay, Part 3, we’ll present SuperSkyRay’s handling of dynamic Ray app use cases, including scaling an online on-premise prediction service to add a cloud cluster worker without migration downtime, and bursting to another cluster to facilitate update of a running Ray service.

Do you have use cases where bursting your Ray workload across K8s clusters would save you money and/or time?  Cilium Cluster Mesh is open-source and a free trial version of Nova is available here.  Please give SuperSkyRay a try and let us know how it goes!

Appendix A: Example Details

Setup SuperSkyRay
  • Allocate 2 AKS cloud K8s clusters to serve as Nova workload clusters, joined w/Cilium Cluster Mesh 1.17.4 or later, as described in cheat-sheet here
    • Have 1 more AKS cloud K8s cluster available to host the Nova Control Plane.
  • Install Nova 1.3.11 (or later) on the clusters, enable --multi-cluster-capacity, as described in cheat-sheet here
  • Deploy KubeRay in the SkyRay Configuration, as described in cheat-sheet here

Run Example Use Case
  • Place a RayService that won't fit on one workload cluster, but does fit on 2, as described in cheat-sheet here
    • SuperSkyRay will spread the RayService across the 2 workload clusters
  • Interact with the RayService, as described in the cheat-sheet here

Cleanup
  • Please see the cheat-sheet here


Authors:
Anne Holler (Chief Scientist, Elotl)
Liz Rice (Chief Open Source Officer, Isovalent at Cisco)

Contributors:
Dan Wendlandt (Co-Founder, Isovalent at Cisco)
Nicholas Lane (Principal Solutions Architect, Isovalent at Cisco)

Comments are closed.

    Topic

    All
    ARM
    Autoscaling
    Deep Learning
    Disaster Recovery
    GPU Time-slicing
    Luna
    Machine Learning
    Node Management
    Nova
    Troubleshooting
    VPA

    Archives

    November 2025
    September 2025
    August 2025
    July 2025
    May 2025
    April 2025
    January 2025
    November 2024
    October 2024
    August 2024
    July 2024
    June 2024
    April 2024
    February 2024

    RSS Feed

​© 2025 Elotl, Inc.
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial