Elotl
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
Search

SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases

11/2/2025

 

Abstract

Picture
In our previous blog, SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency, we discussed how SuperSkyRay could be used to run Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh.

In this blog, we turn our attention to how SuperSkyRay can non-disruptively handle Ray apps that outgrow their single Kubernetes (K8s) cluster placement.  SuperSkyRay can dynamically change the Ray app placement from single-cluster to cross-cluster, increasing the app’s resources without requiring any app relocation downtime.

Introduction

When SuperNova (Nova w/multi-cluster-capacity set) performs capacity-based scheduling of a K8s object group, it prefers to place the group on a single cluster if possible, since that choice is simpler in terms of management and networking than cross-cluster placement.  If a group placed on a single cluster contains an app for which the worker count is later scaled up, the result may no longer fit on that cluster, e.g., because the cluster has reached its fixed size limit, as is the case of on-premise or cloud reserved-instance clusters. When a group no longer fits on its cluster, SuperNova seeks to reschedule the group.
Focusing on the case where the Ray app worker count is scaled up, SuperSkyRay (SuperNova managing SkyRay) by default looks for another single cluster for the group, although relocating the group will involve downtime.  However, if Nova is run with skip-capacity-relocate, which specifies not to relocate a capacity-based group from its current cluster solely to get more resources, or if there is no other single cluster that can run the group, SuperSkyRay considers dynamically expanding the single-cluster placement to a multi-cluster placement, leveraging its specialized knowledge about extending the Ray app’s Ray cluster to span multiple K8s clusters.  By expanding the running app to multi-cluster placement, the downtime that would be needed to relocate the app is avoided.  During any subsequent Ray app scale down, remote Ray workers, i.e., those placed on a K8s cluster not containing the Ray head, are preferentially removed.

We present an example use case where a Ray online prediction service running on an on-premise K8s cluster is, due to increased query volume, scaled up and will no longer fit on the K8s cluster.  SuperSkyRay dynamically extends the service to span the on-premise and cloud clusters, supporting the increase in Ray worker count with no service downtime.  And we present a similar second use case in which the Ray Serve autoscaler increases the number of Ray Workers after the initial on-prem placement of the Ray cluster, requiring cluster span.

SuperSkyRay Cross-Cluster Reschedule Operation

This section assumes that the SuperSkyRay components are set up as described in our blog "SuperSkyRay: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency". 

For SuperSkyRay cross-cluster reschedule, SuperNova is run with skip-capacity-relocate to specify Nova should not relocate a capacity-based group from its current cluster solely to get more resources. When a workload cluster Nova agent status controller detects that a group does not fit, it marks the schedule group for rescheduling by the Nova control plane. When the SuperSkyRay Nova control plane looks at rescheduling a group in this case, it considers dynamically updating the single-cluster placement to a multi-cluster placement.  When the Nova control plane updates the Ray object schedule to multi-cluster placement, it modifies the scheduling data for the Ray app manifest in the workload cluster Nova scheduling configmap.

The Nova agent schedule controller applies the modification to the running Ray app in the workload cluster and the Nova agent status controller detects the change.  It then performs similar operations to those it does for initial cross-cluster Ray worker placement: it replaces each pending pod that should run on a different cluster with a placeholder pod and puts the pod manifest into the appropriate workload cluster Nova scheduling configmap.  It also duplicates the Ray head service onto all clusters slated to run Ray workers so that the Ray cluster head service can leverage Cilium Cluster Mesh for cross-K8s workers.

SuperSkyRay Manually-Scaling Example Use Case

Let’s look at an example use case where Nova has placed a Ray online prediction service on an on-premise K8s cluster, as shown in Figure 1, with AKS clusters standing in for “on-prem” and “cloud” clusters.  The service is later manually scaled to add a worker, which does not fit on the “on-prem” cluster.  SuperSkyRay with skip-capacity-relocate reschedules the group non-disruptively by extending the single-cluster placement to a cross-cluster placement, as shown in Figure 2. 
Picture
Figure 1: SuperSkyRay initially scheduled RayService to run on on-premise cluster
Picture
Figure 2: SuperSkyRay revised schedule for RayService to run across on-premise and cloud cluster
Appendix A contains the details for running this use case on AKS cloud K8s clusters.

SuperSkyRay Auto-Scaling Example Use Case

Let’s look at an example use case where Nova has placed a Ray online prediction service on an on-premise K8s cluster, as shown in Figure 3 with AKS clusters standing in for “on-prem” and “cloud” clusters.  The Ray cluster is configured with 0 workers initially.  The Ray Serve autoscaler subsequently scales the Ray cluster to 2 GPU workers, only one of which will fit on the on-premise cluster.  SuperSkyRay reschedules the group non-disruptively by extending the single-cluster placement to a cross-cluster placement, as shown in Figure 4.  
Picture
Figure 3: SuperSkyRay initially scheduled RayService to run on on-premise cluster
Picture
Figure 4: SuperSkyRay revised schedule for RayService to run across on-premise and cloud cluster
Appendix B contains the details for running this use case on AKS cloud K8s clusters.

Conclusion

In this blog, we’ve discussed how SuperSkyRay can non-disruptively handle KubeRay Ray apps that outgrow their single K8s cluster placement.  SuperSkyRay can dynamically change the Ray app placement from single-cluster to cross-cluster, increasing the app’s resources without app relocation downtime.  We’ve presented two example use cases in which a Ray online prediction service running on an on-premise K8s cluster is scaled to add a worker that would not fit on its workload cluster.  SuperSkyRay dynamically extends the service to span the on-premise and cloud clusters, supporting the increase in Ray worker count with no application downtime.

In subsequent blog, SuperSkyRay, Part 3, we’ll present SuperSkyRay’s handling of RayService cluster upgrade/reconfigure by rescheduling Ray AI Apps to another cluster.

Do you have use cases where bursting your Ray workload dynamically across K8s clusters would save you money and/or time?  Cilium Cluster Mesh is open-source and a free trial version of Nova is available here.  Please give SuperSkyRay a try and let us know how it goes!

Appendix A: Example Details

Setup SuperSkyRay
  • Allocate 2 AKS cloud K8s clusters to serve as Nova workload clusters, joined w/Cilium Cluster Mesh 1.17.4 or later as described here
    • Have 1 more AKS cloud K8s cluster available to host the Nova Control Plane.
  • Install Nova 1.3.11 (or later) on the clusters, enable --multi-cluster-capacity and –skip-capacity-relocate Nova options, as described in cheat-sheet here
  • Deploy KubeRay in the SkyRay Configuration, as described in cheat-sheet here


Run Example Use Case
  • Place a RayService that fits on one workload cluster, as described here
    • SuperSkyRay places the RayService on one workload cluster
  • Interact with the RayService, as described in the cheat-sheet here
  • Manually increase the RayService to request an additional replica that won’t fit; increase spec.serveConfigV2.applications.text_summarizer.deployments.num_replicas to 3 and spec.rayClusterConfig.workerGroupSpecs.replicas to 3
    • SuperSkyRay spreads the existing RayService across the 2 workload clusters
  • Interact with the RayService, as described in the cheat-sheet here
  • Manually decrease the RayService to restore the original replica count
    • SuperSkyRay scales the existing RayService back down to 1 workload cluster

Cleanup
  • Please see the cheat-sheet here

Appendix B: Example Details

Setup SuperSkyRay
  • Allocate 2 AKS cloud K8s clusters to serve as Nova workload clusters, joined w/Cilium Cluster Mesh 1.17.4 or later as described here
    • Include 1 Standard_NV36ads_A10_v5 A10 GPU node in each cluster
    • Have 1 more AKS cloud K8s cluster available to host the Nova Control Plane.
  • Install Nova 1.3.11 (or later) on the clusters, enable --multi-cluster-capacity and –skip-capacity-relocate Nova options, as described in cheat-sheet here.
  • Deploy KubeRay in the SkyRay Configuration, as described in cheat-sheet here

Run Example Use Case
  • Place a RayService that initially fits on one workload cluster and then is scaled by RayServe to fit on 2 clusters, as described here.
    • First SuperSkyRay places the RayService on one workload cluster
    • Then SuperSkyRay spreads the existing RayService across the 2 workload clusters
  • Interact with the RayService, as described in the cheat-sheet here

Cleanup
  • Please see the cheat-sheet here


Authors:
Anne Holler (Chief Scientist, Elotl)
Liz Rice (Chief Open Source Officer, Isovalent at Cisco)

​Contributors:
Dan Wendlandt (Co-Founder, Isovalent at Cisco)
Nicholas Lane (Principal Solutions Architect, Isovalent at Cisco)

Comments are closed.

    Topic

    All
    ARM
    Autoscaling
    Deep Learning
    Disaster Recovery
    GPU Time-slicing
    Luna
    Machine Learning
    Node Management
    Nova
    Troubleshooting
    VPA

    Archives

    November 2025
    September 2025
    August 2025
    July 2025
    May 2025
    April 2025
    January 2025
    November 2024
    October 2024
    August 2024
    July 2024
    June 2024
    April 2024
    February 2024

    RSS Feed

​© 2025 Elotl, Inc.
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial