Abstract In our blogs “SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency” and "SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases", we discussed SuperSkyRay’s support for running Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh as well as SuperSkyRay’s non-disruptive handling of Ray apps that outgrow single-cluster placement via extending them to multi-cluster placement. In this blog, we consider SuperSkyRay’s handling of KubeRay RayServices that outgrow the single Kubernetes (K8s) clusters hosting them due to Ray cluster upgrade or reconfigure with zero downtime. To support zero downtime (default), the RayService keeps the current Ray cluster running while it brings up an additional Ray cluster with the new configuration; the upgrade or reconfiguration is incomplete until the new version of the Ray cluster is available. SuperSkyRay can reschedule a RayService deployed on a single cluster onto a different cluster to avoid the update stalling indefinitely when there are insufficient resources for a second RayCluster. While this relocation involves downtime, it is appropriate when time-to-update is critical and resources are limited. Introduction When any field in spec.rayClusterConfig of a running RayService is changed, KubeRay by default performs a zero downtime upgrade of the Ray cluster as follows. It keeps the current copy of the Ray cluster running to continue processing service requests while it deploys an additional version of the Ray cluster with the updates. Once the new version is fully ready, it switches the service to using the updated Ray cluster and removes the old Ray cluster. While this avoids service downtime, it requires that the K8s cluster hosting the RayService have sufficient resources to run two copies of the Ray cluster. When this is not possible, the service update remains incomplete for an indefinite period of time, which is undesirable. (RayService no-downtime upgrade can be disabled by setting ENABLE_ZERO_DOWNTIME to false, so cluster config changes do not engender any upgrade operation, which can also be undesirable.)
SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases11/2/2025
Abstract In our previous blog, SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency, we discussed how SuperSkyRay could be used to run Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh. In this blog, we turn our attention to how SuperSkyRay can non-disruptively handle Ray apps that outgrow their single Kubernetes (K8s) cluster placement. SuperSkyRay can dynamically change the Ray app placement from single-cluster to cross-cluster, increasing the app’s resources without requiring any app relocation downtime. Introduction When SuperNova (Nova w/multi-cluster-capacity set) performs capacity-based scheduling of a K8s object group, it prefers to place the group on a single cluster if possible, since that choice is simpler in terms of management and networking than cross-cluster placement. If a group placed on a single cluster contains an app for which the worker count is later scaled up, the result may no longer fit on that cluster, e.g., because the cluster has reached its fixed size limit, as is the case of on-premise or cloud reserved-instance clusters. When a group no longer fits on its cluster, SuperNova seeks to reschedule the group.
SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency11/2/2025
Abstract This blog presents SuperSkyRay, a name we gave to supporting Ray app execution via KubeRay across Kubernetes (K8s) clusters running the Cilium Cluster Mesh multi-cluster datapath. SuperSkyRay uses the Nova K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation. SuperSkyRay addresses the resource and time inefficiency that occurs when resources needed for Ray apps are fragmented across K8s clusters. Introduction Organizations using KubeRay to run the Ray ML platform on K8s often have multiple clusters for reasons such as resource availability and cost, service continuity, geo-location, and quality of service. SkyRay reduces the toil of managing instances of KubeRay running on a fleet of K8s clusters by providing policy-driven resource-aware scheduling of Ray apps onto K8s clusters. However, SkyRay does not address the inefficiency that occurs if the desired scale of a Ray app exceeds the spare capacity of any single cluster in the fleet, while at the same time the fleet has sufficient idle resources fragmented across clusters. In this case, the app runs with fewer resources than desired or is delayed until enough single-cluster capacity is freed. This inefficiency could be addressed if the Ray app could be run across multiple K8s clusters.
When managing machine learning workloads at scale, efficient GPU scheduling becomes critical. The KAI Scheduler introduces a structured approach to resource allocation by organizing jobs into queues and operating under the assumption of fixed GPU resources available within the cluster. For clarification for those not familiar with KAI terminology, the term "job" refers to a unit of scheduling work defined within KAI’s own abstraction, not to be confused with a Kubernetes Job resource (i.e., the batch/v1 kind used in Kubernetes for running finite, batch-style workloads). Each queue can be assigned limits and quotas, allowing administrators to control how resources are distributed across teams, projects, or workloads. This model ensures fair usage and predictability, but it also means that when demand exceeds supply, jobs can sit idle, waiting for resources to become available, and when supply exceeds demand, unnecessary costs are incurred.
This is where the real strength of the KAI Scheduler can shine: pairing the KAI Scheduler with Luna, an intelligent autoscaler. By combining the KAI Scheduler with an intelligent autoscaler like Luna, the system becomes highly elastic, able to dynamically add GPU nodes only when truly needed, and scale them back down to optimize efficiency. Instead of relying on a static pool of GPUs, the cluster can grow to meet active demand — but only up to what is necessary and permitted by the configured queue limits and quotas. It’s worth noting, Luna doesn't indiscriminately add nodes; it works intelligently alongside KAI, ensuring that scaling decisions respect organizational boundaries and cost controls. Beyond scaling decisions, Luna offers settings to guide GPU instance selection, adding another layer of precision. Experiences using Luna Smart Autoscaling of Public Cloud Kubernetes Clusters for Offline Inference using GPUs
Offline inference is well-suited to take advantage of spot GPU capacity in public clouds. However, obtaining spot and on-demand GPU instances can be frustrating, time-consuming, and costly. The Luna smart cluster autoscaler scales cloud Kubernetes (K8s) clusters with the least-expensive available spot and on-demand instances, in accordance with constraints that can include GPU SKU and count as well as maximum estimated hourly cost. In this blog, we share recent experiences with offline inference on GKE, AKS, and EKS clusters using Luna. Luna efficiently handled the toil of finding the lowest-priced available spot GPU instances, reducing estimated hourly costs by 38-50% versus an on-demand baseline and turning an often tedious task into bargain-jolt fun.
Introduction
Applications such as query/response chatbots are handled via online serving, in which each input and prompt is provided in real-time to the model running on one or more GPU workers. Automatic instance allocation for online serving presents efficiency challenges. Real-time response is sensitive to scaling latency during usage spikes and can be impacted by spot reclamation and replacement. Also, peak online serving usage often overlaps with peak cloud resource usage, affecting the available capacity for GPU instances. We've previously discussed aspects of using the Luna smart cluster autoscaler to automatically allocate instances for online serving, e.g., scaling Helix to handle ML load and reducing deploy time for new ML workers.
OVERVIEW
26 minutes! 26 long minutes was our wait time in one example case for our chatbot to be operational. Our LLM Kubernetes service runs in the cloud, and we found that deploying it from start to finish took between 13 and 26 minutes, which negatively impacted our agility and our happiness! Spinning up the service does involve a lot of work: creating the GPU node, pulling the large container image, and downloading the files containing the LLM weights to run our model. But we hoped we could make some simple changes to speed it up, and we did. In this post you will learn how to do just-in-time provisioning of an LLM service in cloud Kubernetes at deployment times that won't bum you out.
We share our experience with straightforward, low-cost, off-the-shelf methods to reduce container image fetch and model download times on EKS, GKE, and AKS clusters running the Luna smart cluster autoscaler. Our example LLM serving workload is a KubeRay RayService using vLLM to serve an open-source model downloaded from HuggingFace. We measured deploy-time improvements of up to 60%. Introduction
Are you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the Elotl Nova policy-driven multi-cluster orchestrator simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity. Nova is autoscaler-aware, detecting if workload clusters are running either the K8s cluster autoscaler or the Elotl Luna intelligent cluster autoscaler.
In this blog, we examine how Nova policies combined with its autoscaler-awareness can be used to achieve a variety of "right place, right size" outcomes for several common ML/AI GPU workload scenarios. When Nova and Luna team up you can:
Using NVIDIA GPU Time-slicing in Cloud Kubernetes Clusters with the Luna Smart Cluster Autoscaler6/25/2024
Introduction
Kubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default. With NVIDIA GPU time-slicing, GPUs can be shared among K8s workloads by interleaving their GPU use. For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.
Using NVIDIA GPU time-slicing in a cloud Kubernetes cluster with a cluster autoscaler (CA) that is aware of the time-slicing configuration can significantly reduce costs. A time-slice aware “smart” CA prevents initial over-allocation of instances and optimizes instance selection, and reduces the risk of exceeding quotas and capacity limits. Also, on GKE, where GPU time-slicing is expected to be configured at the control plane level, a smart CA facilitates using time-slicing on GPU resources that are dynamically allocated. How do I efficiently run my AI or Machine Learning (ML) workloads in my Kubernetes clusters? Operating Kubernetes clusters with GPU compute manually presents several challenges, particularly in the allocation and management of GPU resources. One significant pain point is the potential for wasted spend, as manually allocated GPUs may remain idle during periods of low workload. In dynamic or bursty clusters, predicting the optimal GPU requirements becomes challenging, leading to suboptimal resource utilization and increased costs. Additionally, manual allocation necessitates constant monitoring of GPU availability, requiring administrators be aware of the GPU availability in clusters spread across different zones or regions. Once the GPU requirements are determined for a given workload, the administrator needs to manually add nodes when demand surges and remove them during periods of inactivity. There are many GPU types, each with different capabilities, running on different nodes types. The combination of these three factors makes manual GPU nodes management increasingly convoluted. Different workloads may require specific GPU models, leading to complexities in node allocation. Manually ensuring the correct GPU nodes for diverse workloads becomes a cumbersome task, especially when dealing with multiple applications with varying GPU preferences. This adds another layer of operational overhead, demanding detailed knowledge of GPU types, and again availability, and continuous adjustments to meet workload demands. Luna, an intelligent node autoscaler, addresses these pain points by automating GPU node allocation based on workload demands. Luna is aware of GPU availability, as such, it can dynamically choose and allocate needed GPU nodes, eliminating the need for manual intervention. This optimizes resource utilization and reduces wasted spend by scaling GPU resources in line with the workload. Moreover, Luna can allocate specific nodes as defined by the workload requirements, ensuring precise resource allocation tailored to the application's needs. This makes Luna perfectly suited for the most complex compute jobs like AI and ML workloads. Furthermore, Luna's core functionality includes the automatic allocation of alternative GPU nodes in cases where preferred GPUs are unavailable, bolstering its flexibility and resilience. This ensures that workloads with specific GPU preferences can seamlessly transition to available alternatives, maintaining uninterrupted operation. Controlled through annotations within the workload, users can specify cloud instance types to use or avoid, either by instance family or via regular expressions, along with desired GPU SKUs. This capability enables dynamic allocation based on GPU availability and workload demands, simplifying cluster management and promoting efficient scaling and resource utilization without the need for constant manual adjustments. Lastly, the advantages of Luna extend beyond resource optimization and workload adaptability in a single specific cloud. When organizations leverage various cloud providers, flexibility is paramount. An intelligent autoscaler designed to support GPU management within multiple cloud providers empowers users with the freedom to choose the most suitable cloud platform for their specific needs. With Luna enterprises are not locked into a single cloud provider, offering them the agility to transition workloads seamlessly between different cloud environments based on cost-effectiveness, performance, or specific features. Currently Luna supports four cloud providers: Amazon AWS with EKS, Google Cloud with GKE, Microsoft Azure with AKS, and Oracle Cloud Infrastructure with OKE. By providing a unified and agnostic approach to GPU resource management, Luna becomes a strategic asset, enabling organizations to harness the benefits of diverse cloud ecosystems without compromising efficiency or incurring cloud vendor lock-in. In summary, manually managing GPU compute in Kubernetes clusters poses challenges related to wasted spend, manual addition, monitoring, and removal of nodes. Luna addresses these pain points by:
Luna simplifies cluster node management, reduces operational overhead, and ensures efficient GPU resource utilization. To delve deeper into Luna's powerful features and capabilities, explore the Luna product page for details. For step-by-step guidance, consult our Documentation. Ready to experience the seamless management of GPU workloads firsthand? Try Luna today with our free trial and witness the efficiency and flexibility it brings to your cloud environments. Author: Justin Willoughby (Principal Solutions Architect, Elotl) Contributors: Henry Precheur (Senior Staff Engineer, Elotl) Anne Holler (Chief Scientist, Elotl) |
Topic
All
Archives
November 2025
|









RSS Feed