Elotl
  • Home
  • Platform
    • Luna
    • Nova
  • Resources
    • Blog
    • Youtube
    • Podcast
    • Meetup
  • Usecases
    • GenAI
  • Company
    • Team
    • Careers
    • Contact
    • News
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
  • Home
  • Platform
    • Luna
    • Nova
  • Resources
    • Blog
    • Youtube
    • Podcast
    • Meetup
  • Usecases
    • GenAI
  • Company
    • Team
    • Careers
    • Contact
    • News
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
Search

Blog

Building an Elastic GPU Cluster with the KAI Scheduler and Luna Autoscaler

5/28/2025

 
Picture
When managing machine learning workloads at scale, efficient GPU scheduling becomes critical. The KAI Scheduler introduces a structured approach to resource allocation by organizing jobs into queues and operating under the assumption of fixed GPU resources available within the cluster. For clarification for those not familiar with KAI terminology, the term "job" refers to a unit of scheduling work defined within KAI’s own abstraction, not to be confused with a Kubernetes Job resource (i.e., the batch/v1 kind used in Kubernetes for running finite, batch-style workloads). Each queue can be assigned limits and quotas, allowing administrators to control how resources are distributed across teams, projects, or workloads. This model ensures fair usage and predictability, but it also means that when demand exceeds supply, jobs can sit idle, waiting for resources to become available, and when supply exceeds demand, unnecessary costs are incurred.

This is where the real strength of the KAI Scheduler can shine: pairing the KAI Scheduler with Luna, an intelligent autoscaler. By combining the KAI Scheduler with an intelligent autoscaler like Luna, the system becomes highly elastic, able to dynamically add GPU nodes only when truly needed, and scale them back down to optimize efficiency. Instead of relying on a static pool of GPUs, the cluster can grow to meet active demand — but only up to what is necessary and permitted by the configured queue limits and quotas. It’s worth noting, Luna doesn't indiscriminately add nodes; it works intelligently alongside KAI, ensuring that scaling decisions respect organizational boundaries and cost controls.  Beyond scaling decisions, Luna offers settings to guide GPU instance selection, adding another layer of precision.


Read More

Supercharge your Cluster Autoscaling with VPA

5/13/2025

 
Picture
Choosing accurate CPU and memory request values for Kubernetes workloads is a difficult endeavor. This difficulty results in application developers overprovisioning their workloads to ensure that application performance will not be affected. This can lead to increasing cloud costs and inefficient resource usage. In addition, it is also possible that workloads can be underprovisioned inadvertently. This can negatively affect application performance and potentially even lead to service disruptions.

In this blog, we describe how Kubernetes Vertical Pod Autoscaler (VPA) can be leveraged in conjunction with Luna, a powerful cluster autoscaler - to ensure that Kubernetes workloads are right-sized by VPA and the Kubernetes cluster as well as nodes are right-sized by Luna - resulting in cost-effective and performant operations.



Read More

Fun with Spot

4/24/2025

 

Experiences using Luna Smart Autoscaling of Public Cloud Kubernetes Clusters for Offline Inference using GPUs

Picture
Offline inference is well-suited to take advantage of spot GPU capacity in public clouds.  However, obtaining spot and on-demand GPU instances can be frustrating, time-consuming, and costly.  The Luna smart cluster autoscaler scales cloud Kubernetes (K8s) clusters with the least-expensive available spot and on-demand instances, in accordance with constraints that can include GPU SKU and count as well as maximum estimated hourly cost.  In this blog, we share recent experiences with offline inference on GKE, AKS, and EKS clusters using Luna.  Luna efficiently handled the toil of finding the lowest-priced available spot GPU instances, reducing estimated hourly costs by 38-50% versus an on-demand baseline and turning an often tedious task into bargain-jolt fun.

Introduction

Applications such as query/response chatbots are handled via online serving, in which each input and prompt is provided in real-time to the model running on one or more GPU workers.  Automatic instance allocation for online serving presents efficiency challenges.  Real-time response is sensitive to scaling latency during usage spikes and can be impacted by spot reclamation and replacement.  Also, peak online serving usage often overlaps with peak cloud resource usage, affecting the available capacity for GPU instances.  We've previously discussed aspects of using the Luna smart cluster autoscaler to automatically allocate instances for online serving, e.g., scaling Helix to handle ML load and reducing deploy time for new ML workers.

Read More

Reducing Deploy Time for LLM Serving on Cloud Kubernetes with Luna Smart Autoscaler

1/28/2025

 

OVERVIEW

Picture
26 minutes!  26 long minutes was our wait time in one example case for our chatbot to be operational.  Our LLM Kubernetes service runs in the cloud, and we found that deploying it from start to finish took between 13 and 26 minutes, which negatively impacted our agility and our happiness!  Spinning up the service does involve a lot of work: creating the GPU node, pulling the large container image, and downloading the files containing the LLM weights to run our model.  But we hoped we could make some simple changes to speed it up, and we did.  In this post you will learn how to do just-in-time provisioning of an LLM service in cloud Kubernetes at deployment times that won't bum you out.

We share our experience with straightforward, low-cost, off-the-shelf methods to reduce container image fetch and model download times on EKS, GKE, and AKS clusters running the Luna smart cluster autoscaler.  Our example LLM serving workload is a KubeRay RayService using vLLM to serve an open-source model downloaded from HuggingFace.  We measured deploy-time improvements of up to 60%.


Read More

EKS Auto Mode vs. Luna: Choosing the Right Scaling Strategy for Your Kubernetes Workloads

1/14/2025

 
Picture
Running Kubernetes on AWS using Elastic Kubernetes Service (EKS) offers a robust platform for container orchestration, but the challenge of managing the underlying compute infrastructure persists. This limitation can be addressed through various approaches, including the fully managed simplicity of EKS Auto Mode or the granular control offered by an intelligent Kubernetes cluster autoscaler like Luna. In this post, we’ll explore the advantages of each, helping you choose the best scaling strategy for your workloads.

Introduction

EKS Auto Mode is a fully managed solution aimed at reducing operational complexity for Kubernetes clusters on AWS. It automates essential tasks like node provisioning, scaling, and lifecycle management, offering an ideal entry point for teams new to EKS or operating simpler workloads.

In contrast, compute autoscalers like Luna offer greater flexibility and customization, allowing you to optimize your infrastructure for the demands of complex and/or resource-intensive workloads.


Read More

Mastering Kubernetes Autoscaling: How Luna Combines Bin-Packing and Bin-Selection for Optimal Cluster Scaling Efficiency

10/3/2024

 
Picture
In the world of Kubernetes, understanding the basics of pods and nodes is important, but to truly optimize your infrastructure, you need to delve deeper. The real game-changer? Cluster Autoscalers. These tools dynamically adjust the size of your cluster, ensuring you meet workload demands without over-provisioning resources. But while many autoscalers focus solely on bin-packing, Luna takes it a step further with its innovative bin-selection feature, delivering an all-encompassing solution for workload management and cost efficiency.

In this blog, we will explore both bin-packing and bin-selection, two essential strategies for Kubernetes autoscaling. By leveraging Luna, you can maximize efficiency, minimize waste, and keep costs under control, all while handling the complexities of varying workload sizes and resource requirements. Let’s dive in!

What is Bin-Packing in Kubernetes?

Bin-packing is the default approach for optimizing pod placement in Kubernetes, maximizing resource utilization across nodes. The concept is simple: pack as many items (pods) into as few bins (nodes) as possible, maximizing resource utilization and minimizing the number of nodes required.


Read More

Luna Hot Node Mitigation: A Chill Pill to Cure Pod Performance Problems

8/21/2024

 
Picture
When nodes in a cluster become over-utilized, pod performance suffers. Avoiding or addressing hot nodes can reduce workload latency and increase throughput.  In this blog, we present two Ray Machine Learning serving experiments that show the performance benefit of Luna’s new Hot Node Mitigation (HNM) feature. With HNM enabled, Luna demonstrated a reduction in latency relative to the hot node runs: 40% in the first experiment and 70% in the second. It also increased throughput: 30% in the first and 40% in the second. We describe how the Luna smart cluster autoscaler with HNM addresses hot node performance issues by triggering the allocation and use of additional cluster resources.

INTRODUCTION

A pod's CPU and memory resource requests express its minimum resource allocations.  The Kubernetes (K8s) scheduler uses these values as constraints for placing the pod on a node, leaving the pod pending when the settings cannot be respected.  Cloud cluster autoscalers look at these values on pending pods to determine the amount of resources to add to a cluster.

A pod configured with both CPU and memory requests, and with limits equal to those requests, is in QoS class guaranteed.  A K8s cluster hosting any non-guaranteed pods runs the risk that some nodes in the cluster could become over-utilized when such pods have CPU or memory usage bursts. Bursting pods running on hot nodes can have performance problems.  A bursting pod’s attempts to use CPU above its CPU resource request can be throttled.  And its attempts to use memory above its memory resource request can cause the pod to be killed.  The K8s scheduler can worsen the situation, by continuing to schedule pods onto hot nodes.

Read More

Right Place, Right Size: Using an Autoscaler-Aware Multi-Cluster Kubernetes Fleet Manager for ML/AI Workloads

7/11/2024

 

Introduction

Picture
Are you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the Elotl Nova policy-driven multi-cluster orchestrator simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity.  Nova is autoscaler-aware, detecting if workload clusters are running either the K8s cluster autoscaler or the Elotl Luna intelligent cluster autoscaler.

In this blog, we examine how Nova policies combined with its autoscaler-awareness can be used to achieve a variety of "right place, right size" outcomes for several common ML/AI GPU workload scenarios. When Nova and Luna team up you can:
  1. Reduce the latency of critical ML/AI workloads by scheduling on available GPU compute.
  2. Reduce your bill by directing experimental jobs to sunk-cost clusters.
  3. Reduce your costs via policies that select GPUs with the desired price/performance.


Read More

Using NVIDIA GPU Time-slicing in Cloud Kubernetes Clusters with the Luna Smart Cluster Autoscaler

6/25/2024

 

Introduction

Picture
Kubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default.  With NVIDIA GPU time-slicing, GPUs can be shared among K8s workloads by interleaving their GPU use.  For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.

Using NVIDIA GPU time-slicing in a cloud Kubernetes cluster with a cluster autoscaler (CA) that is aware of the time-slicing configuration can significantly reduce costs. A time-slice aware “smart” CA prevents initial over-allocation of instances and optimizes instance selection, and reduces the risk of exceeding quotas and capacity limits.  Also, on GKE, where GPU time-slicing is expected to be configured at the control plane level, a smart CA facilitates using time-slicing on GPU resources that are dynamically allocated.



Read More

Unleashing the Power of ARM: Elevating Your Kubernetes Workloads with ARM Nodes

4/29/2024

 
Picture
The recent surge in ARM processor capabilities has sparked a wave of exploration beyond their traditional mobile device domain. This blog explains why you may want to consider using ARM nodes for your Kubernetes workloads. We'll identify potential benefits of leveraging ARM nodes for containerized deployments while acknowledging the inherent trade-offs and scenarios where x86-64 architectures may perform better and thus continue to be a better fit. Lastly we'll describe a seamless way to add ARM nodes to your Kubernetes clusters.

In this blog, for the sake of clarity and brevity, I will be using the term 'ARM' to refer to ARM64 or ARM 64-bit processors, while 'x86' or 'x86-64' will be used interchangeably to denote Intel or AMD 64-bit processors.

What Kubernetes Workloads Tend To Be Ideal for ARM Processors?

Inference-heavy tasks:

While the computations involved in Deep Learning training typically require GPUs for acceptable performance, DL inference is less computationally intense.  Tasks that apply pre-trained models for DL regression or classification can benefit from ARM's power/performance relative to GPU or x86-64 systems. We presented data on running inference on ARM64 in our Scale20x talk.

Read More
<<Previous

    Topic

    All
    ARM
    Autoscaling
    Deep Learning
    Disaster Recovery
    GPU Time-slicing
    Luna
    Machine Learning
    Node Management
    Nova
    Troubleshooting
    VPA

    Archives

    May 2025
    April 2025
    January 2025
    November 2024
    October 2024
    August 2024
    July 2024
    June 2024
    April 2024
    February 2024

    RSS Feed

​© 2025 Elotl, Inc.
  • Home
  • Platform
    • Luna
    • Nova
  • Resources
    • Blog
    • Youtube
    • Podcast
    • Meetup
  • Usecases
    • GenAI
  • Company
    • Team
    • Careers
    • Contact
    • News
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial