![]()
When managing machine learning workloads at scale, efficient GPU scheduling becomes critical. The KAI Scheduler introduces a structured approach to resource allocation by organizing jobs into queues and operating under the assumption of fixed GPU resources available within the cluster. For clarification for those not familiar with KAI terminology, the term "job" refers to a unit of scheduling work defined within KAI’s own abstraction, not to be confused with a Kubernetes Job resource (i.e., the batch/v1 kind used in Kubernetes for running finite, batch-style workloads). Each queue can be assigned limits and quotas, allowing administrators to control how resources are distributed across teams, projects, or workloads. This model ensures fair usage and predictability, but it also means that when demand exceeds supply, jobs can sit idle, waiting for resources to become available, and when supply exceeds demand, unnecessary costs are incurred.
This is where the real strength of the KAI Scheduler can shine: pairing the KAI Scheduler with Luna, an intelligent autoscaler. By combining the KAI Scheduler with an intelligent autoscaler like Luna, the system becomes highly elastic, able to dynamically add GPU nodes only when truly needed, and scale them back down to optimize efficiency. Instead of relying on a static pool of GPUs, the cluster can grow to meet active demand — but only up to what is necessary and permitted by the configured queue limits and quotas. It’s worth noting, Luna doesn't indiscriminately add nodes; it works intelligently alongside KAI, ensuring that scaling decisions respect organizational boundaries and cost controls. Beyond scaling decisions, Luna offers settings to guide GPU instance selection, adding another layer of precision.
Even more powerfully, when demand drops, the autoscaler can scale GPU nodes down to zero, eliminating idle GPU resource costs entirely when no jobs are pending. This combination of KAI’s scheduling guarantees with elastic GPU scaling through Luna improves resource utilization, enforces workload fairness, and reduces cloud costs — all while staying responsive to real-time demand.
Although KAI's queue-based model applies to both GPU and non-GPU scheduling scenarios, this blog highlights its integration with Luna in the context of GPU workloads, where elastic scaling offers the greatest impact. In this blog post, we'll dive deeper into how KAI's design philosophy around queues and quotas enables this behavior, and how coupling it with the Luna autoscaler transforms your GPU cluster into a highly responsive, cost-effective machine learning platform. Queues, Quotas, and Priorities: The Building Blocks of KAI Scheduling
The KAI Scheduler is a purpose-built GPU scheduling system designed for modern AI/ML clusters where jobs vary widely in size, duration, and importance. At its core, KAI is designed to maximize GPU utilization while ensuring fairness, predictability, and administrative control. Unlike traditional Kubernetes scheduling, which typically operates at a pod-by-pod level, KAI introduces a queue-based model that groups jobs by context, such as by team, project, or workload class, allowing more intelligent and policy-driven resource sharing.
Each queue in KAI acts like a controlled funnel for jobs, with configurable limits (the maximum number of GPUs it can use at once) and quotas (reserved GPU allocations that a queue is guaranteed even during cluster contention). This structure ensures that important teams or high-priority projects are not starved when demand is high, while still allowing flexibility to share unused capacity when possible. KAI also supports job priorities within queues. Higher-priority jobs are scheduled before lower-priority ones, even within the same queue, enabling teams to manage critical workloads more effectively. When GPUs are scarce, KAI can preempt lower-priority jobs (depending on configuration) to ensure that the most important work gets done first. Combined with fair sharing across queues and configurable preemption policies, this priority system helps align resource allocation with business and operational goals. This structured approach — queues, quotas, limits, and priorities — makes KAI uniquely capable of supporting large, dynamic GPU clusters where the mix of users, workloads, and urgency changes constantly. When coupled with Luna, an intelligent autoscaler, KAI ensures that the right jobs run at the right time, while infrastructure elastically grows or shrinks to match real demand. This blog highlights just a few core concepts of the KAI Scheduler, specifically its use of queues, quotas, and priority-based scheduling. However, KAI also includes many other advanced features designed for complex workload management. While we won’t cover those here, they may be worth checking out and exploring. How the Luna Autoscaler Works with the KAI Scheduler
While many Kubernetes autoscalers operate by simply watching for pending pods and then adding nodes when any pod remains unscheduled, this approach falls short in environments where more complex scheduling logic is in place, such as when using the KAI Scheduler. In KAI, it is perfectly normal (and intentional) for some pods to remain pending, not because resources are unavailable, but because a queue’s GPU limit or quota has been reached. An autoscaler that simply reacts to all pending pods would wastefully add GPU nodes that the KAI Scheduler would never utilize, leading to unnecessary cloud spend and resource sprawl.
The Luna autoscaler solves this problem with a more intelligent strategy. Rather than simply responding to the existence of pending pods, Luna can be configured to inspect the pod’s status, conditions, and associated messages to determine why the pod is pending. This allows it to distinguish between pods that truly need more capacity versus pods that are simply waiting for their turn within a queue limit. For example, if a pod’s status.conditions section includes a message such as:
-or-
this indicates that the pod is unschedulable because there are no nodes with GPU resources available. In this case, Luna correctly triggers the addition of a new GPU node, allowing the KAI Scheduler to proceed with placing the job.
On the other hand, if the pending pod’s message says:
it signals that the queue’s GPU limit has already been reached. In this case, adding more nodes would be futile because KAI will not schedule the pod until the quota is freed, regardless of available cluster resources. The Luna autoscaler, if properly configured, recognizes this scenario and avoids unnecessary node provisioning.
The flexibility that enables Luna to behave correctly in these cases comes from its pendingPodReasonRegexp configuration option. This setting lets administrators define a regular expression to match only those pending pod messages that warrant scaling actions. Without any configuration, Luna would treat all pending pods as triggers for scale-out. However, with an expression like:
Luna can simultaneously support both default Kubernetes scheduling messages (like "0/5 nodes are available") and KAI Scheduler-specific resource shortage messages (like "No node in the default node-pool has GPU resources"). Critically, it would ignore pods pending due to quota overages, respecting the queue limits and policies enforced by KAI.
This integration makes Luna a powerful autoscaling companion to KAI, enabling truly elastic GPU infrastructure: adding nodes when needed for real workloads, avoiding waste when queues are at quota, and scaling down to zero when no pods are eligible for scheduling. Together, KAI and Luna deliver an efficient, responsive, and cost-optimized platform for running large-scale AI and ML jobs. Real-World Dynamics: Scheduling, Queue Limits, and Intelligent Scaling
Let's walk through an example to see how the KAI Scheduler and Luna autoscaler work together in practice. We'll explore how GPU workloads are scheduled across queues, how scaling decisions are made, and how the system remains efficient even as demand changes throughout the day.
Imagine a Kubernetes cluster set up to serve multiple internal teams running AI workloads. Two KAI Scheduler queues are configured: a "Research" queue and a "Production Inference" queue. The "Research" queue is assigned a quota of 4 GPUs and a limit of 8 GPUs, while the "Production Inference" queue has a quota of 8 GPUs and a limit of 12 GPUs. These settings ensure that critical production workloads are prioritized and guaranteed sufficient resources even during periods of high demand, while still allowing research teams to scale up when capacity is available. At the start of the day, several production inference jobs are submitted, consuming 6 GPUs. Luna, detects that some production pods are pending with a valid unschedulable reason, indicating a lack of GPU resources, not just a KAI queue overlimit. Based on its configured pendingPodReasonRegexp, Luna correctly interprets these pending pods as requiring new compute and promptly scales up additional GPU nodes. Once the nodes are ready, KAI schedules these inference jobs, bringing production workloads up toward their quota. Shortly afterward, research engineers kick off a series of experimental training jobs, requesting 10 GPUs in total. KAI schedules the first 4 research jobs immediately, aligned with the Research queue’s quota. Another 4 pods may also be scheduled—provided the cluster has sufficient capacity—since they remain within the queue’s configured limit of 8 GPUs. Meanwhile, Luna inspects the pending pods: it recognizes that some research pods are pending due to insufficient GPU capacity, while 2 others remain unscheduled and will continue to remain pending because the queue’s GPU limit has been reached. In response, Luna allocates additional GPU nodes to accommodate the pods still eligible to run within the queue’s limit. Luna only scales up nodes for the pods that actually need capacity and ignores those pending due to limit enforcement. This selective scaling ensures efficient cluster growth without wasting compute on artificially pending jobs. As demand surges further, a second wave of production inference jobs arrives, consuming more GPUs and pushing the cluster toward full utilization. Because production workloads have a higher queue priority, KAI favors them over research jobs when scheduling GPUs that become available. The research pods exceeding their limits remain pending, awaiting free resources. Later in the day, several production inference jobs complete, releasing GPUs back into the cluster. The KAI Scheduler notices the freed-up GPUs and begins to schedule the pending research jobs, respecting quota, limit, and priority policies. As the workload tapers off toward evening, both queues gradually empty out. Luna detects the sustained idleness, no pods are pending that would require GPUs, and begins scaling down the GPU node pools, eventually reaching zero GPU nodes once all jobs have completed or been canceled. Throughout this cycle, KAI ensures fair, priority-aware scheduling based on queue configurations, while Luna manages dynamic, intelligent autoscaling, scaling up precisely when workloads genuinely need resources and scaling down aggressively to save costs. This close coordination keeps the platform cost-effective, responsive, and well-aligned to workload demand. How Luna Ensures Efficient Scaling Even Under Rapid Changes
While the Luna Autoscaler is designed to scale GPU nodes (as well as non-GPU nodes) precisely according to actual demand, it’s important to note that small overshoots can occasionally occur. Because of the inherently dynamic nature of Kubernetes, with pods completing, new pods arriving, and scheduling conditions changing rapidly, Luna may sometimes add slightly more nodes than strictly needed. However, this is expected behavior in highly dynamic systems, and Luna is built to detect and reconcile any over-provisioned nodes quickly. Unused GPU nodes are automatically identified and safely removed during the next autoscaling evaluation cycle. This reconciliation mechanism ensures that the cluster stays responsive to fast-changing workloads without risking long-term resource waste, striking a balance between agility and efficiency.
To further reduce the potential for over-scaling, administrators can configure Luna’s clusterGPULimit option. This setting acts as a cap on the total number of GPUs Luna is allowed to provision. For example, it can be set to the sum of all KAI queue limits or slightly above the expected maximum GPU demand. This ensures that even under bursts of pending pods or fluctuating queue activity, Luna will not scale the cluster beyond a known safe threshold, providing another safeguard for cloud cost and quota control. Closing Thoughts: Intelligent Scheduling and Autoscaling in Action
Effectively managing GPU resources in a Kubernetes environment requires more than just reactively scaling for all pending pods. It demands an understanding of why pods are pending, how workloads are prioritized, and how quotas and queue limits impact scheduling decisions. The KAI Scheduler brings powerful, queue-based control to the table, allowing administrators to enforce GPU resource guarantees, prioritize critical workloads, and avoid resource contention across teams, all while enabling dynamic, fair resource sharing when capacity allows.
However, intelligent scheduling alone isn't enough. To maximize efficiency and cost-effectiveness, the platform must also dynamically match the underlying compute supply to real-world demand. That’s where the Luna intelligent autoscaler complements the KAI Scheduler perfectly. By inspecting pod status messages and acting only when GPU nodes are genuinely needed, not merely reacting to all pending pods, Luna ensures that scaling decisions are precise, deliberate, and resource-aware. As we saw in the example scenario, this combination allows workloads to ramp up smoothly, respecting both quota guarantees and dynamic limits, while ensuring GPU nodes are provisioned only when they can actually be used. When workloads complete, Luna responds quickly, scaling GPU nodes back down, even all the way to zero, helping to avoid unnecessary cloud costs during idle periods. In short, pairing the KAI Scheduler with an intelligent autoscaler, like Luna, provides a powerful foundation for managing large-scale, GPU-intensive Kubernetes workloads. Together, they deliver better workload fairness, faster responsiveness, and smarter resource utilization — all critical ingredients for running a highly efficient, cost-effective compute platform at scale. Looking Ahead: Evolving Luna and KAI Integration
The current integration between the Luna autoscaler and the KAI Scheduler already enables powerful, efficient GPU workload scaling with intelligent handling of queues, quotas, and real-time cluster demands. While the existing functionality covers many common scenarios, we recognize that there may be opportunities for even deeper integration based on real-world needs.
We’d be very interested in hearing from you about potential improvements. If you have ideas for tighter coupling, additional features, or specific use cases where Luna could better support KAI's advanced scheduling behavior, we'd love your feedback. Your input could help guide future enhancements and ensure the system continues to meet evolving GPU workload demands. Get Involved
If you're running GPU workloads today, or planning to, and want to make the most of the KAI Scheduler and Luna autoscaler together, now is the perfect time to get involved.
Share your feedback, test new features, and help us build even smarter, more efficient scaling for Kubernetes GPU environments and workloads. Discover how Luna’s intelligent autoscaling enhances GPU workload management, especially when paired with advanced schedulers like KAI. Visit our Luna product page to explore all its capabilities, or dive into the documentation for hands-on setup guidance. Ready to optimize your cluster with smarter GPU scaling? Start your free trial today and experience the efficiency, control, and cost savings can Luna bring. Author: Justin Willoughby (Principal Solutions Architect, Elotl) Comments are closed.
|