Using the Cost Estimation Feature in the Luna K8s Smart Autoscaler to Preview and Tune AI Workload Cloud Computing Expenses
While running AI workloads on cloud K8s clusters can make resource scaling seamless, it can also lead to the sticker shock of unexpectedly high cloud bills. And tuning AI workload resource allocation for usage increases can be unintuitive and inefficient, given the idiosyncrasies of cloud vendor node types and prices. In this blog, we introduce the Luna Smart Cluster Autoscaler Cost Estimation feature for estimating the node cost of pods before they run. We show how Luna's node cost estimation feature avoids AI workload sticker shock and facilitates assessing strategies for AI workload scaling.
INTRODUCTION
Kubernetes (K8s) cluster autoscalers can reduce cloud computing expenses by allocating nodes when needed and removing them when no longer needed. For expensive workloads like AI, getting an estimate of the hourly cost before the workload is scheduled can help prevent cloud sticker shock. Also, getting estimated costs helps in configuring the workload to optimize expenses when planning for future growth. Estimated costs can be used to assess the monetary impact of choices such as workload size, GPU SKU and/or instance family selection, and on-demand versus spot pricing.
The Luna Smart Autoscaler for cloud K8s recently added support for providing node hourly cost estimation. For Luna-managed pods whose scheduling readiness is blocked by K8s scheduling gates, if the gates include nodecostestimate, Luna reports a pod event that indicates the node type it would allocate were the pod schedulable, with the type's estimated hourly compute cost. In this blog, we present an overview of Luna's cost estimation feature. We next use the feature to preview the estimated baseline cost of an LLM serving workload running on Amazon AWS EKS, Google GCP GKE, and Microsoft Azure AKS cloud K8s clusters. We discuss how cost estimation can be used to guide tuning the costs of scaling the workload as its usage increases, with the clouds showing significant cost differences for potential workload scaling strategies. We show estimated on-demand costs for EKS, GKE, and AKS, as well as estimated spot costs for EKS. Note that the estimated costs that Luna reports are public prices, and do not reflect customer discounts or special pricing. OVERVIEW OF LUNA COST ESTIMATION
The Luna Smart Autoscaler allocates nodes for pending pods marked for Luna management. As shown in Figure 1, Luna node allocation supports both bin-packing, in which nodes are allocated to host multiple small generic pods, and bin-selection, in which nodes are allocated to host larger pods or pods with special requirements. Luna chooses the lowest-cost node type that satisfies the pod's resource requests and node type selection constraints, if any. Luna supports a variety of selection constraints, including on instance type (include/exclude instance family, match regular expression), maximum instance cost, GPU SKUs, maximum GPU count, and pricing category (on-demand, spot, or either). Also, if Luna encounters a transient problem when allocating a node type in a pricing category, e.g., cloud capacity stock out, cloud account quota exhausted, node scale-up time limit exceeded, etc, it backs off from allocating that node type and pricing category combination for a configurable period, and proceeds to try to allocate the next cheapest node type.
K8s support for Pod Scheduling Readiness controlled by schedulingGates became stable in v1.30. When a pod has schedulingGates, it is not considered for placement by KubeScheduler or any K8s cluster autoscalers (including Luna), until/unless its schedulingGates are removed. Luna was recently updated to recognize the nodecostestimate scheduling gate; for example:
When a pod marked for Luna management includes the nodecostestimate scheduling gate, Luna determines the node type it would choose if that pod were not currently gated, and reports that type, its cost, and the count of nodes of that type Luna would allocate for the set of matching gated pods, and reports that information in a NodeCostEstimate pod event. Figure 2 shows an event for a pod in a set of 3 small bin-packed pods, which Luna expects to run together on a single node. Figure 3 gives an event for a pod in a set of 3 bin-select pods, which Luna expects to run on 3 separate nodes. Run this to get all NodeCostEstimate pod events.
To control the event-reporting overhead for the cost estimate, Luna only generates and reports a cost estimate pod event for pods not already having such an event. A new pod cost estimate event is generated if the existing event is removed, e.g., due to retention policy (pod events are retained for 1 hour by default) or to explicit deletion.
Luna's node cost estimate may over- or under-shoot the actual cost if a pod's schedulingGates are removed and the pod is scheduled for execution. The estimate does not take into account that the pod might be able to share an existing running node, in the case either of bin-packing or of bin-select with node-reuse enabled (default). For these cases, KubeScheduler would handle pod placement and the pod would not need node allocation by Luna. Also, the estimate does not take into account that node type availability at scheduling time may differ from that at estimation time. If any Luna node type back-offs were in effect at estimation time, but are no longer in effect at scheduling time, cheaper node types may be selected. If some node type back-offs were not in effect at estimation time, but are triggered at scheduling time, more expensive node types may be chosen. Note that in general Luna supports capping the cost of a node allocated for bin-selection via the pod annotation node.elotl.co/instance-max-cost. USING LUNA COST ESTIMATION TO ASSESS LLM SERVING CONFIGURATIONS
As an AI workload example, we consider the placement of a KubeRay 1.4.2 RayService serving an LLM model. We use the model microsoft/Phi-3-mini-4k-instruct, which runs successfully on mid-tier NVIDIA GPU SKUs such as L4, A10G, A10, and L40S. The baseline workload config is given here, comprising a CPU-only head requesting 2 CPUs and 16 GB memory, and 2 GPU-enabled workers, each requesting 16 CPUs, 16 GB memory, and 1 NVIDIA GPU. Given the pods' resource requirements, Luna assigns a node for each pod (bin-selection); it is also possible to configure Luna to assign multiple GPU pods per node (bin-packing) for this case.
We examine Luna's estimated costs for the baseline configuration on EKS, GKE, and AKS cloud K8s clusters to illustrate the value of getting that visibility before running the workload. We then consider the costs of several strategies for scaling up the workload's processing capacity, i.e., increasing the worker count, or maintaining the same worker count while either increasing each worker's GPU count or allocating a more powerful GPU device. These costs can be used to guide workload scaling performance evaluation testing. We observe that these strategies have significantly different costs across clouds. We note that all node cost estimate experiments were run without any Luna resource availability back-offs in effect, meaning that the estimates assume sufficient cloud stock and user quota for the selected node types. While availability issues can occur, particularly for popular instance types, obtaining cost estimates that represent the preferred node types is useful since it can steer region choice and quota setting in accordance with acquiring those node types. AWS EKS LUNA NODE COST ESTIMATE EXPERIMENTS
We ran the AWS node cost estimate experiments using Luna v1.3.3 on an EKS 1.33 cluster in us-west-2. The results for on-demand pricing are given in Table 1, with links given to associated yaml configurations. It is useful to see the baseline costs in advance to avoid sticker shock; this baseline workload would cost ~$715/week.
Also, it is helpful to see the potential costs of scaling up the workload. Both the first and second "Scale" configuration rows involve 4 L4 GPUs, but the relatively low price of the g6.12xlarge type makes it a less costly way to obtain those 4 GPUs; workload scaling performance evaluation with that configuration seems worth exploring. The third scale row shows that upgrading the GPU SKU to the A100 would be expensive, but that cost reflects that the A100 is only available in instances with 8x GPUs. The per-GPU cost of the A100 is $2.8012/hr, which is ~40% ($2.8012/$2.0144) higher than the L4, so if the workload scale can use all 8 GPUs, the config is worth considering, given A100's faster floating point and larger memory (80 vs 24 GB).
Table 1: Luna On-Demand Node Cost Estimate Experiments run on EKS 1.33 cluster
We repeated the node cost estimate experiments using spot pricing. The results are given in Table 2, with the "Ratio over baseline" compared to the baseline value in Table 1, to facilitate comparing spot with on-demand prices. We ran with the Luna aws.useSpotAdvisor option set true, meaning that Luna used the AWS spot instance advisor data to estimate spot prices. Spot instance advisor provides the average spot discount for the region and instance type over the last 30 days, and also includes the average frequency of spot reclamation interruptions, which can be used to constrain Luna spot node type selection.
The spot prices in Table 2 are roughly half of the on-demand prices in Table 1, which is nice. However, the spot advisor data (viewable via the AWS tool link in the previous paragraph or in Luna verbose logs) indicates that all 3 GPU-enabled node types are in the highest frequency interruption bucket, meaning a 20%+ risk of node reclamation during use. When configured to use spot advisor data, Luna supports the aws.maxSpotInterruptBucket option to constrain spot selection by maximum spot interrupt bucket for managing risk and the aws.maxSpotPriceRatio option to constrain spot selection for ensuring sufficient savings, used for pricing or placement.
Table 2: Luna Spot Node Cost Estimate Experiments run on EKS 1.33 cluster
GCP GKE LUNA NODE COST ESTIMATE EXPERIMENTS
We ran the GCP GKE node cost estimate experiments using Luna v1.3.3 on a regional GKE 1.33.3 cluster in us-central1. The results for on-demand pricing are given in Table 3. This baseline workload would cost ~$613/week.
Again, it is also helpful to see the potential costs of scaling up the workload. Both the first and second "Scale" configuration rows involve 4 L4 GPUs, but the lower price for the g2-standard-24 type makes it a less costly way to obtain those 4 GPUs; workload scaling performance evaluation with that configuration seems worth checking. The third scale row shows that upgrading the GPU SKU to the A100 would be expensive, with the A100 per-GPU cost of 7.3390 being significantly higher than the L4 per-GPU cost of 1.7343 (unlike on EKS), so unless the A100 provides much better performance, switching to it is not economical.
Table 3: Luna On-Demand Node Cost Estimate Experiments run on GKE 1.33 cluster
AZURE AKS LUNA NODE COST ESTIMATE EXPERIMENTS
We ran the Azure AKS node cost estimate experiments using Luna v1.3.3 on an AKS 1.32.6 cluster in eastus. The results for on-demand pricing are given in Table 4. This baseline workload would cost ~$1113/week.
Again, it is helpful to see the potential costs of scaling the workload. Both the first and second "Scale" configuration rows include 4 A10 GPUs, and the pricing is comparable, unlike the case on EKS and GKE. And the third row shows that upgrading the GPU SKU to the A100 would not be very expensive, and it is worth evaluating the scaling workload performance for that config.
Table 4: Luna On-Demand Node Cost Estimate Experiments run on AKS 1.32 cluster
SUMMARY
In this blog, we've described the cost estimation feature in the Luna Smart Cluster Autoscaler and shown how it can be used to avoid cloud sticker shock. We've discussed how it can guide cost-aware workload configuration when considering future workload scale increases, with large differences between scale strategies observed across cloud vendors. In an upcoming blog, we'll describe how the Luna cost estimation feature can be used with the Nova multi-cluster manager to choose the K8s cluster on which to run an AI workload at the lowest price.
Have you experienced cloud sticker shock? Do you have ways you'd like to use estimated node pricing for workload resource planning activities? Please try Luna and let us know how it goes! A free trial download version is available here. Author: Anne Holler (Chief Scientist, Elotl) Comments are closed.
|
Topic
All
Archives
November 2025
|




RSS Feed