Elotl
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
Search

Avoiding AI Workload Cloud Sticker Shock

9/25/2025

 

Using the Cost Estimation Feature in the Luna K8s Smart Autoscaler to Preview and Tune AI Workload Cloud Computing Expenses

Picture
While running AI workloads on cloud K8s clusters can make resource scaling seamless, it can also lead to the sticker shock of unexpectedly high cloud bills.  And tuning AI workload resource allocation for usage increases can be unintuitive and inefficient, given the idiosyncrasies of cloud vendor node types and prices.  In this blog, we introduce the Luna Smart Cluster Autoscaler Cost Estimation feature for estimating the node cost of pods before they run.  We show how Luna's node cost estimation feature avoids AI workload sticker shock and facilitates assessing strategies for AI workload scaling.

INTRODUCTION

Kubernetes (K8s) cluster autoscalers can reduce cloud computing expenses by allocating nodes when needed and removing them when no longer needed.  For expensive workloads like AI, getting an estimate of the hourly cost before the workload is scheduled can help prevent cloud sticker shock.  Also, getting estimated costs helps in configuring the workload to optimize expenses when planning for future growth.  Estimated costs can be used to assess the monetary impact of choices such as workload size, GPU SKU and/or instance family selection, and on-demand versus spot pricing.

The Luna Smart Autoscaler for cloud K8s recently added support for providing node hourly cost estimation.  For Luna-managed pods whose scheduling readiness is blocked by K8s scheduling gates, if the gates include nodecostestimate, Luna reports a pod event that indicates the node type it would allocate were the pod schedulable, with the type's estimated hourly compute cost.  

In this blog, we present an overview of Luna's cost estimation feature.  We next use the feature to preview the estimated baseline cost of an LLM serving workload running on Amazon AWS EKS, Google GCP GKE, and Microsoft Azure AKS cloud K8s clusters.  We discuss how cost estimation can be used to guide tuning the costs of scaling the workload as its usage increases, with the clouds showing significant cost differences for potential workload scaling strategies.  We show estimated on-demand costs for EKS, GKE, and AKS, as well as estimated spot costs for EKS.  Note that the estimated costs that Luna reports are public prices, and do not reflect customer discounts or special pricing.

OVERVIEW OF LUNA COST ESTIMATION

The Luna Smart Autoscaler allocates nodes for pending pods marked for Luna management.  As shown in Figure 1, Luna node allocation supports both bin-packing, in which nodes are allocated to host multiple small generic pods, and bin-selection, in which nodes are allocated to host larger pods or pods with special requirements.  Luna chooses the lowest-cost node type that satisfies the pod's resource requests and node type selection constraints, if any.  Luna supports a variety of selection constraints, including on instance type (include/exclude instance family, match regular expression), maximum instance cost, GPU SKUs, maximum GPU count, and pricing category (on-demand, spot, or either).  Also, if Luna encounters a transient problem when allocating a node type in a pricing category, e.g., cloud capacity stock out, cloud account quota exhausted, node scale-up time limit exceeded, etc, it backs off from allocating that node type and pricing category combination for a configurable period, and proceeds to try to allocate the next cheapest node type.
Picture
Figure 1: Luna dynamic node allocation using bin-packing and bin-selection
K8s support for Pod Scheduling Readiness controlled by schedulingGates became stable in v1.30.  When a pod has schedulingGates, it is not considered for placement by KubeScheduler or any K8s cluster autoscalers (including Luna), until/unless its schedulingGates are removed.  Luna was recently updated to recognize the nodecostestimate scheduling gate; for example:

apiVersion: v1
kind: Pod
metadata:
  name: busyboxbp
  labels:
    elotl-luna: "true"
spec:
  schedulingGates:
  - name: "nodecostestimate"
  containers:
  - name: busyboxbp <snip>
    
When a pod marked for Luna management includes the nodecostestimate scheduling gate, Luna determines the node type it would choose if that pod were not currently gated, and reports that type, its cost, and the count of nodes of that type Luna would allocate for the set of matching gated pods, and reports that information in a NodeCostEstimate pod event.  Figure 2 shows an event for a pod in a set of 3 small bin-packed pods, which Luna expects to run together on a single node.  Figure 3 gives an event for a pod in a set of 3 bin-select pods, which Luna expects to run on 3 separate nodes.  Run this to get all NodeCostEstimate pod events.
Picture
Figure 2: NodeCostEstimate Pod Event reported for deployment of 3 bin-packed pods on GKE
Picture
Figure 3: NodeCostEstimate Pod Event reported for deployment of 3 bin-selected pods on GKE
To control the event-reporting overhead for the cost estimate, Luna only generates and reports a cost estimate pod event for pods not already having such an event.  A new pod cost estimate event is generated if the existing event is removed, e.g., due to retention policy (pod events are retained for 1 hour by default) or to explicit deletion.

Luna's node cost estimate may over- or under-shoot the actual cost if a pod's schedulingGates are removed and the pod is scheduled for execution.  The estimate does not take into account that the pod might be able to share an existing running node, in the case either of bin-packing or of bin-select with node-reuse enabled (default).  For these cases, KubeScheduler would handle pod placement and the pod would not need node allocation by Luna.  Also, the estimate does not take into account that node type availability at scheduling time may differ from that at estimation time.  If any Luna node type back-offs were in effect at estimation time, but are no longer in effect at scheduling time, cheaper node types may be selected.  If some node type back-offs were not in effect at estimation time, but are triggered at scheduling time, more expensive node types may be chosen.  Note that in general Luna supports capping the cost of a node allocated for bin-selection via the pod annotation node.elotl.co/instance-max-cost.

USING LUNA COST ESTIMATION TO ASSESS LLM SERVING CONFIGURATIONS

As an AI workload example, we consider the placement of a KubeRay 1.4.2 RayService serving an LLM model.  We use the model microsoft/Phi-3-mini-4k-instruct, which runs successfully on mid-tier NVIDIA GPU SKUs such as L4, A10G, A10, and L40S.  The baseline workload config is given here, comprising a CPU-only head requesting 2 CPUs and 16 GB memory, and 2 GPU-enabled workers, each requesting 16 CPUs, 16 GB memory, and 1 NVIDIA GPU.  Given the pods' resource requirements, Luna assigns a node for each pod (bin-selection); it is also possible to configure Luna to assign multiple GPU pods per node (bin-packing) for this case.

We examine Luna's estimated costs for the baseline configuration on EKS, GKE, and AKS cloud K8s clusters to illustrate the value of getting that visibility before running the workload.  We then consider the costs of several strategies for scaling up the workload's processing capacity, i.e., increasing the worker count, or maintaining the same worker count while either increasing each worker's GPU count or allocating a more powerful GPU device.  These costs can be used to guide workload scaling performance evaluation testing.  We observe that these strategies have significantly different costs across clouds.

We note that all node cost estimate experiments were run without any Luna resource availability back-offs in effect, meaning that the estimates assume sufficient cloud stock and user quota for the selected node types.  While availability issues can occur, particularly for popular instance types, obtaining cost estimates that represent the preferred node types is useful since it can steer region choice and quota setting in accordance with acquiring those node types.

AWS EKS LUNA NODE COST ESTIMATE EXPERIMENTS

We ran the AWS node cost estimate experiments using Luna v1.3.3 on an EKS 1.33 cluster in us-west-2.  The results for on-demand pricing are given in Table 1, with links given to associated yaml configurations.  It is useful to see the baseline costs in advance to avoid sticker shock; this baseline workload would cost ~$715/week.

Also, it is helpful to see the potential costs of scaling up the workload.  Both the first and second "Scale" configuration rows involve 4 L4 GPUs, but the relatively low price of the g6.12xlarge type makes it a less costly way to obtain those 4 GPUs; workload scaling performance evaluation with that configuration seems worth exploring.  The third scale row shows that upgrading the GPU SKU to the A100 would be expensive, but that cost reflects that the A100 is only available in instances with 8x GPUs.  The per-GPU cost of the A100 is $2.8012/hr, which is  ~40% ($2.8012/$2.0144) higher than the L4, so if the workload scale can use all 8 GPUs, the config is worth considering, given A100's faster floating point and larger memory (80 vs 24 GB).

Configuration Head Node Type Head Node $/hr Worker Node Type Worker Node $/hr Worker Node GPU SKU Worker Node Count Total Cost $/hr Ratio over baseline
Baseline: 2 1-GPU workers r5a.xlarge 0.2260 g6.8xlarge 2.0144 1x L4 2 4.2548 1.00
Scale: 4 1-GPU workers r5a.xlarge 0.2260 g6.8xlarge 2.0144 1x L4 4 8.2836 1.95
Scale: 2 2-GPU workers r5a.xlarge 0.2260 g6.12xlarge 4.6016 4x L4 1 4.8276 1.14
Scale: 2 1-GPU A100 workers r5a.xlarge 0.2260 p4d.24xlarge 22.1836 8x A100 1 22.4096 5.27
Table 1: Luna On-Demand Node Cost Estimate Experiments run on EKS 1.33 cluster
We repeated the node cost estimate experiments using spot pricing.  The results are given in Table 2, with the "Ratio over baseline" compared to the baseline value in Table 1, to facilitate comparing spot with on-demand prices.  We ran with the Luna aws.useSpotAdvisor option set true, meaning that Luna used the AWS spot instance advisor data to estimate spot prices. Spot instance advisor provides the average spot discount for the region and instance type over the last 30 days, and also includes the average frequency of spot reclamation interruptions, which can be used to constrain Luna spot node type selection.

The spot prices in Table 2 are roughly half of the on-demand prices in Table 1, which is nice.  However, the spot advisor data (viewable via the AWS tool link in the previous paragraph or in Luna verbose logs) indicates that all 3 GPU-enabled node types are in the highest frequency interruption bucket, meaning a 20%+ risk of node reclamation during use.  When configured to use spot advisor data, Luna supports the aws.maxSpotInterruptBucket option to constrain spot selection by maximum spot interrupt bucket for managing risk and the aws.maxSpotPriceRatio option to constrain spot selection for ensuring sufficient savings, used for pricing or placement.
Configuration Head Node Type Head Node $/hr Worker Node Type Worker Node $/hr Worker Node GPU SKU Worker Node Count Total Cost $/hr Ratio over baseline
Spot: 2 1-GPU workers r5a.xlarge 0.0859 g6.8xlarge 0.9871 1x L4 2 2.0601 0.48
Spot Scale: 4 1-GPU workers r5a.xlarge 0.0859 g6.8xlarge 0.9871 1x L4 4 4.0343 0.95
Spot Scale: 2 2-GPU workers r5a.xlarge 0.0859 g6.12xlarge 2.2548 4x L4 1 2.3407 0.55
Spot Scale: 2 1-GPU A100 workers r5a.xlarge 0.0859 p4d.24xlarge 9.6614 8x A100 1 9.7473 2.29
Table 2: Luna Spot Node Cost Estimate Experiments run on EKS 1.33 cluster

GCP GKE LUNA NODE COST ESTIMATE EXPERIMENTS

We ran the GCP GKE node cost estimate experiments using Luna v1.3.3 on a regional GKE 1.33.3 cluster in us-central1.  The results for on-demand pricing are given in Table 3.  This baseline workload would cost ~$613/week.

Again, it is also helpful to see the potential costs of scaling up the workload.  Both the first and second "Scale" configuration rows involve 4 L4 GPUs, but the lower price for the g2-standard-24 type makes it a less costly way to obtain those 4 GPUs; workload scaling performance evaluation with that configuration seems worth checking.  The third scale row shows that upgrading the GPU SKU to the A100 would be expensive, with the A100 per-GPU cost of 7.3390 being significantly higher than the L4 per-GPU cost of 1.7343 (unlike on EKS), so unless the A100 provides much better performance, switching to it is not economical.
Configuration Head Node Type Head Node $/hr Worker Node Type Worker Node $/hr Worker Node GPU SKU Worker Node Count Total Cost $/hr Ratio over baseline
Baseline: 2 1-GPU workers e2-highmem-4 0.1808 g2-standard-32 1.7343 1x L4 2 3.6494 1.00
Scale: 4 1-GPU workers e2-highmem-4 0.1808 g2-standard-32 1.7343 1x L4 4 7.1180 1.95
Scale: 2 2-GPU workers e2-highmem-4 0.1808 g2-standard-24 2.0008 2x L4 2 4.1968 1.15
Scale: 2 1-GPU A100 workers e2-highmem-4 0.1808 a2-highGPU-2g 7.3390 1x A100 1 14.8588 4.07
Table 3: Luna On-Demand Node Cost Estimate Experiments run on GKE 1.33 cluster

AZURE AKS LUNA NODE COST ESTIMATE EXPERIMENTS

We ran the Azure AKS node cost estimate experiments using Luna v1.3.3 on an AKS 1.32.6 cluster in eastus. The results for on-demand pricing are given in Table 4.  This baseline workload would cost ~$1113/week.

Again, it is helpful to see the potential costs of scaling the workload.  Both the first and second "Scale" configuration rows include 4 A10 GPUs, and the pricing is comparable, unlike the case on EKS and GKE.  And the third row shows that upgrading the GPU SKU to the A100 would not be very expensive, and it is worth evaluating the scaling workload performance for that config.
Configuration Head Node Type Head Node $/hr Worker Node Type Worker Node $/hr Worker Node GPU SKU Worker Node Count Total Cost $/hr Ratio over baseline
Baseline: 2 1-GPU workers E4as_v5 0.2260 NV36ads_A10_v5 3.2000 1x A10 2 6.6260 1.00
Scale: 4 1-GPU workers E4as_v5 0.2260 NV36ads_A10_v5 3.2000 1x A10 4 13.0260 1.97
Scale: 2 2-GPU workers E4as_v5 0.2260 NV72ads_A10_v5 6.5200 2x A10 2 13.2660 2.00
Scale: 2 1-GPU A100 workers E4as_v5 0.2260 NC24ads_A100_v4 3.6730 1x A100 2 7.5720 1.14
Table 4: Luna On-Demand Node Cost Estimate Experiments run on AKS 1.32 cluster

SUMMARY

In this blog, we've described the cost estimation feature in the Luna Smart Cluster Autoscaler and shown how it can be used to avoid cloud sticker shock.  We've discussed how it can guide cost-aware workload configuration when considering future workload scale increases, with large differences between scale strategies observed across cloud vendors.  In an upcoming blog, we'll describe how the Luna cost estimation feature can be used with the Nova multi-cluster manager to choose the K8s cluster on which to run an AI workload at the lowest price.

Have you experienced cloud sticker shock?  Do you have ways you'd like to use estimated node pricing for workload resource planning activities?  Please try Luna and let us know how it goes!  A free trial download version is available here.


Author:
Anne Holler (Chief Scientist, Elotl)


Comments are closed.

    Topic

    All
    ARM
    Autoscaling
    Deep Learning
    Disaster Recovery
    GPU Time-slicing
    Luna
    Machine Learning
    Node Management
    Nova
    Troubleshooting
    VPA

    Archives

    November 2025
    September 2025
    August 2025
    July 2025
    May 2025
    April 2025
    January 2025
    November 2024
    October 2024
    August 2024
    July 2024
    June 2024
    April 2024
    February 2024

    RSS Feed

​© 2026 Elotl, Inc.
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial