<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" >

<channel><title><![CDATA[Elotl - Blog]]></title><link><![CDATA[https://www.elotl.co/blog]]></link><description><![CDATA[Blog]]></description><pubDate>Thu, 09 Apr 2026 09:14:54 -0700</pubDate><generator>Weebly</generator><item><title><![CDATA[Thrifty-Nova: Cost-Ordered AI Workload Placement for Multi-Cluster K8s with Autoscaled Cloud Clusters]]></title><link><![CDATA[https://www.elotl.co/blog/thrifty-nova-cost-ordered-ai-workload-placement-for-multi-cluster-k8s-with-autoscaled-cloud-clusters]]></link><comments><![CDATA[https://www.elotl.co/blog/thrifty-nova-cost-ordered-ai-workload-placement-for-multi-cluster-k8s-with-autoscaled-cloud-clusters#comments]]></comments><pubDate>Tue, 18 Nov 2025 14:49:08 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Nova]]></category><guid isPermaLink="false">https://www.elotl.co/blog/thrifty-nova-cost-ordered-ai-workload-placement-for-multi-cluster-k8s-with-autoscaled-cloud-clusters</guid><description><![CDATA[ABSTRACTIn a multi-cluster Kubernetes (K8s) environment, when there are insufficient statically-allocated free cluster resources to schedule a workload, an autoscaled cloud cluster can be used to obtain the resources needed to run the workload.&nbsp; Selecting among your autoscaled cloud clusters the one that can obtain those resources at the lowest estimated price is desirable, particularly for AI workloads requiring GPUs, since cloud GPU supply can be limited and costs can be high and can vary [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="5">ABSTRACT</font><br></h2><span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/thrifty-nova-cost-ordered-ai-workload-placement-for-multi-cluster-k8s-with-autoscaled-cloud-clusters.png?1763477528" style="margin-top: 0px; margin-bottom: 0px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -0px; margin-bottom: 0px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;">In a multi-cluster <a href="https://kubernetes.io/"><u>Kubernetes</u></a> (K8s) environment, when there are insufficient statically-allocated free cluster resources to schedule a workload, an autoscaled cloud cluster can be used to obtain the resources needed to run the workload.&nbsp; Selecting among your autoscaled cloud clusters the one that can obtain those resources at the lowest estimated price is desirable, particularly for AI workloads requiring GPUs, since cloud GPU supply can be limited and costs can be high and can vary greatly across vendors.<br><br>In this blog, we present Thrifty-Nova, a tool for performing cost-ordered workload placement on autoscaled cloud clusters.&nbsp; Thrifty-Nova leverages the <a href="https://www.elotl.co/nova.html"><u>Nova</u></a> fleet manager's policy-driven multi-cluster scheduling and the <a href="https://www.elotl.co/luna.html"><u>Luna</u></a> Smart cluster autoscaler's node cost estimate feature to create a Nova placement policy that is customized to the workload with respect to relevant cloud resource availability and price.&nbsp; We give several examples of Thrifty-Nova usage that show the value of automating workload cluster selection in cost-order priority, given the impact of workload configuration and dynamic resource availability on successful placement.<br></div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><h2 class="wsite-content-title">INTRODUCTION<br></h2><div class="paragraph" style="text-align:left;">Nova manages a multi-cluster multi-cloud K8s fleet, scheduling K8s workloads on target clusters in accordance with scheduling policies and free capacity, as shown in Figure 1.&nbsp; Nova handles a variety of use-cases, including workload placement for resource availability or quality as presented <a href="https://youtu.be/sP3Oo8yT5xA"><u>here</u></a>, with optional cross-cluster placement as demonstrated, e.g., using <a href="https://cilium.io/use-cases/cluster-mesh/"><u>Cilium Cluster Mesh</u></a> stretched networking as covered in this three blog series (<a href="https://www.elotl.co/blog/superskyray-part-1-running-ray-ai-apps-across-k8s-clusters-for-resource-and-time-efficiency"><u>blog1</u></a>, <a href="https://www.elotl.co/blog/superskyray-part-2-scaling-ray-ai-apps-across-k8s-clusters-for-no-downtime-resource-increases"><u>blog2</u></a>, <a href="https://www.elotl.co/blog/superskyray-part-3-rescheduling-ray-ai-apps-between-k8s-clusters-for-rayservice-cluster-upgradereconfigure"><u>blog3</u></a>); priority-based cluster selection allowing preferential workload placement on on-premise or reserved clusters as described <a href="https://youtu.be/nt2iq5hbssY"><u>here</u></a>; duplicate workload placement for common tooling or service continuity as discussed <a href="https://www.elotl.co/blog/a-guide-to-disaster-recovery-for-ferretdb-with-elotl-nova-on-kubernetes"><u>here</u></a>, and workload migration for cluster maintenance or upgrade as illustrated <a href="https://youtu.be/SiAoPbKnooU"><u>here</u></a>.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/thrifty-nova-cost-ordered-ai-workload-placement-for-multi-cluster-k8s-with-autoscaled-cloud-clusters-intro-image.png?1763477656" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%">Figure 1: Nova Multi-Cluster Fleet Manager</div></div></div><div class="paragraph" style="text-align:left;">Nova interoperates with cloud cluster autoscalers, including the K8s Cluster Autoscaler and the Luna Smart cluster autoscaler.&nbsp; If no workload cluster that meets a schedule group's policy has sufficient free capacity for the group, Nova places the group on an autoscaled cluster that meets the policy, with the expectation that the autoscaler will add the needed capacity, as discussed <a href="https://www.elotl.co/blog/right-place-right-size-using-an-autoscaler-aware-multi-cluster-kubernetes-fleet-manager-for-mlai-workloads"><u>here</u></a>.&nbsp; Luna was recently updated to provide node cost estimation for pods.&nbsp; As described <a href="https://www.elotl.co/blog/avoiding-ai-workload-cloud-sticker-shock"><u>here</u></a>, for Luna-managed pods whose scheduling readiness is blocked by the <em>nodecostestimate</em> K8s scheduling gate, Luna reports a pod event that indicates the node type it would allocate were the pod schedulable, with the type's estimated hourly compute cost.&nbsp; Thrifty-Nova, leveraging the capabilities of Nova and Luna, dynamically creates a Nova cluster-priority group policy to have Nova dynamically select the cluster to run a workload with the lowest estimated price.<br></div><h2 class="wsite-content-title"><font size="5">THRIFTY-NOVA OPERATION</font><br></h2><div class="paragraph" style="text-align:left;">Given a workload to be run at the lowest price, Thrifty-Nova determines the per-cluster workload cost estimates using Nova and Luna.&nbsp; Thrifty-Nova then creates a Nova policy for cost-ordered placement and deploys the workload using that policy.<br><br>To determine the per-cluster workload cost estimates using Nova and Luna, Thrifty-Nova does the following:<br><br><ul><li>Deploys a <em>nodecostestimate</em> schedule-gated version of the workload using a Nova spread/duplicate policy.</li><li>Gathers <em>NodeCostEstimate</em> events for the workload pods running on Luna-enabled clusters and sums them.</li><li>Treats statically-allocated clusters as 0 cost and autoscaled clusters not reporting estimates as max cost.</li><li>Undeploys the schedule-gated version of the workload and the associated spread/duplicate policy.</li></ul><br>Note that the Luna <em>NodeCostEstimate</em> event will indicate if Luna would not currently expect to obtain a pod's needed resources, e.g., due to stock-out or quota backoffs; Thrifty-Nova treats any such clusters as having max cost.&nbsp; Also note that when Luna estimates the cost of a node to host a pod, it does so based on the information it has at that point.&nbsp; When Luna actually allocates a node for the pod, it may allocate a more expensive node type (if the node type used for its estimate is not available) or a less expensive node type (if Luna considered the node type unavailable at the time of its estimate).&nbsp; The cost of a node Luna will allocate for a pod can be capped by annotating the pod with <em>node.elotl.co/instance-max-cost</em> set to the cost maximum.<br><br>To create a Nova policy for cost-ordered placement and deploy a workload using that policy, Thrifty-Nova does the following:<br><br><ul><li>Creates a Nova cluster-priority group policy, with the clusters specified in ascending cost order.</li><li>Deploys a non-schedule-gated version of the workload using that policy.</li></ul><br>Based on the policy, the Nova control plane will gang-schedule the workload on the first cluster on which the workload appears to fit.&nbsp; If the workload doesn't fit on a statically-allocated cluster, Nova will choose the first autoscaled cluster in the list.&nbsp; If a Luna autoscaled cluster cannot obtain the resources to run a pod, it reports a <em>NodeAddRequestWarning</em> event.&nbsp; Nova detects that pod event and retries the group placement on the next cluster in the priority list.&nbsp; Note that Luna retries the clusters in the priority list in round-robin fashion, meaning that the Luna cluster could eventually be retried if no other cluster is able to host the workload.<br><br>The Thrifty-Nova tool script is <a href="https://github.com/elotl/try-nova/blob/main/thrifty-nova/cost-schedule.sh"><u>here</u></a>.&nbsp; Its arguments are the path to a local <a href="https://github.com/elotl/try-nova"><u>try-nova</u></a> repo clone, both the schedule-gated and non-gated workload yamls, the namespace to use for workload policy and deployment, and the label key and value that select workload objects for Nova group placement.&nbsp; To try this out, you'll need to install the Nova control plane on a host K8s cluster and the Nova agent on each of the workload clusters; Nova installation instructions are <a href="https://docs.elotl.co/nova/installation_novactl/"><u>here</u></a>.&nbsp; You'll also need to ensure that the namespace being used for the workload policy is available on all of the workload clusters; an example Nova spread/duplicate policy can be found <a href="https://github.com/elotl/skyray/blob/main/policies/nspolicy.yaml"><u>here</u></a>, which Nova could apply to the namespace deployment <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/namespace.yaml"><u>here</u></a>.<br></div><h2 class="wsite-content-title"><font size="5">THRIFTY-NOVA EXPERIMENTS</font><br></h2><div class="paragraph" style="text-align:left;">The Thrifty-Nova experiments were run using Nova v1.3.12 for the clusters listed in Table 1.&nbsp; The Luna clusters used Luna v1.4.0.<br></div><div><div id="395085323680711672" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 15%;">Nova Cluster Role</th><th style="width: 17%;">Cluster Name</th><th style="width: 17%;">Cloud K8s</th><th style="width: 17%;">K8s Version</th><th style="width: 17%;">Location</th><th style="width: 17%;">Resource Allocation</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>Control Plane</td><td>control-plane-host4</td><td>GKE</td><td>1.33</td><td>us-central1</td><td>static</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Workload</td><td>static-gke</td><td>GKE</td><td>1.33</td><td>us-central1</td><td>static</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Workload</td><td>autoscale-gke-a</td><td>GKE</td><td>1.33</td><td>us-central1-a</td><td>dynamic via Luna</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Workload</td><td>autoscale-gke-f</td><td>GKE</td><td>1.33</td><td>us-central1-f</td><td>dynamic via Luna</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Workload</td><td>autoscale-aks</td><td>AKS</td><td>1.32</td><td>eastus</td><td>dynamic via Luna</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Workload</td><td>autoscale-eks</td><td>EKS</td><td>1.33</td><td>us-west-2</td><td>dynamic via Luna</td></tr></tbody></table></div></div><div class="paragraph" style="text-align:center;">Table 1: Clusters used in Thrifty-Nova Experiments<br></div><div class="paragraph" style="text-align:left;">The workload for the experiments is LLM model serving via a <a href="https://github.com/ray-project/kuberay"><u>KubeRay</u></a> <a href="https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html#kuberay-rayservice-quickstart"><u>RayService</u></a> deployment running on Nova's SkyRay platform.&nbsp; SkyRay, presented <a href="https://www.youtube.com/watch?v=JyRZApYsci4"><u>here</u></a> and documented <a href="https://docs.elotl.co/nova/Concepts/sky-ray/"><u>here</u></a>, requires Nova spread/duplicate scheduling of KubeRay to all workload clusters to which a Ray object may be placed; a simple approach is to place it on all clusters. We used KubeRay 1.4.2.<br><br>The experiments used the model <a href="https://huggingface.co/microsoft/Phi-3-mini-4k-instruct"><em><u>microsoft/Phi-3-mini-4k-instruct</u></em></a>, which runs efficiently on mid-tier NVIDIA GPU SKUs such as L4, A10G, A10, and L40S.&nbsp; The Luna option to specify the desired GPU SKU choices was used for RayService worker pods; on Luna-enabled clusters, Luna ensured that the associated pods were placed on the lowest-cost available node types satisfying the GPU SKU constraint.&nbsp; To ensure placement on the desired GPU models on the static cluster, node affinity to GPU model labels on those nodes was used. The GKE NVIDIA daemonset adds the node label&nbsp;<em>cloud.google.com/gke-accelerator</em> set to the GPU model from <a href="https://cloud.google.com/compute/docs/gpus#gpu-models"><u>this list</u></a> automatically; that label is used in the following nodeAffinity setting to work for both Luna and non-Luna clusters (the matchExpressions are ORed):<br></div><div><div id="464849646727259163" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">      affinity:        nodeAffinity:          requiredDuringSchedulingIgnoredDuringExecution:            nodeSelectorTerms:            - matchExpressions:              - key: node.elotl.co/created-by                operator: In                values:                - luna            - matchExpressions:              - key: cloud.google.com/gke-accelerator                operator: In                values:                - &lt;GKE-model-name1&gt;...                - &lt;GKE-model-nameN&gt;    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">Note that on non-GKE K8s clusters, NVIDIA GPU Feature Discovery in the k8s-device-plugin daemonset similarly automatically sets the node label <em>nvidia.com/gpu.product</em> to the NVIDIA GPU product name derived from <a href="https://github.com/NVIDIA/k8s-device-plugin/blob/main/vendor/github.com/NVIDIA/go-nvlib/pkg/pciids/default_pci.ids"><u>this list</u></a>, so static clusters using GFD can use that key to specify the desired GPU model(s).<br></div><h2 class="wsite-content-title"><font size="4">Experiment 1: RayService with 2 mid-tier 1-GPU workers</font><br></h2><div class="paragraph" style="text-align:left;">For Experiment 1, Thrifty-Nova was requested to place the RayService comprising a 2-CPU 16GB CPU-only head and 2 16-CPU 16GB 1-NVIDIA-GPU workers, as per the schedule-gated config <a href="https://github.com/elotl/skyray/blob/main/thrifty-nova/ray-service.llm-serve.schedgate.yaml"><u>here</u></a> and non-gated config <a href="https://github.com/elotl/skyray/blob/main/thrifty-nova/ray-service.llm-serve.noschedgate.yaml"><u>here</u></a>.&nbsp; Thrifty-Nova created a placement policy with the clusters in the priority order: static-gke, autoscale-gke-a, autoscale-eks, autoscale-aks, autoscale-gke-f, as per the cost estimates shown in Table 2.&nbsp; The static-gke cluster was first with 0 cost, since no additional cost would be incurred by placing the workload on that cluster.&nbsp; The autoscale-gke-f cluster was last at max cost, because us-central1-f did not have any capacity for the specified GPU SKUs.<br><br>When Nova ran placement with the created policy, static-gke had 2 1-GPU L4 nodes allocated and available, and hence had sufficient resources for the workload, so that placement worked.<br></div><div><div id="377828202851354508" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 15%;">Cluster Name</th><th style="width: 15%;">Est. Workload Cost ($/hr)</th><th style="width: 25%;">Head Node Type (Est. Cost)</th><th style="width: 25%;">Worker Node(s) Type (Est. Cost)</th><th style="width: 20%;">Cluster Selection Status</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>static-gke</td><td>0</td><td>N/A</td><td>N/A</td><td>Selected</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-gke-a</td><td>3.649</td><td>e2-highmem-4 (0.181)</td><td>2x g2-standard-32 (1.734)</td><td></td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-eks</td><td>4.254</td><td>r5a.xlarge (0.226)</td><td>2x g6.8xlarge (2.014)</td><td></td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-aks</td><td>6.626</td><td>Standard_E4as_v5 (0.226)</td><td>2x Standard_NV36ads_A10_v5 (3.200)</td><td></td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-gke-f</td><td>max</td><td>e2-highmem-4 (0.181)</td><td>No NVIDIA GPUs for requested SKUs</td><td></td></tr></tbody></table></div></div><div class="paragraph" style="text-align:center;">Table 2: Per Cluster Estimated Workload Cost for Experiment 1<br></div><h2 class="wsite-content-title"><font size="4">Experiment 2: RayService with 2 mid-tier 2-GPU workers</font><br></h2><div class="paragraph" style="text-align:left;">For Experiment 2, the workload was specified to have 2 2-GPU workers rather than 2 1-GPU workers, with the schedule-gated config <a href="https://github.com/elotl/skyray/blob/main/thrifty-nova/ray-service.llm-serve.schedgate.2gpus.yaml"><u>here</u></a> and non-gated config <a href="https://github.com/elotl/skyray/blob/main/thrifty-nova/ray-service.llm-serve.noschedgate.2gpus.yaml"><u>here</u></a>. Thrifty-Nova again created a placement policy that specified the clusters in the order: static-gke, autoscale-gke-a, autoscale-eks, autoscale-aks, autoscale-gke-f, as per the cost estimates shown in Table 3.<br><br>When Nova ran placement with the created policy, static-gke did not have any available 2-GPU resources, so Nova next attempted to place the workload on autoscale-gke-a.&nbsp; If Nova placement was run during off-peak hours, Luna was able to scale up autoscale-gke-a, so Nova placement there was successful.&nbsp; However, if Nova placement was run during peak hours, Luna encountered stock-out for all of the candidate GPU instances in that cluster, and Nova then tried placement of the workload on autoscale-eks, where Luna was able to allocate the resources.<br></div><div><div id="430505692479183925" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 15%;">Cluster Name</th><th style="width: 15%;">Est. Workload Cost ($/hr)</th><th style="width: 25%;">Head Node Type (Est. Cost)</th><th style="width: 25%;">Worker Node(s) Type (Est. Cost)</th><th style="width: 20%;">Cluster Selection Status</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>static-gke</td><td>0</td><td>N/A</td><td>N/A</td><td>Insufficient 2-gpu resources</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-gke-a</td><td>4.182</td><td>e2-highmem-4 (0.181)</td><td>2x g2-standard-24 (2.001)</td><td>Selected during off-peak; Stock out during peak</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-eks</td><td>4.828</td><td>r5a.xlarge (0.226)</td><td>1x g6.12xlarge (4.602)</td><td>Selected during peak</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-aks</td><td>13.266</td><td>Standard_E4as_v5 (0.226)</td><td>2x Standard_NV72ads_A10_v5 (6.520)</td><td></td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-gke-f</td><td>max</td><td>e2-highmem-4 (0.181)</td><td>No NVIDIA GPUs for requested SKUs</td><td></td></tr></tbody></table></div></div><div class="paragraph" style="text-align:center;">Table 3: Per Cluster Estimated Workload Cost for Experiment 2<br></div><h2 class="wsite-content-title"><font size="4">Experiment 3: RayService with 2 A100 1-GPU workers</font><br></h2><div class="paragraph" style="text-align:left;">For Experiment 3, 2 1-GPU workers were specified to use the A100 GPU SKU rather than one of the mid-tier GPU SKUs previously listed, with the schedule-gated config <a href="https://github.com/elotl/skyray/blob/main/thrifty-nova/ray-service.llm-serve.schedgate.a100.yaml"><u>here</u></a> and non-gated config <a href="https://github.com/elotl/skyray/blob/main/thrifty-nova/ray-service.llm-serve.noschedgate.a100.yaml"><u>here</u></a>.&nbsp; In this case, Thrifty-Nova created a placement policy that specified the clusters in the order: static-gke, autoscale-aks, autoscale-gke-a, autoscale-gke-f, autoscale-eks, as shown in Table 4.<br><br>Nova attempted placement on static-gke, autoscale-aks, autoscale-gke-a, and autoscale-gke-f, but there were no A100 instances in static-gke and Luna could not allocate A100-enabled instances on the AKS and GKE autoscaled clusters due to our accounts on those clouds having insufficient A100 quota.&nbsp; Nova next attempted placement of the workload to autoscale-eks, where Luna was able to allocate the resources.</div><div><div id="155103829605908635" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 15%;">Cluster Name</th><th style="width: 15%;">Est. Workload Cost ($/hr)</th><th style="width: 25%;">Head Node Type (Est. Cost)</th><th style="width: 25%;">Worker Node(s) Type (Est. Cost)</th><th style="width: 20%;">Cluster Selection Status</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>static-gke</td><td>0</td><td>N/A</td><td>N/A</td><td>Insufficient A100 resources</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-aks</td><td>7.572</td><td>Standard_E4as_v5 (0.226)</td><td>2x Standard_NC24ads_A100_v4 (3.673)</td><td>Insufficient A100 quota</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-gke-a</td><td>14.859</td><td>e2-highmem-4 (0.181)</td><td>2x a2-highgpu-2g (7.339)</td><td>Insufficient A100 quota</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-gke-f</td><td>14.859</td><td>e2-highmem-4 (0.181)</td><td>2x a2-highgpu-2g (7.339)</td><td>Insufficient A100 quota</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>autoscale-eks</td><td>22.183</td><td>r5a.xlarge (0.226)</td><td>1x p4d.24xlarge (21.958)</td><td>Selected</td></tr></tbody></table></div></div><div class="paragraph" style="text-align:center;">Table 4: Per Cluster Estimated Workload Cost for Experiment 3<br></div><h2 class="wsite-content-title"><font size="5">SUMMARY</font><br></h2><div class="paragraph" style="text-align:left;">We've presented Thrifty-Nova, a tool for performing cost-ordered workload placement on a mix of on-premise and cloud clusters managed by the Nova fleet manager, including cloud clusters running the Luna Smart autoscaler.&nbsp; Thrifty-Nova uses a Nova spread/duplicate policy to estimate workload costs via the Luna Smart autoscaler node cost estimate feature, and then creates a Nova cluster-priority group policy to perform workload placement in cluster cost order.&nbsp; We've shown examples of how using that policy allows the lowest-cost available resources to be allocated, leveraging the power of Nova and Luna to get the lowest cost resources while responding dynamically to capacity constraints, including cloud stock-out and quota issues.<br><br>Are you sensitive to cost and resource availability for your workloads, especially expensive AI workloads, when choosing between your on-premise, reserved, and autoscaled cloud K8s clusters?&nbsp; Thrifty-Nova is available as a simple shell script that you can use with free trial versions of <a href="https://www.elotl.co/nova-free-trial.html"><u>Nova</u></a> and <a href="https://www.elotl.co/luna-free-trial.html"><u>Luna</u></a>.&nbsp; We invite you to try Nova, Luna, and Thrifty-Nova, and to let us know how it goes!<br><br><br><br><strong>Author:</strong><br>Anne Holler (Chief Scientist, Elotl)<br><br></div>]]></content:encoded></item><item><title><![CDATA[SuperSkyRay, Part 3: Rescheduling Ray AI Apps Between K8s Clusters for RayService Cluster Upgrade/Reconfigure]]></title><link><![CDATA[https://www.elotl.co/blog/superskyray-part-3-rescheduling-ray-ai-apps-between-k8s-clusters-for-rayservice-cluster-upgradereconfigure]]></link><comments><![CDATA[https://www.elotl.co/blog/superskyray-part-3-rescheduling-ray-ai-apps-between-k8s-clusters-for-rayservice-cluster-upgradereconfigure#comments]]></comments><pubDate>Sun, 02 Nov 2025 22:52:09 GMT</pubDate><category><![CDATA[Machine Learning]]></category><category><![CDATA[Nova]]></category><guid isPermaLink="false">https://www.elotl.co/blog/superskyray-part-3-rescheduling-ray-ai-apps-between-k8s-clusters-for-rayservice-cluster-upgradereconfigure</guid><description><![CDATA[Abstract   In our blogs &ldquo;SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency&rdquo; and "SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases", we discussed SuperSkyRay&rsquo;s support for running Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh as well as SuperSkyRay&rsquo;s non-disruptive handling of Ray apps that outgrow single-cluster placement via extending them t [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="5">Abstract</font><br></h2>  <span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/superskyrayblogimage.png?1762126547" style="margin-top: 5px; margin-bottom: 0px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -0px; margin-bottom: 0px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="text-align:left;display:block;">In our blogs &ldquo;<a href="https://www.elotl.co/blog/superskyray-part-1-running-ray-ai-apps-across-k8s-clusters-for-resource-and-time-efficiency"><font size="3">SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency</font></a>&rdquo; and "<a href="https://www.elotl.co/blog/superskyray-part-2-scaling-ray-ai-apps-across-k8s-clusters-for-no-downtime-resource-increases"><font size="3">SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases</font></a>", we discussed SuperSkyRay&rsquo;s support for running Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh as well as SuperSkyRay&rsquo;s non-disruptive handling of Ray apps that outgrow single-cluster placement via extending them to multi-cluster placement.<br /><br />In this blog, we consider SuperSkyRay&rsquo;s handling of KubeRay RayServices that outgrow the single <a href="https://kubernetes.io/"><u>Kubernetes (K8s) </u></a>clusters hosting them due to Ray cluster upgrade or reconfigure with zero downtime.&nbsp; To support zero downtime (default), the RayService keeps the current Ray cluster running while it brings up an additional Ray cluster with the new configuration; the upgrade or reconfiguration is incomplete until the new version of the Ray cluster is available. SuperSkyRay can reschedule a RayService deployed on a single cluster onto a different cluster to avoid the update stalling indefinitely when there are insufficient resources for a second RayCluster.&nbsp; While this relocation involves downtime, it is appropriate when time-to-update is critical and resources are limited.<br></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <h2 class="wsite-content-title"><font size="5">Introduction</font><br></h2>  <div class="paragraph" style="text-align:left;">When any field in <em>spec.rayClusterConfig</em> of a running RayService is changed, KubeRay by default performs a <a href="https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/rayservice.html#step-8-zero-downtime-upgrade-for-ray-clusters"><u>zero downtime upgrade</u></a> of the Ray cluster as follows.&nbsp; It keeps the current copy of the Ray cluster running to continue processing service requests while it deploys an additional version of the Ray cluster with the updates.&nbsp; Once the new version is fully ready, it switches the service to using the updated Ray cluster and removes the old Ray cluster.&nbsp; While this avoids service downtime, it requires that the K8s cluster hosting the RayService have sufficient resources to run two copies of the Ray cluster.&nbsp; When this is not possible, the service update remains incomplete for an indefinite period of time, which is undesirable.&nbsp; (RayService no-downtime upgrade can be disabled by setting ENABLE_ZERO_DOWNTIME to false, so cluster config changes do not engender any upgrade operation, which can also be undesirable.)<br></div>  <div>  <!--BLOG_SUMMARY_END--></div>  <div class="paragraph" style="text-align:left;">When Nova detects that the schedule group running on a single cluster has pending pods, it looks to reschedule the group.&nbsp; If <strong>skip-capacity-relocate </strong>is not set, it will first look for an alternative single-cluster placement.&nbsp; When the group contains a RayService with a Ray cluster, it seeks an alternative single cluster that is sufficient for one copy of the Ray cluster, which works fine for the update case since the relocated RayService is restarted with only the most recent Ray cluster configuration.&nbsp; While this relocation will engender RayService downtime, it may be worthwhile to achieve the service update in a timely manner.<br /><br />Note that if <strong>skip-capacity-relocate</strong> option is set, the RayService will not be relocated and the service update will remain incomplete until sufficient resources are available in the cluster.&nbsp; SuperSkyRay could be extended to perform cross-cluster placement of the new Ray cluster, while maintaining the existing Ray cluster on the current K8s cluster, but the ROI of adding this complexity is unclear; we note that KubeRay is moving to <a href="https://github.com/ray-project/kuberay/pull/3166"><u>no-downtime incremental upgrades</u></a>, which will reduce the resource requirements of updating RayService Ray clusters.<br></div>  <h2 class="wsite-content-title"><font size="5">SuperSkyRay New Cluster Reschedule Operation</font><br></h2>  <div class="paragraph" style="text-align:left;">SuperSkyRay&rsquo;s group rescheduling is triggered as in our previous blog "SuperSkyRay: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases".&nbsp; In this case, however, since <strong>skip-capacity-relocate</strong> is unset, an alternative single cluster placement is considered.&nbsp; When another placement is found, the manifests for objects in the scheduling group are removed from the old cluster and added to the new, and the workload is redeployed.<br></div>  <h2 class="wsite-content-title"><font size="5">SuperSkyRay Example Use Case</font><br></h2>  <div class="paragraph" style="text-align:left;">Let&rsquo;s look at an example use case where Nova has placed a group containing a RayService prediction service on an on-premise K8s cluster, as shown in Figure 1, using an AKS &ldquo;on-prem&rdquo; cluster for illustration.&nbsp; We then manually update the configuration of the Ray cluster in the service, leading KubeRay to create a second copy of the Ray cluster with the updated configuration in the service.&nbsp; This second copy does not fit on the on-premise K8s cluster, so the update is blocked.&nbsp; SuperSkyRay reschedules the group containing the RayService to the AKS &ldquo;cloud&rdquo; cluster where the updated service is deployed, as shown in Figure 2.&nbsp; Note we could optionally trigger a reschedule of the updated service back to the on-premise cluster, if desired.<br></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-3-figure-1-superskyray-initially-scheduled-rayservice-to-run-on-on-premise-cluster.png?1762124276" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 1: SuperSkyRay initially scheduled RayService to run on on-premise cluster</div> </div></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/editor/part-3-figure-2-superskyray-revised-schedule-for-updated-rayservice-to-run-on-cloud-cluster.png?1762124324" alt="Picture" style="width:696;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 2: SuperSkyRay revised schedule for updated RayService to run on cloud cluster</div> </div></div>  <div class="paragraph" style="text-align:left;">Appendix A contains the details for running this use case on AKS cloud K8s clusters.<br></div>  <h2 class="wsite-content-title"><font size="5">Conclusion</font><br></h2>  <div class="paragraph" style="text-align:left;">In this blog, we explained how SuperSkyRay handles a Ray app that outgrows its original cluster after an upgrade or reconfiguration, by rescheduling the app to another K8s cluster to prevent updates from stalling due to insufficient resources.&nbsp; While this Ray app relocation involves downtime, it is appropriate when resources are limited and time-to-update is critical.<br /><br />Have you experienced RayService RayCluster updates blocking indefinitely due to insufficient resources to run a second copy of the RayCluster?&nbsp; Cilium Cluster Mesh is open-source and a free trial version of Nova is available <a href="https://www.elotl.co/nova-free-trial.html" target="_blank">here</a>.&nbsp; Please give SuperSkyRay a try and let us know how it goes!</div>  <h2 class="wsite-content-title"><font size="4">Appendix A: Example Details</font></h2>  <div class="paragraph"><em>Setup SuperSkyRay</em><ul><li><em>Allocate 2 AKS cloud K8s clusters to serve as Nova workload clusters, joined w/Cilium Cluster Mesh 1.17.4 or later as described <a href="https://drive.google.com/file/d/1MdmQq9lngIiDPJix9w1DwRKTcwl1_xbJ/view?usp=sharing"><u>here</u></a></em><ul><li><em>Have 1 more AKS cloud K8s cluster available to host the Nova Control Plane.</em></li></ul></li><li><em>Install Nova 1.3.11 (or later) on the clusters, enable --multi-cluster-capacity, as described in cheat-sheet <a href="https://drive.google.com/file/d/1nK4DcVSlImeg6CziG2tEGJz68vrmnYU_/view?usp=sharing"><u>here</u></a></em></li><li><em>Deploy KubeRay in the SkyRay Configuration, as described in cheat-sheet <a href="https://drive.google.com/file/d/1Uqp9K9WSHiEvW1d_5f5tGzL_Rmxsh7Kj/view?usp=sharing"><u>here</u></a></em></li></ul><br /><em>Run Example Use Case</em><ul><li><em>Place a RayService that fits on one workload cluster, as described <a href="https://drive.google.com/file/d/1hkX815CgtlWtlfQIlp3h5HozJ8nA-CoM/view?usp=sharing"><u>here</u></a></em><ul><li><em>SuperSkyRay will place the RayService on one workload cluster</em></li></ul></li><li><em>Interact with the RayService, as described in the cheat-sheet <a href="https://drive.google.com/file/d/156xQCHaj-mzuDMv-FgWfYbVRjPpVnp_I/view?usp=sharing"><u>here</u></a></em></li><li><em>Manually update the RayService cluster configuration to, e.g., increase the head memory limit.</em><ul><li><em>KubeRay will deploy an additional copy of the Ray cluster, which won&rsquo;t fit</em></li><li><em>SuperSkyRay will reschedule the RayService [updated] on the 2nd workload cluster</em></li></ul></li><li><em>Interact with the RayService, as described in the cheat-sheet <a href="https://drive.google.com/file/d/156xQCHaj-mzuDMv-FgWfYbVRjPpVnp_I/view?usp=sharing"><u>here</u></a></em></li></ul><br /><em>Cleanup</em><ul><li><em>Please see the cheat-sheet <a href="https://drive.google.com/file/d/15UlQy462LrSqlAyaHFgiL__CvWI80iV0/view?usp=sharing"><u>here</u></a></em></li></ul><br /><br /><strong>Authors:</strong><br />Anne Holler (Chief Scientist, Elotl)<br />Liz Rice (Chief Open Source Officer, Isovalent at Cisco)<br /><br /><strong style="color:rgb(54, 54, 54)">Contributors:</strong><br /><span style="color:rgb(54, 54, 54)">Dan Wendlandt (Co-Founder, Isovalent at Cisco)</span><br /><span style="color:rgb(54, 54, 54)">Nicholas Lane (Principal Solutions Architect, Isovalent at Cisco)</span></div>]]></content:encoded></item><item><title><![CDATA[SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases]]></title><link><![CDATA[https://www.elotl.co/blog/superskyray-part-2-scaling-ray-ai-apps-across-k8s-clusters-for-no-downtime-resource-increases]]></link><comments><![CDATA[https://www.elotl.co/blog/superskyray-part-2-scaling-ray-ai-apps-across-k8s-clusters-for-no-downtime-resource-increases#comments]]></comments><pubDate>Sun, 02 Nov 2025 21:41:28 GMT</pubDate><category><![CDATA[Machine Learning]]></category><category><![CDATA[Nova]]></category><guid isPermaLink="false">https://www.elotl.co/blog/superskyray-part-2-scaling-ray-ai-apps-across-k8s-clusters-for-no-downtime-resource-increases</guid><description><![CDATA[Abstract   In our previous blog,&nbsp;SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency, we discussed how SuperSkyRay could be used to run Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh.In this blog, we turn our attention to how SuperSkyRay can non-disruptively handle Ray apps that outgrow their single Kubernetes (K8s) cluster placement.&nbsp; SuperSkyRay can dynamically change the Ray app placement from single-c [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="5">Abstract</font><br></h2>  <span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:184px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/superskyrayblogimage.png?1762126362" style="margin-top: 5px; margin-bottom: 0px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -0px; margin-bottom: 0px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="text-align:left;display:block;">In our previous blog,&nbsp;<a href="https://www.elotl.co/blog/superskyray-part-1-running-ray-ai-apps-across-k8s-clusters-for-resource-and-time-efficiency"><font size="3">SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency</font></a>, we discussed how SuperSkyRay could be used to run Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh.<br /><br />In this blog, we turn our attention to how SuperSkyRay can non-disruptively handle Ray apps that outgrow their single <a href="https://kubernetes.io/"><u>Kubernetes (K8s)</u></a> cluster placement.&nbsp; SuperSkyRay can dynamically change the Ray app placement from single-cluster to cross-cluster, increasing the app&rsquo;s resources without requiring any app relocation downtime.<br></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <h2 class="wsite-content-title"><font size="5">Introduction</font><br></h2>  <div class="paragraph" style="text-align:left;">When SuperNova (Nova w/<strong>multi-cluster-capacity</strong> set) performs capacity-based scheduling of a K8s object group, it prefers to place the group on a single cluster if possible, since that choice is simpler in terms of management and networking than cross-cluster placement.&nbsp; If a group placed on a single cluster contains an app for which the worker count is later scaled up, the result may no longer fit on that cluster, e.g., because the cluster has reached its fixed size limit, as is the case of on-premise or cloud reserved-instance clusters. When a group no longer fits on its cluster, SuperNova seeks to reschedule the group.<br></div>  <div>  <!--BLOG_SUMMARY_END--></div>  <div class="paragraph" style="text-align:left;">Focusing on the case where the Ray app worker count is scaled up, SuperSkyRay (SuperNova managing SkyRay) by default looks for another single cluster for the group, although relocating the group will involve downtime.&nbsp; However, if Nova is run with <strong>skip-capacity-relocate</strong>, which specifies not to relocate a capacity-based group from its current cluster solely to get more resources, or if there is no other single cluster that can run the group, SuperSkyRay considers dynamically expanding the single-cluster placement to a multi-cluster placement, leveraging its specialized knowledge about extending the Ray app&rsquo;s Ray cluster to span multiple K8s clusters.&nbsp; By expanding the running app to multi-cluster placement, the downtime that would be needed to relocate the app is avoided.&nbsp; During any subsequent Ray app scale down, remote Ray workers, i.e., those placed on a K8s cluster not containing the Ray head, are preferentially removed.<br /><br />We present an example use case where a Ray online prediction service running on an on-premise K8s cluster is, due to increased query volume, scaled up and will no longer fit on the K8s cluster.&nbsp; SuperSkyRay dynamically extends the service to span the on-premise and cloud clusters, supporting the increase in Ray worker count with no service downtime.&nbsp; And we present a similar second use case in which the Ray Serve autoscaler increases the number of Ray Workers after the initial on-prem placement of the Ray cluster, requiring cluster span.<br></div>  <h2 class="wsite-content-title"><font size="5">SuperSkyRay Cross-Cluster Reschedule Operation</font><br></h2>  <div class="paragraph" style="text-align:left;">This section assumes that the SuperSkyRay components are set up as described in our blog "SuperSkyRay: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency".&nbsp;<br /><br />For SuperSkyRay cross-cluster reschedule, SuperNova is run with <strong>skip-capacity-relocate </strong>to specify Nova should not relocate a capacity-based group from its current cluster solely to get more resources. When a workload cluster Nova agent status controller detects that a group does not fit, it marks the schedule group for rescheduling by the Nova control plane. When the SuperSkyRay Nova control plane looks at rescheduling a group in this case, it considers dynamically updating the single-cluster placement to a multi-cluster placement.&nbsp; When the Nova control plane updates the Ray object schedule to multi-cluster placement, it modifies the scheduling data for the Ray app manifest in the workload cluster Nova scheduling configmap.<br /><br />The Nova agent schedule controller applies the modification to the running Ray app in the workload cluster and the Nova agent status controller detects the change.&nbsp; It then performs similar operations to those it does for initial cross-cluster Ray worker placement: it replaces each pending pod that should run on a different cluster with a placeholder pod and puts the pod manifest into the appropriate workload cluster Nova scheduling configmap.&nbsp; It also duplicates the Ray head service onto all clusters slated to run Ray workers so that the Ray cluster head service can leverage Cilium Cluster Mesh for cross-K8s workers.<br></div>  <h2 class="wsite-content-title"><font size="5">SuperSkyRay Manually-Scaling Example Use Case</font><br></h2>  <div class="paragraph" style="text-align:left;">Let&rsquo;s look at an example use case where Nova has placed a Ray online prediction service on an on-premise K8s cluster, as shown in Figure 1, with AKS clusters standing in for &ldquo;on-prem&rdquo; and &ldquo;cloud&rdquo; clusters.&nbsp; The service is later manually scaled to add a worker, which does not fit on the &ldquo;on-prem&rdquo; cluster.&nbsp; SuperSkyRay with&nbsp;<strong>skip-capacity-relocate&nbsp;</strong>reschedules the group non-disruptively by extending the single-cluster placement to a cross-cluster placement, as shown in Figure 2.&nbsp;</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-2-figure-1-superskyray-initially-scheduled-rayservice-to-run-on-on-premise-cluster.png?1762124758" alt="Picture" style="width:672;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 1: SuperSkyRay initially scheduled RayService to run on on-premise cluster</div> </div></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-2-figure-2-superskyray-revised-schedule-for-rayservice-to-run-across-on-premise-and-cloud-cluster.png?1762124770" alt="Picture" style="width:658;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 2: SuperSkyRay revised schedule for RayService to run across on-premise and cloud cluster</div> </div></div>  <div class="paragraph">Appendix A contains the details for running this use case on AKS cloud K8s clusters.</div>  <h2 class="wsite-content-title"><font size="5">SuperSkyRay Auto-Scaling Example Use Case</font><br></h2>  <div class="paragraph" style="text-align:left;">Let&rsquo;s look at an example use case where Nova has placed a Ray online prediction service on an on-premise K8s cluster, as shown in Figure 3 with AKS clusters standing in for &ldquo;on-prem&rdquo; and &ldquo;cloud&rdquo; clusters.&nbsp; The Ray cluster is configured with 0 workers initially.&nbsp; The Ray Serve autoscaler subsequently scales the Ray cluster to 2 GPU workers, only one of which will fit on the on-premise cluster.&nbsp; SuperSkyRay reschedules the group non-disruptively by extending the single-cluster placement to a cross-cluster placement, as shown in Figure 4.&nbsp;&nbsp;</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-2-figure-3-superskyray-initially-scheduled-rayservice-to-run-on-on-premise-cluster.png?1762124944" alt="Picture" style="width:627;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 3: SuperSkyRay initially scheduled RayService to run on on-premise cluster</div> </div></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-2-figure-4-superskyray-revised-schedule-for-rayservice-to-run-across-on-premise-and-cloud-cluster.png?1762124959" alt="Picture" style="width:692;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 4: SuperSkyRay revised schedule for RayService to run across on-premise and cloud cluster</div> </div></div>  <div class="paragraph">Appendix B contains the details for running this use case on AKS cloud K8s clusters.</div>  <h2 class="wsite-content-title"><font size="5">Conclusion</font><br></h2>  <div class="paragraph" style="text-align:left;">In this blog, we&rsquo;ve discussed how SuperSkyRay can non-disruptively handle KubeRay Ray apps that outgrow their single K8s cluster placement.&nbsp; SuperSkyRay can dynamically change the Ray app placement from single-cluster to cross-cluster, increasing the app&rsquo;s resources without app relocation downtime.&nbsp; We&rsquo;ve presented two example use cases in which a Ray online prediction service running on an on-premise K8s cluster is scaled to add a worker that would not fit on its workload cluster.&nbsp; SuperSkyRay dynamically extends the service to span the on-premise and cloud clusters, supporting the increase in Ray worker count with no application downtime.<br /><br /><strong><span style="color:rgb(54, 54, 54)">In subsequent blog,&nbsp;</span><a href="https://www.elotl.co/blog/superskyray-part-3-rescheduling-ray-ai-apps-between-k8s-clusters-for-rayservice-cluster-upgradereconfigure"><font size="3">SuperSkyRay, Part 3</font></a><span style="color:rgb(54, 54, 54)">, we&rsquo;ll present SuperSkyRay&rsquo;s handling of RayService cluster upgrade/reconfigure by rescheduling Ray AI Apps to another cluster.</span></strong><br /><br />Do you have use cases where bursting your Ray workload dynamically across K8s clusters would save you money and/or time?&nbsp; Cilium Cluster Mesh is open-source and a free trial version of Nova is available <a href="https://www.elotl.co/nova-free-trial.html" target="_blank">here</a>.&nbsp; Please give SuperSkyRay a try and let us know how it goes!</div>  <h2 class="wsite-content-title"><font size="4">Appendix A: Example Details</font></h2>  <div class="paragraph" style="text-align:left;"><em>Setup SuperSkyRay</em><ul><li><em>Allocate 2 AKS cloud K8s clusters to serve as Nova workload clusters, joined w/Cilium Cluster Mesh 1.17.4 or later as described <a href="https://drive.google.com/file/d/1MdmQq9lngIiDPJix9w1DwRKTcwl1_xbJ/view?usp=sharing"><u>here</u></a></em><ul><li><em>Have 1 more AKS cloud K8s cluster available to host the Nova Control Plane.</em></li></ul></li><li><em>Install Nova 1.3.11 (or later) on the clusters, enable --multi-cluster-capacity and &ndash;skip-capacity-relocate Nova options, as described in cheat-sheet <a href="https://drive.google.com/file/d/1nK4DcVSlImeg6CziG2tEGJz68vrmnYU_/view?usp=sharing"><u>here</u></a></em></li><li><em>Deploy KubeRay in the SkyRay Configuration, as described in cheat-sheet <a href="https://drive.google.com/file/d/1Uqp9K9WSHiEvW1d_5f5tGzL_Rmxsh7Kj/view?usp=sharing"><u>here</u></a></em></li></ul><em><br /><br />Run Example Use Case</em><ul><li><em>Place a RayService that fits on one workload cluster, as described <a href="https://drive.google.com/file/d/1hkX815CgtlWtlfQIlp3h5HozJ8nA-CoM/view?usp=sharing"><u>here</u></a></em><ul><li><em>SuperSkyRay places the RayService on one workload cluster</em></li></ul></li><li><em>Interact with the RayService, as described in the cheat-sheet <a href="https://drive.google.com/file/d/156xQCHaj-mzuDMv-FgWfYbVRjPpVnp_I/view?usp=sharing"><u>here</u></a></em></li><li><em>Manually increase the RayService to request an additional replica that won&rsquo;t fit; increase spec.serveConfigV2.applications.text_summarizer.deployments.num_replicas to 3 and spec.rayClusterConfig.workerGroupSpecs.replicas to 3</em><ul><li><em>SuperSkyRay spreads the existing RayService across the 2 workload clusters</em></li></ul></li><li><em>Interact with the RayService, as described in the cheat-sheet <a href="https://drive.google.com/file/d/156xQCHaj-mzuDMv-FgWfYbVRjPpVnp_I/view?usp=sharing"><u>here</u></a></em></li><li><em>Manually decrease the RayService to restore the original replica count</em><ul><li><em>SuperSkyRay scales the existing RayService back down to 1 workload cluster</em></li></ul></li></ul><em><br />Cleanup</em><ul><li><em>Please see the cheat-sheet <a href="https://drive.google.com/file/d/15UlQy462LrSqlAyaHFgiL__CvWI80iV0/view?usp=sharing"><u>here</u></a></em></li></ul></div>  <h2 class="wsite-content-title"><font size="4">Appendix B: Example Details</font></h2>  <div class="paragraph" style="text-align:left;"><em>Setup SuperSkyRay</em><ul><li><em>Allocate 2 AKS cloud K8s clusters to serve as Nova workload clusters, joined w/Cilium Cluster Mesh 1.17.4 or later as described <a href="https://drive.google.com/file/d/1MdmQq9lngIiDPJix9w1DwRKTcwl1_xbJ/view?usp=sharing"><u>here</u></a></em><ul><li><em>Include 1 Standard_NV36ads_A10_v5 A10 GPU node in each cluster</em></li><li><em>Have 1 more AKS cloud K8s cluster available to host the Nova Control Plane.</em></li></ul></li><li><em>Install Nova 1.3.11 (or later) on the clusters, enable --multi-cluster-capacity and &ndash;skip-capacity-relocate Nova options, as described in cheat-sheet <a href="https://drive.google.com/file/d/1nK4DcVSlImeg6CziG2tEGJz68vrmnYU_/view?usp=sharing"><u>here</u></a>.</em></li><li><em>Deploy KubeRay in the SkyRay Configuration, as described in cheat-sheet <a href="https://drive.google.com/file/d/1Uqp9K9WSHiEvW1d_5f5tGzL_Rmxsh7Kj/view?usp=sharing"><u>here</u></a></em></li></ul><br /><em>Run Example Use Case</em><ul><li><em>Place a RayService that initially fits on one workload cluster and then is scaled by RayServe to fit on 2 clusters, as described <a href="https://drive.google.com/file/d/1hBpEo6toC3zewAtcLiwTQY1kkS-rU5LX/view?usp=sharing"><u>here</u></a>.</em><ul><li><em>First SuperSkyRay places the RayService on one workload cluster</em></li><li><em>Then SuperSkyRay spreads the existing RayService across the 2 workload clusters</em></li></ul></li><li><em>Interact with the RayService, as described in the cheat-sheet <a href="https://drive.google.com/file/d/1I-ZajWFcml6t9dxxnMgWGAF5Cb193Pqa/view?usp=sharing"><u>here</u></a></em></li></ul><br /><em>Cleanup</em><ul><li><em>Please see the cheat-sheet <a href="https://drive.google.com/file/d/1ilbpiPXe9CjObCeZdmpW9dErsTjRyNdK/view?usp=sharing"><u>here</u></a></em></li></ul><br /><br /><strong>Authors:</strong><br />Anne Holler (Chief Scientist, Elotl)<br />Liz Rice (Chief Open Source Officer, Isovalent at Cisco)<br /><br />&#8203;<strong style="color:rgb(54, 54, 54)">Contributors:</strong><br /><span style="color:rgb(54, 54, 54)">Dan Wendlandt (Co-Founder, Isovalent at Cisco)</span><br /><span style="color:rgb(54, 54, 54)">Nicholas Lane (Principal Solutions Architect, Isovalent at Cisco)</span></div>]]></content:encoded></item><item><title><![CDATA[SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency]]></title><link><![CDATA[https://www.elotl.co/blog/superskyray-part-1-running-ray-ai-apps-across-k8s-clusters-for-resource-and-time-efficiency]]></link><comments><![CDATA[https://www.elotl.co/blog/superskyray-part-1-running-ray-ai-apps-across-k8s-clusters-for-resource-and-time-efficiency#comments]]></comments><pubDate>Sun, 02 Nov 2025 21:09:22 GMT</pubDate><category><![CDATA[Machine Learning]]></category><category><![CDATA[Nova]]></category><guid isPermaLink="false">https://www.elotl.co/blog/superskyray-part-1-running-ray-ai-apps-across-k8s-clusters-for-resource-and-time-efficiency</guid><description><![CDATA[Abstract   This blog presents SuperSkyRay, a name we gave to supporting Ray app execution via KubeRay across Kubernetes (K8s) clusters running the Cilium Cluster Mesh multi-cluster datapath.&nbsp; SuperSkyRay uses the Nova K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation.&nbsp; SuperSkyRay addresses the resource and time inefficiency that occurs when resources needed for Ray apps are fragmented across K8s clusters.   Introduction  Organiz [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="5">Abstract</font></h2>  <span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:185px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/superskyrayblogimage.png?1762126262" style="margin-top: 5px; margin-bottom: 0px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -0px; margin-bottom: 0px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="text-align:left;display:block;">This blog presents SuperSkyRay, a name we gave to supporting <a href="https://docs.ray.io/en/latest/index.html"><u>Ray</u></a> app execution via <a href="https://github.com/ray-project/kuberay"><u>KubeRay</u></a> across <a href="https://kubernetes.io/"><u>Kubernetes (K8s)</u></a> clusters running the <a href="https://cilium.io/use-cases/cluster-mesh/"><u>Cilium Cluster Mesh</u></a> multi-cluster datapath.&nbsp; SuperSkyRay uses the <a href="https://www.elotl.co/nova.html"><u>Nova</u></a> K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation.&nbsp; SuperSkyRay addresses the resource and time inefficiency that occurs when resources needed for Ray apps are fragmented across K8s clusters.<br></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <h2 class="wsite-content-title"><font size="5">Introduction</font><br></h2>  <div class="paragraph" style="text-align:left;">Organizations using <a href="https://github.com/ray-project/kuberay"><u>KubeRay</u></a> to run the <a href="https://docs.ray.io/en/latest/index.html"><u>Ray</u></a> ML platform on <a href="https://kubernetes.io/"><u>K8s</u></a> often have multiple clusters for reasons such as resource availability and cost, service continuity, geo-location, and quality of service.&nbsp; <a href="https://static.sched.com/hosted_files/colocatedeventsna2024/d1/AIDaySkyRay.pdf?_gl=1*1ca9326*_gcl_au*MTQ2ODc3NjAyOC4xNzUwOTUxNzgz"><u>SkyRay</u></a> reduces the toil of managing instances of KubeRay running on a fleet of K8s clusters by providing policy-driven resource-aware scheduling of Ray apps onto K8s clusters.&nbsp; However, SkyRay does not address the inefficiency that occurs if the desired scale of a Ray app exceeds the spare capacity of any single cluster in the fleet, while at the same time the fleet has sufficient idle resources fragmented across clusters. In this case, the app runs with fewer resources than desired or is delayed until enough single-cluster capacity is freed.&nbsp; This inefficiency could be addressed if the Ray app could be run across multiple K8s clusters.<br></div>  <div>  <!--BLOG_SUMMARY_END--></div>  <div class="paragraph" style="text-align:left;">This blog presents SuperSkyRay, which supports <a href="https://docs.ray.io/en/latest/index.html"><u>Ray</u></a> app execution via <a href="https://github.com/ray-project/kuberay"><u>KubeRay</u></a> across <a href="https://kubernetes.io/"><u>K8s</u></a> clusters running the <a href="https://cilium.io/use-cases/cluster-mesh/"><u>Cilium Cluster Mesh</u></a> multi-cluster datapath.&nbsp; SuperSkyRay uses the <a href="https://www.elotl.co/nova.html"><u>Nova</u></a> K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation.&nbsp; We describe SuperSkyRay&rsquo;s components and placement operation and then give an example use case running a RayService for prediction across on-premise and cloud clusters.&nbsp; The example achieves better utilization and time-to-results than possible with single-cluster placement in the case that needed resources are fragmented.<br></div>  <h2 class="wsite-content-title"><font size="5">SuperSkyRay Components</font><br></h2>  <h2 class="wsite-content-title"><font size="4">Ray, KubeRay</font></h2>  <div class="paragraph" style="text-align:left;"><a href="https://www.anyscale.com/glossary/what-is-ray"><u>Ray</u></a> is an open-source unified framework designed to simplify the development and scaling of distributed applications, particularly for AI workloads.&nbsp; Ray includes:<ul><li>Ray core: supplies primitives to simplify building and scaling distributed applications.</li><li>Ray AI libraries: support running a variety of distributed ML tasks.</li><li>Ray clusters: provide Ray workers connected to a Ray head for running Ray apps.<br /><br /></li></ul> <a href="https://docs.ray.io/en/latest/cluster/kubernetes/index.html"><u>KubeRay</u></a> handles the creation, deletion, and scaling of Ray clusters, jobs, and services on a K8s cluster. The structure of KubeRay is shown in Figure 1. KubeRay supports three K8s Custom Resource Definitions:<br /><br /><ul><li>RayCluster<ul><li>For creating a Ray cluster with the specified resources and attributes.</li></ul></li><li>RayJob<ul><li>For creating a Ray cluster and submitting a job to it when the cluster is ready.</li><li>Can optionally delete the Ray cluster once the job finishes.</li><li>Often used for ML/AI training or batch prediction.</li></ul></li><li>RayService<ul><li>For creating a Ray cluster and running a Ray Serve deployment graph.</li><li>Offers zero-downtime upgrades, high availability, and <a href="https://docs.ray.io/en/latest/serve/autoscaling-guide.html#ray-serve-autoscaling"><u>Ray Serve autoscaling</u></a>.</li><li>Often used for ML/AI online serving.</li></ul></li></ul> KubeRay deployments can optionally also include the <a href="https://docs.ray.io/en/latest/cluster/key-concepts.html#autoscaling"><u>Ray Autoscaler</u></a>, which automatically adds and removes worker nodes from a Ray cluster based on resource requests.</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-1-figure-1-kuberay-structure.png?1762125087" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 1: KubeRay Structure</div> </div></div>  <h2 class="wsite-content-title"><font size="4">Nova, SkyRay</font><br></h2>  <div class="paragraph" style="text-align:left;"><a href="https://www.elotl.co/nova.html"><u>Nova</u></a> is a K8s workload fleet manager that schedules groups of K8s objects onto K8s workload clusters, according to policies and available capacity.&nbsp; Cluster selection can utilize cluster names, labels, attributes, priorities, and available capacity, and placement can handle single or duplicate workload group instances with optional customization per instance.&nbsp; A Nova workload group placed using an available-capacity policy is gang-scheduled, meaning no member is scheduled until the entire group can fit.&nbsp; We note that <a href="https://www.elotl.co/blog/right-place-right-size-using-an-autoscaler-aware-multi-cluster-kubernetes-fleet-manager-for-mlai-workloads"><u>Nova interoperates with cluster autoscalers</u></a>, including the K8s Cluster Autoscaler and <a href="https://www.elotl.co/luna.html">Luna</a>, and optionally supports just-in-time workload clusters, allowing K8s clusters to scale to 0 or be removed when idle and restored/recreated or cloned when needed.&nbsp; The structure of Nova is shown in Figure 2.<br></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-1-figure-2-nova-structure.png?1762125646" alt="Picture" style="width:684;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 2: Nova Structure</div> </div></div>  <div class="paragraph" style="text-align:left;">In 2024, we introduced <a href="https://static.sched.com/hosted_files/colocatedeventsna2024/d1/AIDaySkyRay.pdf?_gl=1*1ca9326*_gcl_au*MTQ2ODc3NjAyOC4xNzUwOTUxNzgz"><u>SkyRay</u></a> to extend KubeRay from a single K8s cluster to multi-cluster multi-cloud operation via interoperation with the <a href="https://www.elotl.co/nova.html"><u>Nova</u></a> policy-driven resource-aware fleet manager.&nbsp; Nova automatically selects each Ray app&rsquo;s target K8s cluster, on which KubeRay handles the app.&nbsp; To set up SkyRay, Nova is used with a spread/duplicate policy to deploy KubeRay and its CRDs onto all of its workload clusters, so each cluster is KubeRay-enabled.&nbsp; Then, whenever a KubeRay CR is submitted to Nova for placement, Nova applies the policy relevant to that CR to select a workload cluster, on which KubeRay deploys and monitors the associated Ray pods.&nbsp; We note that Nova recognizes the Ray CRs and can determine their resource needs, so Nova can do available-capacity placement of Ray objects.&nbsp; The structure of SkyRay is shown in Figure 3.<br></div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-1-figure-3-skyray-structure.png?1762125314" alt="Picture" style="width:970;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 3: SkyRay Structure</div> </div></div>  <h2 class="wsite-content-title"><font size="4">Cluster Mesh, SuperSkyRay</font><br></h2>  <div class="paragraph" style="text-align:left;"><a href="https://cilium.io/use-cases/cluster-mesh/"><u>Cilium Cluster Mesh</u></a> joins multiple K8s clusters into a unified network, regardless of the K8s distribution or location of each cluster.&nbsp; Cluster Mesh can combine services running across K8s clusters, allowing service workers to be spread across clusters.&nbsp; To do this, Cluster Mesh requires that such services be marked with the annotation <a href="http://service.cilium.io/global:"><em>service.cilium.io/global:</em></a><em> "true"</em>.<br /><br />SuperSkyRay augments SkyRay with Cilium Cluster Mesh to allow KubeRay-deployed Ray clusters to span multiple K8s clusters.&nbsp; This is handled via Nova with its <a href="https://docs.elotl.co/nova/Concepts/group-scheduling/"><u>multi-cluster-capacity</u></a> option enabled, a configuration we call SuperNova.&nbsp; If no single cluster has sufficient free capacity for placement of a group using an available-capacity policy, SuperNova checks if it is possible to place the group using resources from multiple clusters, and if so, it chooses that placement.&nbsp; SuperSkyRay includes specialized knowledge in SuperNova about how to handle cross-cluster placement of Ray clusters running with KubeRay and Cluster Mesh.&nbsp; The structure of SuperSkyRay is shown in Figure 4.</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-1-figure-4-superskyray-structure.png?1762125436" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 4: SuperSkyRay Structure</div> </div></div>  <div class="paragraph" style="text-align:left;"><br />We designed SuperSkyRay&rsquo;s operation to be transparent to KubeRay, to interoperate with standard KubeRay installations and minimally-changed Ray app CRs.&nbsp; That said, SuperSkyRay imposes several KubeRay-related requirements:<ul><li>To allow Cluster Mesh to join the Ray cluster head service across multiple K8s clusters, the Cluster Mesh global service annotation must be included in the Ray head service manifest as shown <a href="https://github.com/elotl/skyray/blob/main/supernova-examples/ray-job.text-summarizer.cpu.snova.1.3.2.yaml#L12"><u>here</u></a>.&nbsp; Also, the Ray cluster head service must have an IP; for recent KubeRay releases, either the ENABLE_RAY_HEAD_CLUSTER_IP_SERVICE option must be set or the service must be configured to use (say) the NodePort type.</li><li>To allow SuperSkyRay to do the Ray worker updates needed for cross-cluster operation, the Ray autoscaler must be enabled (example <a href="https://github.com/elotl/skyray/blob/main/supernova-examples/ray-job.text-summarizer.cpu.snova.1.3.2.yaml#L13"><u>here</u></a>), even for fixed size Ray clusters.&nbsp; Enabling Ray autoscaler instructs KubeRay that Ray cluster worker nodes are externally managed so that KubeRay refrains from doing Ray worker node scaling operations.</li></ul></div>  <h2 class="wsite-content-title"><font size="5">SuperSkyRay Cross-Cluster Placement Operation</font><br></h2>  <div class="paragraph" style="text-align:left;">SuperSkyRay cross-cluster placement operates as follows:<ul><li>When SuperNova chooses cross-cluster placement of a Ray app, the Nova control plane places the Ray object manifest into the Nova scheduling configmap for the K8s cluster on which the Ray cluster head is slated to run.</li><li>The workload cluster Nova agent schedule controller that monitors that Nova scheduling configmap then deploys the Ray object manifest onto its workload cluster.</li><li>The KubeRay instance on that cluster materializes the K8s deployments, services, and jobs associated with that Ray object.</li><li>The Nova agent status controller running on that cluster detects Ray cluster worker pods that are pending in the cluster and that are intended to be scheduled on another cluster.&nbsp; It replaces those pods with placeholder pods to satisfy KubeRay&rsquo;s Ray cluster goal state; without placeholder pods, KubeRay will not transition the Ray cluster to the ready state. &nbsp; And it places the manifests for those worker pods into the Nova scheduling configmap of the K8s cluster on which they were intended to run.</li><li>The Nova agent status controller also detects head services for Ray clusters with cross-cluster placement and duplicates the manifest of those services into the Nova scheduling configmap of the other clusters that will host Ray workers, as required by Cilium Cluster Mesh to combine cross-cluster services.</li></ul></div>  <h2 class="wsite-content-title"><font size="5">SuperSkyRay Example Use Case</font><br></h2>  <div class="paragraph" style="text-align:left;">An example SuperSkyRay use case involves running large-scale prediction across on-premise and cloud clusters for better utilization and time-to-results than single-cluster placement.&nbsp; This use case, called &ldquo;<a href="https://www.ciscolive.com/c/dam/r/ciscolive/global-event/docs/2025/pdf/CENDCN_1399.pdf"><u>AI Workload Cloud Bursting</u></a>&rdquo;, was presented at Cisco Live 2025.<br /><br />In Appendix A, we describe how to run a simplified version of this use case using only AKS cloud K8s clusters, for ease of trial.&nbsp; The outcome of the simplified placement is depicted in Figure 5.&nbsp; A demo of the scenario is available <a href="https://drive.google.com/file/d/1jUnLmHJuwqC5T6WLdTvpzMr7wJojfDmk/view?usp=drive_link"><u>here</u></a>.</div>  <div><div class="wsite-image wsite-image-border-none " style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"> <a> <img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/part-1-figure-5-superskyray-cross-cluster-rayservice-placement.png?1762125614" alt="Picture" style="width:auto;max-width:100%" /> </a> <div style="display:block;font-size:90%">Figure 5: SuperSkyRay cross-cluster RayService placement</div> </div></div>  <h2 class="wsite-content-title"><font size="5">Conclusion</font><br></h2>  <div class="paragraph" style="text-align:left;">In this blog, we described the components and operation of SuperSkyRay.&nbsp; We presented an example use case it enables, which involves running a RayService for prediction across a fleet comprising an on-premise and cloud K8s cluster.&nbsp; The Ray app doesn&rsquo;t fit on either K8s cluster, but can fit using the spare resources on both clusters.&nbsp; SuperSkyRay schedules it across the clusters, increasing utilization and reducing time-to-results relative to single-cluster placement.<br /><br /><strong>In subsequent blogs,&nbsp;<font size="3"><a href="https://www.elotl.co/blog/superskyray-part-2-scaling-ray-ai-apps-across-k8s-clusters-for-no-downtime-resource-increases">SuperSkyRay, Part 2</a>&nbsp;&amp;&nbsp;</font><a href="https://www.elotl.co/blog/superskyray-part-3-rescheduling-ray-ai-apps-between-k8s-clusters-for-rayservice-cluster-upgradereconfigure"><font size="3">SuperSkyRay, Part 3</font></a>, we&rsquo;ll present SuperSkyRay&rsquo;s handling of dynamic Ray app use cases, including scaling an online on-premise prediction service to add a cloud cluster worker without migration downtime, and bursting to another cluster to facilitate update of a running Ray service.</strong><br /><br />Do you have use cases where bursting your Ray workload across K8s clusters would save you money and/or time?&nbsp; Cilium Cluster Mesh is open-source and a free trial version of Nova is available <a href="https://www.elotl.co/nova-free-trial.html"><u>here</u></a>.&nbsp; Please give SuperSkyRay a try and let us know how it goes!</div>  <h2 class="wsite-content-title"><font size="4">Appendix A: Example Details</font></h2>  <div class="paragraph"><em>Setup SuperSkyRay</em><ul><li><em>Allocate 2 AKS cloud K8s clusters to serve as Nova workload clusters, joined w/Cilium Cluster Mesh 1.17.4 or later, as described in cheat-sheet <a href="https://drive.google.com/file/d/1MdmQq9lngIiDPJix9w1DwRKTcwl1_xbJ/view?usp=sharing"><u>here</u></a></em><ul><li><em>Have 1 more AKS cloud K8s cluster available to host the Nova Control Plane.</em></li></ul></li><li><em>Install Nova 1.3.11 (or later) on the clusters, enable --multi-cluster-capacity, as described in cheat-sheet <a href="https://drive.google.com/file/d/1nK4DcVSlImeg6CziG2tEGJz68vrmnYU_/view?usp=sharing"><u>here</u></a></em></li><li><em>Deploy KubeRay in the SkyRay Configuration, as described in cheat-sheet <a href="https://drive.google.com/file/d/1Uqp9K9WSHiEvW1d_5f5tGzL_Rmxsh7Kj/view?usp=sharing"><u>here</u></a></em></li></ul><br /><em>Run Example Use Case</em><ul><li><em>Place a RayService that won't fit on one workload cluster, but does fit on 2, as described in cheat-sheet <a href="https://drive.google.com/file/d/1x9-cHSlUx7LCRSSEE32Wo4kFDil_Zesf/view?usp=sharing"><u>here</u></a></em><ul><li><em>SuperSkyRay will spread the RayService across the 2 workload clusters</em></li></ul></li><li><em>Interact with the RayService, as described in the cheat-sheet <a href="https://drive.google.com/file/d/156xQCHaj-mzuDMv-FgWfYbVRjPpVnp_I/view?usp=sharing"><u>here</u></a></em></li></ul><br /><em>Cleanup</em><ul><li><em>Please see the cheat-sheet <a href="https://drive.google.com/file/d/15UlQy462LrSqlAyaHFgiL__CvWI80iV0/view?usp=sharing"><u>here</u></a></em></li></ul><br /><br /><strong>Authors:</strong><br />Anne Holler (Chief Scientist, Elotl)<br />Liz Rice (Chief Open Source Officer, Isovalent at Cisco)<br /><br /><strong style="color:rgb(54, 54, 54)">Contributors:</strong><br /><span style="color:rgb(54, 54, 54)">Dan Wendlandt (Co-Founder, Isovalent at Cisco)</span><br /><span style="color:rgb(54, 54, 54)">Nicholas Lane (Principal Solutions Architect, Isovalent at Cisco)</span></div>]]></content:encoded></item><item><title><![CDATA[Avoiding AI Workload Cloud Sticker Shock]]></title><link><![CDATA[https://www.elotl.co/blog/avoiding-ai-workload-cloud-sticker-shock]]></link><comments><![CDATA[https://www.elotl.co/blog/avoiding-ai-workload-cloud-sticker-shock#comments]]></comments><pubDate>Thu, 25 Sep 2025 13:22:43 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/avoiding-ai-workload-cloud-sticker-shock</guid><description><![CDATA[Using the Cost Estimation Feature in the Luna K8s Smart Autoscaler to Preview and Tune AI Workload Cloud Computing ExpensesWhile running AI workloads on cloud K8s clusters can make resource scaling seamless, it can also lead to the sticker shock of unexpectedly high cloud bills.&nbsp; And tuning AI workload resource allocation for usage increases can be unintuitive and inefficient, given the idiosyncrasies of cloud vendor node types and prices.&nbsp; In this blog, we introduce the Luna Smart Clu [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="3">Using the Cost Estimation Feature in the Luna K8s Smart Autoscaler to Preview and Tune AI Workload Cloud Computing Expenses</font><br></h2><span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/avoiding-ai-workload-cloud-sticker-shock.png?1758806723" style="margin-top: 0px; margin-bottom: 10px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;">While running AI workloads on cloud <a href="https://kubernetes.io/"><u>K8s</u></a> clusters can make resource scaling seamless, it can also lead to the sticker shock of unexpectedly high cloud bills.&nbsp; And tuning AI workload resource allocation for usage increases can be unintuitive and inefficient, given the idiosyncrasies of cloud vendor node types and prices.&nbsp; In this blog, we introduce the Luna Smart Cluster Autoscaler Cost Estimation feature for estimating the node cost of pods before they run.&nbsp; We show how Luna's node cost estimation feature avoids AI workload sticker shock and facilitates assessing strategies for AI workload scaling.<br></div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><h2 class="wsite-content-title"><font size="5">INTRODUCTION</font><br></h2><div class="paragraph" style="text-align:left;">Kubernetes (K8s) cluster autoscalers can reduce cloud computing expenses by allocating nodes when needed and removing them when no longer needed.&nbsp; For expensive workloads like AI, getting an estimate of the hourly cost before the workload is scheduled can help prevent cloud sticker shock.&nbsp; Also, getting estimated costs helps in configuring the workload to optimize expenses when planning for future growth.&nbsp; Estimated costs can be used to assess the monetary impact of choices such as workload size, GPU SKU and/or instance family selection, and on-demand versus spot pricing.<br><br>The <a href="https://www.elotl.co/luna.html"><u>Luna Smart Autoscaler</u></a> for cloud K8s recently added support for providing node hourly cost estimation.&nbsp; For Luna-managed pods whose scheduling readiness is blocked by <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/"><u>K8s scheduling gates</u></a>, if the gates include <em>nodecostestimate</em>, Luna reports a pod event that indicates the node type it would allocate were the pod schedulable, with the type's estimated hourly compute cost.&nbsp;&nbsp;<br><br>In this blog, we present an overview of Luna's cost estimation feature.&nbsp; We next use the feature to preview the estimated baseline cost of an LLM serving workload running on <a href="https://aws.amazon.com/pm/eks/"><u>Amazon AWS EKS</u></a>, <a href="https://cloud.google.com/kubernetes-engine?hl=en"><u>Google GCP GKE</u></a>, and <a href="https://azure.microsoft.com/en-us/products/kubernetes-service"><u>Microsoft Azure AKS</u></a> cloud K8s clusters.&nbsp; We discuss how cost estimation can be used to guide tuning the costs of scaling the workload as its usage increases, with the clouds showing significant cost differences for potential workload scaling strategies.&nbsp; We show estimated on-demand costs for EKS, GKE, and AKS, as well as estimated spot costs for EKS.&nbsp; Note that the estimated costs that Luna reports are public prices, and do not reflect customer discounts or special pricing.<br></div><h2 class="wsite-content-title"><font size="5">OVERVIEW OF LUNA COST ESTIMATION</font><br></h2><div class="paragraph" style="text-align:left;">The Luna Smart Autoscaler allocates nodes for pending pods marked for Luna management.&nbsp; As shown in Figure 1, Luna node allocation supports both bin-packing, in which nodes are allocated to host multiple small generic pods, and bin-selection, in which nodes are allocated to host larger pods or pods with special requirements.&nbsp; Luna chooses the lowest-cost node type that satisfies the pod's resource requests and node type selection constraints, if any.&nbsp; Luna supports a variety of selection constraints, including on instance type (include/exclude instance family, match regular expression), maximum instance cost, GPU SKUs, maximum GPU count, and pricing category (on-demand, spot, or either).&nbsp; Also, if Luna encounters a transient problem when allocating a node type in a pricing category, e.g., cloud capacity stock out, cloud account quota exhausted, node scale-up time limit exceeded, etc, it backs off from allocating that node type and pricing category combination for a configurable period, and proceeds to try to allocate the next cheapest node type.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/luna-diagram.png?1758828865" alt="Picture" style="width:560;max-width:100%"></a><div style="display:block;font-size:90%">Figure 1: Luna dynamic node allocation using bin-packing and bin-selection</div></div></div><div class="paragraph" style="text-align:left;">K8s support for Pod Scheduling Readiness controlled by <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/pod-scheduling-readiness/"><u>schedulingGates</u></a> became stable in v1.30.&nbsp; When a pod has schedulingGates, it is not considered for placement by KubeScheduler or any K8s cluster autoscalers (including Luna), until/unless its schedulingGates are removed.&nbsp; Luna was recently updated to recognize the <em>nodecostestimate</em> scheduling gate; for example:<br></div><div><div id="456161409316250927" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">apiVersion: v1kind: Podmetadata:  name: busyboxbp  labels:    elotl-luna: "true"spec:  schedulingGates:  - name: "nodecostestimate"  containers:  - name: busyboxbp &lt;snip&gt;    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">When a pod marked for Luna management includes the <em>nodecostestimate</em> scheduling gate, Luna determines the node type it would choose if that pod were not currently gated, and reports that type, its cost, and the count of nodes of that type Luna would allocate for the set of matching gated pods, and reports that information in a NodeCostEstimate pod event.&nbsp; Figure 2 shows an event for a pod in a set of 3 small bin-packed pods, which Luna expects to run together on a single node.&nbsp; Figure 3 gives an event for a pod in a set of 3 bin-select pods, which Luna expects to run on 3 separate nodes.&nbsp; Run <a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/listNodeCostEstimateEvents.sh"><u>this</u></a> to get all NodeCostEstimate pod events.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/avoiding-ai-workload-cloud-sticker-shock-figure-2_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%">Figure 2: NodeCostEstimate Pod Event reported for deployment of 3 bin-packed pods on GKE</div></div></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/avoiding-ai-workload-cloud-sticker-shock-figure-3_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%">Figure 3: NodeCostEstimate Pod Event reported for deployment of 3 bin-selected pods on GKE</div></div></div><div class="paragraph" style="text-align:left;">To control the event-reporting overhead for the cost estimate, Luna only generates and reports a cost estimate pod event for pods not already having such an event.&nbsp; A new pod cost estimate event is generated if the existing event is removed, e.g., due to retention policy (pod events are retained for 1 hour by default) or to explicit deletion.<br><br>Luna's node cost estimate may over- or under-shoot the actual cost if a pod's schedulingGates are removed and the pod is scheduled for execution.&nbsp; The estimate does not take into account that the pod might be able to share an existing running node, in the case either of bin-packing or of bin-select with node-reuse enabled (default).&nbsp; For these cases, KubeScheduler would handle pod placement and the pod would not need node allocation by Luna.&nbsp; Also, the estimate does not take into account that node type availability at scheduling time may differ from that at estimation time.&nbsp; If any Luna node type back-offs were in effect at estimation time, but are no longer in effect at scheduling time, cheaper node types may be selected.&nbsp; If some node type back-offs were not in effect at estimation time, but are triggered at scheduling time, more expensive node types may be chosen.&nbsp; Note that in general Luna supports capping the cost of a node allocated for bin-selection via the pod annotation <em>node.elotl.co/instance-max-cost</em>.</div><h2 class="wsite-content-title"><font size="5">USING LUNA COST ESTIMATION TO ASSESS LLM SERVING CONFIGURATIONS</font><br></h2><div class="paragraph" style="text-align:left;">As an AI workload example, we consider the placement of a <a href="https://github.com/ray-project/kuberay"><u>KubeRay</u></a> 1.4.2 <a href="https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html"><u>RayService</u></a> serving an LLM model.&nbsp; We use the model <a href="https://huggingface.co/microsoft/Phi-3-mini-4k-instruct"><em><u>microsoft/Phi-3-mini-4k-instruct</u></em></a>, which runs successfully on mid-tier NVIDIA GPU SKUs such as L4, A10G, A10, and L40S.&nbsp; The baseline workload config is given <a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.yaml"><u>here</u></a>, comprising a CPU-only head requesting 2 CPUs and 16 GB memory, and 2 GPU-enabled workers, each requesting 16 CPUs, 16 GB memory, and 1 NVIDIA GPU.&nbsp; Given the pods' resource requirements, Luna assigns a node for each pod (bin-selection); it is also possible to configure Luna to assign multiple GPU pods per node (bin-packing) for this case.<br><br>We examine Luna's estimated costs for the baseline configuration on EKS, GKE, and AKS cloud K8s clusters to illustrate the value of getting that visibility before running the workload.&nbsp; We then consider the costs of several strategies for scaling up the workload's processing capacity, i.e., increasing the worker count, or maintaining the same worker count while either increasing each worker's GPU count or allocating a more powerful GPU device.&nbsp; These costs can be used to guide workload scaling performance evaluation testing.&nbsp; We observe that these strategies have significantly different costs across clouds.<br><br>We note that all node cost estimate experiments were run without any Luna resource availability back-offs in effect, meaning that the estimates assume sufficient cloud stock and user quota for the selected node types.&nbsp; While availability issues can occur, particularly for popular instance types, obtaining cost estimates that represent the preferred node types is useful since it can steer region choice and quota setting in accordance with acquiring those node types.<br></div><h2 class="wsite-content-title"><font size="5">AWS EKS LUNA NODE COST ESTIMATE EXPERIMENTS</font><br></h2><div class="paragraph" style="text-align:left;">We ran the AWS node cost estimate experiments using Luna v1.3.3 on an EKS 1.33 cluster in <em>us-west-2</em>.&nbsp; The results for on-demand pricing are given in Table 1, with links given to associated yaml configurations.&nbsp; It is useful to see the baseline costs in advance to avoid sticker shock; this baseline workload would cost ~$715/week.<br><br>Also, it is helpful to see the potential costs of scaling up the workload.&nbsp; Both the first and second "Scale" configuration rows involve 4 L4 GPUs, but the relatively low price of the <em>g6.12xlarge</em> type makes it a less costly way to obtain those 4 GPUs; workload scaling performance evaluation with that configuration seems worth exploring.&nbsp; The third scale row shows that upgrading the GPU SKU to the A100 would be expensive, but that cost reflects that the A100 is only available in instances with 8x GPUs.&nbsp; The per-GPU cost of the A100 is $2.8012/hr, which is&nbsp; ~40% ($2.8012/$2.0144) higher than the L4, so if the workload scale can use all 8 GPUs, the config is worth considering, given A100's faster floating point and larger memory (80 vs 24 GB).<br><br></div><div><div id="428541626203596083" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 20%;">Configuration</th><th style="width: 10%;">Head Node Type</th><th style="width: 10%;">Head Node $/hr</th><th style="width: 10%;">Worker Node Type</th><th style="width: 10%;">Worker Node $/hr</th><th style="width: 10%;">Worker Node GPU SKU</th><th style="width: 10%;">Worker Node Count</th><th style="width: 10%;">Total Cost $/hr</th><th style="width: 10%;">Ratio over baseline</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.yaml">Baseline: 2 1-GPU workers</a></td><td>r5a.xlarge</td><td>0.2260</td><td>g6.8xlarge</td><td>2.0144</td><td>1x L4</td><td>2</td><td>4.2548</td><td>1.00</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.4workers.yaml">Scale: 4 1-GPU workers</a></td><td>r5a.xlarge</td><td>0.2260</td><td>g6.8xlarge</td><td>2.0144</td><td>1x L4</td><td>4</td><td>8.2836</td><td>1.95</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.2gpus.yaml">Scale: 2 2-GPU workers</a></td><td>r5a.xlarge</td><td>0.2260</td><td>g6.12xlarge</td><td>4.6016</td><td>4x L4</td><td>1</td><td>4.8276</td><td>1.14</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.a100.yaml">Scale: 2 1-GPU A100 workers</a></td><td>r5a.xlarge</td><td>0.2260</td><td>p4d.24xlarge</td><td>22.1836</td><td>8x A100</td><td>1</td><td>22.4096</td><td>5.27</td></tr></tbody></table></div></div><div class="paragraph" style="text-align:center;">Table 1: Luna On-Demand Node Cost Estimate Experiments run on EKS 1.33 cluster<br></div><div class="paragraph" style="text-align:left;">We repeated the node cost estimate experiments using spot pricing.&nbsp; The results are given in Table 2, with the "Ratio over baseline" compared to the baseline value in Table 1, to facilitate comparing spot with on-demand prices.&nbsp; We ran with the Luna <a href="https://docs.elotl.co/luna/Configuration/#use-of-spot-instance-advisor-on-aws-eks"><em><u>aws.useSpotAdvisor</u></em></a> option set true, meaning that Luna used the <a href="https://aws.amazon.com/ec2/spot/instance-advisor/"><u>AWS spot instance advisor</u></a> data to estimate spot prices. Spot instance advisor provides the average spot discount for the region and instance type over the last 30 days, and also includes the average frequency of spot reclamation interruptions, which can be used to constrain Luna spot node type selection.<br><br>The spot prices in Table 2 are roughly half of the on-demand prices in Table 1, which is nice.&nbsp; However, the spot advisor data (viewable via the AWS tool link in the previous paragraph or in Luna verbose logs) indicates that all 3 GPU-enabled node types are in the highest frequency interruption bucket, meaning a 20%+ risk of node reclamation during use.&nbsp; When configured to use spot advisor data, Luna supports the <em>aws.maxSpotInterruptBucket</em> option to constrain spot selection by maximum spot interrupt bucket for managing risk and the <em>aws.maxSpotPriceRatio</em> option to constrain spot selection for ensuring sufficient savings, used for pricing or placement.<br></div><div><div id="929439362753681425" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 20%;">Configuration</th><th style="width: 10%;">Head Node Type</th><th style="width: 10%;">Head Node $/hr</th><th style="width: 10%;">Worker Node Type</th><th style="width: 10%;">Worker Node $/hr</th><th style="width: 10%;">Worker Node GPU SKU</th><th style="width: 10%;">Worker Node Count</th><th style="width: 10%;">Total Cost $/hr</th><th style="width: 10%;">Ratio over baseline</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.spot.yaml">Spot: 2 1-GPU workers</a></td><td>r5a.xlarge</td><td>0.0859</td><td>g6.8xlarge</td><td>0.9871</td><td>1x L4</td><td>2</td><td>2.0601</td><td>0.48</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.4workers.spot.yaml">Spot Scale: 4 1-GPU workers</a></td><td>r5a.xlarge</td><td>0.0859</td><td>g6.8xlarge</td><td>0.9871</td><td>1x L4</td><td>4</td><td>4.0343</td><td>0.95</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.2gpus.spot.yaml">Spot Scale: 2 2-GPU workers</a></td><td>r5a.xlarge</td><td>0.0859</td><td>g6.12xlarge</td><td>2.2548</td><td>4x L4</td><td>1</td><td>2.3407</td><td>0.55</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.a100.spot.yaml">Spot Scale: 2 1-GPU A100 workers</a></td><td>r5a.xlarge</td><td>0.0859</td><td>p4d.24xlarge</td><td>9.6614</td><td>8x A100</td><td>1</td><td>9.7473</td><td>2.29</td></tr></tbody></table></div></div><div class="paragraph" style="text-align:center;">Table 2: Luna Spot Node Cost Estimate Experiments run on EKS 1.33 cluster<br></div><h2 class="wsite-content-title"><font size="5">GCP GKE LUNA NODE COST ESTIMATE EXPERIMENTS</font><br></h2><div class="paragraph" style="text-align:left;">We ran the GCP GKE node cost estimate experiments using Luna v1.3.3 on a regional GKE 1.33.3 cluster in <em>us-central1</em>.&nbsp; The results for on-demand pricing are given in Table 3.&nbsp; This baseline workload would cost ~$613/week.<br><br>Again, it is also helpful to see the potential costs of scaling up the workload.&nbsp; Both the first and second "Scale" configuration rows involve 4 L4 GPUs, but the lower price for the <em>g2-standard-24</em> type makes it a less costly way to obtain those 4 GPUs; workload scaling performance evaluation with that configuration seems worth checking.&nbsp; The third scale row shows that upgrading the GPU SKU to the A100 would be expensive, with the A100 per-GPU cost of 7.3390 being significantly higher than the L4 per-GPU cost of 1.7343 (unlike on EKS), so unless the A100 provides much better performance, switching to it is not economical.<br></div><div><div id="870175329170061983" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 20%;">Configuration</th><th style="width: 10%;">Head Node Type</th><th style="width: 10%;">Head Node $/hr</th><th style="width: 10%;">Worker Node Type</th><th style="width: 10%;">Worker Node $/hr</th><th style="width: 10%;">Worker Node GPU SKU</th><th style="width: 10%;">Worker Node Count</th><th style="width: 10%;">Total Cost $/hr</th><th style="width: 10%;">Ratio over baseline</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.yaml">Baseline: 2 1-GPU workers</a></td><td>e2-highmem-4</td><td>0.1808</td><td>g2-standard-32</td><td>1.7343</td><td>1x L4</td><td>2</td><td>3.6494</td><td>1.00</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.4workers.yaml">Scale: 4 1-GPU workers</a></td><td>e2-highmem-4</td><td>0.1808</td><td>g2-standard-32</td><td>1.7343</td><td>1x L4</td><td>4</td><td>7.1180</td><td>1.95</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.2gpus.yaml">Scale: 2 2-GPU workers</a></td><td>e2-highmem-4</td><td>0.1808</td><td>g2-standard-24</td><td>2.0008</td><td>2x L4</td><td>2</td><td>4.1968</td><td>1.15</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.a100.yaml">Scale: 2 1-GPU A100 workers</a></td><td>e2-highmem-4</td><td>0.1808</td><td>a2-highGPU-2g</td><td>7.3390</td><td>1x A100</td><td>1</td><td>14.8588</td><td>4.07</td></tr></tbody></table></div></div><div class="paragraph" style="text-align:center;">Table 3: Luna On-Demand Node Cost Estimate Experiments run on GKE 1.33 cluster<br></div><h2 class="wsite-content-title"><font size="5">AZURE AKS LUNA NODE COST ESTIMATE EXPERIMENTS</font><br></h2><div class="paragraph" style="text-align:left;">We ran the Azure AKS node cost estimate experiments using Luna v1.3.3 on an AKS 1.32.6 cluster in <em>eastus</em>. The results for on-demand pricing are given in Table 4.&nbsp; This baseline workload would cost ~$1113/week.<br><br>Again, it is helpful to see the potential costs of scaling the workload.&nbsp; Both the first and second "Scale" configuration rows include 4 A10 GPUs, and the pricing is comparable, unlike the case on EKS and GKE.&nbsp; And the third row shows that upgrading the GPU SKU to the A100 would not be very expensive, and it is worth evaluating the scaling workload performance for that config.<br></div><div><div id="508347011412675186" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 20%;">Configuration</th><th style="width: 10%;">Head Node Type</th><th style="width: 10%;">Head Node $/hr</th><th style="width: 10%;">Worker Node Type</th><th style="width: 10%;">Worker Node $/hr</th><th style="width: 10%;">Worker Node GPU SKU</th><th style="width: 10%;">Worker Node Count</th><th style="width: 10%;">Total Cost $/hr</th><th style="width: 10%;">Ratio over baseline</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.yaml">Baseline: 2 1-GPU workers</a></td><td>E4as_v5</td><td>0.2260</td><td>NV36ads_A10_v5</td><td>3.2000</td><td>1x A10</td><td>2</td><td>6.6260</td><td>1.00</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.4workers.yaml">Scale: 4 1-GPU workers</a></td><td>E4as_v5</td><td>0.2260</td><td>NV36ads_A10_v5</td><td>3.2000</td><td>1x A10</td><td>4</td><td>13.0260</td><td>1.97</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.2gpus.yaml">Scale: 2 2-GPU workers</a></td><td>E4as_v5</td><td>0.2260</td><td>NV72ads_A10_v5</td><td>6.5200</td><td>2x A10</td><td>2</td><td>13.2660</td><td>2.00</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm-serve.schedgate.a100.yaml">Scale: 2 1-GPU A100 workers</a></td><td>E4as_v5</td><td>0.2260</td><td>NC24ads_A100_v4</td><td>3.6730</td><td>1x A100</td><td>2</td><td>7.5720</td><td>1.14</td></tr></tbody></table></div></div><div class="paragraph" style="text-align:center;">Table 4: Luna On-Demand Node Cost Estimate Experiments run on AKS 1.32 cluster<br></div><h2 class="wsite-content-title"><font size="5">SUMMARY</font><br></h2><div class="paragraph" style="text-align:left;">In this blog, we've described the cost estimation feature in the Luna Smart Cluster Autoscaler and shown how it can be used to avoid cloud sticker shock.&nbsp; We've discussed how it can guide cost-aware workload configuration when considering future workload scale increases, with large differences between scale strategies observed across cloud vendors.&nbsp; In an upcoming blog, we'll describe how the Luna cost estimation feature can be used with the <a href="https://www.elotl.co/nova.html"><u>Nova multi-cluster manager</u></a> to choose the K8s cluster on which to run an AI workload at the lowest price.<br><br>Have you experienced cloud sticker shock?&nbsp; Do you have ways you'd like to use estimated node pricing for workload resource planning activities?&nbsp; Please try Luna and let us know how it goes!&nbsp; A free trial download version is available <a href="https://www.elotl.co/luna-free-trial.html">here</a>.<br><br><br><strong>Author:</strong><br>Anne Holler (Chief Scientist, Elotl)<br><br></div>]]></content:encoded></item><item><title><![CDATA[Elotl receives investment from Cisco Investments to accelerate AI-ready Infra for Multi-Cloud Era]]></title><link><![CDATA[https://www.elotl.co/blog/elotl-receives-investment-from-cisco-investments-to-accelerate-ai-ready-infra-for-multi-cloud-era]]></link><comments><![CDATA[https://www.elotl.co/blog/elotl-receives-investment-from-cisco-investments-to-accelerate-ai-ready-infra-for-multi-cloud-era#comments]]></comments><pubDate>Thu, 14 Aug 2025 14:12:03 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.elotl.co/blog/elotl-receives-investment-from-cisco-investments-to-accelerate-ai-ready-infra-for-multi-cloud-era</guid><description><![CDATA[We are excited to announce an investment from Cisco Investments to accelerate AI-ready Infra for Enterprise AI platform teams!AI software stacks have standardized on top of Kubernetes. Elotl&rsquo;s enterprise-grade battle-tested Luna provisions just-in-time right-sized compute for Kubernetes. Luna prevents wasted GPU spend for AI workloads along with simplifying operations.Enterprise AI must meet response time SLAs before going live. Since expensive accelerators like GPUs are in short supply, w [...] ]]></description><content:encoded><![CDATA[<div class="paragraph"><span><span style="color:rgb(0, 0, 0)">We are excited to announce an investment from Cisco Investments to accelerate AI-ready Infra for Enterprise AI platform teams!</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">AI software stacks have standardized on top of Kubernetes. Elotl&rsquo;s enterprise-grade battle-tested </span><a href="https://www.elotl.co/luna.html"><span style="color:rgb(17, 85, 204)">Luna</span></a><span style="color:rgb(0, 0, 0)"> provisions just-in-time right-sized compute for Kubernetes. Luna prevents wasted GPU spend for AI workloads along with simplifying operations.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">Enterprise AI must meet response time SLAs before going live. Since expensive accelerators like GPUs are in short supply, waiting to source compute from a single region/datacenter/hyperscaler/neocloud would jeopardize AI business SLAs. Kubernetes platform teams need to dynamically source compute from multiple regions and cloud providers to be AI ready. This calls for a federated compute fabric spanning across on-prem datacenters, hyperscalers, and neoclouds. </span><a href="https://www.elotl.co/nova.html"><span style="color:rgb(17, 85, 204)">Elotl Nova</span></a><span style="color:rgb(0, 0, 0)"> is a policy-driven federated compute fabric that commoditizes Kubernetes clusters across regions and cloud providers.</span></span><br /><br /><span><span style="color:rgb(0, 0, 0)">As AI workloads scale, the need for robust, secure, and scalable networking becomes just as critical as compute. Through the acquisition of Isovalent in 2024, Cisco added the industry standard for Kubernetes networking and security, including technologies like Cilium and Tetragon, to its solutions for enterprise AI and cloud-native environments. These technologies are now foundational for enterprises running cloud-native and AI workloads on Kubernetes, providing the networking, security, and observability capabilities needed to support dynamic, distributed environments.<br />&#8203;</span></span><br /><span><span style="color:rgb(0, 0, 0)">At Elotl, we&rsquo;re committed to helping enterprises focus on building AI solutions while we take care of infrastructure complexity. With Cisco&rsquo;s investment and the strength of its industry-leading technologies, organizations can accelerate innovation and confidently run AI across multi-cloud environments. Here is a demo of cloud bursting AI workloads from on-prem datacenter to Azure using Nova, Cilium Cluster Mesh, and Hubble:</span></span></div>  <div class="wsite-youtube" style="margin-bottom:10px;margin-top:10px;"><div class="wsite-youtube-wrapper wsite-youtube-size-auto wsite-youtube-align-center"> <div class="wsite-youtube-container">  <iframe src="//www.youtube.com/embed/7_dM35hViCA?wmode=opaque" frameborder="0" allowfullscreen></iframe> </div> </div></div>  <div class="paragraph"><br /><span><span style="color:rgb(0, 0, 0)">&#8203;If you are interested in using Luna and/or Nova for your self-hosted training/inference/batch initiatives, please reach out at </span><a href="mailto:info@elotl.co"><span style="color:rgb(17, 85, 204)">info@elotl.co</span></a></span><br /><br /><strong style="color:rgb(54, 54, 54)">Author:&nbsp;</strong><span style="color:rgb(54, 54, 54)">Madhuri Yechuri</span><br /><br /></div>]]></content:encoded></item><item><title><![CDATA[Right-Sizing Your Kubernetes Pods with a Custom VPA Tracker]]></title><link><![CDATA[https://www.elotl.co/blog/right-sizing-your-kubernetes-pods-with-a-custom-vpa-tracker]]></link><comments><![CDATA[https://www.elotl.co/blog/right-sizing-your-kubernetes-pods-with-a-custom-vpa-tracker#comments]]></comments><pubDate>Thu, 31 Jul 2025 18:16:19 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/right-sizing-your-kubernetes-pods-with-a-custom-vpa-tracker</guid><description><![CDATA[The Kubernetes Vertical Pod Autoscaler (vpa) provides near-instantaneous recommendations for CPU and memory requests for a pod. It can be used either as a read-only or as a fully automated recommender, where pods are mutated with the recommended requests.&nbsp;When a cluster operator is considering whether or not to use VPA for a specific workload, it is helpful to simply monitor and visualize both VPA recommendations along with actual resource usage over a test period, before using it in an aut [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:215px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/right-sizing-your-kubernetes-pods-with-a-custom-vpa-tracker.png?1753985915" style="margin-top: 0px; margin-bottom: 10px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;">The <a href="https://kubernetes.io/docs/concepts/workloads/autoscaling/#scaling-workloads-vertically"><u>Kubernetes Vertical Pod Autoscaler</u></a> (vpa) provides near-instantaneous recommendations for CPU and memory requests for a pod. It can be used either as a read-only or as a fully automated recommender, where pods are mutated with the recommended requests.&nbsp;<br><br>When a cluster operator is considering whether or not to use VPA for a specific workload, it is helpful to simply monitor and visualize both VPA recommendations along with actual resource usage over a test period, before using it in an automated fashion. In this blog, we illustrate how we can track VPA operation over such a test period using a popular open-source <a href="https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack"><u>monitoring and visualization stack for Kubernetes</u></a> (which includes Prometheus and Grafana).<br></div><hr style="width:100%;clear:both;visibility:hidden;"><h2 class="wsite-content-title"><font size="5">Motivation for VPA tracking</font><br></h2><div class="paragraph" style="text-align:left;">Kubernetes VPA can be used in two primary update modes: Off (read-only mode) and Auto (aka Recreate). In the Off mode, the VPA custom resource provides near-instantaneous recommendations for suitable values of CPU and memory requests for pods in various types of Kubernetes resources - such as deployments, jobs, daemonsets, etc. Workload administrators can use these recommendations to manually update pod requests. Given below is an example of CPU and memory recommendations within a VPA custom resource object.</div><div><!--BLOG_SUMMARY_END--></div><div><div id="112101802912229391" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">Recommendation:Container Recommendations:   Container Name:  workload-c...   Target:     Cpu: 587m     Memory:  262144k...     </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">As a pod&rsquo;s resource usage changes, VPA recommendations also get updated based on resource utilization data. So, if a cluster administrator wanted to observe these recommendations over time and then use them to manually choose the right value for their pods, it is not possible to do so out-of-the-box with VPA. We would need a way to run a specific workload, managed by VPA over a sufficient period of time, then use a monitoring tool, like Prometheus to collect both a) resource usage from the pod b) Target recommendations from the VPA object. A visualization tool like Grafana can then be used to visually inspect these values over the test period. At periodic intervals the maximum recommendation from VPA can be then used to manually update a pod&rsquo;s manifest - which can then be redeployed via appropriate rolling update techniques on the cluster.&nbsp;<br><br>Let&rsquo;s look into each of the components needed for this VPA tracker and the steps involved in setting up the monitoring and visualization stack for this example workload.<br></div><h2 class="wsite-content-title"><font size="5">Workload and VPA object</font><br></h2><div class="paragraph" style="text-align:left;">A <a href="https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/quickstart.md#example-vpa-configuration"><u>VPA custom resource</u></a> object is needed for every Kubernetes resource that is to be managed by VPA. We create a sample workload and a VPA custom resource for this workload. The workload used in this blog post is available in this Github repo: <a href="https://github.com/elotl/vpa-tracker"><u>elotl/vpa-tracker</u></a>.<br></div><div><div id="741171038629971680" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl apply -f &ldquo;workload-c.yaml&rdquo;    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">The workload uses the CPU stressor pod from this Github repo: <a href="https://hub.docker.com/r/narmidm/k8s-pod-cpu-stressor"><u>narmidm/k8s-pod-cpu-stressor</u></a>. It allows us to control the CPU usage of a deployment&rsquo;s pods via an input parameter in the deployment manifest.<br></div><h2 class="wsite-content-title"><font size="5">VPA metrics exporter</font><br></h2><div class="paragraph" style="text-align:left;">The VPA object makes available its resource recommendations in the object&rsquo;s Status field. We created a simple Python script to export metrics from all VPA custom resources in our cluster to a <strong>/metrics</strong> endpoint. This exporter is in this Elotl public <a href="https://github.com/elotl/vpa-tracker"><u>repo</u></a>.&nbsp; The VPA exporter consists of a Kubernetes deployment and service and can be deployed as follows. A specific label is added to the default namespace to indicate to Prometheus that all ServiceMonitors in this namespace are to be included by Prometheus for scraping.<br></div><div><div id="716695456780591887" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl apply -f vpa-tracker/vpa-metrics-exporter/vpa_exporter.yamlkubectl port-forward svc/vpa-exporter  8080:8080kubectl label servicemonitor vpa-exporter  release=kube-prometheus-stack --overwrite    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">Monitoring of VPA metrics&nbsp;</font><br></h2><div class="paragraph" style="text-align:left;">Any Kubernetes monitoring tool can be used to monitor workload resource usage and the VPA metrics. As an example, in this blog, we use these open-source tools:&nbsp;<ul><li><a href="https://github.com/kubernetes/kube-state-metrics"><u>kube-state-metrics</u></a> for exporting all Kubernetes resource metrics, such as CPU and memory usage</li><li><a href="https://prometheus.io/"><u>Prometheus</u></a> for scraping both usage and VPA metrics from their respective endpoints</li><li><a href="https://grafana.com/"><u>Grafana</u></a> for visualizing metrics via Dashboards</li></ul>The <a href="https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack"><u>kube-prometheus-stack</u></a> project is an easy way to install these three components.&nbsp;<br>Prometheus, when installed via the kube-prometheus-stack, by default scrapes all metrics collected by the kube-state-metrics tool. However, an additional configuration step is needed to scrape the new VPA metrics that are being exported by the VPA exporter described in the prior section. This is done by creating a ServiceMonitor <a href="https://github.com/elotl/vpa-tracker/blob/main/vpa-recommender-servicemonitor.yaml"><u>Custom resource object</u></a> and exposing the needed Service.</div><div><div id="242219920433332855" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl apply -f vpa-recommender-servicemonitor.yamlkubectl apply -f vpa-metrics-expose-svc.yaml    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">Visualization of VPA metrics</font><br></h2><div class="paragraph" style="text-align:left;"><strong>Target</strong>, refers to the recommended values of CPU and memory requests for the workload. It corresponds to the 90 percentile (by default) of the decaying histogram of observed peak usage values.&nbsp; This percentile value can be configured using the flags --target-cpu-percentile and --target-memory-percentile when starting up the <a href="https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/flags.md#what-are-the-parameters-to-vpa-recommender"><u>vpa-recommender</u></a>.<br><br><strong>Uncapped Target</strong> refers to the recommended values of CPU and memory requests for a pod without taking into consideration the <strong>max allowed</strong> value in the Spec section of the VPA custom resource object.&nbsp;</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/final-1-vpa-tracker-scaleup-example-legend-noborder-title-preview-jpg.jpg?1753986806" alt="Picture" style="width:750;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph" style="text-align:left;">Let&rsquo;s look in detail at an example of the custom panel. In the graph above, at around 12pm, we increase the CPU usage of the CPU stressor pod from 120 millicores to 230 millicores. We do this by editing the deployment&rsquo;s <strong>cpu</strong> flag from a value of 0.1 to 0.2. We see that, at ~2:45pm, the VPA target recommendations (shown in yellow & green and overlapping this case) increases to an appropriate value of ~260 millicores.<br></div><h2 class="wsite-content-title"><font size="5">Scale-up and Scale-down Response Times</font><br></h2><div class="paragraph" style="text-align:left;">By scale-up response time, we refer to the time taken for the VPA CPU target to envelop a step increase in CPU usage. In many practical use-cases, increase in CPU usage can also be gradual. For the sample workload above and default VPA configuration parameters, we see that the scale-up response time is approximately 2hr 45mins.<br><br>Similarly, by scale-down response time, we refer to the time taken for the VPA&rsquo; CPU target to respond to a step decrease in CPU usage. The scale-down response time for the sample workload and default parameters of the VPA recommender is ~3 days and is shown in the graph below.</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/final-2-expt3-scaledown-vpa-tracker-scaleup-example-legend-noborder-title-preview-jpg.jpg?1753986921" alt="Picture" style="width:769;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph" style="text-align:left;">The key configuration parameter to the VPA-recommender that determines this response time is the <strong>cpu-histogram-decay-half-life</strong>.This value is the time duration after which the weight of each CPU/memory observation in the calculation of the target is halved. So the smaller this value, the faster the response times. Typically, we want a long-enough response time such that any transient or periodically repeating peaks and valleys in CPU usage will not influence the recommended target. Its default value is set to 24 hrs and users are recommended to increase or decrease this value based on the usage patterns of their particular workload.<br></div><h2 class="wsite-content-title"><font size="5">VPA Tracker Reports</font><br></h2><div class="paragraph" style="text-align:left;">As the final step in the VPA tracking workflow, a cluster operator can optionally set up the <a href="https://prometheus.io/docs/alerting/latest/alertmanager/"><u>Prometheus Alert Manager</u></a> to send a report of the final target recommendation at the end of each testing period. Alternatively, reviewing the Grafana panel over the testing period will allow the operator to identify and choose either the peak or the most recent target recommendation.&nbsp; &nbsp;<br><br>We provide an example of using the Alert Manager to send a message to a Slack channel at the end of each testing period with the recommended VPA CPU target value here: <a href="https://github.com/elotl/vpa-tracker/tree/main/vpa-alerts"><u>vpa-tracker-reports</u></a>. The graphic below shows a sample alert from a slack channel for workload-c.</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/slacknotification-vpa-tracker.jpg?1753987073" alt="Picture" style="width:767;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph" style="text-align:left;">If at the end of a few iterations of testing, the VPA recommendations work well, the cluster operator can choose to either: a) manually update the resource requests of pods or b) run VPA in auto update mode.<br></div><h2 class="wsite-content-title"><font size="5">Luna Autoscaler and VPA</font><br></h2><div class="paragraph" style="text-align:left;">When VPA recommends resource values that exceed the cluster&rsquo;s current capacity, using an intelligent cluster autoscaler, like Luna can help ensure that workloads will continue to run without any interruptions and without any manual intervention to add cluster capacity. Similarly, when VPA recommends target values that would result in some cluster nodes being under-utilized, Luna can detect this and scale-down the appropriate nodes. This helps keep cluster operation costs in check.<br><br>If you are interested in using VPA with Luna, please download our free trial version from here:&nbsp;<a href="https://www.elotl.co/luna-free-trial.html">Luna Free Trial</a>. And do write to us if you would like some help to get started: <a href="mailto:info@elotl.co"><u>info@elotl.co</u></a>.<br><br><br><strong>Author:</strong><br><br>Selvi Kadirvel (VP Engineering, Elotl)<br><br><span></span></div>]]></content:encoded></item><item><title><![CDATA[Luna now supports RKE2 clusters on AWS EC2]]></title><link><![CDATA[https://www.elotl.co/blog/luna-now-supports-rke2-clusters-running-in-aws-ec2]]></link><comments><![CDATA[https://www.elotl.co/blog/luna-now-supports-rke2-clusters-running-in-aws-ec2#comments]]></comments><pubDate>Thu, 03 Jul 2025 13:18:04 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/luna-now-supports-rke2-clusters-running-in-aws-ec2</guid><description><![CDATA[The Luna cluster autoscaler can now run with SUSE's&nbsp;RKE2 clusters on AWS EC2 nodes.Compared to EKS, RKE2 on EC2 offers more operational control, better customization, improved flexibility, and federation across different infrastructures: EC2, on-prem, and edge.Luna 1.2.19 can create and manage RKE2 worker nodes, allowing you to scale your RKE2 compute resources more efficiently than with the basic Kubernetes cluster autoscaler.How to configure Luna for RKE2Here are the steps to configure Lu [...] ]]></description><content:encoded><![CDATA[<span class="imgPusher" style="float:right;height:0px"></span><span style="display: table;width:196px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/luna-rke2-aws-ec2.png?1751561035" style="margin-top: 0px; margin-bottom: 0px; margin-left: 10px; margin-right: 0px; border-width:0; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -0px; margin-bottom: 0px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="display:block;">The <a href="https://docs.elotl.co/luna/intro/"><u>Luna cluster autoscaler</u></a> can now run with SUSE's&nbsp;<a href="https://docs.rke2.io/"><u>RKE2</u></a> clusters on AWS EC2 nodes.<br>Compared to EKS, RKE2 on EC2 offers more operational control, better customization, improved flexibility, and federation across different infrastructures: EC2, on-prem, and edge.<br>Luna 1.2.19 can create and manage RKE2 worker nodes, allowing you to scale your RKE2 compute resources more efficiently than with the basic Kubernetes cluster autoscaler.</div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><div class="wsite-youtube" style="margin-bottom:10px;margin-top:10px;"><div class="wsite-youtube-wrapper wsite-youtube-size-auto wsite-youtube-align-center"><div class="wsite-youtube-container"><iframe src="//www.youtube.com/embed/kqb4BGXtlAs?wmode=opaque" frameborder="0" allowfullscreen></iframe></div></div></div><h2 class="wsite-content-title"><font size="5">How to configure Luna for RKE2</font><br></h2><div class="paragraph" style="text-align:left;">Here are the steps to configure Luna with RKE2 with Amazon EC2. Here we'll assume that the RKE2 cluster already exists, and that Luna will get installed in the <em>elotl</em> namespace.</div><h2 class="wsite-content-title"><font size="4">Create a Docker Hub secret</font><br></h2><div class="paragraph">If you aren't using the <a href="https://www.elotl.co/luna-free-trial.html"><u>trial</u></a> version of Luna, you'll have to configure the Docker Hub secret to fetch the images.</div><div><div id="841331595316381919" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl -n elotl create secret docker-registry dockerhub \  --docker-server=docker.io \  --docker-username= \  --docker-password=    </code></pre></div></div></div></div><div class="paragraph">This secret will be referenced later when Luna is deployed.</div><h2 class="wsite-content-title"><font size="4">Create EC2 credentials for Luna</font><br></h2><div class="paragraph" style="text-align:left;">Unlike EKS, RKE2 doesn't support AWS built-in credential mechanisms to authenticate a service account attached to the pod. This means Luna has to rely on an access key to use the EC2 API.<br>Create the access key in the AWS console and input its information into a generic secret like this:</div><div><div id="292707267482748928" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl -n elotl create secret generic aws-credentials \  --from-literal=AWS_ACCESS_KEY_ID= \  --from-literal=AWS_SECRET_ACCESS_KEY= \  --from-literal=AWS_REGION=    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">Because these credentials can be read by anyone with access to the cluster, it's important to restrict the permissions of the AWS access key. The EKS installation script has a file named role_policies.json listing all the IAM permissions required by Luna; you can use these policies to restrict the IAM permissions on the AWS access key role.</div><h2 class="wsite-content-title"><font size="4">Find the subnets, security groups, and node instance profile for the cluster</font><br></h2><div class="paragraph" style="text-align:left;">With EKS, Luna automatically queries the subnets and security groups based on the cluster tags, but with RKE2, these tags may not exist.<br>You can find the subnets with the cluster's VPC in the AWS console.<br>To get the security groups and node instance profile, take a look at an RKE2 control or worker node in the cluster using the AWS EC2 console. On the instance page, go to the "Security" tab. The security group IDs are listed in the "Security Groups" section. To get the node instance profile, click on the "IAM role" link and look for "Instance profile ARN" on the IAM role page. The node instance profile ARN format is <em>arn:aws:iam::&lt;account ID&gt;:instance-profile/&lt;node-instance-profile&gt;</em>, only use the <em>&lt;node-instance-profile&gt;</em> part when configuring Luna.</div><h2 class="wsite-content-title"><font size="4">Get the node token and the cluster's IP address</font></h2><div class="paragraph" style="text-align:left;">The agent token is used to authenticate the nodes with the cluster. To get the agent token from the kube-apiserver pod, first find the apiserver pods on the RKE2 cluster:<br></div><div><div id="539862493525215020" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl -n kube-system get pod -l component=kube-apiserver    </code></pre></div></div></div></div><div><div id="903026919932971103" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">NAME                                    READY   STATUS    RESTARTS   AGEkube-apiserver-rke2-pool1-zdzdn-6g2ql   1/1     Running   0          15dkube-apiserver-rke2-pool1-zdzdn-kcxwv   1/1     Running   0          15d    </code></pre></div></div></div></div><div class="paragraph">Then exec into one of the pods and print the agent-token file:</div><div><div id="439176683365701539" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl exec -it kube-apiserver-rke2-pool1-zdzdn-6g2ql -n kube-system -- bash    </code></pre></div></div></div></div><div><div id="377909448112439459" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">cat /var/lib/rancher/rke2/server/agent-token    </code></pre></div></div></div></div><div class="paragraph"><br>To get the server's API, list the control plane nodes and use one of the node's internal IP:</div><div><div id="807532008474602950" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl get node -l node-role.kubernetes.io/control-plane=true -o wide    </code></pre></div></div></div></div><div class="paragraph">Alternatively, you can use the load balancer's IP if you are using a high availability solution with the control plane.</div><h2 class="wsite-content-title"><font size="4">Create Helm values file</font><br></h2><div class="paragraph">Now let's put it all together and create the Helm values file for the Luna chart.<br>We'll use a base Ubuntu image and create the user data script required to set up the RKE2 worker node to work with Luna:<br><br></div><div><div id="737580040203725762" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">cloudProvider: awsclusterID: aws:    subnets: ["subnet-1234567890"]    securityGroups: ["sg-1234567890"]    nodeInstanceProfile: node-instance-profile    amiIdGeneric: ami-09a13b25443518b29    userDataType: Template    userData: |        #!/bin/bash        mkdir -p /etc/rancher/rke2/        cat &lt; /etc/rancher/rke2/config.yaml        server: "https://:9345"        token: ""        node-label:        {{- range $k, $v := .Labels }}"{{ $k }}={{ $v }}"        {{- end }}        {{- if .Taints }}        node-taint:        {{- range $t := .Taints }}"{{ $t }}"        {{- end }}        {{- end }}        {{- if (gt .MaxPods 0) }}        kubelet-arg: "--max-pods={{.MaxPods}}"        {{- end }}        EOF        curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sh -        systemctl enable rke2-agent.service        systemctl start rke2-agent.serviceimagePullSecretName: dockerhublabels: "elotl-luna=true"manager:    envFrom:secretRef:        name: aws-credentials    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Deploy Luna with Helm and test</font><br></h2><div class="paragraph">Once the Helm values file is created, you can deploy Luna from its Helm chart with the Helm values file:<br></div><div><div id="550219562954133697" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">helm install 'elotl-luna'  \  --wait \  --create-namespace \  --namespace="elotl" \  --values=helm_values.yaml    </code></pre></div></div></div></div><div class="paragraph"><br>Once the deployment is running, you can test the installation by creating a test deployment like this:</div><div><div id="809000161494997702" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">cat nginx.yaml    </code></pre></div></div></div></div><div><div id="697590705998331154" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">apiVersion: apps/v1kind: Deploymentmetadata:  name: nginx  labels:    app: nginxspec:  replicas: 4  selector:    matchLabels:      app: nginx  template:    metadata:      labels:        app: nginx        elotl-luna: "true"    spec:      containers:name: nginx        image: nginx:mainline        resources:          requests:            cpu: 800m            memory: 200Mi    </code></pre></div></div></div></div><div><div id="743588763629214422" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl apply -f nginx.yaml    </code></pre></div></div></div></div><div class="paragraph"><span>The nginx pod will initially be in the Pending state and Luna nodes will come up to run them:</span></div><div><div id="626562848792290694" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl get node -l node.elotl.co/created-by=luna -w    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">Conclusion</font><br></h2><div class="paragraph" style="text-align:left;">Supporting RKE2 clusters on AWS EC2 marks a significant milestone for Luna, delivering advanced autoscaling to more Kubernetes users in the Amazon cloud. By following the configuration best practices shared above, your team can deploy Luna confidently, unlocking new opportunities for cost efficiency and operational control in your RKE2 clusters.<br><br><br><strong>Author:</strong><br><br><span></span>Henry Precheur (Senior Staff Engineer, Elotl)<br><br><span></span></div>]]></content:encoded></item><item><title><![CDATA[Building an Elastic GPU Cluster with the KAI Scheduler and Luna Autoscaler]]></title><link><![CDATA[https://www.elotl.co/blog/building-an-elastic-gpu-cluster-with-the-kai-scheduler-and-luna-autoscaler]]></link><comments><![CDATA[https://www.elotl.co/blog/building-an-elastic-gpu-cluster-with-the-kai-scheduler-and-luna-autoscaler#comments]]></comments><pubDate>Wed, 28 May 2025 18:34:17 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/building-an-elastic-gpu-cluster-with-the-kai-scheduler-and-luna-autoscaler</guid><description><![CDATA[When managing machine learning workloads at scale, efficient GPU scheduling becomes critical. The KAI Scheduler introduces a structured approach to resource allocation by organizing jobs into queues and operating under the assumption of fixed GPU resources available within the cluster. For clarification for those not familiar with KAI terminology, the term "job" refers to a unit of scheduling work defined within KAI’s own abstraction, not to be confused with a Kubernetes Job resource (i.e., th [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:269px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/building-an-elastic-gpu-cluster-with-the-kai-scheduler-and-luna-autoscaler.png?1748457454" style="margin-top: 0px; margin-bottom: 0px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -0px; margin-bottom: 0px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;">When managing machine learning workloads at scale, efficient GPU scheduling becomes critical. The <a href="https://github.com/NVIDIA/KAI-Scheduler"><strong><u>KAI Scheduler</u></strong></a> introduces a structured approach to resource allocation by organizing jobs into <em>queues</em> and operating under the assumption of <em>fixed GPU resources</em> available within the cluster. For clarification for those not familiar with KAI terminology, the term "job" refers to a unit of scheduling work defined within KAI&rsquo;s own abstraction, not to be confused with a Kubernetes Job resource (i.e., the batch/v1 kind used in Kubernetes for running finite, batch-style workloads). Each queue can be assigned limits and quotas, allowing administrators to control how resources are distributed across teams, projects, or workloads. This model ensures fair usage and predictability, but it also means that when demand exceeds supply, jobs can sit idle, waiting for resources to become available, and when supply exceeds demand, unnecessary costs are incurred.<br><br>This is where the real strength of the KAI Scheduler can shine: pairing the KAI Scheduler with <strong>Luna, an intelligent autoscaler</strong>. By combining the KAI Scheduler with an intelligent autoscaler like Luna, the system becomes highly elastic, able to dynamically add GPU nodes only when truly needed, and scale them back down to optimize efficiency. Instead of relying on a static pool of GPUs, the cluster can grow to meet active demand &mdash; <em>but only up to what is necessary and permitted by the configured queue limits and quotas</em>. It&rsquo;s worth noting, Luna doesn't indiscriminately add nodes; it works intelligently alongside KAI, ensuring that scaling decisions respect organizational boundaries and cost controls.&nbsp; Beyond scaling decisions, Luna offers settings to guide GPU instance selection, adding another layer of precision.</div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><div class="paragraph" style="text-align:left;">Even more powerfully, when demand drops, the autoscaler can scale GPU nodes down to zero, eliminating idle GPU resource costs entirely when no jobs are pending. This combination of KAI&rsquo;s scheduling guarantees with elastic GPU scaling through Luna improves resource utilization, enforces workload fairness, and reduces cloud costs &mdash; all while staying responsive to real-time demand.<br><br>Although KAI's queue-based model applies to both GPU and non-GPU scheduling scenarios, this blog highlights its integration with Luna in the context of GPU workloads, where elastic scaling offers the greatest impact.<br><br>In this blog post, we'll dive deeper into how KAI's design philosophy around queues and quotas enables this behavior, and how coupling it with the Luna autoscaler transforms your GPU cluster into a highly responsive, cost-effective machine learning platform.</div><h2 class="wsite-content-title"><font size="5">Queues, Quotas, and Priorities: The Building Blocks of KAI Scheduling</font><br></h2><div class="paragraph" style="text-align:left;">The KAI Scheduler is a purpose-built GPU scheduling system designed for modern AI/ML clusters where jobs vary widely in size, duration, and importance. At its core, KAI is designed to maximize GPU utilization while ensuring fairness, predictability, and administrative control. Unlike traditional Kubernetes scheduling, which typically operates at a pod-by-pod level, KAI introduces a queue-based model that groups jobs by context, such as by team, project, or workload class, allowing more intelligent and policy-driven resource sharing.<br><br>Each <em>queue</em> in KAI acts like a controlled funnel for jobs, with configurable <em>limits</em> (the maximum number of GPUs it can use at once) and <em>quotas</em> (reserved GPU allocations that a queue is guaranteed even during cluster contention). This structure ensures that important teams or high-priority projects are not starved when demand is high, while still allowing flexibility to share unused capacity when possible.<br><br>KAI also supports <em>job priorities</em> within queues. Higher-priority jobs are scheduled before lower-priority ones, even within the same queue, enabling teams to manage critical workloads more effectively. When GPUs are scarce, KAI can preempt lower-priority jobs (depending on configuration) to ensure that the most important work gets done first. Combined with fair sharing across queues and configurable preemption policies, this priority system helps align resource allocation with business and operational goals.<br><br>This structured approach &mdash; queues, quotas, limits, and priorities &mdash; makes KAI uniquely capable of supporting large, dynamic GPU clusters where the mix of users, workloads, and urgency changes constantly. When coupled with Luna, an intelligent autoscaler, KAI ensures that the right jobs run at the right time, while infrastructure elastically grows or shrinks to match real demand.<br><br>This blog highlights just a few core concepts of the KAI Scheduler, specifically its use of queues, quotas, and priority-based scheduling. However, KAI also includes many other advanced features designed for complex workload management. While we won&rsquo;t cover those here, they may be worth checking out and exploring.</div><h2 class="wsite-content-title"><font size="5">How the Luna Autoscaler Works with the KAI Scheduler</font><br></h2><div class="paragraph" style="text-align:left;">While many Kubernetes autoscalers operate by simply watching for pending pods and then adding nodes when any pod remains unscheduled, this approach falls short in environments where more complex scheduling logic is in place, such as when using the KAI Scheduler. In KAI, it is perfectly normal (and intentional) for some pods to remain pending, not because resources are unavailable, but because a queue&rsquo;s GPU <em>limit</em> or <em>quota</em> has been reached. An autoscaler that simply reacts to all pending pods would wastefully add GPU nodes that the KAI Scheduler would never utilize, leading to unnecessary cloud spend and resource sprawl.<br><br>The Luna autoscaler solves this problem with a more intelligent strategy. Rather than simply responding to the existence of pending pods, Luna can be configured to inspect the pod&rsquo;s status, conditions, and associated messages to determine <em>why</em> the pod is pending. This allows it to distinguish between pods that truly need more capacity versus pods that are simply waiting for their turn within a queue limit.<br><br>For example, if a pod&rsquo;s <strong><font color="#626262">status.conditions</font></strong> section includes a message such as:<br></div><div><div id="709022772257916334" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">Scheduling conditions were not met for pod default/gpu-pod-1a:MaxNodePoolResources: The pod default/gpu-pod-1a requires GPU: 1, CPU: 0 (cores), memory: 0 (GB). No node in the default node-pool has GPU resources.    </code></pre></div></div></div></div><div class="paragraph">-or-<br></div><div><div id="225341365251053462" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">no nodes with enough resources were found: 4 node(s) didn''t have enough resources: GPUs.    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">this indicates that the pod is unschedulable because there are <strong>no nodes with GPU resources</strong> available. In this case, Luna correctly triggers the addition of a new GPU node, allowing the KAI Scheduler to proceed with placing the job.<br><br>On the other hand, if the pending pod&rsquo;s message says:<br></div><div><div id="578006931532394955" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">OverLimit: default1 quota has reached the allowable limit of GPUs. Limit is 1 GPUs, currently 1 GPUs allocated and workload requested 1 GPU    </code></pre></div></div></div></div><div class="paragraph">it signals that the queue&rsquo;s GPU limit has already been reached. In this case, adding more nodes would be futile because KAI will not schedule the pod until the quota is freed, regardless of available cluster resources. The Luna autoscaler, if properly configured, recognizes this scenario and avoids unnecessary node provisioning.<br><br>The flexibility that enables Luna to behave correctly in these cases comes from its pendingPodReasonRegexp configuration option. This setting lets administrators define a regular expression to match only those pending pod messages that warrant scaling actions. Without any configuration, Luna would treat <em>all</em> pending pods as triggers for scale-out. However, with an expression like:<br></div><div><div id="104438841689460423" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">pendingPodReasonRegexp: (.*[Nn]o.*resources.*|^0/([0-9]+) nodes are available)    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">Luna can simultaneously support both default Kubernetes scheduling messages (like "0/5 nodes are available") and KAI Scheduler-specific resource shortage messages (like "No node in the default node-pool has GPU resources"). Critically, it would <em>ignore</em> pods pending due to quota overages, respecting the queue limits and policies enforced by KAI.<br><br>This integration makes Luna a powerful autoscaling companion to KAI, enabling truly elastic GPU infrastructure: adding nodes when needed for real workloads, avoiding waste when queues are at quota, and scaling down to zero when no pods are eligible for scheduling. Together, KAI and Luna deliver an efficient, responsive, and cost-optimized platform for running large-scale AI and ML jobs.<br></div><h2 class="wsite-content-title"><font size="5">Real-World Dynamics: Scheduling, Queue Limits, and Intelligent Scaling</font><br></h2><div class="paragraph" style="text-align:left;"><strong>Let's walk through an example to see how the KAI Scheduler and Luna autoscaler work together in practice.</strong> We'll explore how GPU workloads are scheduled across queues, how scaling decisions are made, and how the system remains efficient even as demand changes throughout the day.<br><br>Imagine a Kubernetes cluster set up to serve multiple internal teams running AI workloads. Two KAI Scheduler queues are configured: a <strong>"Research"</strong> queue and a <strong>"Production Inference"</strong> queue. The "Research" queue is assigned a <strong>quota of 4 GPUs</strong> and a <strong>limit of 8 GPUs</strong>, while the "Production Inference" queue has a <strong>quota of 8 GPUs</strong> and a <strong>limit of 12 GPUs</strong>. These settings ensure that critical production workloads are prioritized and guaranteed sufficient resources even during periods of high demand, while still allowing research teams to scale up when capacity is available.<br><br>At the start of the day, several production inference jobs are submitted, consuming 6 GPUs. Luna, detects that some production pods are <strong>pending with a valid unschedulable reason,</strong> indicating a lack of GPU resources, not just a KAI queue overlimit. Based on its configured <strong><font color="#818181">pendingPodReasonRegexp</font></strong>, Luna correctly interprets these pending pods as requiring new compute and promptly scales up additional GPU nodes. Once the nodes are ready, KAI schedules these inference jobs, bringing production workloads up toward their quota.<br><br>Shortly afterward, research engineers kick off a series of experimental training jobs, requesting 10 GPUs in total. KAI schedules the first 4 research jobs immediately, aligned with the Research queue&rsquo;s quota. Another 4 pods may also be scheduled&mdash;provided the cluster has sufficient capacity&mdash;since they remain within the queue&rsquo;s configured limit of 8 GPUs. Meanwhile, Luna inspects the pending pods: it recognizes that some research pods are pending due to insufficient GPU capacity, while 2 others remain unscheduled and will continue to remain pending because the queue&rsquo;s GPU limit has been reached. In response, Luna allocates additional GPU nodes to accommodate the pods still eligible to run within the queue&rsquo;s limit.<br><br>Luna only scales up nodes for the pods that actually need capacity and ignores those pending due to limit enforcement. This selective scaling ensures efficient cluster growth without wasting compute on artificially pending jobs.<br><br>As demand surges further, a second wave of production inference jobs arrives, consuming more GPUs and pushing the cluster toward full utilization. Because production workloads have a higher queue priority, KAI favors them over research jobs when scheduling GPUs that become available. The research pods exceeding their limits remain pending, awaiting free resources.<br><br>Later in the day, several production inference jobs complete, releasing GPUs back into the cluster. The KAI Scheduler notices the freed-up GPUs and begins to schedule the pending research jobs, respecting quota, limit, and priority policies. As the workload tapers off toward evening, both queues gradually empty out. Luna detects the sustained idleness, no pods are pending that would require GPUs, and begins scaling down the GPU node pools, eventually reaching <em>zero GPU nodes</em> once all jobs have completed or been canceled.<br><br>Throughout this cycle, KAI ensures fair, priority-aware scheduling based on queue configurations, while Luna manages <strong>dynamic, intelligent autoscaling</strong>, scaling up precisely when workloads genuinely need resources and scaling down aggressively to save costs. This close coordination keeps the platform <strong>cost-effective, responsive, and well-aligned to workload demand</strong>.<br></div><h2 class="wsite-content-title"><font size="5">How Luna Ensures Efficient Scaling Even Under Rapid Changes</font><br></h2><div class="paragraph" style="text-align:left;">While the Luna Autoscaler is designed to scale GPU nodes (as well as non-GPU nodes) precisely according to actual demand, it&rsquo;s important to note that small overshoots can occasionally occur. Because of the inherently dynamic nature of Kubernetes, with pods completing, new pods arriving, and scheduling conditions changing rapidly, Luna may sometimes add slightly more nodes than strictly needed. However, this is expected behavior in highly dynamic systems, and Luna is built to detect and reconcile any over-provisioned nodes quickly. Unused GPU nodes are automatically identified and safely removed during the next autoscaling evaluation cycle. This reconciliation mechanism ensures that the cluster stays responsive to fast-changing workloads without risking long-term resource waste, striking a balance between agility and efficiency.<br><br>To further reduce the potential for over-scaling, administrators can configure Luna&rsquo;s <strong><font color="#818181">clusterGPULimit</font></strong> option. This setting acts as a cap on the total number of GPUs Luna is allowed to provision. For example, it can be set to the sum of all KAI queue limits or slightly above the expected maximum GPU demand. This ensures that even under bursts of pending pods or fluctuating queue activity, Luna will not scale the cluster beyond a known safe threshold, providing another safeguard for cloud cost and quota control.<br></div><h2 class="wsite-content-title"><font size="5">Closing Thoughts: Intelligent Scheduling and Autoscaling in Action</font><br></h2><div class="paragraph" style="text-align:left;">Effectively managing GPU resources in a Kubernetes environment requires more than just reactively scaling for all pending pods. It demands an understanding of why pods are pending, how workloads are prioritized, and how quotas and queue limits impact scheduling decisions. The KAI Scheduler brings powerful, queue-based control to the table, allowing administrators to enforce GPU resource guarantees, prioritize critical workloads, and avoid resource contention across teams, all while enabling dynamic, fair resource sharing when capacity allows.<br><br>However, intelligent scheduling alone isn't enough. To maximize efficiency and cost-effectiveness, the platform must also dynamically match the underlying compute supply to real-world demand. That&rsquo;s where the Luna intelligent autoscaler complements the KAI Scheduler perfectly. By inspecting pod status messages and acting only when GPU nodes are genuinely needed, not merely reacting to all pending pods, Luna ensures that scaling decisions are precise, deliberate, and resource-aware.<br><br>As we saw in the example scenario, this combination allows workloads to ramp up smoothly, respecting both quota guarantees and dynamic limits, while ensuring GPU nodes are provisioned only when they can actually be used. When workloads complete, Luna responds quickly, scaling GPU nodes back down, even all the way to zero, helping to avoid unnecessary cloud costs during idle periods.<br><br>In short, pairing the KAI Scheduler with an intelligent autoscaler, like Luna, provides a powerful foundation for managing large-scale, GPU-intensive Kubernetes workloads. Together, they deliver <strong>better workload fairness, faster responsiveness, and smarter resource utilization</strong> &mdash; all critical ingredients for running a highly efficient, cost-effective compute platform at scale.<br></div><h2 class="wsite-content-title"><font size="5">Looking Ahead: Evolving Luna and KAI Integration</font><br></h2><div class="paragraph" style="text-align:left;">The current integration between the Luna autoscaler and the KAI Scheduler already enables powerful, efficient GPU workload scaling with intelligent handling of queues, quotas, and real-time cluster demands. While the existing functionality covers many common scenarios, we recognize that there may be opportunities for even deeper integration based on real-world needs.<br><br>We&rsquo;d be very interested in hearing from you about potential improvements. If you have ideas for tighter coupling, additional features, or specific use cases where Luna could better support KAI's advanced scheduling behavior, we'd love your feedback. Your input could help guide future enhancements and ensure the system continues to meet evolving GPU workload demands.</div><h2 class="wsite-content-title"><font size="5">Get Involved</font><br></h2><div class="paragraph" style="text-align:left;">If you're running GPU workloads today, or planning to, and want to make the most of the KAI Scheduler and Luna autoscaler together, now is the perfect time to get involved.<br><br>Share your feedback, test new features, and help us build even smarter, more efficient scaling for Kubernetes GPU environments and workloads.<br><br>Discover how Luna&rsquo;s intelligent autoscaling enhances GPU workload management, especially when paired with advanced schedulers like KAI. Visit our <a href="https://www.elotl.co/luna.html"><u>Luna</u></a> product page to explore all its capabilities, or dive into the <a href="https://docs.elotl.co/luna/intro/"><u>documentation</u></a> for hands-on setup guidance. Ready to optimize your cluster with smarter GPU scaling? Start your <a href="https://www.elotl.co/luna-free-trial.html"><u>free trial</u></a> today and experience the efficiency, control, and cost savings can Luna bring.<br><br><br><strong>Author:</strong><br>Justin Willoughby (Principal Solutions Architect, Elotl)<br></div>]]></content:encoded></item><item><title><![CDATA[Supercharge your Cluster Autoscaling with VPA]]></title><link><![CDATA[https://www.elotl.co/blog/supercharge-your-cluster-autoscaling-with-vpa]]></link><comments><![CDATA[https://www.elotl.co/blog/supercharge-your-cluster-autoscaling-with-vpa#comments]]></comments><pubDate>Tue, 13 May 2025 17:46:41 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Luna]]></category><category><![CDATA[VPA]]></category><guid isPermaLink="false">https://www.elotl.co/blog/supercharge-your-cluster-autoscaling-with-vpa</guid><description><![CDATA[Choosing accurate CPU and memory request values for Kubernetes workloads is a difficult endeavor. This difficulty results in application developers overprovisioning their workloads to ensure that application performance will not be affected. This can lead to increasing cloud costs and inefficient resource usage. In addition, it is also possible that workloads can be underprovisioned inadvertently. This can negatively affect application performance and potentially even lead to service disruptions [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/vpa-and-luna-interoperability-experiments.jpg?1747158858" style="margin-top: 0px; margin-bottom: 10px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;">Choosing accurate CPU and memory request values for Kubernetes workloads is a difficult endeavor. This difficulty results in application developers overprovisioning their workloads to ensure that application performance will not be affected. This can lead to increasing cloud costs and inefficient resource usage. In addition, it is also possible that workloads can be underprovisioned inadvertently. This can negatively affect application performance and potentially even lead to service disruptions.<br><br>In this blog, we describe how <a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler"><u>Kubernetes Vertical Pod Autoscaler</u></a> (VPA) can be leveraged in conjunction with <a href="https://www.elotl.co/luna.html"><u>Luna</u></a>, a powerful cluster autoscaler - to ensure that Kubernetes workloads are <strong>right-sized by VPA</strong> and the Kubernetes cluster as well as nodes are <strong>right-sized</strong> by Luna - resulting in <strong>cost-effective</strong> and <strong>performant</strong> operations.<br><br></div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><h2 class="wsite-content-title"><font size="5">Overview of VPA</font><br></h2><div class="paragraph" style="text-align:left;">The Vertical Pod Autoscaler in Kubernetes leverages CPU and memory usage history of managed workloads to make recommendations of resource request values for containers and optionally update a container&rsquo;s resource requests in an automated fashion. Workloads that can be vertically scaled using VPA include Deployments, Statefulsets, Daemonsets as well as Custom Resources (that have the scale subresource defined). VPA uses the <a href="https://kubernetes-sigs.github.io/metrics-server/"><u>Kubernetes metrics server</u></a> to monitor and track CPU and memory resource usage.&nbsp;&nbsp;<br><br><span></span>VPA is implemented as a Custom Resource in Kubernetes. An instance of the custom resource will need to be created for each workload that the user would like to manage or vertically autoscale. VPA can be used in 3 different modes. These are described below:<br><br><span></span><ol><li><strong>Off</strong>: In this mode, VPA provides recommendations for pod resource request values. These recommended values can be read from the VPA custom resource object. This mode requires manual activation or human intervention to apply recommendations.&nbsp;<br><span></span></li><li><strong>Initial</strong>: In this mode, VPA provides recommendations for pod request values in the VPA custom resource object just as in the <strong>&ldquo;off&rdquo;</strong> mode. In addition, these resource recommendations are applied to pods during pod creation (alone) and do not change during the lifetime of the pod. These pod creations could have been triggered either by prior pod restarts or via horizontal pod scaling.&nbsp;<br><span></span></li><li><strong>Auto/Recreate</strong>: In this mode, VPA assigns resource requests on pod creation as well as updates these resource requests over the lifetime of the pod.&nbsp;<br><br><span></span></li></ol>Note: This blog focuses on the traditional behavior of the Vertical Pod Autoscaler (VPA), which involves evicting and restarting pods to apply new resource recommendations. It does not cover the newer in-place pod resizing feature introduced in Kubernetes v1.33+, which allows certain resource updates without pod restarts. If you're using Kubernetes 1.33 or later and are interested in in-place resizing, be aware that it introduces different behavior and considerations not discussed in detail in this blog post. Please review the section &ldquo;VPA and In-place Pod Resizing&rdquo; at the end of this blog post to learn more.&nbsp;<br><br><span></span></div><h2 class="wsite-content-title"><font size="5">VPA: Under the hood</font><br></h2><div class="paragraph" style="text-align:left;">The Vertical Pod Autoscaler consists of 3 components in the <em>kube-system</em> namespace:<ol><li>Recommender, vpa-recommender: The recommender utilizes past and current CPU and memory usage values to calculate recommendations for resource requests for containers within a managed pod. The recommendations are made available within the Status field of the VerticalPodAutoscaler custom resource object.<br></li><li>Updater, vpa-updater: The updater is responsible for checking current resource requests for containers in a pod and evicting pods for which the recommended resources vary significantly from the current allocation. The pod disruption budget is respected during evictions. And the vpa-updater only comes into effect if the VPA is operated in auto mode. It is to be noted that for pods whose resources need to be updated, vpa-updater is only responsible for deleting the pod. The pod-controller is then responsible for initiating restart of the pod.</li><li>Admission Controller, vpa-admission-controller: The admission controller sets correct resource requests on newly created pods. This includes pods created for the first time as well as pods recreated after eviction by the vpa-updater.<br><br></li></ol>VPA also includes supporting components like a Mutating webhook configuration named vpa-webhook-config, and a ClusterIP service called vpa-webhook, which work together to apply recommended resource updates.</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/vertical-pod-autoscaler.png?1747159145" alt="Picture" style="width:598;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph" style="text-align:left;">The VPA recommender calculates resource requests using a <strong>decaying histogram</strong> of monitored<br>CPU and memory usage metrics. In a decaying histogram, the weight of each metric value decreases over time. By default, a historical CPU usage sample loses half of its weight in 24 hours. This default value can be changed using the flag, --cpu-histogram-decay-half-life. The frequency at which CPU and memory metrics are fetched defaults to 1 minute and can be changed using the flag, --recommender-interval. &nbsp; An extensive list of other flags to customize the vpa-recommender are documented here: <a href="https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/flags.md#what-are-the-parameters-to-vpa-recommender"><u>VPA-recommender flags</u></a>.&nbsp; A detailed description of margins and confidence intervals that are applied over the decaying histogram technique can be found in this CNCF blog post: <a href="https://www.cncf.io/blog/2023/02/24/optimizing-kubernetes-vertical-pod-autoscaler-responsiveness/"><u>Optimizing VPA responsiveness</u></a> and <a href="https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/pkg/recommender/logic/recommender.go"><u>here</u></a>.<br><br>The VPA object for each Kubernetes resource can also be configured to provide recommendations for both CPU and memory or just one of these resources using the <strong>controlledResources</strong> parameter in the VPA object (shown in the example VPA object below). It is important to note that it is <strong>not</strong> recommended to use VPA along with the Horizontal Pod Autoscaler for the same resource. More details about this limitation can be found in these references: <a href="https://github.com/kubernetes/design-proposals-archive/blob/main/autoscaling/vertical-pod-autoscaler.md#combining-vertical-and-horizontal-scaling"><u>VPA design docs</u></a>, <a href="https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/known-limitations.md#known-limitations"><u>Known Limitations of VPA</u></a> and <a href="https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler#limitations"><u>VPA on GKE Limitations</u></a>.<br><br>Let&rsquo;s look at an example of a VPA object:</div><div><div id="583817042931912367" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">apiVersion: "autoscaling.k8s.io/v1"kind: VerticalPodAutoscalermetadata:  name: workload-c-vpaspec:  targetRef:        apiVersion: "apps/v1"        kind: Deployment        name: workload-c  updatePolicy:        updateMode: "Auto"  resourcePolicy:        containerPolicies:        - containerName: '*'        minAllowed:        cpu: 100m        memory: 50Mi        maxAllowed:        cpu: 2        memory: 500Mi        controlledResources: ["cpu", "memory"]...    </code></pre></div></div></div></div><div class="paragraph">The <strong>targetRef</strong> field refers to the Kubernetes resource that this VPA object manages, which in this case is a deployment named &ldquo;workload-a&rdquo;. The <strong>updatePolicy</strong> field can be one of the four modes listed in the Overview section: Off, Initial, Auto or Recreate.&nbsp; The <strong>minAllowed</strong> and <strong>maxAllowed</strong> fields are used to set the absolute minimum and maximum values that the VPA can recommend. This prevents excessive resource usage as well as resource starvation for pods and can help to keep performance and cost within acceptable bounds.<br><br>Let&rsquo;s now look at an example of a recommendation within the VPA object after it begins operation:</div><div><div id="931477548549255750" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;"> Recommendation:        Container Recommendations:        Container Name:  workload-c        Lower Bound:       Cpu:     382m       Memory:  262144k        Target:       Cpu:     587m       Memory:  262144k        Uncapped Target:       Cpu:     587m       Memory:  262144k        Upper Bound:       Cpu:     1       Memory:  500Mi    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">In the above snippet, <strong>Target</strong>, refers to the recommended values of CPU and memory requests for the container named &ldquo;workload-a&rdquo;. It corresponds to the 90 percentile (by default) of the decaying histogram of observed peak usage values.&nbsp; This percentile value can be configured using the flags --target-cpu-percentile and --target-memory-percentile when starting up the <a href="https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/flags.md#what-are-the-parameters-to-vpa-recommender"><u>vpa-recommender</u></a>.<br><br><strong>Uncapped Target</strong> refers to the recommended values of CPU and memory requests for the same container without taking into consideration the <strong>maxAllowed</strong> value in the Spec section of the VPA custom resource object. The <strong>lower bound</strong> and <strong>upper bound</strong> values correspond to the 50th percentile and 95th percentile of the decaying histogram and these can be configured with the flags: --recommendation-lower-bound-cpu-percentile and --recommendation-upper-bound-cpu-percentile.&nbsp;&nbsp;<br></div><h2 class="wsite-content-title"><font size="5">VPA: Better Together with Cluster Autoscaling</font><br></h2><div class="paragraph" style="text-align:left;">In this section, let&rsquo;s look at how Vertical Pod Autoscaling and Cluster autoscaling compliment each other. VPA can be utilized to right-size pods that are initially either overprovisioned or underprovisioned. We delve into each of these cases and find out how a cluster autoscaler can help with both.&nbsp;<br></div><h2 class="wsite-content-title"><font size="4">Application Under-Provisioning</font><br></h2><div class="paragraph" style="text-align:left;">When a pod is underprovisioned, VPA recommends <strong>larger</strong> resource values than its current allocation. In this case, the current cluster nodes may not be able to accommodate the updated pod. This can result in pods remaining in pending state. In such a case, having an Intelligent Kubernetes Cluster Autoscaler, like Luna, becomes critical to keep the application or service running without interruptions. Luna automates the addition of a right-sized cluster node to accommodate these pending pods (that were recreated because of the actions of the VPA-updater).<br><br>Additionally, Luna places pods on nodes via two techniques:&nbsp;<ol><li>Bin-packing: In this placement mode, pods with modest resource requirements are placed along with other pods sharing the same node.&nbsp;<br><br></li><li>Bin-selection: In this placement mode, pods with larger resource requirements are placed on their own nodes.&nbsp;</li></ol>The resource thresholds that determine whether a node will be bin-packed or bin-selected are configurable via these <a href="https://docs.elotl.co/luna/Configuration/#bin-selection-1"><u>Luna parameters</u></a>:&nbsp; binSelectPodCpuThreshold, binSelectPodMemoryThreshold and binSelectPodGPUThreshold. Any pod whose resource request equals or exceeds these thresholds will be bin-selected.<br><br>When an underprovisioned pod&rsquo;s resources are increased by VPA, a bin-pack designated pod may become a bin-select designated pod. In this case, Luna automatically detects this change and places the pod appropriately on a bin-select node. We illustrate this via an experiment in the section: &ldquo;Experiment 4: VPA and Luna Interoperation to Handle Pod Under-provisioning&rdquo;.<br></div><h2 class="wsite-content-title"><font size="4">Application Over-Provisioning</font><br></h2><div class="paragraph" style="text-align:left;">When a pod is overprovisioned, VPA recommends <strong>smaller</strong> resource values than its current allocation. In this case, since the pod&rsquo;s resource request is smaller, total cluster capacity will not need to change - i.e. the cluster will continue to be able to accommodate the updated pod.<br><br>However, the decrease in resource requests could result in a change in the designation of a pod from bin-select to bin-pack. In this case, the pod, after restart, will be placed on a bin-pack node by Luna. The bin-select node will automatically get scaled-in (or deleted) if no other pods were also running on that node. A detailed experiment of this scenario is described in the section: &ldquo;Experiment 3: VPA and Luna Interoperation to Handle Pod Over-provisioning&rdquo;.<br><br></div><h2 class="wsite-content-title"><font size="5">VPA & Luna Interoperability Experiments&nbsp;</font><br></h2><div class="paragraph" style="text-align:left;">In this section, we detail a number of experiments to showcase how VPA and Luna interoperate under different operational conditions and modes.<br></div><h2 class="wsite-content-title"><font size="4">Experiment 1: Interoperation of VPA in &ldquo;<strong>Auto mode&rdquo;</strong> and Luna</font><br></h2><div class="paragraph">In this experiment, we illustrate an example where VPA recommends increased resources to a managed deployment. Luna promptly detects that the pod recreated by the vpa-updater cannot be accommodated as-is in the current cluster and hence adds a new node to the cluster and places the restarted pod on this new node.<br><br>When Luna and VPA (in auto mode) are used together, their admission webhooks need to be executed in the correct order. At first, the VPA admission controller adjusts pods&rsquo; resource values and then Luna&rsquo;s admission webhook needs to come into effect. Luna uses the updated resource values in a pod to then choose an appropriate node. Luna provides a <a href="https://docs.elotl.co/luna/Configuration/#webhookconfigprefix"><u>configuration parameter, called</u></a> webhookConfigPrefix, to enable this ordering.</div><h2 class="wsite-content-title"><font size="3">1. <strong>Initial Setup</strong></font><br></h2><div class="paragraph">Two deployments, &ldquo;workload-A&rdquo; and &ldquo;workload-B&rdquo; are running on 2 nodes in a Luna-enabled EKS cluster.<br></div><div><div id="618942580580490076" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wideNAME                          READY   STATUS    RESTARTS   AGE   NODE                                      workload-a-746c7d676c-g6fvm   1/1     Running   0          14m   ip-192-168-29-254.us-west-1.compute.internalworkload-b-748848d855-8xq2x   1/1     Running   0          14m   ip-192-168-20-122.us-west-1.compute.internal    </code></pre></div></div></div></div><h2 class="wsite-content-title"><strong><font size="3">2. Starting a VPA managed workload</font></strong><br></h2><div class="paragraph" style="text-align:left;">A third deployment, workload-C, managed by VPA is created on this cluster.<br></div><div><div id="593886771201632708" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl apply -f workload-C.yamlverticalpodautoscaler.autoscaling.k8s.io/workload-c-vpa createddeployment.apps/workload-c created    </code></pre></div></div></div></div><div class="paragraph">The VPA custom-resource is seen below.<br></div><div><div id="441804390500762881" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get vpa  NAME             MODE   CPU   MEM   PROVIDED   AGEworkload-c-vpa   Auto                          5s    </code></pre></div></div></div></div><div class="paragraph">We see that the CPU and memory request values are not immediately available.<br><br>Initially, workload-C is placed by Luna on an existing Luna-managed node, ip-192-168-20-122 because there is sufficient capacity on that node.</div><div><div id="878778035708163192" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wide               NAME                          READY   STATUS    RESTARTS   AGE   NODE workload-c-7758ccbf84-crgqm   1/1         Running   0              4s    ip-192-168-20-122.us-west-1.compute.internalworkload-c-7758ccbf84-qgv9d   1/1         Running   0              4s    ip-192-168-20-122.us-west-1.compute.internalworkload-a-746c7d676c-g6fvm   1/1         Running   0              20m   ip-192-168-29-254.us-west-1.compute.internalworkload-b-748848d855-8xq2x   1/1         Running   0              20m   ip-192-168-20-122.us-west-1.compute.internal    </code></pre></div></div></div></div><div class="paragraph">Workload-C was chosen such that its CPU usage can be configured to spike up or down as needed: <a href="https://github.com/narmidm/k8s-pod-cpu-stressor"><u>cpu-stressor-pod</u></a>.<br><br>We then see that the original pods&rsquo; usage begins to spike up, as captured below:&nbsp;<br></div><div><div id="638862728453735093" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">  % kubectl top pods       NAME                          CPU(cores)   MEMORY(bytes)   workload-c-7758ccbf84-crgqm   346m         1Mi           workload-c-7758ccbf84-jvcxz   3303m        1Mi           workload-a-746c7d676c-g6fvm   2588m        1Mi           workload-b-748848d855-8xq2x   275m         1Mi     </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">We see that the VPA updater evicts one of the pods, and a newly created replacement pod enters into Pending state:<br></div><div><div id="759853797576358438" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods       NAME                          READY   STATUS    RESTARTS   AGEworkload-c-7758ccbf84-btrcn   0/1     Pending   0          41sworkload-c-7758ccbf84-jvcxz   1/1     Running   0          101sworkload-a-746c7d676c-g6fvm   1/1     Running   0          22mworkload-b-748848d855-8xq2x   1/1     Running   0          22m    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">At the same time, we see that the CPU and memory recommendations are updated within the VPA custom resource object.&nbsp;</div><div><div id="139779818373401315" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get vpa        NAME             MODE   CPU   MEM       PROVIDED   AGEworkload-c-vpa   Auto   2     262144k   True       2m58s    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">Within a minute, we see that the pending pod successfully starts running on a newly created node (<span>ip-192-168-30-113</span>), that was triggered by Luna. We verify that the node creation was in fact initiated by Luna by checking that node&rsquo;s labels include: <span>node.elotl.co/created-by=luna.</span></div><div><div id="902102840276024087" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wideNAME                          READY   STATUS   RESTARTS   AGE   NODE                                       workload-c-7758ccbf84-btrcn   1/1     Running  0          81s   ip-192-168-30-113.us-west-1.compute.internal    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">This experiment showcases that for Vertical pod autoscaling to be used in an automated fashion, an intelligent autoscaler like Luna is critical to scale-out nodes when necessary.&nbsp;&nbsp;</div><h2 class="wsite-content-title"><font size="4">Experiment 2: Interoperation of VPA in &ldquo;<strong>Initial mode</strong>&rdquo; and Luna</font><br></h2><div class="paragraph">In this experiment, we illustrate an example where VPA recommends increased resources to a managed deployment. However, since VPA is configured in &ldquo;Initial&rdquo; mode, resource requests are not automatically applied to containers. In this mode, requests are applied only during pod creation. So application administrators can restart a pod manually to update requests.<br><br>Pods initially run on the existing Luna-managed node, ip-192-168-20-122.<br></div><div><div id="250670685940947106" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wideNAME                         READY   STATUS        RESTARTS   AGE     IP   NODE                                      workload-d-79f5997949-cdxv4  1/1     Running   0          8m12s        ip-192-168-20-122.us-west-1.compute.internalworkload-d-79f5997949-gs7tb  1/1     Running   0          8m12s        ip-192-168-20-122.us-west-1.compute.internalworkload-a-746c7d676c-g6fvm  1/1     Running   0          3d8h         ip-192-168-29-254.us-west-1.compute.internalworkload-b-748848d855-8xq2x  1/1     Running   0          3d8h         ip-192-168-20-122.us-west-1.compute.internal    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">After new VPA recommendations have been calculated in the VPA object, pods are deleted.<br></div><div><div id="239140065364508584" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get vpa                                                NAME             MODE      CPU   MEM       PROVIDED   AGEworkload-d-vpa   Initial   2     262144k   True           22m% kubectl delete pod workload-d-79f5997949-cdxv4 workload-d-79f5997949-gs7tb    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">We see that 1 replica of the workload gets started on a new Luna-triggered node (ip-192-168-3-62) taking into account the pod&rsquo;s newly assigned resource request.<br></div><div><div id="461362999972141551" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wideNAME                          READY   STATUS    RESTARTS   AGE   IP   NODE                                       workload-d-79f5997949-4zpxw   1/1         Running   0              5m3s       ip-192-168-3-62.us-west-1.compute.internalworkload-d-79f5997949-kvwt5   1/1         Running   0              5m3s       ip-192-168-20-122.us-west-1.compute.internalworkload-a-746c7d676c-g6fvm   1/1         Running   0              3d9h       ip-192-168-29-254.us-west-1.compute.internalworkload-b-748848d855-8xq2x   1/1         Running   0              3d9h       ip-192-168-20-122.us-west-1.compute.internal    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">With the updated assignment, both the existing node (ip-192-168-20-122) and the new Luna provisioned node (ip-192-168-3-62) are operating at full capacity.<br></div><div><div id="568050108987883762" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl top nodes      NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   ip-192-168-20-122.us-west-1.compute.internal   3977m        101%   588Mi           3%            ip-192-168-29-254.us-west-1.compute.internal   2261m        57%    1065Mi          7%            ip-192-168-3-62.us-west-1.compute.internal     4000m        102%   531Mi           3%     </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Experiment 3: VPA and Luna interoperation to handle Pod over-provisioning</font><br></h2><div class="paragraph" style="text-align:left;">When a pod is initially over-provisioned, VPA can recommend lower resource request values by observing resource usage over a period of time. These lower resource values, recommended by VPA, can result in a pod, initially categorized as a bin-select pod by Luna, to later be categorized as a bin-pack pod. In the experiment described below, we showcase how VPA and Luna work well together to handle this appropriately.&nbsp;</div><h2 class="wsite-content-title"><font size="3"><strong>1. Creation of an over-provisioned workload</strong></font><br></h2><div class="paragraph" style="text-align:left;">We create a workload, <span>workload-g</span>, that is overprovisioned. A VPA object is created for this deployment. Initially, VPA does not have a resource recommendation since there is insufficient historical data.<br></div><div><div id="150792545869215683" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get vpaNAME             MODE   CPU   MEM   PROVIDED   AGEworkload-g-vpa   Auto                          34s    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="3">2.&nbsp;<strong>Pod placement as bin-select on separate nodes&nbsp;</strong></font><br></h2><div class="paragraph" style="text-align:left;">Initially, the pods of this deployment request 2 CPUs each, as specified in the deployment manifest. Luna marks these pods as bin-select pods since the CPU request value falls below the default threshold for bin-packing in Luna.<br><br>As can be seen below, Luna places the pods on two separate nodes, <span>ip-192-168-11-117</span> and <span>ip-192-168-31-139</span>.&nbsp;<br></div><div><div id="685049623171331800" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wideNAME                         READY   STATUS    RESTARTS   AGE    NODEworkload-a-746c7d676c-g6fvm  1/1     Running   0          10d    ip-192-168-29-254.us-west-1.compute.internalworkload-b-9c878584d-fv7tq   1/1     Running   0          3d2h   ip-192-168-20-122.us-west-1.compute.internal workload-g-6bd4dc4c66-4l9kr  1/1     Running   0          96s    ip-192-168-11-117.us-west-1.compute.internalworkload-g-6bd4dc4c66-pzq6m  1/1     Running   0          96s    ip-192-168-31-139.us-west-1.compute.internal    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="3"><strong>3. Pod right-sizing by V</strong>PA</font><br></h2><div class="paragraph" style="text-align:left;">After a few minutes of operation, VPA utilizes the usage metrics and recommends the following resource requests. We see that the recommended CPU request for the pod is only 163m of CPU while the original CPU request in the pod&rsquo;s manifest was for 2 CPUs.&nbsp;</div><div><div id="372652040633149840" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get vpaNAME             MODE   CPU    MEM       PROVIDED   AGEworkload-g-vpa   Auto   163m   262144k   True       101s    </code></pre></div></div></div></div><h2 class="wsite-content-title"><strong><font size="3">4. Right-sized Pod placement via bin-packing by Luna</font></strong><br></h2><div class="paragraph" style="text-align:left;">We use VPA in <strong>auto</strong> update mode in this experiment. So the pods get restarted and get updated with the recommended lower resource values automatically.&nbsp; Luna detects the new resource values on the restarted pods and places them as bin-pack pods on an existing bin-pack node, <span>ip-192-168-20-122</span>, as seen below.<br></div><div><div id="800541317848882321" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wideNAME                          READY   STATUS    RESTARTS   AGE   IP   NODEworkload-a-746c7d676c-g6fvm   1/1     Running   0          10d        ip-192-168-29-254.us-west-1.compute.internalworkload-b-9c878584d-fv7tq    1/1     Running   0          3d2h       ip-192-168-20-122.us-west-1.compute.internalworkload-g-6bd4dc4c66-pjs6x   1/1     Running   0          4s         ip-192-168-20-122.us-west-1.compute.internalworkload-g-6bd4dc4c66-qq9dl   1/1     Running   0          64s        ip-192-168-20-122.us-west-1.compute.internal    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">From this experiment, we see that using Luna with VPA can help handle overprovisioned pods by right-sizing them and placing them on appropriate nodes automatically.</div><h2 class="wsite-content-title"><font size="4">Experiment 4: VPA and Luna Interoperation to Handle Pod Underprovisioning</font><br></h2><div class="paragraph" style="text-align:left;">Just as applications can be overprovisioned, as we saw in Experiment 3, applications can also be under-provisioned. This can result in a degradation of application performance and necessitates prompt remediation. In the following example, we show how VPA and Luna operate together to handle this situation without any manual intervention.&nbsp;</div><h2 class="wsite-content-title"><font size="3"><strong>1. Creation of an under-provisioned workload</strong></font><br></h2><div class="paragraph" style="text-align:left;">An under-provisioned Kubernetes deployment, <strong>workload-f</strong> is created. The workload&rsquo;s CPU request is set to <strong>100m</strong> in its manifest. We use the <a href="https://github.com/narmidm/k8s-pod-cpu-stressor"><u>cpu-stressor-pod</u></a> to configure its actual CPU usage to be much larger that this 100m request. A VPA object is also created for this deployment. Initially, the VPA object managing this deployment does not have any resource recommendations due to insufficient historical data:<br></div><div><div id="100783491973033615" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get vpaNAME             MODE   CPU   MEM   PROVIDED   AGEworkload-f-vpa   Auto                          21s    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="3"><strong>2. Pod placement as bin-pack by Luna</strong></font><br></h2><div class="paragraph" style="text-align:left;">Since the pod&rsquo;s CPU request of 100m falls below the default bin-pack threshold of 2 CPUs, Luna places both replicas of workload-f on a bin-pack node, <span>ip-192-168-20-122.</span>&nbsp;</div><div><div id="219700599893776754" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wideNAME                          READY   STATUS    RESTARTS   AGE     NODEworkload-a-746c7d676c-g6fvm   1/1     Running   0          10d     ip-192-168-29-254.us-west-1.compute.internalworkload-b-9c878584d-fv7tq    1/1     Running   0          3d22h   ip-192-168-20-122.us-west-1.compute.internalworkload-f-c9cd8df4-kkrrf     1/1     Running   0          26s     ip-192-168-20-122.us-west-1.compute.internalworkload-f-c9cd8df4-lfq2m     1/1     Running   0          26s     ip-192-168-20-122.us-west-1.compute.internal    </code></pre></div></div></div></div><h2 class="wsite-content-title"><strong><font size="3">3. Pod right-sizing by VPA</font></strong><br></h2><div class="paragraph" style="text-align:left;">Using the <strong>kubectl top</strong> command, we see that workload-f&rsquo;s CPU usage is much higher than its original request value of 100m for CPU.<br></div><div><div id="756372930818145156" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl top podsNAME                          CPU(cores)   MEMORY(bytes)workload-a-746c7d676c-g6fvm   2516m        1Miworkload-b-9c878584d-fv7tq    101m         1Miworkload-f-c9cd8df4-774bv     1922m        1Miworkload-f-c9cd8df4-b9fmr     1910m        1Mi    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">Soon, VPA utilizes observed CPU usage values and recommends a higher value of CPU - 2406m, as seen below:&nbsp;</div><div><div id="299832717245604512" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get vpaNAME             MODE   CPU     MEM       PROVIDED   AGEworkload-g-vpa   Auto   2406m   262144k   True       2m2s    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="3"><strong>4. Right-sized Pod placement via bin-select by Luna</strong></font><br></h2><div class="paragraph" style="text-align:left;">Since VPA is in auto mode for this experiment, workload-f&rsquo;s pods are recreated with the recommended higher CPU request values. The new CPU values now exceed Luna&rsquo;s bin-packing threshold value of 2 CPUs. Luna, in turn, responds by placing these pods on newly created bin-select nodes <span>ip-192-168-28-198</span> and <span>ip-192-168-11-176.&nbsp;</span><br></div><div><div id="969868164914383452" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">% kubectl get pods -o wideNAME                          READY   STATUS    RESTARTS   AGE   IP   NODEworkload-a-746c7d676c-g6fvm   1/1     Running   0          11d        ip-192-168-29-254.us-west-1.compute.internalworkload-b-9c878584d-fv7tq    1/1     Running   0          4d6h       ip-192-168-20-122.us-west-1.compute.internalworkload-f-c9cd8df4-6wvc9     1/1     Running   0          5h44m      ip-192-168-28-198.us-west-1.compute.internalworkload-f-c9cd8df4-l554v     1/1     Running   0          5h42m      ip-192-168-11-176.us-west-1.compute.internal    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">From this experiment, we see that Luna and VPA work well together to manage under-provisioned resource requests of pods without any manual intervention.<br></div><h2 class="wsite-content-title"><font size="5">VPA and In-place Pod Resizing&nbsp;</font><br></h2><div class="paragraph" style="text-align:left;"><a href="https://kubernetes.io/docs/tasks/configure-pod-container/resize-container-resources/"><u>In-place pod update</u></a> is a feature in Kubernetes that allows pods&rsquo; resource requests to be updated without having to evict and restart the pod. It has been available as an alpha feature from Kubernetes 1.27 (behind a feature gate). It is available as <a href="https://kubernetes.io/blog/2025/04/23/kubernetes-v1-33-release/#beta-in-place-resource-resize-for-vertical-scaling-of-pods"><u>beta from Kubernetes 1.33</u></a>.<br><br>Currently, in released versions of VPA (as of April 2025), the vpa-updater component does not utilize in-place pod resizing. However, VPA is being extended to leverage this feature; details of this development are tracked here: <a href="https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/enhancements/4016-in-place-updates-support/README.md"><u>AEP-4016</u></a>. It is important to note that VPA with in-place updates is not guaranteed to prevent pod disruptions since the actuating resize operation depends on the underlying container runtime. The end-user expectation is for pod disruptions to be <strong>minimal</strong>.<br><br>When VPA is able to utilize the in-place pod resizing feature, Luna&rsquo;s hot node mitigation feature may be able to help handle those cases where pods with increased resource requests cause excessive node utilization. Hot node mitigation is described in detail in this blog post: <a href="https://www.elotl.co/blog/luna-hot-node-mitigation-a-chill-pill-to-cure-pod-performance-problems"><u>Luna Hot Node Mitigation: A chill pill to cure pod performance problems</u></a>.</div><h2 class="wsite-content-title"><font size="5">Conclusion</font><br></h2><div class="paragraph" style="text-align:left;">In summary, when considering to use Vertical Pod Autoscaling for your Kubernetes workloads, leveraging an Intelligent Kubernetes Cluster Autoscaler, like Luna can ensure that restarted, scaled-up or scaled-down pods in your cluster can be placed on just-in-time, right-sized nodes in a fully automated fashion. If you would like to try VPA with an intelligent cluster autoscaler, please <a href="https://www.elotl.co/luna-free-trial.html"><u>download Luna</u></a> and reach out to us with questions or comments at <a href="mailto:info@elotl.co"><u>info@elotl.co</u></a> .&nbsp;<br><br></div><div class="paragraph"><strong><br>Author:</strong><br>Selvi Kadirvel (VP Engineering, Elotl)<br><br></div>]]></content:encoded></item><item><title><![CDATA[Fun with Spot: : Experiences using Luna Smart Autoscaling ofPublic Cloud Kubernetes Clusters for Offline Inference using GPUs]]></title><link><![CDATA[https://www.elotl.co/blog/fun-with-spot]]></link><comments><![CDATA[https://www.elotl.co/blog/fun-with-spot#comments]]></comments><pubDate>Thu, 24 Apr 2025 18:07:00 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/fun-with-spot</guid><description><![CDATA[Experiences using Luna Smart Autoscaling of Public Cloud Kubernetes Clusters for Offline Inference using GPUsOffline inference is well-suited to take advantage of spot GPU capacity in public clouds.&nbsp; However, obtaining spot and on-demand GPU instances can be frustrating, time-consuming, and costly.&nbsp; The Luna smart cluster autoscaler scales cloud Kubernetes (K8s) clusters with the least-expensive available spot and on-demand instances, in accordance with constraints that can include GPU [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="4">Experiences using Luna Smart Autoscaling of Public Cloud Kubernetes Clusters for Offline Inference using GPUs</font><br></h2><span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:216px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/fun-with-spot-luna.png?1745525657" style="margin-top: 0px; margin-bottom: 10px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;">Offline inference is well-suited to take advantage of spot GPU capacity in public clouds.&nbsp; However, obtaining spot and on-demand GPU instances can be frustrating, time-consuming, and costly.&nbsp; The <a href="https://www.elotl.co/luna.html"><u>Luna smart cluster autoscaler</u></a> scales cloud <a href="https://kubernetes.io/"><u>Kubernetes</u></a> (K8s) clusters with the least-expensive available spot and on-demand instances, in accordance with constraints that can include GPU SKU and count as well as maximum estimated hourly cost.&nbsp; In this blog, we share recent experiences with offline inference on <a href="https://cloud.google.com/kubernetes-engine?hl=en"><u>GKE</u></a>, <a href="https://azure.microsoft.com/en-us/products/kubernetes-service"><u>AKS</u></a>, and <a href="https://aws.amazon.com/eks/"><u>EKS</u></a> clusters using Luna.&nbsp; Luna efficiently handled the toil of finding the lowest-priced available spot GPU instances, <strong>reducing estimated hourly costs by 38-50%</strong> versus an on-demand baseline and turning an often tedious task into bargain-jolt <strong>fun</strong>.<br></div><hr style="width:100%;clear:both;visibility:hidden;"><h2 class="wsite-content-title"><font size="5">Introduction</font><br></h2><div class="paragraph" style="text-align:left;">Applications such as query/response chatbots are handled via online serving, in which each input and prompt is provided in real-time to the model running on one or more GPU workers.&nbsp; Automatic instance allocation for online serving presents efficiency challenges.&nbsp; Real-time response is sensitive to scaling latency during usage spikes and can be impacted by spot reclamation and replacement.&nbsp; Also, peak online serving usage often overlaps with peak cloud resource usage, affecting the available capacity for GPU instances.&nbsp; We've previously discussed aspects of using the Luna smart cluster autoscaler to automatically allocate instances for online serving, e.g., <a href="https://www.elotl.co/blog/helix-luna-efficient-genai-for-serious-people"><u>scaling Helix to handle ML load</u></a> and <a href="https://www.elotl.co/blog/reducing-deploy-time-for-llm-serving-on-cloud-kubernetes-with-luna-smart-autoscaler"><u>reducing deploy time for new ML workers</u></a>.</div><div><!--BLOG_SUMMARY_END--></div><div class="paragraph" style="text-align:left;">This blog focuses on offline inference, which avoids the challenges with the real-time burstiness of online serving.&nbsp; Applications such as text summarization, content generation, and financial forecasting employ offline inference, in which input and prompt pairs are sent to the model as a batch job, with the output being stored for subsequent use.&nbsp; Automatic instance allocation for offline inferencing can achieve greater resource efficiency than that for online serving.&nbsp; Offline prediction jobs are generally tolerant of scaling latency and spot instance reclamation and replacement, can be run off-peak, and are often configured with a fixed-size set of instances to handle the input load, which is typically known in advance.<br><br>We present experiences using Luna to allocate spot and on-demand GPU instances on GKE, AKS, and EKS cloud K8s clusters for offline inference.&nbsp; We share observations on resource efficiency in terms of GPU instance costs, and on instance availability and allocation search. The results show the cost savings from utilizing spot pricing and instance choice flexibility, and the value of using Luna to efficiently manage instance allocation in compliance with constraints and guardrails.&nbsp; While the results represent a small sample size, and your mileage may vary, we hope they demonstrate strategies you will find beneficial for your offline inference jobs.</div><h2 class="wsite-content-title"><font size="5">Example Offline Inference Workload</font><br></h2><div class="paragraph" style="text-align:left;">For offline inferencing, we chose to use the <a href="https://www.ray.io/"><u>Ray AI platform</u></a>, with the <a href="https://docs.ray.io/en/latest/cluster/kubernetes/index.html"><u>KubeRay operator</u></a> to deploy a RayJob on K8s.&nbsp; We adapted this simple <a href="https://docs.ray.io/en/latest/cluster/kubernetes/examples/rayjob-batch-inference-example.html"><u>batch inference example</u></a>, that runs an inference job for image classification on a single-node Ray cluster.&nbsp; The single-node Ray cluster comprises a GPU-enabled head that serves as a worker, which was run on an on-demand instance with 4 Nvidia T4 GPUs.&nbsp; This basic setup was adequate for the purpose of exercising GPU instance allocation and measuring instance cost on a set of cloud vendors.&nbsp; We updated <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml"><u>our version of the example</u></a> to indicate that Luna should handle allocating the instances for the <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L16"><u>Ray cluster head</u></a> and for the <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L62"><u>pod that submits the Ray job</u></a> to the Ray cluster.&nbsp; We added the <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L7"><em><u>shutdownAfterJobFinishes</u></em></a> option to have the Ray cluster automatically deleted after the RayJob completes, to avoid consuming resources once the Ray cluster becomes idle.<br><br>We changed several aspects of the example around GPU SKU choice, GPU count, and pricing category to make obtaining the GPU cloud capacity easier and less costly, as described below.&nbsp; These aspects may be worth considering for your workloads.<br><br><a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L18"><em><u>Flexible GPU SKU choice</u></em></a>. By default, Luna will choose the least expensive instance that meets a pending pod's resource requirements, but since the GPU-enabled Ray head in the Ray example was run on an instance with Nvidia T4 GPUs, we wanted to specify that Luna use that SKU in our experiments.&nbsp; However, we found the T4 SKU could be in short supply.&nbsp; We added a Luna annotation to the Ray head configuration indicating that Luna could choose a node with any GPU SKU in a list specified by the env variable <em>RAY_CLUSTER_GPU_SKUS,</em> which we populated with SKUs chosen as described below. Giving Luna the option to choose between several GPU SKU options facilitated its obtaining spot GPU capacity in a timely manner.<br><br><a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L40"><em><u>Flexible GPU count</u></em></a>. In the Ray example, the GPU-enabled Ray head was run on an instance with 4 T4 GPUs.&nbsp; However, we found that 4-GPU instances had lower availability and higher cost relative to T4 instances with fewer GPUs, and that the example ran fine with fewer T4s.&nbsp; The constant 4 was replaced with the env variable <em>RAY_CLUSTER_GPU_COUNT</em> to allow us to reduce this value, with <em>RAY_CLUSTER_CPU_COUNT</em> and <em>RAY_CLUSTER_MEMORY_SIZE</em> env variables added to allow us to scale down the CPU and memory requests accordingly.<br><br><em>Flexible pricing category for the</em> <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L19"><em><u>Ray head</u></em></a> <em>and</em> <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L64"><em><u>Ray job submitter</u></em></a>.&nbsp; In the Ray example, the workloads were run on pre-allocated on-demand instances.&nbsp; We updated the job configs to allow the user to specify the price categories from which Luna should request an instance via the env variable <em>BATCH_JOB_PRICE_CATEGORIES</em>.&nbsp; This option can be set to &ldquo;on-demand&rdquo; or to &ldquo;spot&rdquo; to indicate that Luna should only use that specific pricing category or the option can be set to &ldquo;spot,on-demand&rdquo; to have Luna choose the instance having the lowest estimated price drawn from either category.<br><br>Also, we added pod annotations to place guardrails on <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L21"><u>instance cost</u></a>, to avoid very expensive instances, and on <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L20"><u>GPU count</u></a>, to reduce the instance selection search space.&nbsp; <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml"><u>Here</u></a> is the updated version of the RayJob configuration.<br><br>To deploy the RayJob with a specific configuration, we did the following:<ul><li>&nbsp;Source the <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-setup.sh">ray-job.batch-setup.sh</a> script to define the environment variable settings, e.g.:<br></li></ul></div><div><div id="678061309550825306" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;"> . ./ray-job.batch-setup.sh    </code></pre></div></div></div></div><div class="paragraph"><ul><li>Create an instance of the RayJob yaml with the environment variables expanded via <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml">ray-job.batch-inference.yaml</a>:<br></li></ul></div><div><div id="741991535739894695" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;"> envsubst &lt; ray-job.batch-inference.yaml &gt;ray-job.yaml    </code></pre></div></div></div></div><div class="paragraph"><ul><li>Deploy that instance:<br></li></ul></div><div><div id="716028859401396193" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;"> kubectl apply -f  ray-job.yaml    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">Luna Operation on Offline Inference Workload</font><br></h2><div class="paragraph" style="text-align:left;">Each offline inference workload run was performed on a cloud K8s cluster running Luna 1.2.16.&nbsp; For the workload&rsquo;s pending pods and their constraints, Luna generates a list of candidate instance types with price categories and sorts them by estimated hourly cost.&nbsp; Luna estimates spot hourly cost as a configurable ratio <em>spotPriceRatioEstimate</em> of on-demand hourly cost; the default value is 0.5, which is a conservative estimate on GKE, AKS, and EKS.&nbsp; Luna then selects the candidate with the lowest estimated cost and sends a request to the cloud vendor to allocate it.&nbsp; When the requested instance type in the specified price category is readily available, the cloud vendor completes the allocation within Luna&rsquo;s default <a href="https://docs.elotl.co/luna/Configuration/#scaleuptimeout"><em><u>scaleUpTimeout</u></em></a> time of 10m.<br><br>When a requested instance type and category combination is not currently available, Luna generates a new request as follows.&nbsp; If the request fails with the cloud reporting insufficient capacity, Luna avoids the associated combination for a configurable back-off time and generates a new allocation request for the candidate with the next lowest estimated cost.&nbsp; If the cloud vendor keeps the request running for longer than <em>scaleUpTimeout,</em> Luna discontinues that request and, as in the failure case, avoids using the associated combination for a configurable back-off time and generates a new request for the candidate with the next lowest estimated cost.&nbsp; We&rsquo;ve found that Luna&rsquo;s strategy of discontinuing long-running allocation requests, which we&rsquo;ve seen often persist for 40m or more and then fail on GKE, is efficient since it allows Luna to retry instance allocation with an alternative candidate that is allocated successfully sooner.<br></div><h2 class="wsite-content-title"><font size="5">GKE Offline Inference Allocation Results</font><br></h2><div class="paragraph" style="text-align:left;">The GKE runs were executed on a standard GKE regional cluster running K8s 1.32 in the <em>us-central1</em> region.&nbsp; This region offers a wide selection of GPU-enabled instance types and GKE regional clusters support <a href="https://cloud.google.com/blog/products/containers-kubernetes/choosing-a-regional-vs-zonal-gke-cluster"><u>more instance availability</u></a> than zonal clusters.&nbsp; We ran the workload during US daytime hours, likely a peak usage period for the region.&nbsp; Our goal was to capture data that reflects conditions when spot and on-demand GPU capacity might be limited, providing a conservative estimate of the spot benefit compared to what would be seen for off-peak runs.<br><br>For the <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L59"><u>RayJob submitter pod</u></a> configuration, which specifies instance-offerings but no resource requests, Luna chose an <em>e2-medium</em> instance.&nbsp; This instance type has a low on-demand price ($0.0553/hr) and no issues were found with obtaining spot capacity for this instance type.&nbsp;<br><br>The main costs and capacity challenges were in allocating a node to host the GPU-enabled Ray cluster head.&nbsp; Results are given in Table 1.&nbsp; The first row represents the on-demand baseline for comparison with spot allocation.&nbsp; We initially attempted to have Luna allocate an on-demand node that matched the node used in the Ray example, i.e., an instance that could provide 4 T4 GPUs, 54 CPUs, and 54 GB memory, for which we specified no constraints on maximum GPUs or cost.&nbsp; However, Luna was not able to obtain an instance for that config after a round of trying all 5 candidate instance types with its default 10m <em>scaleUpTimeout</em> for each.&nbsp; Seeing that Luna had tried all candidates, we canceled the RayJob; while Luna would have continued to try to get a matching instance, and presumably would have eventually been successful, we considered the latency to get this instance type was too high for our use case.&nbsp; We tried a scaled-down config, with 2 T4 GPUs (as per the Ray example GPU SKU), <em>RAY_CLUSTER_CPU_COUNT</em> set to 27 CPUs, and <em>RAY_CLUSTER_MEMORY_SIZE set to</em> 27 GB memory, and Luna successfully obtained an instance which we used as our baseline.<br><br>We next had Luna try to allocate a spot node, using the baseline resource config with spot added to the price category.&nbsp; We also added more GPU SKUs to <em>RAY_CLUSTER_GPU_SKUS</em>, to give Luna more options to find spot nodes.&nbsp; And since the additional SKUs were more costly, we added a node cost max.&nbsp; After Luna tried two spot T4 instance types whose long-running scaling operations hit Luna&rsquo;s 10m <em>scaleUpTimeout</em> and were discontinued, Luna obtained a 2-GPU P4 spot instance, which was 38% cheaper than the on-demand 2-GPU T4 instance.&nbsp; Using Luna&rsquo;s strategy of retrying an alternative candidate when scale-up time exceeds <em>scaleUpTimeout</em>, an alternative spot instance was found in around 20m, rather than likely spending around 40m trying and ultimately failing to allocate the first candidate T4 spot instance.</div><div><div id="336950483674786012" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 10%; word-break: break-all;">RAY_CLUSTER_GPU_COUNT</th><th style="width: 10%; word-break: break-all;">RAY_CLUSTER_GPU_SKUS</th><th style="width: 10%;">Input Price Category</th><th style="width: 10%;">Input Max GPUs</th><th style="width: 10%;">Input Max Cost</th><th style="width: 15%;">Instances Luna tried that had insufficient capacity</th><th style="width: 15%;">Instance Luna Found</th><th style="width: 10%;">Instance Found Est Cost</th><th style="width: 10%;">Est Cost Ratio to Baseline</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>2</td><td>T4</td><td>on-demand</td><td>2</td><td>N/A</td><td>N/A</td><td>N1-standard-32 w/2 T4 GPUs (on-demand)</td><td>$2.22/hr</td><td>1.00</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>2</td><td>T4,P4,L4</td><td>spot, on-demand</td><td>2</td><td>$4.75/hr</td><td>N1-standard-32 w/2 T4 GPUs (spot), N1-highmem-32 w/2 T4 GPUs (spot)</td><td>N1-standard-32 w/2 P4 GPUs (spot)</td><td>$1.36/hr</td><td>0.62</td></tr></tbody></table></div></div><div class="paragraph">Table 1: Luna GKE node allocation for RayJob GPU-enabled head with specified constraints</div><h2 class="wsite-content-title"><font size="5">AKS Offline Inference Allocation Results</font><br></h2><div class="paragraph" style="text-align:left;">The AKS runs were executed on an AKS cluster running K8s 1.31 in the <em>east-us</em> region.&nbsp; As with GKE, the workload was run during US daytime, for a conservative estimate of the spot benefits.&nbsp; Note that to use spot, tolerations needed to be added to the <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L24"><u>Ray head</u></a> and <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L68"><u>Ray job submitter</u></a>.<br><br>For the <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L59"><u>RayJob submitter pod</u></a> configuration, Luna allocated a <em>Standard_B2als_v2</em> instance.&nbsp; This instance type has a low on-demand price ($0.0376/hr) and spot capacity was available for the type.<br><br>The results for allocating a node to host the GPU-enabled Ray cluster head are given in Table 2.&nbsp; Luna was able to allocate an on-demand 4-GPU T4 node corresponding to the node used in the Ray example run, shown in row 1. &nbsp; However, there were challenges allocating a spot node for comparison.&nbsp; Luna was not able to allocate a spot node for the original config due to insufficient capacity.&nbsp; Also, Azure does not support many instance types with 2 GPUs, including having no 2-GPU T4 nodes.&nbsp; Hence, for spot allocation, the Ray head was scaled down to a config of 1 GPU with 14 CPUs and 14 GB memory.&nbsp; As with GKE spot allocation, more GPU SKU choices were added to <em>RAY_CLUSTER_GPU_SKUS</em>, along with a max node cost.&nbsp; With this config, Luna obtained a spot instance with 1 T4 GPU at a cost of $0.60/hr.&nbsp; To compare this 1-GPU cost to the baseline cost of $4.35/hr for 4 GPUs, the baseline cost was normalized via dividing it by 4 and the spot cost was compared to that quotient; the spot cost was 45% lower.</div><div><div id="852360129388830417" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 10%; word-break: break-all;">RAY_CLUSTER_GPU_COUNT</th><th style="width: 10%; word-break: break-all;">RAY_CLUSTER_GPU_SKUS</th><th style="width: 10%;">Input Price Category</th><th style="width: 10%;">Input Max GPUs</th><th style="width: 10%;">Input Max Cost</th><th style="width: 15%;">Instances Luna tried that had insufficient capacity</th><th style="width: 15%;">Instance Luna Found</th><th style="width: 10%;">Instance Found Est Cost</th><th style="width: 10%;">Est Cost Ratio to Baseline (normalized)</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>4</td><td>T4</td><td>on-demand</td><td>N/A</td><td>N/A</td><td>N/A</td><td>Standard_NC64as_T4_v3 (on-demand)</td><td>$4.35/hr</td><td>1.00</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>1</td><td>T4,V100,A10</td><td>spot, on-demand</td><td>1</td><td>$4.75/hr</td><td>N/A</td><td>Standard_NC16as_T4_v3 (spot)</td><td>$0.60/hr</td><td>0.55</td></tr></tbody></table></div></div><div class="paragraph">Table 2: Luna AKS node allocation for RayJob GPU-enabled head with specified constraints</div><h2 class="wsite-content-title"><font size="5">EKS Offline Inference Allocation Results</font><br></h2><div class="paragraph" style="text-align:left;">The EKS runs were executed on an EKS cluster running K8s 1.32 in the <em>us-west-2</em> region.&nbsp; As was the case for GKE and AKS, the workload was run during US daytime hours, with the intent of yielding a conservative estimate of the spot benefits.<br><br>For the <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.batch-inference.yaml#L59"><u>RayJob submitter pod</u></a> configuration, Luna allocated a <em>t3a.small</em> instance.&nbsp; This instance type has a low on-demand price ($0.0188/hr) and there were no issues obtaining spot capacity for the type.&nbsp;<br><br>Results for allocating a node to host the GPU-enabled Ray cluster head are given in Table 3.&nbsp; Luna was able to allocate an on-demand node with 4 T4 GPUs as in the Ray documentation; the result is shown in row 1.&nbsp; Note that <em>RAY_CLUSTER_CPU_COUNT</em> was dropped to 44 and <em>RAY_CLUSTER_MEMORY_SIZE</em> to 44 GB, given that AWS does not have any 4-GPU T4 instances with enough CPUs to handle the original request of 54.&nbsp; Row 2 shows the results of adding spot to the input pricing category; Luna was able to allocate a spot version of the same instance type.<br></div><div><div id="457489514558201647" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 10%; word-break: break-all;">RAY_CLUSTER_GPU_COUNT</th><th style="width: 10%; word-break: break-all;">RAY_CLUSTER_GPU_SKUS</th><th style="width: 10%;">Input Price Category</th><th style="width: 10%;">Input Max GPUs</th><th style="width: 10%;">Input Max Cost</th><th style="width: 15%;">Instances Luna tried that had insufficient capacity</th><th style="width: 15%;">Instance Luna Found</th><th style="width: 10%;">Instance Found Est Cost</th><th style="width: 10%;">Est Cost Ratio to Baseline</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>4</td><td>T4</td><td>on-demand</td><td>N/A</td><td>N/A</td><td>N/A</td><td>g4dn.12xlarge (on-demand)</td><td>$3.91/hr</td><td>1.00</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>4</td><td>T4</td><td>spot, on-demand</td><td>4</td><td>$4.75/hr</td><td>N/A</td><td>g4dn.12xlarge (spot)</td><td>$1.96/hr</td><td>0.50</td></tr></tbody></table></div></div><div class="paragraph">Table 3: Luna EKS node allocation for RayJob GPU-enabled head with specified constraints<br></div><h2 class="wsite-content-title"><font size="5">Conclusion</font><br></h2><div class="paragraph" style="text-align:left;">Offline prediction jobs are typically not considered sensitive to node allocation latency and to the impact of spot reclamation and replacement and hence are ideal candidates for spot node use.&nbsp; We&rsquo;ve presented the results of using the Luna smart cluster autoscaler to allocate spot and on-demand instances on GKE, AKS, and EKS clusters for an example offline prediction job.&nbsp; We&rsquo;ve shown conservative estimated hourly cost savings of 38-50% using spot, achieved in an easy (and hence fun!) way with Luna&rsquo;s efficient approach to instance allocation search.<br><br>We invite you to have fun with Luna!&nbsp; Download the <a href="https://www.elotl.co/luna-free-trial.html"><u>free trial version of Luna</u></a> or reach out to us at <a href="mailto:info@elotl.co">info@elotl.co</a> if you would like to try Luna for your batch inference (or any other) workloads!<br><br><br><strong>Author:</strong><br>Anne Holler (Chief Scientist, Elotl)<br></div>]]></content:encoded></item><item><title><![CDATA[Reducing Deploy Time for LLM Serving on Cloud Kubernetes with Luna Smart Autoscaler]]></title><link><![CDATA[https://www.elotl.co/blog/reducing-deploy-time-for-llm-serving-on-cloud-kubernetes-with-luna-smart-autoscaler]]></link><comments><![CDATA[https://www.elotl.co/blog/reducing-deploy-time-for-llm-serving-on-cloud-kubernetes-with-luna-smart-autoscaler#comments]]></comments><pubDate>Tue, 28 Jan 2025 14:30:31 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/reducing-deploy-time-for-llm-serving-on-cloud-kubernetes-with-luna-smart-autoscaler</guid><description><![CDATA[OVERVIEW26 minutes!&nbsp; 26 long minutes was our wait time in one example case for our chatbot to be operational.&nbsp; Our LLM Kubernetes service runs in the cloud, and we found that deploying it from start to finish took between 13 and 26 minutes, which negatively impacted our agility and our happiness!&nbsp; Spinning up the service does involve a lot of work: creating the GPU node, pulling the large container image, and downloading the files containing the LLM weights to run our model.&nbsp; [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="5">OVERVIEW</font><br></h2><span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:213px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/reducing-deploy-time-for-llm-serving-on-cloud-kubernetes-with-luna-smart-autoscaler.png?1738074867" style="margin-top: 0px; margin-bottom: 10px; margin-left: 10px; margin-right: 10px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;">26 minutes!&nbsp; 26 long minutes was our wait time in one example case for our chatbot to be operational.&nbsp; Our LLM Kubernetes service runs in the cloud, and we found that deploying it from start to finish took between 13 and 26 minutes, which negatively impacted our agility and our happiness!&nbsp; Spinning up the service does involve a lot of work: creating the GPU node, pulling the large container image, and downloading the files containing the LLM weights to run our model.&nbsp; But we hoped we could make some simple changes to speed it up, and we did.&nbsp; In this post you will learn how to do just-in-time provisioning of an LLM service in cloud Kubernetes at deployment times that won't bum you out.<br><br>We share our experience with straightforward, low-cost, off-the-shelf methods to reduce container image fetch and model download times on EKS, GKE, and AKS clusters running the <a href="https://www.elotl.co/luna.html"><u>Luna smart cluster autoscaler</u></a>.&nbsp; Our example LLM serving workload is a <a href="https://docs.ray.io/en/latest/cluster/kubernetes/index.html"><u>KubeRay</u></a> <a href="https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/rayservice-quick-start.html"><u>RayService</u></a> using <a href="https://docs.vllm.ai/en/latest/"><u>vLLM</u></a> to serve an open-source model downloaded from <a href="https://huggingface.co/models"><u>HuggingFace</u></a>.&nbsp; <strong>We measured deploy-time improvements of up to 60%.</strong><br></div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><h2 class="wsite-content-title"><font size="5">APPROACH</font><br></h2><div class="paragraph" style="text-align:left;">We observed that deploying LLM serving workloads on autoscaled cloud Kubernetes clusters can take between 13 to 26 minutes.&nbsp; Key components of this time include adding a GPU node to the cluster to host the LLM serving worker pod, fetching the container image for that pod from a container registry, and downloading the LLM weights for model serving by that pod.&nbsp; There are a number of approaches to reducing LLM deploy time, which have various cost and complexity trade-offs.<br><br>One approach to reducing node scale-up time is to use node over-provisioning via low-priority pod deployment to keep extra node(s) available for scale-up, and to have a daemonset pre-pull the container image(s) of interest into the image cache on the extra node(s).&nbsp; We utilized this approach in our previous work described in <a href="https://www.elotl.co/blog/luna-hot-node-mitigation-a-chill-pill-to-cure-pod-performance-problems"><u>this Elotl blog</u></a> and Scale describes using this kind of approach in <a href="https://scale.com/blog/reduce-cold-start-time-llm-inference"><u>this Scale blog</u></a>.&nbsp; A downside with this approach is the cost overhead of the extra idle node(s).&nbsp; Our previous work involved serving ML models that could run on CPU-only nodes, where the cost overhead was relatively low; our current work involves serving LLM models requiring more expensive GPU nodes, so the cost overhead was higher than we wanted.&nbsp; Hence, we focused on allocating GPU nodes on demand and on techniques to quickly populate new nodes with the image of interest.<br><br>To quickly populate an image on new nodes, we first explored using Dragonfly pre-seeding with peer-to-peer distribution, but we did not get the performance results we expected despite a number of tuning attempts. We were also deterred by its usage complexity.&nbsp; We then looked at using cloud-vendor solutions to preload or cache/stream the images and found the solutions gave good results out-of-the-box, and were well-supported by the Luna smart cluster autoscaler.&nbsp; A drawback with this approach is the need for cloud-specific setup, but since each cloud's setup is fairly simple and reasonably well-documented, this was not a deal-breaker for us.&nbsp; And we&rsquo;re including setup detail links in this blog, so hopefully it will be even easier for you, blog reader!<br><br>With respect to reducing the time to download the model weights, we wanted to utilize HuggingFace's optimizations in this area before looking at the ROI of pursuing further improvement on our side.&nbsp; We found downloading with <a href="https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhubenablehftransfer"><u>HF_HUB_ENABLE_HF_TRANSFER enabled</u></a> gave a modest additional improvement in startup time relative to that given by the image load improvements.&nbsp; We have not yet looked at techniques such as pre-downloading the weights to shared fast storage with corresponding retargeting of the model loading path.&nbsp; We note that our model of interest is stored using the <a href="https://github.com/huggingface/safetensors"><u>safetensors</u></a> representation.<br></div><h2 class="wsite-content-title"><font size="5">PER-CLOUD IMPROVEMENTS</font><br></h2><div class="paragraph" style="text-align:left;">In this section, we present our experience with simple low-cost off-the-shelf methods for reducing container image fetch and model download time on EKS, GKE, and AKS clusters running the Luna smart cluster autoscaler.&nbsp; Our example LLM serving workload is a KubeRay-deployed RayService using vLLM to serve an open-source model downloaded from HuggingFace.&nbsp; Our target use case is inexpensive self-hosted LLM serving that does not require service guarantees for sudden extreme load bursts.<br><br>We collected baseline and improved deployment times for a KubeRay RayService using vLLM to serve the open-source model <a href="https://huggingface.co/microsoft/Phi-3-mini-4k-instruct"><u>microsoft/Phi-3-mini-4k-instruct</u></a> downloaded from HuggingFace.&nbsp; Deployment time is measured from K8s submission until the <em>service/llm-model-serve-serve-svc</em> endpoint is ready.&nbsp; We ran both static and dynamic setups.&nbsp; For the static setup, we ran without the Ray Autoscaler, specifying a CPU Ray head and GPU Ray workers, with <em>replicas</em> set to 1.&nbsp; For the dynamic setup, we ran with the Ray Autoscaler, specifying a CPU Ray head and GPU Ray workers, with <em>replicas</em> and <em>minReplicas</em> set to 0; the Ray Autoscaler scaled up to 1 replica during the deployment.&nbsp; The dynamic setup requires more time to deploy than the static setup, since the scale up from 0 to 1 GPU Ray worker replicas is not started until after the Ray Head is configured and the service workload is submitted to it, whereas in the static setup, the single GPU Ray worker is created in parallel with the CPU Ray head.<br></div><h2 class="wsite-content-title"><font size="5">Reducing EKS LLM Scale-up Time</font><br></h2><div class="paragraph" style="text-align:left;"><br>To reduce image load time on EKS, we chose the strategy described <a href="https://aws.amazon.com/blogs/containers/reduce-container-startup-time-on-amazon-eks-with-bottlerocket-data-volume/"><u>here</u></a> of using Bottlerocket node images with a data volume pre-populated to contain a snapshot of our container image.&nbsp; The Luna smart autoscaler supports allocating Bottlerocket nodes.&nbsp; As described below, we built an ECR image for our workload container, took a snapshot of it, and configured Luna to use Bottlerocket with our snapshot.<br><br>Our LLM serving workload uses the ray-ml image <em>rayproject/ray-ml:2.33.0.914af0-py311</em> from dockerhub, which is also published to ECR as <em>public.ecr.aws/anyscale/ray-ml:2.33.0-py311</em>.&nbsp; In addition, our RayService config ran &ldquo;pip install vllm==0.5.4&rdquo;, which we discovered impacted scale up time.&nbsp; And to use HF_HUB_ENABLE_HF_TRANSFER to speed up model download, we needed to include &ldquo;pip install hf_transfer&rdquo; as well.&nbsp; So we created a new ECR container image that combined <em>public.ecr.aws/anyscale/ray-ml:2.33.0-py311</em> with the vllm and hf_transfer pip installs.&nbsp; We took a snapshot of the resulting ECR image using the instructions <a href="https://github.com/aws-samples/bottlerocket-images-cache?tab=readme-ov-file#build-ebs-snapshot-with-cached-container-image"><u>here</u></a>.&nbsp; We set up our cluster as described <a href="https://github.com/elotl/GenAI-infra-stack/blob/main/docs/install.md#cluster-setup-summary"><u>here</u></a>, with Luna configured as described <a href="https://github.com/elotl/GenAI-infra-stack/blob/main/docs/install.md#bottlerocket-node-images"><u>here</u></a> to use Bottlerocket node images and the snapshot.<br><br>Table 1 contains the EKS measurement results, with the improved time including the impact of both the reduced image load time and reduced model download time using <em>hf_transfer</em>.&nbsp; Both static and dynamic deployment times were significantly improved, with static time reduced by 26% and dynamic time reduced by 46%, almost twice as much.&nbsp; We expected the improvement to be higher for the dynamic case, given that the time to create the worker node and pull its image is not overlapped with the time to create the head node and pull its image, so the worker image pull speedup is more impactful.&nbsp; We note that using <em>hf_transfer</em> for model download without also using the custom ECR image is slower than the baseline; the time needed to do the &ldquo;pip install hf_transfer&rdquo; at runtime is higher than the time saved by the faster model download.<br><br></div><div><div id="747097432451735688" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 15%;">RayService Deployment Setup</th><th style="width: 15%;">Baseline Time</th><th style="width: 15%;">Improved Time</th><th style="width: 15%;">Percent Improved</th><th style="width: 20%;">Ray Head Instance Type</th><th style="width: 20%;">Ray Worker Instance Type</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>Static</td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.yaml">811s</a></td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.fastereks.yaml&quot;">598s</a></td><td>26%</td><td>t3a.xlarge: 4 CPUs, 16GB</td><td>g6.4xlarge: 16 CPUs, 64GB, 1 L4 GPU</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Dynamic</td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.autoscale.yaml&quot;">1308s</a></td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.autoscale.fastereks.yaml&quot;">698s</a></td><td>47%</td><td>t3a.xlarge: 4 CPUs, 16GB</td><td>g6.4xlarge: 16 CPUs, 64GB, 1 L4 GPU</td></tr></tbody></table></div></div><div class="paragraph">Table 1: EKS RayService Baseline and Improved Deployment Times</div><h2 class="wsite-content-title"><font size="5">Reducing GKE LLM Scale-up Time</font><br></h2><div class="paragraph" style="text-align:left;">To reduce image load time on GKE, we chose the strategy described <a href="https://cloud.google.com/kubernetes-engine/docs/how-to/image-streaming"><u>here</u></a> of Image Streaming from the GCP Artifact Registry with warmed multi-level caches.&nbsp; The Luna smart autoscaler supports GKE Image Streaming.&nbsp; As described below, we built a GCR Artifact Registry image for our workload container, enabled Image Streaming on our cluster, and configured Luna to allow the nodes it allocates to pull from Artifact Registry for Image Streaming.<br><br>The workload container we built consisted of <em>rayproject/ray-ml:2.33.0.914af0-py311</em> from dockerhub plus the vllm and hf_transfer pip installs, similar to our ECR image.&nbsp; We stored it in the GCP Artifact Registry.&nbsp; We set up our cluster as described <a href="https://github.com/elotl/GenAI-infra-stack/blob/main/docs/install.md#cluster-setup-summary"><u>here</u></a> and enabled Image Streaming on it, and we configured Luna as described <a href="https://github.com/elotl/GenAI-infra-stack/blob/main/docs/install.md#gke"><u>here</u></a> to allow the nodes it allocates to pull from Artifact Registry.&nbsp; Note that we did an initial fetch of the image to prewarm GCP&rsquo;s multi-level caches, which is required to see the image load benefits.<br><br>Table 2 contains the GKE measurement results, with the improved time including the impact of both the reduced image load time and reduced model download time using hf_transfer.&nbsp; Again, the static and dynamic deployment times were significantly improved, with static time reduced by 47% and dynamic time reduced by 48%.&nbsp; Unlike on EKS, we did not see the expected much higher impact of the improvements in the dynamic case; we speculate that this is because there was some serialization of the node setup even in the static case.&nbsp; Note that an instance type with slightly larger memory (15GB -&gt; 16GB) was allocated for the Ray Head in the Dynamic setup, to accommodate the [modest] additional resources needed to run the Ray Autoscaler; this was not needed in the EKS case, since instance type chosen for the static already had 16GB memory.&nbsp; As on EKS, using <em>hf_transfer</em> for model download without also using the custom GCR image is slower than the baseline, due to the cost to pip install hf_transfer.<br><br></div><div><div id="558919444650692411" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 15%;">RayService Deployment Setup</th><th style="width: 15%;">Baseline Time</th><th style="width: 15%;">Improved Time</th><th style="width: 15%;">Percent Improved</th><th style="width: 20%;">Ray Head Instance Type</th><th style="width: 20%;">Ray Worker Instance Type</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>Static</td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.yaml">751s</a></td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.fastergke.yaml">399s</a></td><td>47%</td><td>n1-standard-4: 4 CPUs, 15GB</td><td>g2-standard-12: 12 CPUs, 48GB, 1 L4 GPU</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Dynamic</td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.autoscale.yaml">1029s</a></td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.autoscale.fastergke.yaml">536s</a></td><td>48%</td><td>c2d-highcpu-8: 8 CPUs, 16GB</td><td>g2-standard-12: 12 CPUs, 48GB, 1 L4 GPU</td></tr></tbody></table></div></div><div class="paragraph">Table 2: GKE RayService Baseline and Improved Deployment Times</div><h2 class="wsite-content-title"><font size="5">Reducing AKS LLM Scale-up Time</font><br></h2><div class="paragraph" style="text-align:left;">To reduce image load time on AKS, we chose the (preview feature) strategy described <a href="https://learn.microsoft.com/en-us/azure/aks/artifact-streaming"><u>here</u></a> of Artifact Streaming from the Azure Container Registry to AKS.&nbsp; The Luna smart autoscaler supports AKS Artifact Streaming.&nbsp; As described below, we built an ACR image, enabled Artifact Streaming on it, and configured Luna to enable Artifact Streaming on the nodes that it creates.<br><br>Our ACR image for our workload container consisted of <em>rayproject/ray-ml:2.33.0.914af0-py311</em> from dockerhub plus the vllm and hf_transfer pip installs, similar to our ECR and GCR images.&nbsp; As per the feature link, we registered the ArtifactStreamingPreview feature in our subscription and enabled Artifact Streaming on our ACR image.&nbsp; We set up our cluster as described <a href="https://github.com/elotl/GenAI-infra-stack/blob/main/docs/install.md#cluster-setup-summary"><u>here</u></a>, and configured Luna as described <a href="https://github.com/elotl/GenAI-infra-stack/blob/main/docs/install.md#aks"><u>here</u></a> to enable Artifact Streaming on the nodes that it creates.<br><br>Table 3 contains the measurement results on AKS, with the improved time including the impact of both the reduced image load time and reduced model download time using hf_transfer.&nbsp; Both static and dynamic deployment times were significantly improved, with static time reduced by 47% and dynamic time reduced by 60%.&nbsp; As we had expected and had also observed on EKS, the dynamic time reduction was higher than the static reduction.&nbsp; As on EKS and GKE, we note that using <em>hf_transfer</em> for model download without also using the custom ACR image is slower than the baseline, due to the cost to pip install hf_transfer.<br><br></div><div><div id="448104927305253831" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 15%;">RayService Deployment Setup</th><th style="width: 15%;">Baseline Time</th><th style="width: 15%;">Improved Time</th><th style="width: 15%;">Percent Improved</th><th style="width: 20%;">Ray Head Instance Type</th><th style="width: 20%;">Ray Worker Instance Type</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>Static</td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.yaml">885s</a></td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.fasteraks.yaml">472s</a></td><td>47%</td><td>Standard_B4as_v2: 4 CPUs, 16G</td><td>Standard_NV36ads_A10_v5: 36 CPUs, 440 GB, 1 A10 GPU</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Dynamic</td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.autoscale.yaml">1598s</a></td><td><a href="https://github.com/elotl/skyray/blob/main/luna-llm-serve/ray-service.llm.Phi-3-mini-4k-instruct.autoscale.fasteraks.yaml">647s</a></td><td>60%</td><td>Standard_B4as_v2: 4 CPUs, 16G</td><td>Standard_NV36ads_A10_v5: 36 CPUs, 440 GB, 1 A10 GPU</td></tr></tbody></table></div></div><div class="paragraph">Table 3: AKS RayService Baseline and Improved Deployment Times</div><h2 class="wsite-content-title"><font size="5">SUMMARY</font><br></h2><div class="paragraph" style="text-align:left;">In this blog, we've shared our experience with simple low-cost off-the-shelf methods for reducing container image fetch and model download time on EKS, GKE, and AKS clusters.&nbsp; The Luna smart cluster autoscaler support for each cloud&rsquo;s image fetch acceleration feature made our job easier.&nbsp; For our example LLM serving workload of a KubeRay-deployed RayService using vLLM to serve an open-source model downloaded from HuggingFace, deploy-time was cut roughly in half in most cases.&nbsp; For EKS, deploy-time was reduced by 26% to 47%; for GKE, deploy-time was reduced by 47% to 48%; and for AKS, deploy-time was reduced by 47% to 60%.<br><br>By the way, we note that our target use case is inexpensive self-hosted LLM serving that does not require service guarantees for sudden extreme load bursts.&nbsp; The methods we present do not yield the very low latencies of hosted LLM serving scale-up such as, e.g., provided by <a href="https://www.anyscale.com/blog/autoscale-large-ai-models-faster"><u>the Anyscale product</u></a>, which uses a custom container image format and client to lower image pull times, a special library for fast image loading that streams tensors directly from cloud storage onto the GPU, and a direct interface between the Ray autoscaler and the system control plane for accelerated node allocation.&nbsp; Such hosted products can be a great choice, depending on your use case and budget.<br><br>Please reach out to share your experiences with these deploy-time reduction strategies for your scale-up scenarios.&nbsp; You can get the free trial version of Luna <a href="https://www.elotl.co/luna-free-trial.html"><u>here</u></a>.&nbsp; Thanks for reading our blog and we&rsquo;ll post more material as/when we find more improvements!<br><br><span></span><br><br><span></span><strong>Author:</strong><br><span></span>Anne Holler (Chief Scientist, Elotl)<br><br><span></span></div>]]></content:encoded></item><item><title><![CDATA[EKS Auto Mode vs. Luna: Choosing the Right Scaling Strategy for Your Kubernetes Workloads]]></title><link><![CDATA[https://www.elotl.co/blog/eks-auto-mode-vs-luna-choosing-the-right-scaling-strategy-for-your-kubernetes-workloads]]></link><comments><![CDATA[https://www.elotl.co/blog/eks-auto-mode-vs-luna-choosing-the-right-scaling-strategy-for-your-kubernetes-workloads#comments]]></comments><pubDate>Tue, 14 Jan 2025 18:30:38 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/eks-auto-mode-vs-luna-choosing-the-right-scaling-strategy-for-your-kubernetes-workloads</guid><description><![CDATA[ Running Kubernetes on AWS using Elastic Kubernetes Service (EKS) offers a robust platform for container orchestration, but the challenge of managing the underlying compute infrastructure persists. This limitation can be addressed through various approaches, including the fully managed simplicity of EKS Auto Mode or the granular control offered by an intelligent Kubernetes cluster autoscaler like Luna. In this post, we&rsquo;ll explore the advantages of each, helping you choose the best scaling  [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/eks-auto-mode-vs-luna-choosing-the-right-scaling-strategy-for-your-kubernetes-workloads.png?1736880212" style="margin-top: 5px; margin-bottom: 0px; margin-left: 10px; margin-right: 10px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -0px; margin-bottom: 0px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="text-align:left;display:block;">Running Kubernetes on AWS using Elastic Kubernetes Service (EKS) offers a robust platform for container orchestration, but the challenge of managing the underlying compute infrastructure persists. <span style="color:rgb(0, 0, 0); font-weight:400">This limitation can be addressed through various approaches, including the fully managed simplicity of <strong>EKS Auto Mode</strong> or the granular control offered by an intelligent Kubernetes cluster autoscaler like <strong>Luna</strong>.</span> In this post, we&rsquo;ll explore the advantages of each, helping you choose the best scaling strategy for your workloads.<br></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <h2 class="wsite-content-title"><font size="5">Introduction</font><br></h2>  <div class="paragraph" style="text-align:left;">EKS Auto Mode is a fully managed solution aimed at reducing operational complexity for Kubernetes clusters on AWS. It automates essential tasks like node provisioning, scaling, and lifecycle management, offering an ideal entry point for teams new to EKS or operating simpler workloads.<br /><br />In contrast, compute autoscalers like Luna offer greater flexibility and customization, allowing you to optimize your infrastructure for the demands of complex and/or resource-intensive workloads.<br /><br /></div>  <div>  <!--BLOG_SUMMARY_END--></div>  <div class="paragraph" style="text-align:left;">Understanding the nuances of these approaches is key to selecting the optimal scaling solution for your Kubernetes deployments.<br></div>  <h2 class="wsite-content-title"><font size="5">EKS Auto Mode: The Allure of Simplicity</font><br></h2>  <div class="paragraph" style="text-align:left;">EKS Auto Mode shines in its simplicity. AWS takes on the heavy lifting of managing your worker nodes, handling everything from provisioning and scaling to OS patching and even instance type selection. This "swift lift" approach offers several key advantages:<ul><li><strong>Reduced Operational Burden:</strong> By automating core infrastructure management tasks, Auto Mode frees up your team to focus on application development and deployment.</li><li><strong>Simplified Security Posture:</strong> Auto Mode defaults to Bottlerocket, a purpose-built, security-focused container operating system. Bottlerocket's minimal attack surface, CIS Level 1 benchmark certification, and FIPS 140-3 compliance provide a strong foundation for secure container workloads.</li><li><strong>Streamlined Upgrades:</strong> Leveraging Karpenter under the hood, Auto Mode automates node refreshes and ensures consistent patching, minimizing security risks and maintaining cluster stability.</li><li><strong>Simplified Setup with Built-in Add-ons:</strong> Essential EKS add-ons, such as the EBS CSI driver for persistent storage and the ALB Ingress Controller for load balancing, are automatically deployed during cluster creation, further simplifying the setup process.</li></ul> However, this simplicity comes at a cost. Auto Mode's opinionated approach introduces several limitations:<ul><li><strong>Irreversible Activation</strong>: Once EKS Auto Mode is enabled on a cluster, it cannot be disabled. This irreversible change requires careful consideration before activation, as it commits the cluster to the Auto Mode management permanently.</li><li><strong>Limited Node Configuration Flexibility</strong>: EKS Auto Mode offers minimal control over node shapes and configurations. You cannot include or exclude specific instance sizes, or fine-tune the infrastructure to meet specialized workload requirements. This lack of flexibility means that Auto Mode's node provisioning is based on a predefined set of instance types selected by AWS.</li><li><strong>Limited Customization</strong>: EKS Auto Mode restricts customization at the node level. You are unable to modify kernel parameters, install custom system packages, or adjust kubelet settings. These limitations make it challenging to meet the requirements of workloads that depend on specific OS configurations or custom software installations.</li><li><strong>Spot Support</strong>:<br />While EKS Auto Mode simplifies operations, it does not leverage or support <strong>spot instances</strong> for cost savings, unlike some advanced autoscalers like Luna. This could result in higher operational costs for workloads where spot instances could be safely utilized.</li><li><strong>Bottlerocket Dependency:</strong> The reliance on Bottlerocket, while beneficial for security, prevents the use of custom Amazon Machine Images (AMIs), which might be necessary for specific software or compliance requirements.</li><li><strong>Potential for IP Address Exhaustion:</strong> Auto Mode utilizes prefix delegation, assigning /28 CIDR blocks to each node. In VPCs with limited IP address space, this can lead to IP exhaustion issues, preventing the creation of new nodes and halting cluster scaling altogether.</li><li><strong>Default Networking Overhead</strong>:<br />EKS Auto Mode relies on AWS-managed networking configurations, which can introduce inefficiencies in specific scenarios, such as cross-AZ traffic or high-latency workloads, due to default routing setups.</li><li><strong>Reduced Visibility:</strong> The automated nature of Auto Mode reduces direct visibility into the node provisioning and configuration processes, making detailed troubleshooting more reliant on AWS's logging and monitoring tools.</li></ul></div>  <h2 class="wsite-content-title"><font size="5">When Does EKS Auto Mode Shine?</font><br></h2>  <div class="paragraph" style="text-align:left;">Auto Mode is ideal for:<ul><li><strong>Small, Simple Clusters</strong>: Perfect for teams running standard workloads without complex resource needs.</li><li><strong>New Users</strong>: A smooth on-ramp for Kubernetes beginners, focusing on applications without delving into infrastructure.</li><li><strong>Testing and Experimentation</strong>: Auto Mode's streamlined setup makes it ideal for quickly creating and tearing down temporary clusters for testing, prototyping, or experimentation.<br /></li></ul></div>  <h2 class="wsite-content-title"><font size="5">Luna: Embracing Flexibility and Control</font><br></h2>  <div class="paragraph" style="text-align:left;">For teams managing larger or more complex clusters, Luna&rsquo;s flexibility and control offer significant advantages.</div>  <h2 class="wsite-content-title"><font size="5">What Does Luna Offer?</font><br></h2>  <div class="paragraph" style="text-align:left;">Luna provides a dynamic, customizable approach to autoscaling that empowers you to fine-tune every aspect of node management:<ul><li>Highly Flexible Instance Selection: Luna dynamically selects appropriate node shapes based on workload requirements such as CPU, memory, architecture (including ARM), and other criteria. This flexibility ensures that the infrastructure is tailored to meet the unique demands of your applications.</li><li>Spot Instance Support for Cost Optimization: Luna enables the use of spot instances, provisioning cost-effective nodes when desired, and based on availability. By incorporating spot instances and mixed instance types, Luna significantly reduces infrastructure costs while maintaining high availability.</li><li><strong>Granular Instance Control:</strong> Inclusion and exclusion lists allow you to define allowed and disallowed instance types/families, optimizing for cost, performance, or specific hardware requirements.</li><li><strong>Cost-Driven Instance Selection:</strong> Luna dynamically selects the least expensive, available instance shape that meets workload requirements, minimizing infrastructure spending.</li><li><strong>Hardware Specialization</strong>: Supports GPU acceleration and other specialized hardware for resource-intensive applications.</li><li><strong>Support for Custom AMIs:</strong> Luna allows you to choose a specific AMI or use your own custom AMI, enabling fine-grained control over the OS and installed software.</li><li><strong>Advanced Scheduling Capabilities:</strong> Features like node taints, tolerations, and node affinity allow precise control over pod placement, Luna provisions the appropriate nodes to support this placement as required.</li><li><strong>Serverless-like Experience:</strong> Luna automates much of the underlying node management, offering a simplified operational experience similar to EKS Auto Mode but with greater flexibility.<br /></li></ul></div>  <h2 class="wsite-content-title"><font size="5">Key Benefits of Luna</font><br></h2>  <div class="paragraph" style="text-align:left;"><ul><li><strong>Unparalleled Flexibility</strong>: Ideal for environments requiring specific configurations, hardware accelerations, or software setups.</li><li><strong>Advanced Cost Optimization with Spot</strong>: Spot instance utilization can drastically reduce infrastructure costs compared to on-demand-only nodes.</li><li><strong>Scalable for Large Clusters</strong>: As clusters grow in complexity and size, Luna ensures scalability without sacrificing control.</li><li><strong>Enhanced Workload Support</strong>: Handles diverse and complex workloads better than Auto Mode, offering tailored solutions for every use case.</li><li><strong>Fine-Grained Control:</strong> If your workloads demand specific instance types, OS configurations, or hardware acceleration (like GPUs), an Intelligent Kubernetes Cluster Autoscaler such as Luna is essential.</li><li><strong>Ease of Deployment, Configuration, and Upgrades: </strong>Compared to other autoscalers, Luna streamlines the deployment and configuration process for autoscaling within your EKS clusters. While it requires slightly more setup than EKS Auto Mode, it offers greater flexibility and customization with relatively low effort. Additionally, Luna supports smooth upgrades, ensuring new features and improvements can be rolled out with minimal disruption to cluster operations.<br /></li></ul></div>  <h2 class="wsite-content-title"><font size="5"><strong>Choosing the Right Approach</strong></font><br></h2>  <div class="paragraph" style="text-align:left;">The decision between EKS Auto Mode and Luna boils down to your priorities and workload characteristics:<ul><li><strong>Consider Choosing EKS Auto Mode if</strong>:<ul><li>You&rsquo;re running small, straightforward clusters with minimal customization needs.</li><li>You&rsquo;re new to Kubernetes and want a streamlined experience.</li><li>Your team prioritizes ease of use over granular control.</li></ul></li><li><strong>Consider Choosing Luna if</strong>:<ul><li>You need precise control over infrastructure, including custom AMIs and hardware configurations.</li><li>Your workloads demand advanced scheduling, cost optimization, or specialized resources like GPUs.</li><li>You&rsquo;re managing large clusters with busty workloads and/or diverse application requirements.<br /></li></ul></li></ul></div>  <h2 class="wsite-content-title"><font size="5"><strong>Conclusion</strong></font><br></h2>  <div class="paragraph" style="text-align:left;">Kubernetes compute scaling within EKS requires choosing a solution that aligns with your operational priorities, workload complexity, and cost management goals. <strong>EKS Auto Mode</strong> simplifies Kubernetes management with automation and preconfigured settings, making it an excellent choice for smaller clusters, standard workloads, or teams looking for a low-maintenance entry point. Its ease of use allows you to focus on deploying applications without being bogged down by infrastructure details.<br /><br />On the other hand, an Intelligent Kubernetes Cluster Autoscaler like <strong>Luna</strong> offers the flexibility, control, and cost optimization needed for growing, complex, bursty, or resource-intensive deployments. Whether you're fine-tuning node configurations, optimizing for diverse workload requirements, or leveraging advanced features like spot instances, Luna provides the autoscaling necessary to efficiently scale clusters tailored to your unique needs and workloads.<br /><br />The choice isn&rsquo;t about one being inherently better than the other&mdash;it&rsquo;s about understanding your requirements. For teams prioritizing simplicity and rapid deployment, Auto Mode is worth considering as a viable option. For those needing advanced scaling capabilities and greater customization, Luna&rsquo;s robust feature set provides unmatched value. By carefully evaluating these factors, you can adopt the solution that delivers the best results for your Kubernetes journey on AWS.<br /><br /><br /><strong>Author:</strong><br />Justin Willoughby (Principal Solutions Architect, Elotl)<br /><br /><br /><strong>Disclaimer</strong>: The features and limitations of EKS Auto Mode as described in this blog are based on the author&rsquo;s understanding at the time of publication. AWS may update or change these features over time, and readers are encouraged to consult the official AWS documentation for the most up-to-date information.<br /><br /></div>]]></content:encoded></item><item><title><![CDATA[Helix + Luna: Efficient GenAI for Serious People]]></title><link><![CDATA[https://www.elotl.co/blog/helix-luna-efficient-genai-for-serious-people]]></link><comments><![CDATA[https://www.elotl.co/blog/helix-luna-efficient-genai-for-serious-people#comments]]></comments><pubDate>Fri, 15 Nov 2024 22:41:55 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">https://www.elotl.co/blog/helix-luna-efficient-genai-for-serious-people</guid><description><![CDATA[Why Helix + Luna?  Helix allows companies seeking to leverage LLMs while retaining complete control over data and infrastructure. By utilizing Helix, organizations can connect their data&mdash;either locally or through APIs&mdash;to powerful AI models without transferring sensitive information outside of their ecosystem. Helix&rsquo;s solution empowers companies to deploy open-source LLMs on their own resources, including cloud-based Kubernetes (K8s) clusters. This approach provides the scalabil [...] ]]></description><content:encoded><![CDATA[<div class="paragraph"><strong><span><span style="color:rgb(67, 67, 67); font-weight:400"><font size="6">Why Helix + Luna?</font></span></span></strong></div>  <div class="paragraph"><span><a href="https://tryhelix.ai/"><span style="color:rgb(17, 85, 204)">Helix</span></a><span style="color:rgb(0, 0, 0)"> allows companies seeking to leverage LLMs while retaining complete control over data and infrastructure. By utilizing Helix, organizations can connect their data&mdash;either locally or through APIs&mdash;to powerful AI models without transferring sensitive information outside of their ecosystem. Helix&rsquo;s solution empowers companies to deploy open-source LLMs on their own resources, including cloud-based Kubernetes (K8s) clusters. This approach provides the scalability and resilience of cloud infrastructure with the privacy and control of on-premises deployment. Designed to meet the needs of modern enterprises, Helix enables robust AI integration, whether for enhancing customer interactions, streamlining internal workflows, or extracting valuable insights from vast data sets.</span></span><br /><br /><span><a href="https://www.elotl.co/luna.html"><span style="color:rgb(17, 85, 204)">Elotl Luna</span></a><span style="color:rgb(0, 0, 0)"> is a smart Kubernetes cluster autoscaler that runs on the 4 major K8s cloud platforms, i.e., AWS EKS, GCP GKE, Azure AKS, and Oracle OKE.&nbsp; It adds and removes right-sized compute instances from cloud Kubernetes clusters as needed, thereby reducing operational complexity and preventing wasted spend. Luna is ideally suited for deploying AI/ML platforms running bursty workloads that need special expensive resources such as GPU.</span></span><br /><span><span style="color:rgb(0, 0, 0)">&nbsp;</span></span><br /><span><span style="color:rgb(0, 0, 0)">Combining Helix with Luna in a cloud Kubernetes cluster adds dynamic resource management to Helix, allowing compute instances to be allocated on demand to handle the Helix workload, and later deallocated when no longer needed.&nbsp; This flexible scaling improves efficiency and reduces costs, particularly important when expensive cloud GPU resources are used.<br />&#8203;</span></span><br /></div>  <div>  <!--BLOG_SUMMARY_END--></div>  <div class="paragraph"><span><span style="color:rgb(67, 67, 67); font-weight:400"><font size="6">Helix + Luna Demo</font></span></span></div>  <div class="paragraph">&#8203;<span><span style="color:rgb(0, 0, 0)">This video shows a demonstration of the combination of Helix and Luna in action.<br />&#8203;</span></span><br /></div>  <div class="wsite-youtube" style="margin-bottom:10px;margin-top:10px;"><div class="wsite-youtube-wrapper wsite-youtube-size-auto wsite-youtube-align-center"> <div class="wsite-youtube-container">  <iframe src="//www.youtube.com/embed/pm67IV5eo8U?wmode=opaque" frameborder="0" allowfullscreen></iframe> </div> </div></div>  <div class="paragraph"><br /><span><span style="color:rgb(0, 0, 0)">&#8203;In this demo, Helix was installed on a GKE cluster initially composed of 3 </span><span style="color:rgb(0, 0, 0)">e2-medium</span><span style="color:rgb(0, 0, 0)"> CPU instances, to run Helix and Luna, and 1 </span><span style="color:rgb(0, 0, 0)">g2-standard-16</span><span style="color:rgb(0, 0, 0)"> L4 GPU instance with 150 GB disk, for the LLM model, using </span><a href="https://docs.helix.ml/helix/private-deployment/manual-install/gke/"><span style="color:rgb(17, 85, 204)">these instructions</span></a><span style="color:rgb(0, 0, 0)">.&nbsp; The </span><a href="https://www.elotl.co/luna-free-trial.html"><span style="color:rgb(17, 85, 204)">Luna free trial version</span></a><span style="color:rgb(0, 0, 0)"> was used, with its gcp.diskSizeGb option set to 150.&nbsp; After setup, the </span><span style="color:rgb(0, 0, 0)">my-helix-runner</span><span style="color:rgb(0, 0, 0)"> deployment was edited to set its replicas to 0 and its pod template to include the Luna management label </span><span style="color:rgb(0, 0, 0)">elotl-luna=true</span><span style="color:rgb(0, 0, 0)"> and instance type selector annotation </span><span style="color:rgb(0, 0, 0)">node.elotl.co/instance-type-regexp: g2-standard-16</span><span style="color:rgb(0, 0, 0)">.&nbsp; Next, the statically-allocated </span><span style="color:rgb(0, 0, 0)">g2-standard-16</span><span style="color:rgb(0, 0, 0)"> node was removed from the cluster, since Luna would be handling allocating GPU nodes in response to scaling the Helix runner replicas. </span></span><br />&#8203;<br /><span><span style="color:rgb(0, 0, 0)">Then the command </span><span style="color:rgb(0, 0, 0)">kubectl scale --replicas=1 deployment.apps/my-helix-runner</span><span style="color:rgb(0, 0, 0)"> was used to set the number of replicas to 1.&nbsp; In response, Luna added a new node to the K8s cluster.&nbsp; Note that any further changes in the Helix replicas count would trigger corresponding Luna node add or delete operations.</span></span><br /><br /></div>  <div class="paragraph"><span><span style="color:rgb(67, 67, 67); font-weight:400"><font size="6">Try Helix + Luna!</font></span></span></div>  <div class="paragraph"><span><span style="color:rgb(0, 0, 0)">We want you to benefit from the power of Helix to handle your GenAI workloads in your cloud K8s cluster along with the power of Luna to right-size the resources in your cluster.&nbsp; We plan to hold a workshop on doing this in the near future.&nbsp; Please reach out to Tamao at tamao@helix.ml if you'd like to attend or if you&rsquo;d like to get started on this in the meantime.</span></span><br /><br />&#8203;</div>  <div class="paragraph"><strong>Authors</strong>:</div>  <div class="paragraph">Anne Holler (Elotl), Chris Sterry (Helix), Luke Marsden (Helix)</div>]]></content:encoded></item><item><title><![CDATA[Mastering Kubernetes Autoscaling: How Luna Combines Bin-Packing and Bin-Selection for Optimal Cluster Scaling Efficiency]]></title><link><![CDATA[https://www.elotl.co/blog/mastering-kubernetes-autoscaling-how-luna-combines-bin-packing-and-bin-selection-for-optimal-cluster-scaling-efficiency]]></link><comments><![CDATA[https://www.elotl.co/blog/mastering-kubernetes-autoscaling-how-luna-combines-bin-packing-and-bin-selection-for-optimal-cluster-scaling-efficiency#comments]]></comments><pubDate>Thu, 03 Oct 2024 18:53:34 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/mastering-kubernetes-autoscaling-how-luna-combines-bin-packing-and-bin-selection-for-optimal-cluster-scaling-efficiency</guid><description><![CDATA[ In the world of Kubernetes, understanding the basics of pods and nodes is important, but to truly optimize your infrastructure, you need to delve deeper. The real game-changer? Cluster Autoscalers. These tools dynamically adjust the size of your cluster, ensuring you meet workload demands without over-provisioning resources. But while many autoscalers focus solely on bin-packing, Luna takes it a step further with its innovative bin-selection feature, delivering an all-encompassing solution for  [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/mastering-kubernetes-autoscaling-how-luna-combines-bin-packing-and-bin-selection-for-optimal-cluster-scaling-efficiency.png?1727982227" style="margin-top: 5px; margin-bottom: 10px; margin-left: 10px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="text-align:left;display:block;">In the world of Kubernetes, understanding the basics of pods and nodes is important, but to truly optimize your infrastructure, you need to delve deeper. The real game-changer? <strong>Cluster Autoscalers</strong>. These tools dynamically adjust the size of your cluster, ensuring you meet workload demands without over-provisioning resources. But while many autoscalers focus solely on <strong>bin-packing</strong>, <strong>Luna</strong> takes it a step further with its innovative <strong>bin-selection</strong> feature, delivering an all-encompassing solution for workload management and cost efficiency.<br /><br />In this blog, we will explore both <strong>bin-packing</strong> and <strong>bin-selection</strong>, two essential strategies for Kubernetes autoscaling. By leveraging <strong>Luna</strong>, you can maximize efficiency, minimize waste, and keep costs under control, all while handling the complexities of varying workload sizes and resource requirements. Let&rsquo;s dive in!<br></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <h2 class="wsite-content-title"><font size="5">What is Bin-Packing in Kubernetes?</font><br></h2>  <div class="paragraph" style="text-align:left;"><strong>Bin-packing</strong> is the default approach for optimizing pod placement in Kubernetes, maximizing resource utilization across nodes. The concept is simple: pack as many items (pods) into as few bins (nodes) as possible, maximizing resource utilization and minimizing the number of nodes required.<br /><br></div>  <div>  <!--BLOG_SUMMARY_END--></div>  <div class="paragraph" style="text-align:left;">In Kubernetes, bin-packing refers to placing pods onto nodes in such a way that CPU, memory, and other resources are used efficiently. <strong>Luna</strong> excels at this by dynamically adjusting the number of provisioned nodes based on real-time resource demands. Rather than manually selecting specific node types, Luna allows you to configure bin-packing nodes&rsquo; requirements like:<ul><li>binPackingNodeCpu</li><li>binPackingNodeMemory</li><li>binPackingNodeGPU</li><li>binPackingNodeTypeRegexp</li><li>binPackingNodePricing</li></ul> For example, if you set binPackingNodeCpu to 4 and binPackingNodeMemory to 8Gi, Luna could allocate a cost-effective node like c2d-highcpu-4 in GKE, optimizing for price and resource needs. Note, that depending on the specific cluster and workload needs, deploying multiple instances of Luna can provide flexibility in leveraging different bin-packing node attributes.<br /><br />The strength of Luna lies in its precision and ease of use. Once configured, Luna selects the best-fit node size based on your predefined settings, automatically choosing the most cost-effective node shape for your cloud provider. This eliminates the need to manually look up or specify the node shape, ensuring optimal resource use without overspending on nodes that are too small or too large.<br></div>  <h2 class="wsite-content-title"><font size="5">The Limitations of Bin-Packing-Only Approaches</font><br></h2>  <div class="paragraph" style="text-align:left;">While <strong>bin-packing</strong> is an essential technique for Kubernetes autoscaling, relying solely on this strategy introduces several limitations, particularly around node sizing. Let&rsquo;s break down two distinct issues that occur when using bin-packing in isolation:<br></div>  <h2 class="wsite-content-title"><font size="4"><strong>1. The Risk of Overprovisioning Very Large Nodes</strong></font><br></h2>  <div class="paragraph" style="text-align:left;">One major downside of bin-packing-only approaches is the potential to <strong>overprovision large nodes</strong>, which can lead to resource underutilization and unnecessary cost overhead. Here&rsquo;s how it can happen:<ul><li>When a cluster autoscaler scales up to handle an increased workload, it often provisions <strong>large nodes</strong> that accommodate multiple pods. This works well initially, but as pods terminate or complete their tasks, the large node can become <strong>underutilized</strong>, leaving unused CPU and memory capacity that you&rsquo;re still paying for.</li><li>For example, consider an autoscaler provisions a relatively large node type that is large enough for 100s of pods, for example a n2-standard-16 node on GKE. When bin-packing many small pods into such a node, the risk is that once some of those pods finish, you&rsquo;re left with a half-empty node. Although Kubernetes is efficient at packing new pods into the available space, you may still end up with <strong>idle resources</strong>&mdash;which translates to wasted cloud spend.</li><li>Worse still, if the workloads require different resource ratios (e.g., high memory, low CPU), the node may be constrained by one resource (like CPU) while leaving the other (like memory) underutilized.</li></ul> Moving pods from an oversized node to a smaller one can lead to significant disruption, as it requires evicting and rescheduling a large number of pods. This process can impact application performance and in rare cases lead to downtime. Ideally, you&rsquo;d want to avoid such disruptions by right-sizing nodes upfront, preventing the need for large-scale pod migrations.</div>  <h2 class="wsite-content-title"><font size="4"><strong>2. The Overhead of Many Small Nodes</strong></font><br></h2>  <div class="paragraph" style="text-align:left;">On the flip side, <strong>bin-packing-only</strong> can also lead to the opposite problem: provisioning <strong>too many small nodes</strong>, which introduces its own set of challenges. This issue can occur when a small number of pods consistently enter a pending state over a period of time. A bin-packing-only autoscaler may react by provisioning additional small nodes to accommodate these pods. Over time, this behavior leads to an excessive number of small nodes, which can introduce issues as a result:<ul><li><strong>Management Complexity</strong>: Operating a large number of small nodes, each hosting a few pods, can quickly lead to a complex management scenario. Kubernetes DaemonSets, for example, need to run across every node in the cluster. As the number of nodes increases, so does the overhead associated with these system-level pods, consuming valuable resources that could otherwise be allocated to your workloads.</li><li><strong>IP Address Exhaustion</strong>: In environments with strict limits on IP addresses (such as VPCs or private cloud setups), provisioning many small nodes can lead to <strong>IP exhaustion</strong>. Each node requires its own IP address, and in some cases must reserve a fixed set of IP addresses to be used for any pods placed on that node, and as the node count grows, you might hit limits on available IPs, causing networking challenges.</li><li><strong>Higher Costs</strong>: Cloud providers often price nodes based on the instance type, and while it might seem cheaper to provision many small nodes, there are hidden costs associated with operating a large fleet. This includes costs for networking, persistent volumes, and the aforementioned overhead from system services.</li><li><strong>Resource Under-utilization and Workload Consolidation Challenges:</strong> In environments with spiky workloads and numerous small nodes, resource utilization can become inefficient. The Kubernetes scheduler distributes pods across available nodes, but fluctuating demand often leaves many small nodes underutilized. This makes it difficult to consolidate workloads effectively, leading to increased operational complexity and cost inefficiencies, as the cluster continues to maintain excess nodes during periods of low demand.</li><li><strong>Third-Party Tool Costs:</strong> Many monitoring and logging tools, such as Datadog and Prometheus, charge based on the number of nodes being monitored or where agents are deployed and running. With an increasing number of small nodes, these costs can rise significantly, as each node adds to the monitoring overhead, even if their resource usage remains minimal. This can lead to unexpectedly higher operational expenses.</li></ul></div>  <h2 class="wsite-content-title"><font size="5">Introducing Bin-Selection: The Underrated Power Feature</font><br></h2>  <div class="paragraph" style="text-align:left;">While <strong>bin-packing</strong> is widely used, <strong>bin-selection</strong> remains an underappreciated capability, and it&rsquo;s here where <strong>Luna</strong> truly shines. Unlike bin-packing, which focuses on optimizing the number of pods per node, <strong>bin-selection</strong> targets specific pod requirements, ensuring that each pod is placed on the most suitable node based on its unique needs.<br></div>  <h2 class="wsite-content-title"><font size="4"><strong>What Exactly is Bin-Selection?</strong></font><br></h2>  <div class="paragraph" style="text-align:left;">In simple terms, <strong>bin-selection</strong> ensures that certain pods get their own dedicated nodes. This is crucial for workloads that have high resource demands or special requirements&mdash;such as GPU-bound tasks, memory-intensive applications, or workloads that need to avoid noisy neighbors. It&rsquo;s a 1:1 placement strategy that guarantees optimal performance by avoiding resource contention.<br /><br />Luna&rsquo;s <strong>bin-selection</strong> feature provides flexibility that most other autoscalers lack. While conventional autoscalers focus exclusively on packing as many pods into nodes as possible, Luna allows for a more targeted approach. When certain pods exceed predefined thresholds for CPU, memory, or GPUs, bin-selection is triggered, and a dedicated node is provisioned to meet those specific resource requirements.<br></div>  <h2 class="wsite-content-title"><font size="4"><strong>Why Bin-Selection is Crucial for Kubernetes Workloads</strong></font><br></h2>  <div class="paragraph" style="text-align:left;">Relying solely on bin-packing can lead to several challenges, especially when dealing with large or specialized workloads. Some of the key issues, as highlighted above, include:<ol><li><strong>Overprovisioned Large Nodes</strong>: As mentioned earlier, when multiple pending pods are placed on a large node and then some pods terminate, you&rsquo;re left with an underutilized and expensive node.</li><li><strong>Resource Contention</strong>: Larger nodes can become bottlenecks, especially if multiple pods are competing for CPU, memory, or network resources.</li><li><strong>DaemonSet Overhead</strong>: Running many small nodes can create overhead from DaemonSets, which replicate pods across all nodes, wasting resources.</li></ol> <strong>Bin-selection</strong> solves these problems by ensuring that large, specialized pods aren&rsquo;t crammed into oversized or underutilized nodes. Instead, Luna provisions a dedicated node that matches the pod's exact resource requirements, avoiding the inefficiencies and risks associated with using one-size-fits-all nodes.<br></div>  <h2 class="wsite-content-title"><font size="5">Luna&rsquo;s Dual Mode: Harnessing Both Bin-Packing and Bin-Selection</font><br></h2>  <div class="paragraph" style="text-align:left;">What makes <strong>Luna</strong> unique is that it combines the best of both <strong>bin-packing</strong> and <strong>bin-selection</strong>. By supporting both strategies, Luna ensures that your workloads are managed with maximum efficiency and flexibility. Here&rsquo;s how it works:<ol><li><strong>Bin-Packing</strong>: When workloads fit within standard resource thresholds, Luna dynamically provisions the most cost-effective nodes based on the configured CPU and memory limits. This is ideal for handling typical workloads without overspending on unused capacity.</li><li><strong>Bin-Selection</strong>: For specialized workloads&mdash;such as GPU-bound tasks, or pods that exceed a specific CPU or memory threshold&mdash;Luna automatically switches to <strong>bin-selection</strong>, provisioning a dedicated node that perfectly matches the pod&rsquo;s needs. This ensures that high-demand pods get the resources they require without overloading other nodes or causing resource contention.</li></ol></div>  <h2 class="wsite-content-title"><font size="5">Conclusion: Optimizing Kubernetes Clusters with Luna</font><br></h2>  <div class="paragraph" style="text-align:left;">Kubernetes clusters are dynamic, and managing them effectively requires more than just basic autoscaling. With <strong>Luna</strong>, you get the best of both worlds: efficient resource utilization through <strong>bin-packing</strong>, and tailored node allocation through <strong>bin-selection</strong>. Whether you're dealing with standard workloads or high-demand applications, Luna ensures that your clusters are always optimized for performance and cost-effectiveness.<br /><br />By leveraging both bin-packing and bin-selection, Luna offers a smarter way to handle Kubernetes autoscaling, allowing you to scale your infrastructure with confidence. Embrace the future of Kubernetes node management with Luna, and ensure your workloads always have the resources they need&mdash;without breaking the bank.<br /><br />Discover the full potential of Luna's advanced features and capabilities by visiting our&nbsp;<a href="https://www.elotl.co/luna.html">Luna</a> product page. For hands-on instructions and detailed guidance, check out our <a href="https://docs.elotl.co/luna/intro/" target="_blank">documentation</a>. Ready to streamline your autoscaling? Start your <a href="https://www.elotl.co/luna-free-trial.html">free trial</a> today and experience the unmatched efficiency and flexibility Luna brings to your cloud infrastructure.<br /><strong><br />Author:</strong><br />Justin Willoughby (Principal Solutions Architect, Elotl)<br /></div>]]></content:encoded></item><item><title><![CDATA[Luna Hot Node Mitigation: A Chill Pill to Cure Pod Performance Problems]]></title><link><![CDATA[https://www.elotl.co/blog/luna-hot-node-mitigation-a-chill-pill-to-cure-pod-performance-problems]]></link><comments><![CDATA[https://www.elotl.co/blog/luna-hot-node-mitigation-a-chill-pill-to-cure-pod-performance-problems#comments]]></comments><pubDate>Wed, 21 Aug 2024 14:41:53 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/luna-hot-node-mitigation-a-chill-pill-to-cure-pod-performance-problems</guid><description><![CDATA[When nodes in a cluster become over-utilized, pod performance suffers. Avoiding or addressing hot nodes can reduce workload latency and increase throughput.&nbsp; In this blog, we present two Ray Machine Learning serving experiments that show the performance benefit of Luna’s new Hot Node Mitigation (HNM) feature. With HNM enabled, Luna demonstrated a reduction in latency relative to the hot node runs: 40% in the first experiment and 70% in the second. It also increased throughput: 30% in the  [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/luna-hot-node-mitigation-a-chill-pill-to-cure-pod-performance-problems.png?1724260079" style="margin-top: 0px; margin-bottom: 10px; margin-left: 10px; margin-right: 0px; border-width:0; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="display:block;">When nodes in a cluster become over-utilized, pod performance suffers. Avoiding or addressing hot nodes can reduce workload latency and increase throughput.&nbsp; In this blog, we present two Ray Machine Learning serving experiments that show the performance benefit of Luna&rsquo;s new Hot Node Mitigation (HNM) feature. With HNM enabled, Luna demonstrated a reduction in latency relative to the hot node runs: 40% in the first experiment and 70% in the second. It also increased throughput: 30% in the first and 40% in the second. We describe how the Luna smart cluster autoscaler with HNM addresses hot node performance issues by triggering the allocation and use of additional cluster resources.</div><hr style="width:100%;clear:both;visibility:hidden;"><h2 class="wsite-content-title"><font size="6">INTRODUCTION</font><br></h2><div class="paragraph" style="text-align:left;">A pod's CPU and memory resource requests express its minimum resource allocations.&nbsp; The Kubernetes (K8s) scheduler uses these values as constraints for placing the pod on a node, leaving the pod pending when the settings cannot be respected.&nbsp; Cloud cluster autoscalers look at these values on pending pods to determine the amount of resources to add to a cluster.<br><br>A pod configured with both CPU and memory requests, and with limits equal to those requests, is in QoS class <a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed"><u>guaranteed</u></a>.&nbsp; A K8s cluster hosting any non-guaranteed pods runs the risk that some nodes in the cluster could become over-utilized when such pods have CPU or memory usage bursts. Bursting pods running on hot nodes can have performance problems.&nbsp; A bursting pod&rsquo;s attempts to use CPU above its CPU resource request can be throttled.&nbsp; And its attempts to use memory above its memory resource request can cause the pod to be killed.&nbsp; The K8s scheduler can worsen the situation, by continuing to schedule pods onto hot nodes.<br></div><div><!--BLOG_SUMMARY_END--></div><div class="paragraph" style="text-align:left;">The <a href="https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler"><u>Vertical Pod Autoscaler</u></a> (VPA) can recommend and optionally set a pod's CPU and memory resource requests and limits, based on <a href="https://github.com/kubernetes-sigs/metrics-server"><u>K8s metrics server</u></a> data, and hence can be used to avoid or address hot nodes.&nbsp; However, there are various trade-offs in using VPA, and by default VPA can reduce but does not eliminate hot node risks.&nbsp; Cloud cluster autoscalers obtain resources for pending pods and typically do not address the issue of hot nodes. With these concerns in mind, we introduced the Hot Node Mitigation (HNM) feature to the <a href="https://www.elotl.co/luna.html"><u>Luna smart cluster autoscaler</u></a>.&nbsp; With HNM enabled, Luna monitors its allocated nodes&rsquo; CPU and memory utilization using K8s metrics server data, and takes action to avoid or reduce high CPU or memory utilization.<br><br>In this blog, we describe the K8s hot node problem and discuss handling it via VPA and via Luna's HNM feature.&nbsp; We present two experiments showing how HNM reduces the impact of high utilization.&nbsp; The experiments involve ML workloads.&nbsp; Such workloads are challenging to handle since they are sensitive to the latency impact both of high utilization and of the cluster scaling operations intended to address high utilization.&nbsp; These experiments demonstrate that Luna HNM can be an effective chill pill to cure significant pod performance problems.<br></div><h2 class="wsite-content-title"><font size="6">HANDLING HIGH K8S NODE UTILIZATION</font><br></h2><div class="paragraph" style="text-align:left;">Pods that are not in the guaranteed QoS class introduce the risk that cluster nodes can become highly utilized.&nbsp; Determining how to set a pod&rsquo;s CPU and memory request and limit values so that it is in the guaranteed QoS class is challenging.&nbsp; The pod may be running a new workload, for which the resource needs have not yet been established.&nbsp; Or the pod's resource needs may evolve over time, as its use case changes.&nbsp; Or the pod's resource needs may have rare bursts, and configuring its resource requests to handle such peaks is inefficient in the normal case.<br></div><h2 class="wsite-content-title"><font size="5">Vertical Pod Autoscaler (VPA)</font><br></h2><div class="paragraph" style="text-align:left;">VPA can be used to recommend and optionally set a pod's CPU and memory resource requests and limits, based on the pod&rsquo;s metrics server data.&nbsp; By default, VPA-generated settings maintain the ratios between limits and requests that were specified in the initial container configuration.&nbsp; And if no limits were specified, VPA does not generate limits.&nbsp; Hence, by default, VPA reduces the likelihood of hot nodes when it makes pod requests settings larger, but it does not increase the number of pods with guaranteed QoS or completely eliminate the risk of hot nodes.<br><br>There are various trade-offs in using VPA.&nbsp; When VPA is run in auto (default) or recreate mode, it can be disruptive, since it restarts pods if their VPA-recommended resource requests differ non-trivially from (either below or above) their current resource requests.&nbsp; And if VPA is run in initial or recommendation-only mode, it is not real-time responsive to current conditions.&nbsp; Also, VPA is not tested in large clusters, according to its github README, and users have reported scaling issues when VPA is handling large numbers of pods.&nbsp; Hence, while VPA can help mitigate high node utilization, it may introduce challenges such as unnecessary pod restarts, delayed responses to hot node events, or scalability issues in large-scale environments.<br></div><h2 class="wsite-content-title"><font size="5">Hot Node Mitigation (HNM)</font><br></h2><div class="paragraph" style="text-align:left;">Luna's HNM, by focusing on node hot spots when they occur, is intended to be responsive, disruptive only when appropriate, and scalable.&nbsp; In general, Luna allocates node resources for pods based on the pods' resource request settings. For smaller pods, Luna allocates nodes on which multiple pods may be bin-packed.&nbsp; For larger pods or those with node configuration constraints, Luna allocates a node for each pod.&nbsp; If Luna-managed bin-packed pods have no resource request settings or if their request settings are lower than pod usage, Luna-allocated bin-packed nodes may become highly utilized, causing performance problems.<br><br>When Luna HNM is enabled (via the <em>manageHighUtilization.enabled</em> configuration option set to true), Luna uses K8s metrics server data to monitor the CPU and memory utilization of Luna-allocated bin-packed nodes, and takes action to avoid or reduce high CPU or memory utilization.&nbsp; CPU utilization is computed as usage over CPU capacity.&nbsp; Usage is the CPU core usage reported by metrics server, which averages it over the metrics server configured window period (e.g., 30s or more).&nbsp; Memory utilization is computed as the instantaneous working set memory over memory capacity.<br><br>The Luna HNM loop runs every <em>manageHighUtilization.loopPeriod</em>, and uses metrics server node and pod CPU and memory utilization data and configuration options to characterize busy nodes as yellow or red.&nbsp; Yellow nodes [CPU utilization &gt;= <em>manageHighUtilization.yellowCPU</em> (default 60) or memory utilization &gt;= <em>manageHighUtilization.yellowMemory</em> (default 65)] are considered warm.&nbsp; HNM taints warm nodes to prevent the K8s Scheduler from adding more pods onto them.&nbsp; This diminishes the likelihood of warm nodes transitioning to high CPU or memory utilization.&nbsp; Red nodes [CPU utilization &gt;= <em>manageHighUtilization.redCPU</em> (default 80) or memory utilization &gt;= <em>manageHighUtilization.redMemory</em> (default 85)] are considered hot.&nbsp; In addition to tainting them, HNM performs an eviction of the highest CPU- or memory-demand Luna-scheduled pod (based on pod metrics server data) on them that meets the same pod eviction restrictions applied for Luna node scale-down, which considers a number of factors including respecting the do-not-evict annotation.&nbsp; This reduces high CPU or memory utilization.<br><br>Lightly-used nodes are considered green [CPU utilization &lt; <em>manageHighUtilization.greenCPU</em> (default 10) and memory utilization &lt; <em>manageHighUtilization.greenMemory</em> (default 15)].&nbsp; If green nodes have an HNM taint, it is removed from the node.&nbsp; This allows nodes that are no longer warm or hot to again host additional pods.&nbsp; The large gap between the yellow and green thresholds is intended to avoid the situation that the node taint flaps on and off, engendering associated pod placement churn.<br><br>Note that bin-packed pods which have no CPU and memory request settings (or that have CPU and memory request settings that are inaccurate and very low) introduce the additional risk that the nodes they are running on appear to Luna to be under-utilized with respect to requests and hence candidates for scale-down.&nbsp; For this case, <em>scaleDown.binPackNodeUtilizationThreshold</em> can be set to 0.0, if desired, so Luna only scales down nodes running no Luna-managed pods.<br></div><h2 class="wsite-content-title"><font size="6">LUNA HOT NODE MITIGATION EXPERIMENTS</font><br></h2><div class="paragraph" style="text-align:left;">In this section, we present two experiments.&nbsp; One shows how Luna HNM can reduce the impact of high utilization via hot node pod eviction.&nbsp; The other shows how Luna HNM can avoid the impact of high utilization via warm node tainting.<br><br>For our experiments, we use the <a href="https://github.com/rakyll/hey"><u>hey</u></a> load generator to present queries to an online Machine Learning (ML) model that does text summarization.&nbsp; The ML serving workload runs on a <a href="https://docs.ray.io/en/latest/index.html"><u>Ray</u></a> cluster with CPU Ray worker(s), deployed by <a href="https://docs.ray.io/en/latest/cluster/kubernetes/getting-started.html"><u>KubeRay</u></a> on a Luna-enabled <a href="https://azure.microsoft.com/en-us/products/kubernetes-service"><u>AKS</u></a> cluster.&nbsp; The AKS cluster has 2 static nodes of type Standard_DS2_v2 (2 CPUs, 7G), on which Luna and KubeRay are deployed.&nbsp; We chose to deploy KubeRay onto statically-allocated compute rather than having Luna deploy KubeRay onto dynamically-allocated compute, since KubeRay&rsquo;s role is infrastructure-related and its resource needs are low.&nbsp; It is configured with guaranteed QoS set at CPU requests=limits=100m and memory requests=limits=512Mi.<br><br>When hot node pod eviction is used to reduce node utilization, Luna may need to allocate an additional node to handle the evicted pod.&nbsp; For the online ML model serving use case, which is latency-sensitive, adding that node needs to happen as quickly as possible, since the node scale-up time is on the critical path of addressing the serving performance problem caused by evicting a server worker.&nbsp; We first indicate how we reduced node scale-up time and then present the two experiments.<br></div><h2 class="wsite-content-title"><font size="5">Reducing Node Scale-up Time</font><br></h2><div class="paragraph" style="text-align:left;">Two key components of node scale-up time are node instance allocation time and image pull time.&nbsp; For the instance types in our experiments, we observed node instance allocation times to be 1-2 minutes and pull times for the large <em>rayproject/ray-ml:2.9.0</em> image to be &gt;5 minutes.<br><br>To hide the latency of node instance allocation time, we used over-provisioning, as discussed <a href="https://aws.amazon.com/blogs/containers/eliminate-kubernetes-node-scaling-lag-with-pod-priority-and-over-provisioning/"><u>here</u></a> and <a href="https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler"><u>here</u></a>.&nbsp; We deployed a <a href="https://github.com/elotl/skyray/blob/main/luna-hot-node-mitigation/overprovclass.yaml"><u>low-priority</u></a> single-pod <a href="https://github.com/elotl/skyray/blob/main/luna-hot-node-mitigation/overprovdeploy.yaml"><u>deployment</u></a> configured to consume one bin-packing node, with the idea of keeping a single extra node available for bin-pack scale-up.&nbsp; The expense of this idle node was considered worthwhile for the example ML serving use case.<br><br>To hide the latency of pulling the large ray-ml image, we used <a href="https://github.com/elotl/skyray/blob/main/luna-hot-node-mitigation/prepull.yaml"><u>this</u></a> daemonset to pre-pull the image into the cache on each K8s node.&nbsp; There are a number of general-purpose tools intended to address the image pull latency problem (e.g., <a href="https://github.com/senthilrch/kube-fledged"><u>kube-fledged</u></a>, <a href="https://github.com/dragonflyoss/Dragonfly2"><u>dragonfly</u></a>).&nbsp; We chose a custom daemonset for the simple purposes of our experiment.<br></div><h2 class="wsite-content-title"><font size="5">HNM Hot Node Pod Eviction</font><br></h2><div class="paragraph">To show the impact of HNM hot node pod eviction, we compare load testing performance results on the <a href="https://github.com/ray-project/serve_config_examples/tree/master/text_summarizer"><u>RayService text summarizer</u></a> with 2 CPU Ray workers for 3 configurations:<br><br>1. <strong>Baseline</strong>: the 2 CPU workers are configured for guaranteed QoS and are placed by Luna on 2 separate bin-packing nodes.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:left"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/baseline.png?1724257778" alt="Picture" style="width:488;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph">2. <strong>HNM-Disabled</strong>: the 2 CPU workers are configured for Burstable QoS (requests&lt;limits) and are placed by Luna on the same bin-packing node.&nbsp; HNM is not enabled to mitigate.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:left"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/hnm-disabled.png?1724257862" alt="Picture" style="width:520;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph">3. <strong>HNM-Enabled</strong>: the 2 CPU workers are configured for Burstable QoS (requests&lt;limits) and are placed by Luna on the same bin-packing node.&nbsp; HNM is enabled with <em>redCPU</em> set to 70.&nbsp; HNM mitigates the node&rsquo;s high CPU utilization by evicting one of the Ray worker pods, which is restarted on another node.3.</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:left"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/hnm-enabled.png?1724253246" alt="Picture" style="width:691;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph" style="text-align:left;">For all 3 configurations, the Ray head was annotated for placement on a bin-select node to simplify analysis of the bin-packing scenarios.&nbsp; We note that the Ray head uses guaranteed QoS and hence is not subject to performance impact from bursting.<br><br>Luna bin-packing node size is configured as 8 CPUs and 32Gi memory.&nbsp; The Standard_A8m_v2 instance type is used, since it is the least expensive node that satisfies this bin-pack node size.&nbsp; The Luna bin-select thresholds are set to 7 CPUs and 30G memory.&nbsp; The baseline RayService configuration is <a href="https://github.com/elotl/skyray/blob/main/luna-hot-node-mitigation/ray-service.text-summarizer.cpu.guar.yaml"><u>here</u></a> and the Burstable RayService configuration is <a href="https://github.com/elotl/skyray/blob/main/luna-hot-node-mitigation/ray-service.text-summarizer.cpu.besteffort.yaml"><u>here</u></a>.&nbsp; As you can see, the Baseline Ray workers have requests=limits of 4 CPUs and 16G memory and the Burstable Ray workers have requests of 3 CPUs and 12G memory, meaning that Baseline workers requests do not fit on the same bin-packing node and Burstable workers do.<br><br>With port-forwarding set in a separate terminal:<br></div><div><div id="417365045214930280" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-shell" style="white-space: pre;">kubectl port-forward svc/text-summarizer-serve-svc 8000    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">and with the ML serving model input set as:<br></div><div><div id="955013191214540599" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-shell" style="white-space: pre;">TEXT="It%20was%20the%20best%20of%20times,%20it%20was%20the%20worst%20of%20times,%20it%20was%20the%20age%20of%20wisdom,%20it%20was%20the%20age%20of%20foolishness,%20it%20was%20the%20epoch%20of%20belief"    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">the load test is run for 300 seconds using 10 threads and per-query time-out of 60 seconds as:<br></div><div><div id="460547993915323520" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-shell" style="white-space: pre;">hey -c 10 -z 300s -t 60 -m GET http://localhost:8000/summarize?text=${TEXT}    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">The results of the experiment are given in Table 1.&nbsp; The <strong>HNM-Disabled</strong> row shows the substantial impact that CPU contention has on the average response time (40% worse) and number of responses generated (30% fewer) during the 300 seconds run relative to the baseline.&nbsp; The first <strong>HNM-Enabled</strong> row reflects that pod eviction and restart has a short-term negative impact relative to <strong>HNM-Disabled</strong>, since during the eviction/restart period, the full load is being handled by a single Ray worker.&nbsp; The second <strong>HNM-Enabled</strong> row shows that after that period, performance that matches the baseline is achieved.<br><br>Note that the performance impact of pod eviction/restart by HNM for high CPU utilization is worthwhile only if the load persists for a non-trivial period after the eviction/restart.&nbsp; The ROI of pod eviction is significantly improved if the memory is the highly utilized resource, since memory contention can lead to pod OOM termination.&nbsp; Hence, for memory contention, eviction and restart can be worthwhile for shorter load spike duration.<br></div><div><div id="532181176676017830" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 28%;"></th><th style="width: 18%;">Ave Response Time</th><th style="width: 18%;">Ave Response Time Ratio (smaller is better)</th><th style="width: 18%;">Num Responses</th><th style="width: 18%;">Num Responses Ratio (larger is better)</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td><b>Baseline</b></td><td>21.3s</td><td>1.0</td><td>145</td><td>1.0</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><b>HNM-Disabled</b></td><td>29.9s</td><td>1.4</td><td>104</td><td>0.7</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><b>HNM-Enabled</b> (first 300s load, includes eviction and restart)</td><td>33.2s</td><td>1.6</td><td>90</td><td>0.6</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><b>HNM-Enabled</b> (next 300s load, after restart)</td><td>20.5s</td><td>1.0</td><td>150</td><td>1.0</td></tr></tbody></table></div></div><div class="paragraph" style="text-align:left;">Table 1: Impact of HNM Hot Node Pod Eviction on Text Summarizer Model serving load<br><br>Let&rsquo;s next consider an example where hot node performance problems can be avoided if warm nodes are tainted to inhibit additional pod placement on them.<br></div><h2 class="wsite-content-title"><font size="5">HNM Warm Node Tainting</font><br></h2><div class="paragraph" style="text-align:left;">To show the impact of HNM warm node pod tainting, we have Luna place a <a href="https://github.com/elotl/skyray/blob/main/luna-hot-node-mitigation/stresscpu.yaml"><u>CPU stress test</u></a> pod on a bin-packing node, to act as a noisy neighbor for our experiment.&nbsp; This pod has Best-Effort QoS, specifying neither requests nor limits, which means its requests values are treated as 0. We set the Luna option scaleDown.binPackNodeUtilizationThreshold to 0.0 to have Luna scale-down only consider nodes not running any Luna-managed pods, as previously discussed.<br><br>We compare load testing performance results on the RayService text summarizer with 1 CPU Ray worker (not 2 CPU Ray workers as in the previous experiment) for 2 configurations:<br><br>1. <strong>HNM-Enabled</strong>: the CPU worker is configured for Burstable QoS (requests&lt;limits) and is not placed on the same node as the CPU stress test pod, because HNM has tainted that node due to its utilization exceeding <em>yellowCPU</em>.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:left"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/hnm-enabled-2.png?1724258059" alt="Picture" style="width:485;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph">2. <strong>HNM-Disabled</strong>: the CPU worker is configured for Burstable QoS (requests&lt;limits) and is placed on the same node as the CPU stress test pod, since that node appears to have plenty of resources from the standpoint of requests values.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0px;margin-right:0px;text-align:left"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/hnm-disabled-2.png?1724258171" alt="Picture" style="width:470;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph" style="text-align:left;">For both configurations, the Ray head is placed on a bin-select node, as in the previous experiment.<br><br>Luna bin-packing node size is configured as 8 CPUs and 32Gi memory; the Standard_A8m_v2 instance type is used.&nbsp; The Luna bin-select thresholds are set to 7 CPUs and 30G memory.&nbsp; The Burstable RayService configuration is <a href="https://github.com/elotl/skyray/blob/main/luna-hot-node-mitigation/ray-service.text-summarizer.cpu.besteffort1.yaml"><u>here</u></a>, with requests set to 2 CPUs and 12G memory and limits set to 4 CPUs and 16G memory.<br><br>The load test run uses the same TEXT input and port-forwarding as the previous experiment.&nbsp; The <strong>HNM-Enabled</strong> load test is run for 300 seconds using 10 threads and per-query time-out of 60 seconds as:<br></div><div><div id="638264133674353849" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-shell" style="white-space: pre;">hey -c 10 -z 300s -t 60 -m GET http://localhost:8000/summarize?text=${TEXT}    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">The <strong>HNM-Disabled</strong> configuration could not complete any queries with the per-query time-out set to 60.&nbsp; It was re-run using the per-query time-out of 120 seconds as:<br></div><div><div id="703247932833406428" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-shell" style="white-space: pre;">hey -c 10 -z 300s -t 120 -m GET http://localhost:8000/summarize?text=${TEXT}    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;">The results of the experiment are given in Table 2.&nbsp; For <strong>HNM-Enabled</strong>, the single Burstable ray CPU worker pod was not placed on the same node as the CPU stress test pod, since the node was tainted by HNM due to warm utilization.&nbsp; However, for <strong>HNM-Disabled</strong>, the single Burstable ray CPU worker was placed on the same node as the CPU stress pod and this noisy neighbor greatly impacted its performance. No successful responses were returned within the 60s timeout and significantly poorer average response time (70% higher) and number of responses (40% lower) were observed with the 120s timeout.<br></div><div><div id="708759901150483179" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 28%;"></th><th style="width: 18%;">Ave Response Time</th><th style="width: 18%;">Ave Response Time Ratio (smaller is better)</th><th style="width: 18%;">Num Responses</th><th style="width: 18%;">Num Responses Ratio (larger is better)</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td><b>HNM-Enabled</b>, 60s response timeout</td><td>38.8s</td><td>1.0</td><td>80</td><td>1.0</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><b>HNM-Disabled</b>, 60s response timeout</td><td>N/A</td><td>N/A</td><td>N/A</td><td>N/A</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td><b>HNM-Disabled</b>, 120s response timeout</td><td>67.2s</td><td>1.7</td><td>50</td><td>0.6</td></tr></tbody></table></div></div><div class="paragraph">Table 2: Impact of HNM Warm Node Tainting on Text Summarizer Model serving load</div><h2 class="wsite-content-title"><font size="6">POSSIBLE FUTURE WORK</font><br></h2><div class="paragraph" style="text-align:left;">While we&rsquo;ve presented experiments where the current HNM feature worked well, we note two limitations of the current HNM feature.<ul><li>It is reactive, i.e., it does not take action until/unless node utilization is at or beyond the configured trigger points.&nbsp; This limitation helps with scaling and is fine, if pod churn is low in the steady state and getting to a good placement of evictable pods onto nodes is handled quickly relative to pod lifetimes.&nbsp; However, it may add non-trivial overhead if pod lifetimes are short and repeated.</li><li>It does not handle hot bin-packed nodes containing a single pod or hot bin-select nodes.&nbsp; This limitation assumes that a single relatively-large poor performing pod would warrant manual resource requests update.&nbsp; It would be good to learn if this assumption holds.</li></ul>Both of these limitations can be improved by evicting troublesome pods and increasing their CPU and memory request settings upon restart, possibly based on historical observations or on their configured limits (if any).&nbsp; We note that VPA can already be configured to set a pod's initial CPU and memory requests based on its previous metrics history, but that VPA's full operation can be disruptive or slow to react and can present scaling issues, as previously discussed.&nbsp; We can explore Luna optionally extending its high-utilization pod evictions to bin-select and single bin-packed pods and updating the requests of non-guaranteed pods it evicts upon their restart.<br></div><h2 class="wsite-content-title"><font size="6">CONCLUSION</font><br></h2><div class="paragraph" style="text-align:left;">We used Ray to run two ML online serving workloads. In both cases, Luna Hot Node Mitigation allowed us to significantly reduce the latency (by 40% and 70%) and increase the throughput (by 30% and 40%) relative to runs on hot nodes.<br><br>Take a look at your clusters; do you have non-guaranteed QoS pods and hot nodes? This could be slowing your workloads down. Please feel free to download our <a href="https://www.elotl.co/luna-free-trial.html">free trial</a> version and/or to <a href="mailto:info@elotl.co"><u>reach out</u></a> with any questions or comments.<br><br>We&rsquo;re dedicated to continually enhancing Luna and the Hot Node Mitigation feature.&nbsp; And to do so effectively, we need to hear from you!&nbsp; We welcome your feedback on how our current HNM solution works for you and whether our proposed improvements would be helpful in your set up. Please share your experiences and insights so we can tailor our solution to your needs.<br><br>Thanks for taking the time to read the blog and have a great day!<br><br></div><div class="paragraph"><strong>Author:</strong><br><span></span>Anne Holler (Chief Scientist, Elotl)<br><br><span></span></div>]]></content:encoded></item><item><title><![CDATA[Right Place, Right Size: Using an Autoscaler-Aware Multi-Cluster Kubernetes Fleet Manager for ML/AI Workloads]]></title><link><![CDATA[https://www.elotl.co/blog/right-place-right-size-using-an-autoscaler-aware-multi-cluster-kubernetes-fleet-manager-for-mlai-workloads]]></link><comments><![CDATA[https://www.elotl.co/blog/right-place-right-size-using-an-autoscaler-aware-multi-cluster-kubernetes-fleet-manager-for-mlai-workloads#comments]]></comments><pubDate>Thu, 11 Jul 2024 18:58:11 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Deep Learning]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Nova]]></category><guid isPermaLink="false">https://www.elotl.co/blog/right-place-right-size-using-an-autoscaler-aware-multi-cluster-kubernetes-fleet-manager-for-mlai-workloads</guid><description><![CDATA[IntroductionAre you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the Elotl Nova policy-driven multi-cluster orchestrator simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity.&nbsp; Nova is autoscaler-aware, detecting if workloa [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="5">Introduction</font><br></h2><span class="imgPusher" style="float:right;height:0px"></span><span style="display: table;width:209px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/using-an-autoscaler-aware-multi-cluster-kubernetes-fleet-manager-for-mlai-workloads.png?1720724660" style="margin-top: 5px; margin-bottom: 10px; margin-left: 10px; margin-right: 10px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;">Are you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the <a href="https://www.elotl.co/nova.html"><u>Elotl Nova policy-driven multi-cluster orchestrator</u></a> simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity.&nbsp; Nova is autoscaler-aware, detecting if workload clusters are running either the <a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler"><u>K8s cluster autoscaler</u></a> or the <a href="https://www.elotl.co/luna.html"><u>Elotl Luna intelligent cluster autoscaler</u></a>.<br><br>In this blog, we examine how Nova policies combined with its autoscaler-awareness can be used to achieve a variety of "right place, right size" outcomes for several common ML/AI GPU workload scenarios. When Nova and Luna team up you can:<ol><li>Reduce the latency of critical ML/AI workloads by scheduling on available GPU compute.</li><li>Reduce your bill by directing experimental jobs to sunk-cost clusters.</li><li>Reduce your costs via policies that select GPUs with the desired price/performance.</li></ol></div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><div class="paragraph" style="text-align:left;">For clusters running in the cloud with a cluster autoscaler, the available cluster capacity is dynamic.&nbsp; Nova can schedule a workload on a cluster with dynamic capacity that satisfies the workload's placement policy, even if that target cluster does not currently have sufficient resources for the workload, since the autoscaler can provision the needed resources.&nbsp; When multiple clusters satisfy the workload's placement policy, Nova preferentially selects a target cluster with existing available cluster resources and otherwise selects an alternative target cluster running a cluster autoscaler.<br><br>Nova workloads placed using an available-capacity policy are <a href="https://www.sigarch.org/the-different-facets-of-large-scale-gpu-cluster-scheduling-for-ml-jobs/"><u>gang-scheduled</u></a>. This means that no single job within a workload will start running until all jobs in that workload can be executed simultaneously. Gang scheduling is crucial for ML/AI training jobs, as it ensures all components of a distributed training task begin processing in sync, maximizing efficiency and preventing data inconsistencies.<br><br>Additionally, Nova automatically adds Luna's default pod placement label to the workloads it schedules, which allows the workloads to be handled seamlessly on either Luna or non-Luna clusters.<br></div><h2 class="wsite-content-title"><font size="5">Applying Nova+Luna to Some Common ML/AI GPU Resource Management Scenarios</font><br></h2><div class="paragraph" style="text-align:left;">We consider the following common GPU resource management scenarios:<br><ul><li>Training production ML/AI models on GPUs</li><li>Training experimental ML/AI models on GPUs</li><li>Serving production vs test/dev ML/AI models on GPUs</li></ul>with respect to Nova management of two kinds of workload clusters:<ul><li>Clusters with statically-allocated resources, comprising on-premise or reserved cloud resources, with no cluster autoscaler running.</li><li>Clusters with dynamically-allocated resources, comprising on-demand cloud resources, running the Luna cluster autoscaler.</li></ul></div><h2 class="wsite-content-title"><font size="5">Scenario: Training Production ML/AI Models on GPUs</font><br></h2><h2 class="wsite-content-title"><font size="4">Overview</font><br></h2><div class="paragraph" style="text-align:left;">For the scenario of training production ML/AI models on GPUs, the desired behavior is "fill and spill".&nbsp; The workloads should be gang-scheduled on a statically-allocated cluster if they fit or on a dynamically-allocated cluster if they don't.&nbsp; The workloads' high value warrants the cost of on-demand cloud resources, if needed, and the latency to obtain those resources dynamically is not an issue for the training job use case.<br><br>For the Nova example setup, we configure cluster <em>static-cluster</em> with a set of statically-allocated GPU instances and cluster <em>dynamic-cluster</em> with Luna configured to allocate similar cloud GPUs instances.&nbsp; Both clusters satisfy the Nova available-capacity placement policy. Nova places training workloads on <em>static-cluster</em> first since the resources are immediately available.&nbsp; When a training workload arrives that does not fit on <em>static-cluster</em>, Nova places it on <em>dynamic-cluster</em> and Luna adds resources to accommodate the pending workload.<br></div><h2 class="wsite-content-title"><font size="4">Example Setup</font><br></h2><div class="paragraph" style="text-align:left;">The scripts and K8s yaml input used in the example are available at <a href="https://github.com/elotl/skyray"><u>elotl/skyray</u></a> on Github.&nbsp; The commands that follow expect a clone of that repo at the SKYRAY_PATH environment variable.<br><br>The example is run on EKS cloud K8s clusters.&nbsp; The Nova control plane, installed on a EKS cluster comprising 2 CPU nodes, manages the <em>static-cluster</em> and <em>dynamic-cluster</em> workload EKS clusters, initially populated as shown below.&nbsp; The Luna cluster autoscaler is installed on <em>dynamic-cluster</em>, to scale the cluster to match workload resource requests.&nbsp; <a href="https://docs.elotl.co/luna/intro/"><u>Luna</u></a> is <a href="https://github.com/loftyoutcome/k8s-rag-llm/blob/main/demo/llm.gpu.service/block_device_mapping.json"><u>configured</u></a> to allocate large EBS volumes, to handle the large instance types and storage needs of the example.&nbsp; Also, Luna bin-packing is disabled, since the example does not contain sets of small pods that benefit from scheduling on the same node.<br><br></div><div><div id="379979570468224674" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">kubectl --context=static-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                            STATUS   ROLES    AGE     VERSION              INSTANCE-TYPEip-192-168-100-111.us-west-2.compute.internal   Ready    &lt;none&gt;   4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-105-241.us-west-2.compute.internal   Ready    &lt;none&gt;   4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-149-118.us-west-2.compute.internal   Ready    &lt;none&gt;   4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-181-48.us-west-2.compute.internal    Ready    &lt;none&gt;   28h     v1.29.3-eks-ae9a62a   t3a.2xlargeip-192-168-44-83.us-west-2.compute.internal     Ready    &lt;none&gt;   56d     v1.29.3-eks-ae9a62a   m5.largeip-192-168-72-28.us-west-2.compute.internal     Ready    &lt;none&gt;   4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-78-25.us-west-2.compute.internal     Ready    &lt;none&gt;   56d     v1.29.3-eks-ae9a62a   m5.largeip-192-168-8-48.us-west-2.compute.internal      Ready    &lt;none&gt;   28h     v1.29.3-eks-ae9a62a   t3a.2xlarge    </code></pre></div></div></div></div><div class="paragraph">&nbsp;<br></div><div><div id="460800841704921038" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                          STATUS   ROLES    AGE   VERSION               INSTANCE-TYPEip-192-168-94-42.us-west-2.compute.internal   Ready    &lt;none&gt;   56d   v1.29.3-eks-ae9a62a   m5.large    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;"><br>KubeRay and its CRDs are deployed to the Nova control plane, along with a spread-duplicate policy for their placement.&nbsp; Nova places a copy of KubeRay and its CRDs on each workload cluster, meaning KubeRay is available on each cluster to handle any RayJobs, RayClusters, and RayServices placed by Nova on that cluster.<br><br></div><div><div id="511549106245922092" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">kubectl apply -f ${SKYRAY_PATH}/policies/krpolicy.yamlkubectl apply -f ${SKYRAY_PATH}/policies/crdpolicy.yaml${SKYRAY_PATH}/deploy-scripts/deploy-kuberay-operator.sh    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;"><br>After the KubeRay spread-duplicate placement, the Nova control plane output shown below reflects that there are 2 copies of the kuberay-operator, one on each workload cluster.<br><br></div><div><div id="554347554886795827" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">kubectl get all --all-namespacesNAMESPACE   NAME                       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGEdefault     service/kuberay-operator   ClusterIP   10.96.241.6   &lt;none&gt;        8080/TCP   91sdefault     service/kubernetes         ClusterIP   10.96.0.1     &lt;none&gt;        443/TCP    6m50sNAMESPACE   NAME                               READY   UP-TO-DATE   AVAILABLE   AGEdefault     deployment.apps/kuberay-operator   2/1     2            2           91s    </code></pre></div></div></div></div><div class="paragraph"><br>And Luna has started an additional node in <em>dynamic-cluster</em> to host KubeRay, as shown below.&nbsp; The KubeRay operator has modest resource requests (100m CPU, 512Mi memory) that can be handled by the inexpensive t3a.small instance type (2 CPUs, 2Gi memory).<br><br></div><div><div id="410059897450333503" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                           STATUS   ROLES    AGE   VERSION               INSTANCE-TYPEip-192-168-182-75.us-west-2.compute.internal   Ready    &lt;none&gt;   55s   v1.29.3-eks-ae9a62a   t3a.smallip-192-168-94-42.us-west-2.compute.internal    Ready    &lt;none&gt;   56d   v1.29.3-eks-ae9a62a   m5.large    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Example Runs</font></h2><div class="paragraph" style="text-align:left;">As a proxy for a production training workload, we use the Pytorch image train benchmark, run as a RayJob deployed on a Kubernetes cluster using KubeRay, adapted from the example <a href="https://docs.ray.io/en/master/cluster/kubernetes/examples/gpu-training-example.html"><u>here</u></a>.&nbsp; The RayJob's RayCluster is configured with a CPU head and 2 single-GPU workers.&nbsp; The configuration of the RayJob with its associated RayCluster is available <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.train.yaml"><u>here</u></a>.<br><br>A first copy of the RayJob is deployed to the Nova control plane in the rayjob1 namespace.&nbsp; Its placement uses a Nova <a href="https://github.com/elotl/skyray/blob/main/policies/rayjobcapacitypolicy.yaml"><u>available-capacity policy</u></a>.&nbsp; Nova has native support for the RayCluster, RayJob, and RayService CRDs, and recognizes the resource requests in the podSpecs they contain.&nbsp; Hence, Nova is able to determine the computing resources needed for the pods comprising the RayJob.&nbsp; It chooses to place the RayJob and its RayCluster on s<em>tatic-cluster</em>, since it has sufficient available capacity.<br><br></div><div><div id="906620940351870406" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">export RAYCLUSTER_NAMESPACE1=rayjob1${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE1} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob1 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob1 createdrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train createdexport TARG_CLUSTER1=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE1} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER1}static-cluster    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;"><br>Another copy of the RayJob is deployed to the Nova control plane in the rayjob2 namespace.&nbsp; Its placement again uses an available-capacity policy, and Nova again chooses to place the RayJob and its RayCluster on <em>static-cluster</em>, since it has sufficient available capacity for a second copy of the training job.<br><br></div><div><div id="594803022341381604" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">export RAYCLUSTER_NAMESPACE2=rayjob2${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE2} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob2 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob2 createdrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train created                         export TARG_CLUSTER2=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE2} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER2}  static-cluster    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;"><br>A third copy of the RayJob is deployed to the Nova control plane in the rayjob3 namespace.&nbsp; Its placement again uses an available-capacity policy.&nbsp; This time Nova places the RayJob and its RayCluster on <em>dynamic-cluster</em>. Nova sees that <em>static-cluster</em> has insufficient remaining capacity for a third copy of the job and detects the Luna cluster autoscaler running on <em>dynamic-cluster</em>, which can obtain the needed resources.<br><br></div><div><div id="765443855930263967" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">export RAYCLUSTER_NAMESPACE3=rayjob3${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE3} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob3 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob3 createdrayjob.ray.io/rayjob-train created                            configmap/ray-job-code-train created                          export TARG_CLUSTER3=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE3} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER3}  dynamic-cluster    </code></pre></div></div></div></div><div class="paragraph"><br>All 3 copies of the RayJob can be seen from the Nova control plane:<br><br></div><div><div id="968822638325173691" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">$ kubectl get all --all-namespaces. . .NAMESPACE NAME                        JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGErayjob1   rayjob.ray.io/rayjob-train               Running             2024-07-01T22:13:02Z              9m11srayjob2   rayjob.ray.io/rayjob-train   RUNNING     Running             2024-07-01T22:12:07Z              4m55srayjob3   rayjob.ray.io/rayjob-train               Initializing        2024-07-01T22:16:28Z              34s    </code></pre></div></div></div></div><div class="paragraph"><br>And Luna scales up dynamic cluster accordingly:<br><br></div><div><div id="459694497837799206" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                            STATUS   ROLES    AGE     VERSION               INSTANCE-TYPEip-192-168-161-254.us-west-2.compute.internal   Ready       4m47s   v1.29.3-eks-ae9a62a   t3a.2xlargeip-192-168-182-75.us-west-2.compute.internal    Ready       55m     v1.29.3-eks-ae9a62a   t3a.smallip-192-168-61-229.us-west-2.compute.internal    Ready       4m24s   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-63-192.us-west-2.compute.internal    Ready       4m27s   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-94-42.us-west-2.compute.internal     Ready       56d     v1.29.3-eks-ae9a62a   m5.large    </code></pre></div></div></div></div><div class="paragraph"><br>With all 3 jobs eventually running to completion<br><br></div><div><div id="843386304703614180" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl get all --all-namespaces. . .NAMESPACE NAME                       JOB STATUS DEPL STATUS START TIME          END TIME               AGErayjob1   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:13:02Z 2024-07-01T22:26:30Z   22mrayjob2   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:12:07Z 2024-07-01T22:19:49Z   18mrayjob3   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:16:28Z 2024-07-01T22:30:27Z   14m    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Example Summary</font><br></h2><div class="paragraph" style="text-align:left;">This example demonstrated how Nova, working with Luna, makes handling gang-scheduling and "fill and spill" for a multi-worker ML/AI KubeRay/RayJob training job easy via a simple <a href="https://github.com/elotl/skyray/blob/main/policies/rayjobcapacitypolicy.yaml" title=""><u>available-capacity policy-based</u></a> approach. Nova and Luna can reduce the latency of your ML/AI workloads by scheduling on available compute resources in a matter of seconds.<br></div><h2 class="wsite-content-title"><font size="5">Scenario: Training Experimental ML/AI Models on GPUs</font><br></h2><h2 class="wsite-content-title"><font size="4">Overview</font><br></h2><div class="paragraph" style="text-align:left;">For the scenario of training experimental ML/AI models on GPUs, the desired behavior is "fill, no spill".&nbsp; The workloads should be scheduled on a statically-allocated on-premise or reserved cluster set up for speculative training jobs, consisting of sunk-cost GPU instances.&nbsp; These training workloads have not yet proven to be high-value enough to warrant paying for any on-demand cloud resources.<br><br>For the Nova example setup, we configure cluster <em>static-cluster</em> with a set of statically-allocated GPU instances, which are intended to represent sunk-cost resources.&nbsp; The Nova <a href="https://github.com/elotl/skyray/blob/main/policies/rayjobstaticpolicy.yaml"><u>cluster-specific placement policy</u></a> is set to match only that cluster.&nbsp; Nova places all experimental training workloads on the cluster; any that cannot be run are pending in the cluster.<br></div><h2 class="wsite-content-title"><font size="4">Example Setup</font><br></h2><div class="paragraph">The initial setup for this example is the same as that used for the previous example.</div><h2 class="wsite-content-title"><font size="4">Example Runs</font><br></h2><div class="paragraph">Again, we use the Pytorch image train benchmark, run as a RayJob deployed on a Kubernetes cluster using KubeRay, this time as a proxy for an experimental training job.&nbsp; The RayJob's RayCluster is again configured with a CPU head and 2 single-GPU workers, available <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-job.train.yaml"><u>here</u></a>.<br><br>In this case, a first copy of the RayJob is deployed, in the rayjob1 namespace, to the Nova control plane.&nbsp; Its placement uses a <a href="https://github.com/elotl/skyray/blob/main/policies/rayjobstaticpolicy.yaml"><u>specified-cluster policy</u></a>, with the specified cluster set to <em>static-cluster</em>.<br><br></div><div><div id="728516992609767165" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">export RAYCLUSTER_NAMESPACE=rayjob1${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy creatednamespace/rayjob1 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-static-policy createdrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train createdexport TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER}static-cluster    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;"><br>A second copy of the RayJob is deployed, in the rayjob2 namespace, to the Nova control plane.&nbsp; Its placement uses the same <a href="https://github.com/elotl/skyray/blob/main/policies/rayjobstaticpolicy.yaml"><u>specified-cluster policy</u></a>.<br><br></div><div><div id="232653626449816306" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">export RAYCLUSTER_NAMESPACE=rayjob2${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob2 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-static-policy unchangedrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train createdexport TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER}static-cluster    </code></pre></div></div></div></div><div class="paragraph"><br>And a third copy of the RayJob is deployed, in the rayjob3 namespace, to the Nova control plane.&nbsp; Its placement again uses the same specified-cluster policy and is placed to <em>static-cluster</em> by Nova.<br><br></div><div><div id="176115082157906063" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">export RAYCLUSTER_NAMESPACE=rayjob3${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob3 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-static-policy unchangedrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train createdexport TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER}static-cluster    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;"><br>In this case, static-cluster does not have sufficient remaining resources to run the third copy of RayJob.&nbsp; Its unschedulable pods remain pending until capacity is freed up by the removal of previous job(s).<br><br></div><div><div id="362051010895578752" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl get all --all-namespaces. . .NAMESPACE NAME                      JOB STATUS DEPL STATUS START TIME             END TIME               AGErayjob1  rayjob.ray.io/rayjob-train SUCCEEDED Complete     2024-07-02T13:49:21Z   2024-07-02T13:56:49Z   8m5srayjob2  rayjob.ray.io/rayjob-train RUNNING   Running      2024-07-02T13:53:16Z                          4m10srayjob3  rayjob.ray.io/rayjob-train           Initializing 2024-07-02T13:54:47Z                          2m39s&acirc;&#128;&brvbar;kubectl get all --all-namespaces. . .NAMESPACE NAME                       JOB STATUS DEPL STATUS START TIME             END TIME               AGErayjob2   rayjob.ray.io/rayjob-train SUCCEEDED  Complete    2024-07-02T13:53:16Z   2024-07-02T14:00:49Z   12mrayjob3   rayjob.ray.io/rayjob-train RUNNING    Running     2024-07-02T13:54:47Z                          11m&acirc;&#128;&brvbar;kubectl get all --all-namespaces. . .NAMESPACE NAME                      JOB STATUS DEPL STATUS   START TIME             END TIME               AGErayjob2   rayjob.ray.io/rayjob-train SUCCEEDED Complete      2024-07-02T13:53:16Z   2024-07-02T14:00:49Z   14mrayjob3   rayjob.ray.io/rayjob-train SUCCEEDED Complete      2024-07-02T13:54:47Z   2024-07-02T14:07:53Z   13m    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Example Summary</font><br></h2><div class="paragraph">This example shows how Nova makes handling "fill, no spill" easy via a simple policy-based approach. This simplifies the operation of the cluster and saves money by keeping the workload on the sunk-cost GPUs.</div><h2 class="wsite-content-title" style="text-align:left;"><font size="5">Scenario: Serving Production vs Test/Dev ML/AI Models on GPUs</font><br></h2><div class="paragraph" style="text-align:left;">For the scenario of serving production vs test/dev ML/AI models on GPUs, the desired behavior is "select the right cluster".&nbsp; The online production serving workloads should be placed on the statically-allocated cluster that is configured to satisfy the performance SLA for the maximum supported production load.&nbsp; Online serving workloads have low latency requirements, since they are typically on the critical path of some time-sensitive business application (e.g., predicting a ride-sharing ETA).&nbsp; Hence, dynamic allocation of these resources is not desirable.&nbsp; [And in practice, an additional statically-allocated geo-distinct production cluster would be used to increase availability.]&nbsp; The test/dev serving workloads are placed on the dynamically-allocated cluster configured for lower cost and performance.&nbsp; Providing low latency access for test/dev serving workloads is not a requirement.<br><br><br>For the Nova example setup, cluster <em>static-cluster</em> is configured with a statically-allocated more powerful GPU instance and cluster <em>dynamic-cluster</em> will allocate a less powerful (and cheaper) GPU instance as needed.&nbsp; We add the label <em>production</em> to the <em>static-cluster</em> Nova cluster and the label <em>development</em> to the <em>dynamic-cluster</em> Nova cluster.&nbsp; We note that use of these cluster labels adds a layer of indirection that facilitates adding additional clusters to a category, e.g., adding another production cluster in a different region.&nbsp; We use a Nova cluster selection policy that matches the cluster label appropriate to the workload class.<br><br></div><h2 class="wsite-content-title"><font size="4">Example Setup</font><br></h2><div class="paragraph" style="text-align:left;">The initial setup for this example is the same as that used for the previous 2 examples, except with respect to the GPU instances in static-cluster.&nbsp; Previously, <em>static-cluster</em> had 4 g4dn.2xlarge instances, which have an NVIDIA T4 GPU.&nbsp; For this example, <em>static-cluster</em> has a single g5.xlarge instance, which has a higher-performing NVIDIA A10G GPU.<br></div><div><div id="444797055730232303" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl --context=static-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                           STATUS   ROLES    AGE   VERSION               INSTANCE-TYPEip-192-168-181-48.us-west-2.compute.internal   Ready    &lt;none&gt;   9d    v1.29.3-eks-ae9a62a   t3a.2xlargeip-192-168-44-83.us-west-2.compute.internal    Ready    &lt;none&gt;   64d   v1.29.3-eks-ae9a62a   m5.largeip-192-168-72-62.us-west-2.compute.internal    Ready    &lt;none&gt;   95m   v1.29.3-eks-ae9a62a   g5.xlargeip-192-168-78-25.us-west-2.compute.internal    Ready    &lt;none&gt;   64d   v1.29.3-eks-ae9a62a   m5.largeip-192-168-8-48.us-west-2.compute.internal     Ready    &lt;none&gt;   9d    v1.29.3-eks-ae9a62a   t3a.2xlarge    </code></pre></div></div></div></div><h2 class="wsite-content-title" style="text-align:left;"><font size="4">Example Runs</font><br></h2><div class="paragraph" style="text-align:left;">As a proxy for a production serving workload, we use the text summarizer model service, run as a RayService deployed on a Kubernetes cluster using KubeRay, adapted from the example <a href="https://docs.ray.io/en/master/cluster/kubernetes/examples/text-summarizer-rayservice.html"><u>here</u></a>. The RayService's RayCluster is configured with a CPU head and 1 single-GPU worker.&nbsp; The configuration of the RayService with its associated RayCluster is available <a href="https://github.com/elotl/skyray/blob/main/deploy-scripts/ray-service.text-summarizer.yaml"><u>here</u></a>.<br><br>The production namespace is spread-scheduled to all clusters.&nbsp; RayService is deployed to the Nova control plane in the production namespace.&nbsp; Based on <a href="https://github.com/elotl/skyray/blob/main/policies/rayserviceproductionpolicy.yaml"><u>this</u></a> Nova label-matching policy, it is placed on <em>static-cluster</em>.<br><br></div><div><div id="395425363783073954" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">$ kubectl apply -f ${SKYRAY_PATH}/deploy-scripts/ray-service.text-summarizer.yaml --namespace=productionrayservice.ray.io/text-summarizer createdkubectl --context=static-cluster get all -n productionNAME                                                          READY   STATUS    RESTARTS   AGEpod/text-summarizer-raycluster-ntcfh-head-tmnqr               1/1     Running   0          68mpod/text-summarizer-raycluster-ntcfh-worker-gpu-group-wft6f   1/1     Running   0          68mNAME                                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGEservice/text-summarizer-head-svc                    ClusterIP   10.100.6.157     &lt;none&gt;        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   60mservice/text-summarizer-raycluster-ntcfh-head-svc   ClusterIP   10.100.197.135   &lt;none&gt;        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   68mservice/text-summarizer-serve-svc                   ClusterIP   10.100.205.162   &lt;none&gt;        8000/TCP                                        60mNAME                                                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGEraycluster.ray.io/text-summarizer-raycluster-ntcfh   1                 1                   5      20G      1      ready    68mNAME                                AGErayservice.ray.io/text-summarizer   68m    </code></pre></div></div></div></div><div class="paragraph"><br>We validate its operation as follows:<br><br></div><div><div id="724830960111303087" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl --context=static-cluster port-forward svc/text-summarizer-serve-svc 8000 -n productionForwarding from 127.0.0.1:8000 -&gt; 8000Forwarding from [::1]:8000 -&gt; 8000Handling connection for 8000python text_summarizer_req.pyParis is the capital and most populous city of France. It has an estimated population of 2,175,601 residents as of 2018. The City of Paris is the centre of the French capital.    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;"><br>Next, the development namespace is spread-scheduled to all clusters.&nbsp; We deploy the same RayService to the development namespace.&nbsp; Based on <a href="https://github.com/elotl/skyray/blob/main/policies/rayservicedevelopmentpolicy.yaml"><u>this</u></a> Nova label-matching policy, it is placed on dynamic-cluster.<br><br></div><div><div id="988762577706441243" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl apply -f ${SKYRAY_PATH}/deploy-scripts/ray-service.text-summarizer.yaml --namespace=developmentrayservice.ray.io/text-summarizer createdkubectl --context=dynamic-cluster get all -n developmentNAME                                                          READY   STATUS    RESTARTS   AGEpod/text-summarizer-raycluster-2xnts-head-68bvm               1/1     Running   0          47mpod/text-summarizer-raycluster-2xnts-worker-gpu-group-s8pbn   1/1     Running   0          47mNAME                                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGEservice/text-summarizer-head-svc                    ClusterIP   10.100.45.127   &lt;none&gt;        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   37mservice/text-summarizer-raycluster-2xnts-head-svc   ClusterIP   10.100.46.227   &lt;none&gt;        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   47mservice/text-summarizer-serve-svc                   ClusterIP   10.100.209.7    &lt;none&gt;        8000/TCP                                        37mNAME                                                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGEraycluster.ray.io/text-summarizer-raycluster-2xnts   1                 1                   5      20G      1      ready    47mNAME                                AGErayservice.ray.io/text-summarizer   47m    </code></pre></div></div></div></div><div class="paragraph" style="text-align:left;"><br>In this case, Luna allocates a g4dn.xlarge, which includes an NVIDIA T4 GPU, rather than the g5.xlarge, which includes an NVIDIA A10G GPU.&nbsp; The us-east per-hour on-demand price for the g4dn.xlarge is lower than the 1-year reserved price for the g5.xlarge, so the g4dn.xlarge is a good choice for the development workload, which does not warrant the more powerful GPU.<br><br></div><div><div id="664446715612617069" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                            STATUS   ROLES    AGE   VERSION               INSTANCE-TYPEip-192-168-164-97.us-west-2.compute.internal    Ready    &lt;none&gt;   8d    v1.29.3-eks-ae9a62a   t3a.smallip-192-168-171-101.us-west-2.compute.internal   Ready    &lt;none&gt;   48m   v1.29.3-eks-ae9a62a   t3a.xlargeip-192-168-49-24.us-west-2.compute.internal     Ready    &lt;none&gt;   48m   v1.29.3-eks-ae9a62a   g4dn.xlargeip-192-168-94-42.us-west-2.compute.internal     Ready    &lt;none&gt;   64d   v1.29.3-eks-ae9a62a   m5.large    </code></pre></div></div></div></div><div class="paragraph"><br>Again, we validate its operation as follows:<br><br></div><div><div id="447674408509903303" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block" style="overflow-x: auto;"><pre><code class="language-yaml" style="white-space: pre;">kubectl --context=dynamic-cluster port-forward svc/text-summarizer-serve-svc 8000 -n developmentForwarding from 127.0.0.1:8000 -&gt; 8000Forwarding from [::1]:8000 -&gt; 8000Handling connection for 8000python text_summarizer_req.pyParis is the capital and most populous city of France. It has an estimated population of 2,175,601 residents as of 2018. The City of Paris is the centre of the French capital.    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Example Summary</font><br></h2><div class="paragraph" style="text-align:left;">This example shows how Nova makes handling "select the right cluster" for classes of workloads easy via a simple policy-based approach. By using a Nova policy to select the performance/price ratio that matches each workload, Nova and Luna can reduce your cloud GPU bill while meeting your workloads' requirements.<br></div><h2 class="wsite-content-title"><font size="5">Conclusion</font><br></h2><div class="paragraph">We've shown how the Nova multi-cluster fleet manager, using its cloud autoscaler-aware feature with Luna, can achieve desired "right place, right size" outcomes for three common ML/AI GPU resource management scenarios: "fill and spill" for GPU production ML/AI model training, "fill, no spill" for GPU experimental ML/AI model training, and "select the right cluster" for handling<br>GPU Production vs Test/Dev ML/AI model serving.<br><br>Nova and Luna can:<ol><li>Reduce the latency of critical ML/AI workloads by scheduling on available GPU compute.</li><li>Reduce your bill by directing experimental jobs to sunk-cost clusters.</li><li>Reduce your costs via policies that select GPUs with the desired price/performance.</li></ol><br>And we note that Nova supports a variety of scheduling policies and has been applied to diverse domains, including managing LLM+RAG deployments, multi-cloud disaster recovery, cloud-agnostic gitops, and K8s cluster upgrade.<br><br>If you'd like to try <a href="https://www.elotl.co/nova.html">Nova</a> and <a href="https://www.elotl.co/luna.html">Luna</a> for your workloads, please download our free trial version: <a href="https://www.elotl.co/nova-free-trial.html">Nova</a>, <a href="https://www.elotl.co/luna-free-trial.html">Luna</a>.<br><br></div><div class="paragraph"><strong>Author:</strong><br><span></span>Anne Holler (Chief Scientist, Elotl)<br><br><span></span></div>]]></content:encoded></item><item><title><![CDATA[Using NVIDIA GPU Time-slicing in Cloud Kubernetes Clusters with the Luna Smart Cluster Autoscaler]]></title><link><![CDATA[https://www.elotl.co/blog/using-nvidia-gpu-time-slicing-in-cloud-kubernetes-clusters-with-the-luna-smart-cluster-autoscaler]]></link><comments><![CDATA[https://www.elotl.co/blog/using-nvidia-gpu-time-slicing-in-cloud-kubernetes-clusters-with-the-luna-smart-cluster-autoscaler#comments]]></comments><pubDate>Tue, 25 Jun 2024 18:00:16 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[GPU Time-slicing]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Machine Learning]]></category><guid isPermaLink="false">https://www.elotl.co/blog/using-nvidia-gpu-time-slicing-in-cloud-kubernetes-clusters-with-the-luna-smart-cluster-autoscaler</guid><description><![CDATA[IntroductionKubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default.&nbsp; With NVIDIA GPU time-slicing, GPUs can be shared among K8s workloads by interleaving their GPU use.&nbsp; For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.Using NVIDIA GPU ti [...] ]]></description><content:encoded><![CDATA[<h2 class="wsite-content-title"><font size="6">Introduction</font><br></h2><span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:246px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/gpu-with-blue-and-orange.png?1719345106" style="margin-top: 5px; margin-bottom: 10px; margin-left: 20px; margin-right: 10px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="display:block;">Kubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default.&nbsp; With <a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html">NVIDIA GPU time-slicing</a>, GPUs can be shared among K8s workloads by interleaving their GPU use.&nbsp; For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.<br><br>Using NVIDIA GPU time-slicing in a cloud Kubernetes cluster with a cluster autoscaler (CA) that is aware of the time-slicing configuration <strong>can significantly reduce costs</strong>. A time-slice aware &ldquo;smart&rdquo; CA prevents initial over-allocation of instances and optimizes instance selection, and reduces the risk of exceeding quotas and capacity limits.&nbsp; Also, on GKE, where GPU time-slicing is expected to be configured at the control plane level, a smart CA facilitates using time-slicing on GPU resources that are dynamically allocated.<br><br></div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><div class="paragraph" style="text-align:left;">In this blog, we describe how to use cluster NVIDIA GPU time-slicing in AKS, EKS, OKE, and GKE cloud K8s clusters with Luna, a smart CA that supports GPU time-slicing.&nbsp; We provide examples demonstrating the advantages of using Luna with NVIDIA GPU time-slicing.<br></div><h2 class="wsite-content-title"><font size="6">Configuring NVIDIA GPU Time-slicing on Cloud K8s</font><br></h2><div class="paragraph" style="text-align:left;"><a href="https://www.elotl.co/luna.html"><u>Luna</u></a> is a smart CA that provides the option <em>nvidiaGPUTimeSlices</em> to indicate the NVIDIA GPU slices value used by GPUs in the K8s cluster.&nbsp; When the option is set to N greater than 1, Luna treats the GPUs in cloud instances as being N copies of themselves with respect to resource allocation and scheduling.&nbsp; Luna supports AKS, EKS, OKE, and GKE cloud K8s clusters.<br><br>On AKS, EKS and OKE, NVIDIA GPU time-slicing is configured so that it is transparent to the cluster control plane and to GPU workloads running on the cluster.&nbsp; Appendix A describes how NVIDIA GPU time-slicing can be enabled for all GPUs in the cluster via helm deployment of the <a href="https://github.com/NVIDIA/k8s-device-plugin"><u>nvidia-device-plugin</u></a>, with an associated configmap specifying the number of slices.&nbsp; GPU workloads specify their desired GPU count as usual via the <em>nvidia.com/gpu</em> resource limit and are allocated GPU slices for each GPU they request.<br><br>On <a href="https://cloud.google.com/kubernetes-engine/docs/how-to/timesharing-gpus"><u>GKE, NVIDIA GPU time-slicing</u></a> is visible to the cluster control plane.&nbsp; Time-slicing is specified at the node pool level, with the GPU slice count set as <em>clients-per-gpu</em>.&nbsp; Luna handles the node pool setting when <em>nvidiaGPUTimeSlices</em> is greater than 1.&nbsp; On GKE, time-slicing is also visible to GPU workloads themselves: GPU workloads running on GKE time-sliced GPUs must include <a href="https://cloud.google.com/kubernetes-engine/docs/how-to/timesharing-gpus#deploy"><u>nodeSelectors</u></a> indicating that the workload can use time-shared GPUs and specifying the max <em>clients-per-gpu</em> value allowed.&nbsp; Such workloads are limited to an <a href="http://nvidia.com/gpu"><em>nvidia.com/gpu</em></a> resource limit value of 1.<br></div><h2 class="wsite-content-title"><font size="6">Luna Benefits for GPU Time-Slicing</font><br></h2><div class="paragraph" style="text-align:left;">We&rsquo;ve mentioned that running the Luna smart CA, configured to be aware of the GPU time-slices setting, <strong>reduces expenses as well as quota and capacity limit risks</strong>, by avoiding initial over-allocation of instances and by optimizing instance choice.&nbsp; Let&rsquo;s look at these two areas.<br></div><h2 class="wsite-content-title" style="text-align:left;"><font size="5">Luna Avoiding Instance Over-allocation for GPU Time-Slicing</font><br></h2><div class="paragraph" style="text-align:left;">With respect to initial over-allocation of instances, a CA that is not aware of the GPU time-slices setting of N will initially allocate Nx more nodes than needed.&nbsp; For example, to place 4 1-GPU workloads, a CA that doesn&rsquo;t know time-slices=2 could allocate 2 2-GPU nodes, when 1 2-GPU node can provide 4 slices.&nbsp; Note that this initial over-allocation may unnecessarily hit instance quota or capacity limits.&nbsp; If the CA can subsequently consolidate the workloads and scale in the over-allocated node(s), the expense associated with this issue can be limited.<br></div><h2 class="wsite-content-title" style="text-align:left;"><font size="5">Luna Optimizing Instance Choice for GPU Time-Slicing</font><br></h2><div class="paragraph" style="text-align:left;">With respect to optimizing instance choice, <strong>we observe that for many clouds, the cost of GPU instances increases non-linearly with the instance&rsquo;s GPU count</strong>.&nbsp; For example, in AWS us-west region using Luna&rsquo;s current price list, a g4dn.xlarge with 1 T4 GPU is $0.526/hr, while a g4dn.12xlarge with 4 T4 GPUs is $3.912; the latter is ~7.4x more costly for only 4x more T4 GPUs.&nbsp; Hence, allocating the instance GPU count in light of the time-slices setting can yield significant ongoing savings by choosing instances with fewer GPUs.&nbsp; And our experience is that instances with fewer GPUs tend to have higher quotas and more cloud capacity.<br><br>The benefit of optimizing instance choice can be substantial.&nbsp; In the next section, we present EKS, AKS, and OKE examples to illustrate.&nbsp; And we include a GKE example to show how a smart CA facilitates use of control-plane-aware NVIDIA GPU time-slicing.<br></div><h2 class="wsite-content-title" style="text-align:left;"><font size="6">Examples: Luna Optimizing Instance Choice for GPU Time-Slicing</font><br></h2><div class="paragraph" style="text-align:left;">For our examples, we set NVIDIA GPU time-slices to 2.&nbsp; We consider small 1-GPU workloads that can run together on a single NVIDIA GPU node with time-slices=2.&nbsp; We configure Luna to create bin-packing nodes with 2 GPUs (via setting Luna option <em>binPackingNodeGPU</em>=2).&nbsp; And we configure Luna to place bin-pack 2 1-GPU workloads onto the same node (via setting <em>binSelectPodGPUThreshold</em>=2).<br><br>For each of the 4 clouds supported by Luna, we consider the example of launching 2 small 1-GPU workloads.&nbsp; We examine the benefits of setting Luna&rsquo;s <em>nvidiaGPUTimeSlices</em> option to 2.<br><br></div><h2 class="wsite-content-title"><font size="5">EKS</font><br></h2><div class="paragraph" style="text-align:left;">For our example of deploying 2 small 1-GPU workloads in an EKS cluster with Luna, we use the deployment spec in Appendix B.1.&nbsp; The EKS cluster is configured with GPU time-slices set to 2.&nbsp; It is located in the us-east region and the prices we give are from Luna&rsquo;s current price list.<br><br>When Luna is run without knowledge of the GPU time-slice setting (i.e., <em>nvidiaGPUTimeSlices</em> is set to the default of 1), it allocates a <em>g3.8xlarge</em> instance, which at $2.28/hr is the lowest price 2-GPU instance that meets the desired resource requirements for bin-packing.&nbsp; However, g3* instances have M60 GPUs, which were designed for graphics-intensive workloads, and are not well-suited for ML tasks.&nbsp; Setting b<em>inPackingNodeTypeRegexp</em>: ^([^g]|g($|[^3])).*$ to avoid g3s, Luna allocates a <em>g4dn.12xlarge</em>, which at $3.9120/hr is the next lowest price multi-GPU instance, with 4 T4s.&nbsp; [We note that the default EBS size is insufficient for <em>g4dn.12xlarge</em> instances and the Luna option <em>aws.blockDeviceMappings</em> needs to be <a href="https://github.com/loftyoutcome/k8s-rag-llm/blob/main/demo/llm.gpu.service/block_device_mapping.json"><u>set</u></a> to allocate a larger EBS size.]<br><br>Given that NVIDIA GPU time-slices is 2, a 1-GPU instance can instead be used. When Luna is run with <em>nvidiaGPUTimeSlices=</em>2, it allocates a <em>g4dn.xlarge</em>, which is AWS&rsquo; least expensive 1-GPU instance type.&nbsp; At $0.526/hr, it is much cheaper than the previous 2 alternatives, with respect to both instance and per-slice price.&nbsp; This data is summarized in Table 1.<br></div><div><div id="530483268288104733" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 40%;">EKS</th><th>Instance Type</th><th>GPU Type</th><th>GPU Count</th><th>Instance Price</th><th>Price per Slice</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>Luna option (default) nvidiaGPUTimeSlices=1</td><td>g3.8xlarge</td><td>M60</td><td>2</td><td>$2.280/hr</td><td>$0.570/hr</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Luna option (default) nvidiaGPUTimeSlices=1 and g3 instances excluded</td><td>g4dn.12xlarge</td><td>T4</td><td>4</td><td>$3.912/hr</td><td>$0.489/hr</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Luna option nvidiaGPUTimeSlices=2</td><td>g4dn.xlarge</td><td>T4</td><td>1</td><td>$0.526/hr</td><td>$0.263/hr</td></tr></tbody></table></div></div><div class="paragraph">Table 1: EKS w/NVIDIA GPU time-slices=2, Luna option <em>nvidiaGPUTimeSlices</em> set to 1 vs 2</div><h2 class="wsite-content-title"><font size="5">AKS</font><br></h2><div class="paragraph" style="text-align:left;">For our example of deploying 2 small 1-GPU workloads in an AKS cluster with Luna, we use the deployment spec in Appendix B.2.&nbsp; The AKS cluster is configured with GPU time-slices set to 2.&nbsp; It is located in the east us region and the prices we give were recently fetched by Luna.<br><br>When Luna is run without knowledge of the GPU time-slice setting (i.e., <em>nvidiaGPUTimeSlices</em> is set to the default of 1), it allocates a <em>Standard_NC64as_T4_v3</em> instance, which at $4.352/hr is the lowest price multi-GPU instance that meets the desired resource requirements for bin-packing, comprising 4 T4 GPUs.<br><br>Given that NVIDIA GPU time-slices is 2, a 1-GPU instance can instead be used. When Luna is run with <em>nvidiaGPUTimeSlices</em> set to 2, it allocates a <em>Standard_NC4as_T4_v3</em>, which at $0.526/hr is much cheaper than the <em>Standard_NC64as_T4_v3</em>, in terms of both instance and per-slice price.&nbsp; This data is summarized in Table 2.<br></div><div><div id="123753179810703844" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 40%;">AKS</th><th>Instance Type</th><th>GPU Type</th><th>GPU Count</th><th>Instance Price</th><th>Price per Slice</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>Luna option (default) nvidiaGPUTimeSlices=1</td><td>Standard_NC64as_T4_v3</td><td>T4</td><td>4</td><td>$4.352/hr</td><td>$0.544/hr</td></tr><tr style="background-color: #f8f8f8; height: 25px;"><td>Luna option nvidiaGPUTimeSlices=2</td><td>Standard_NC64as_T4_v3</td><td>T4</td><td>1</td><td>$0.526/hr</td><td>$0.263/hr</td></tr></tbody></table></div></div><div class="paragraph">Table 2: AKS w/NVIDIA GPU time-slices=2, Luna option <em>nvidiaGPUTimeSlices</em> set to 1 vs 2</div><h2 class="wsite-content-title"><font size="5">OKE</font><br></h2><div class="paragraph" style="text-align:left;">For our example of deploying 2 small 1-GPU workloads in an OKE cluster with Luna, we use the deployment spec in Appendix B.3.&nbsp; The OKE cluster is configured with GPU time-slices set to 2.&nbsp; It is located in the us east region and the prices we give are from Luna&rsquo;s current price list.<br><br>When Luna is run without knowledge of the GPU time-slice setting, it fails to allocate any instance, because our account currently has no quota to run multi-GPU instances (and a quota increase request has been outstanding for an extended period).<br><br>When Luna is run with <em>nvidiaGPUTimeSlices</em> set to 2, it allocates a <em>VM.GPU2.1</em>, which is $1.275/hr.&nbsp; In this case, the quota issue prevented the scenario from running at all w/o Luna configured to respect the time-slices setting.&nbsp; This data is summarized in Table 3.<br></div><div><div id="514567480305755560" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 40%;">OKE</th><th>Instance Type</th><th>GPU Type</th><th>GPU Count</th><th>Instance Price</th><th>Price per Slice</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>Luna option nvidiaGPUTimeSlices=2</td><td>VM.GPU2.1</td><td>P100</td><td>1</td><td>$1.275/hr</td><td>$0.6375/hr</td></tr></tbody></table></div></div><div class="paragraph">Table 3: OKE w/NVIDIA GPU time-slices=2, Luna option <em>nvidiaGPUTimeSlices</em> set to 2<br></div><h2 class="wsite-content-title"><font size="5">GKE</font><br></h2><div class="paragraph" style="text-align:left;">For our example of deploying 2 small 1-GPU workloads in a GKE cluster with Luna, we use the deployment spec in Appendix B.4.&nbsp; The GKE cluster is configured with GPU time-slices set to 2.&nbsp; It is located in the us central1 region and the prices we give are from Luna&rsquo;s current price list.<br><br>On GKE, NVIDIA time-slices cannot be enabled without setting Luna&rsquo;s <em>nvidiaGPUTimeSlices</em> option accordingly, since Luna needs to configure time-slicing in the node-pool appropriately.<br><br>When Luna is run with <em>nvidiaGPUTimeSlices</em> set to 2, it allocates an <em>n1-standard-4</em> node with 1 T4 GPU, which is $0.540/hr.&nbsp; In this case, Luna is required to enable NVIDIA GPU time-slicing on dynamically-allocated nodes.&nbsp; This data is summarized in Table 4.<br></div><div><div id="259747746951091691" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><table style="width: 100%;"><thead><tr style="background-color: #e0e0e0; height: 30px;"><th style="width: 40%;">GKE</th><th>Instance Type</th><th>GPU Type</th><th>GPU Count</th><th>Instance Price</th><th>Price per Slice</th></tr></thead><tbody><tr style="background-color: #f8f8f8; height: 25px;"><td>Luna option nvidiaGPUTimeSlices=2</td><td>n1-standard-4</td><td>T4</td><td>1</td><td>$0.540/hr</td><td>$0.270/hr</td></tr></tbody></table></div></div><div class="paragraph">Table 4: GKE w/NVIDIA GPU time-slices=2, Luna option <em>nvidiaGPUTimeSlices</em> set to 2</div><h2 class="wsite-content-title"><font size="6">Conclusion</font><br></h2><div class="paragraph" style="text-align:left;">For cloud K8s clusters running non-demanding non-production GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs.&nbsp; In this blog, we&rsquo;ve explained how to set up NVIDIA GPU time-slicing in AKS, EKS, OKE, and GKE cloud K8s clusters.&nbsp; We&rsquo;ve discussed the benefits of using the Luna smart CA with the time-slices setting, which include avoiding initial over-allocation of instances and optimizing instance choice.&nbsp; With respect to optimizing instance choice, we found that <strong>Luna instance choice halved the price per GPU slice on EKS and AKS</strong>. On OKE, we showed that Luna instance choice avoided hitting our current quota limits.&nbsp; And on GKE, we demonstrated how Luna facilitated CA dynamic node allocation interoperation with NVIDIA GPU time-slicing.<br><br>Want to see how effortlessly you can manage GPU time-slicing with <a href="https://www.elotl.co/luna.html">Luna</a>? Try Luna today with our <a href="https://www.elotl.co/luna-free-trial.html">free trial</a> and experience the enhanced efficiency and flexibility it brings to your cloud environments.<br></div><h2 class="wsite-content-title"><font size="6">Future Work</font><br></h2><div class="paragraph" style="text-align:left;">GPU time-slicing is supported across NVIDIA GPU models, and provides flexible sharing levels.&nbsp; However, the technique does not enforce memory and fault isolation and targets non-production workloads.&nbsp; Recent NVIDIA GPUs support MIG (Multi-Instance GPU) sharing, which partitions each GPU into smaller, predefined instances, with memory and fault isolation enforced by the hardware.&nbsp; Luna support for NVIDIA MIG in Cloud K8s clusters is an area for future work, depending on customer interest in MIG allocation for their workloads.<br></div><h2 class="wsite-content-title" style="text-align:left;"><font size="6">Appendix A: Configuring NVIDIA GPU time-slicing in a K8s cluster</font><br></h2><div><div id="292860854573329584" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;"># This is for use on EKS, AKS, and OKE.  Delete any existing NVIDIA daemonset installationkubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system# Create file nvidia-device-plugin.yaml ConfigMap w/timeslice gpu replicasapiVersion: v1kind: ConfigMapmetadata:  name: nvidia-device-plugin  namespace: kube-systemdata:  any: |-    version: v1    flags:      migStrategy: none    sharing:      timeSlicing:        resources:        - name: nvidia.com/gpu          replicas: ${GPU_SLICE_COUNT}# Set environment variable to desired replica count, e.g., 2export GPU_SLICE_COUNT=2# Deploy ConfigMap from fileenvsubst &lt; nvidia-device-plugin.yaml | kubectl apply -f -# Install/Upgrade NVIDIA driver using helm with ConfigMap specifiedhelm repo add nvdp https://nvidia.github.io/k8s-device-pluginhelm repo update# Use on AKS and EKShelm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace kube-system --version v0.15.0 --set config.name=nvidia-device-plugin --force --set gfd.enabled=true# Use on OKE, which taints GPU nodes w/{effect: NoSchedule; key: nvidia.com/gpu operator: Exists}helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace kube-system --version v0.15.0 --set config.name=nvidia-device-plugin --force --set gfd.enabled=true --set-json='nfd.worker.tolerations=[{"operator":"Exists"}]'# Once driver is running, K8s sees each NVIDIA gpu as GPU_SLICE_COUNT replicaskubectl describe node ip-192-168-48-69.us-west-2.compute.internal&hellip;Allocatable: &hellip;  nvidia.com/gpu:     2    </code></pre></div></div></div></div><h2 class="wsite-content-title" style="text-align:left;"><font size="6">Appendix B: Deployment of 2 pods, each requesting 1 GPU</font><br></h2><h2 class="wsite-content-title"><font size="5">B.1 EKS</font><br></h2><div><div id="498740429818191016" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;"># Define deployment comprising 2 pods, each pod requesting 1 gpuapiVersion: apps/v1kind: Deploymentmetadata:  name: gpu-replicas-gpu  labels:        app: gpu-replicas-gpuspec:  replicas: 2  selector:        matchLabels:          app: gpu-replicas-gpu  template:        metadata:          labels:            app: gpu-replicas-gpu            elotl-luna: "true"        spec:          containers:            - name: dcgmproftester12              image: nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04              command: ["/bin/sh", "-c"]              args:                - while true; do /usr/bin/dcgmproftester12 --no-dcgm-validation -t 1004 -d 30; sleep 30; done              resources:                requests:                  cpu: "1"                  memory: "2G"                limits:                  nvidia.com/gpu: 1              securityContext:                capabilities:                  add: ["SYS_ADMIN"]    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">B.2 AKS</font><br></h2><div><div id="198519308207429155" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;"># Define deployment comprising 2 pods, each pod requesting 1 gpuapiVersion: apps/v1kind: Deploymentmetadata:  name: gpu-replicas-gpu  labels:             app: gpu-replicas-gpuspec:  replicas: 2  selector:             matchLabels:               app: gpu-replicas-gpu  template:             metadata:               labels:                 app: gpu-replicas-gpu                 elotl-luna: "true"             spec:               containers:                 - name: dcgmproftester12                   image: nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04                   command: ["/bin/sh", "-c"]                   args:                     - while true; do /usr/bin/dcgmproftester12 --no-dcgm-validation -t 1004 -d 30; sleep 30; done                   resources:                     requests:                       cpu: "1"                       memory: "2G"                     limits:                       nvidia.com/gpu: 1                   securityContext:                     capabilities:                       add: ["SYS_ADMIN"]    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">B.3 OKE</font><br></h2><div><div id="943329793100407857" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;"># Define deployment comprising 2 pods, each pod requesting 1 gpuapiVersion: apps/v1kind: Deploymentmetadata:  name: gpu-replicas-gpu  labels:         app: gpu-replicas-gpuspec:  replicas: 2  selector:         matchLabels:           app: gpu-replicas-gpu  template:         metadata:           labels:             app: gpu-replicas-gpu             elotl-luna: "true"         spec:           containers:             - name: cuda-vector-add               image: "k8s.gcr.io/cuda-vector-add:v0.1"               command: ["/bin/sh", "-c"]               args:                 - while true; do ./vectorAdd; sleep 30; done               resources:                 requests:                   cpu: "1"                   memory: "2G"                 limits:                   nvidia.com/gpu: 1    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">B.4 GKE</font><br></h2><div><div id="404207195917475897" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;"># Define deployment comprising 2 pods, each pod requesting 1 gpu# Luna options must include placeNodeSelector=trueapiVersion: apps/v1kind: Deploymentmetadata:  name: gpu-replicas-gpu  labels:        app: gpu-replicas-gpuspec:  replicas: 2  selector:        matchLabels:          app: gpu-replicas-gpu  template:        metadata:          labels:            app: gpu-replicas-gpu            elotl-luna: "true"        spec:          nodeSelector:            cloud.google.com/gke-gpu-sharing-strategy: "time-sharing"            cloud.google.com/gke-max-shared-clients-per-gpu: "2"          containers:            - name: dcgmproftester11              image: nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04              command: ["/bin/sh", "-c"]              args:                - while true; do /usr/bin/dcgmproftester11 --no-dcgm-validation -t 1004 -d 30; sleep 30; done              resources:                requests:                  cpu: "1"                  memory: "2G"                limits:                  nvidia.com/gpu: 1              securityContext:                capabilities:                  add: ["SYS_ADMIN"]    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="6">References</font><br></h2><div class="paragraph"><span style="color:#000000; font-weight:400">Selected KubeCon talks</span><ul><li style="color:#000000"><span style="color:#000000; font-weight:400">Unlocking the Full Potential of GPUs for AI Workloads on Kubernetes - Kevin Klues, NVIDIA;</span> <a href="https://www.youtube.com/watch?v=1QfShSQLsbs"><span style="color:#1155cc; font-weight:400">https://www.youtube.com/watch?v=1QfShSQLsbs</span></a><span style="color:#000000; font-weight:400">;&nbsp; KubeCon2023NA</span><ul><li style="color:#000000"><span style="color:#000000; font-weight:400">Using DRA for maximum flexibility in GPU scheduling, emerging K8s technology</span></li></ul></li><li style="color:#000000"><span style="color:#000000; font-weight:400">Efficient Access to Shared GPU Resources: Mechanisms and Use Cases - Diogo Guerra &amp; Diana Gaponcic;</span> <a href="https://www.youtube.com/watch?v=jkcEQE9C338"><span style="color:#1155cc; font-weight:400">https://www.youtube.com/watch?v=jkcEQE9C338</span></a><span style="color:#000000; font-weight:400">;&nbsp; KubeCon2023EU</span><ul><li style="color:#000000"><span style="color:#000000; font-weight:400">CERN experience with GPU time-sharing and MIG</span></li></ul></li><li style="color:#000000"><span style="color:#000000; font-weight:400">Improving GPU Utilization using Kubernetes - Maulin Patel &amp; Pradeep Venkatachalam, Google;</span> <a href="https://www.youtube.com/watch?v=X876kr-LkPA"><span style="color:#1155cc; font-weight:400">https://www.youtube.com/watch?v=X876kr-LkPA</span></a><span style="color:#000000; font-weight:400">;&nbsp; KubeCon2022EU</span><ul><li style="color:#000000"><span style="color:#000000; font-weight:400">GKE ave GPU utilization is 25%, getting worse, discusses time-sharing and MIG</span></li></ul></li></ul><br><span style="color:#000000; font-weight:400">Selected Blogs</span><ul><li style="color:#000000"><a href="https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/"><span style="color:#1155cc; font-weight:400">https://aws.amazon.com/blogs/containers/gpu-sharing-on-amazon-eks-with-nvidia-time-slicing-and-accelerated-ec2-instances/</span></a></li><li style="color:#000000"><a href="https://aws.amazon.com/blogs/containers/maximizing-gpu-utilization-with-nvidias-multi-instance-gpu-mig-on-amazon-eks-running-more-pods-per-gpu-for-enhanced-performance/"><span style="color:#1155cc; font-weight:400">https://aws.amazon.com/blogs/containers/maximizing-gpu-utilization-with-nvidias-multi-instance-gpu-mig-on-amazon-eks-running-more-pods-per-gpu-for-enhanced-performance/</span></a><span style="color:#000000; font-weight:400">&nbsp;</span><br></li></ul><span style="color:#000000; font-weight:400"></span><br><br><span style="color:#000000; font-weight:400"></span><strong>Author:</strong><br><br><span></span>Anne Holler (Chief Scientist, Elotl)<br><br><span></span></div>]]></content:encoded></item><item><title><![CDATA[How to run the OpenTelemetry collector as a Kubernetes sidecar]]></title><link><![CDATA[https://www.elotl.co/blog/how-to-run-the-opentelemetry-collector-as-a-kubernetes-sidecar]]></link><comments><![CDATA[https://www.elotl.co/blog/how-to-run-the-opentelemetry-collector-as-a-kubernetes-sidecar#comments]]></comments><pubDate>Wed, 12 Jun 2024 17:49:47 GMT</pubDate><category><![CDATA[Luna]]></category><category><![CDATA[Troubleshooting]]></category><guid isPermaLink="false">https://www.elotl.co/blog/how-to-run-the-opentelemetry-collector-as-a-kubernetes-sidecar</guid><description><![CDATA[At Elotl we develop Luna, an intelligent cluster autoscaler for Kubernetes. Luna gets deployed on customers' clusters and helps scale up and down compute resources to optimize cost.Luna operates in environments where direct access isn’t always available. To overcome the problem of diagnosis and performance monitoring we have introduced the option for customers to securely send their Luna logs and metrics to our advanced log storage appliance. This empowers us to enhance our support capabilitie [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/opentelemetry-stacked-color.png?1718219527" style="margin-top: 5px; margin-bottom: 10px; margin-left: 20px; margin-right: 10px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="display:block;">At Elotl we develop Luna, an intelligent cluster autoscaler for Kubernetes. Luna gets deployed on customers' clusters and helps scale up and down compute resources to optimize cost.<br><br>Luna operates in environments where direct access isn&rsquo;t always available. To overcome the problem of diagnosis and performance monitoring we have introduced the option for customers to securely send their Luna logs and metrics to our advanced log storage appliance. This empowers us to enhance our support capabilities, providing even more effective assistance to our customers.<br><br><a href="https://opentelemetry.io/">OpenTelemetry</a> is fast becoming the standard for collecting metrics and logs in Kubernetes environments. We opted to run the OpenTelemetry collector as a sidecar for the <a href="https://www.elotl.co/luna.html">Luna cluster autoscaler</a>. It will gather and send the logs from a single pod, therefore running it as a <a href="https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/">sidecar</a> was a perfect match.<br></div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><h2 class="wsite-content-title"><font size="5">Sidecar for a single pod</font><br></h2><div class="paragraph">A sidecar is a pod that runs alongside another pod. In our case, the Luna autoscaler writes logs to files in the /logs directory. To read these logs, we needed to share the directory between the main pod and its sidecar pod.<br><span></span>With Kubernetes, the <a href="https://opentelemetry.io/docs/collector/" title="https://opentelemetry.io/docs/collector/">OpenTelemetry collector</a> can be deployed as a daemonset with a deployment or as a sidecar. While the daemonset and deployment set-up is well-documented in the <a href="https://opentelemetry.io/docs/collector/installation/#kubernetes" title="https://opentelemetry.io/docs/collector/installation/#kubernetes">official documentation</a>, the sidecar set-up is documented using the <a href="https://opentelemetry.io/docs/kubernetes/operator/" title="https://opentelemetry.io/docs/kubernetes/operator/">OpenTelemetry operator</a>. As of the writing of this blog post, the documentation does not cover deploying the collector as a sidecar for a single pod.<br><span></span></div><h2 class="wsite-content-title"><font size="5">Add the sidecar container</font><br></h2><div class="paragraph">For this post, the pod we want to scrape the logs from will be named <em>my-pod</em>. First, add the volume and mount it within the main pod, which is part of a deployment:<br></div><div><div id="880990958536351986" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">apiVersion: apps/v1kind: Deployment...spec:  template:    spec:      containers:      - name: my-pod        ...      volumeMounts:      - name: logs        mountPath: /logs      volumes:      - name: logs        emptyDir: {}          sizeLimit: 500Mi    </code></pre></div></div></div></div><div class="paragraph">Next, add the OpenTelemetry collector pod. Note that we use the <em>otel/opentelemetry-collector-contrib</em> image because it supports reading local directories, unlike the default <em>otel/opentelemetry-collector</em> image:<br></div><div><div id="210723534929510977" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">...    spec:      containers:        - name: my-pod          ...          volumeMounts:            - name: logs              mountPath: /logs        - name: opentelemetry-collector          image: otel/opentelemetry-collector-contrib:latest          volumeMounts:            - name: logs              mountPath: /logs    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">Configuring the OpenTelemetry Collector</font><br></h2><div class="paragraph">With the two pods running and sharing a mounted directory, we need to configure the collector to:<br><span></span><ol><li>Gather logs from the /logs directory<br><span></span></li><li>Process these logs and add metadata<br><span></span></li><li>Upload the logs to our log storage service<br><span></span></li></ol>We'll add a ConfigMap and expose it to the collector by mounting it.<br><br><span></span></div><div><div id="790658465674832539" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">...    spec:      containers:        ...        - name: opentelemetry-collector          image: otel/opentelemetry-collector-contrib:0.96.0          volumeMounts:            - name: logs              mountPath: /logs            - name: opentelemetry-config              mountPath: /conf      volumes:        - name: logs          emptyDir:            sizeLimit: 500Mi        - name: opentelemetry-config          configMap:            name: opentelemetry-collector-config---apiVersion: v1kind: ConfigMapmetadata:  name: opentelemetry-collector-configdata:    collector.yaml: ""    </code></pre></div></div></div></div><div class="paragraph">The collector configuration consists of four sections:<ol><li>receivers: Specifies where the logs should be read from</li><li>processors: Adds metadata to the logs</li><li>exporters: Sets the endpoint for our cloud storage service</li><li>service: Combines the parameters from the previous sections<br></li></ol></div><h2 class="wsite-content-title"><font size="4">Receivers</font><br></h2><div class="paragraph">For <em>receivers</em>, we use the <em>filelog/app</em> receiver to read data from the <em>/logs</em> directory:<br></div><div><div id="393602726315023434" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">acollector.yaml: |  receivers:    filelog/app:      include: [ /logs/* ]    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Processors</font><br></h2><div class="paragraph">For <em>processors</em>, we use the <em>batch</em> and <em>resource</em> processors. The <em>resource</em> processor allows adding keys with desired metadata to each log with the <em>attributes</em> subsection:<br></div><div><div id="564130348992469822" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">collector.yaml: |  ...  processors:    batch:      timeout: 10s    resource:      attributes:        - key: my-metadata-key          value: my-metadata-value          action: insert    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Exporters</font><br></h2><div class="paragraph">For <em>exporters</em>, we use the <em>otlp</em> (OpenTelemetry Protocol Exporter) exporter to send the logs to our cloud storage service:<br></div><div><div id="930561862725507735" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">collector.yaml: |  ...  exporters:    otlp:      endpoint: "my.cloud.storage.hostname"    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Service</font><br></h2><div class="paragraph">Finally, for <em>service</em>, we combine all the predefined sections into a logical pipeline:<br></div><div><div id="408717056410369569" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">collector.yaml: |  ...  service:    pipelines:      logs:        receivers: [filelog/app]        processors: [batch, resource]        exporters: [otlp]    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="4">Full ConfigMap</font><br></h2><div class="paragraph">Here's the complete ConfigMap:</div><div><div id="166197026218210177" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">apiVersion: v1kind: ConfigMapmetadata:  name: opentelemetry-collector-configdata:    collector.yaml: |      receivers:        filelog/app:          include: [ /logs/* ]      processors:        batch:          timeout: 10s        resource:          attributes:            - key: my-metadata-key              value: my-metadata-value              action: insert      exporters:        otlp:          endpoint: "my.cloud.storage.hostname"      service:        pipelines:          logs:            receivers: [filelog/app]            processors: [batch, resource]            exporters: [otlp]    </code></pre></div></div></div></div><div class="paragraph">With the ConfigMap ready, we can use it as a parameter for the collector via /conf/collector.conf, this file expose the ConfigMap <em>opentelemetry-collector-config</em> as a YAML file:<br></div><div><div id="795969296319590157" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">...    spec:      containers:        ...        - name: opentelemetry-collector          image: otel/opentelemetry-collector-contrib:0.96.0          args:            - --config=/conf/collector.yaml          volumeMounts:            - name: logs              mountPath: /logs    </code></pre></div></div></div></div><h2 class="wsite-content-title"><font size="5">The full listing</font><br></h2><div class="paragraph">The final deployment snippet will look like this:</div><div><div id="698620353372295976" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="code-container" style="background-color: #f5f5f5;"><div class="code-block" data-code-lang="yaml" data-testid="renderer-code-block"><pre><code class="language-yaml" style="white-space: pre;">apiVersion: apps/v1kind: Deployment...spec:  template:    spec:      containers:        - name: my-pod          volumeMounts:            - name: logs              mountPath: /logs          ...        - name: opentelemetry-collector          image: otel/opentelemetry-collector-contrib:0.96.0          args:            - --config=/conf/collector.yaml          volumeMounts:            - name: logs              mountPath: /logs            - name: opentelemetry-config              mountPath: /conf      volumes:        - name: logs          emptyDir:            sizeLimit: 500Mi        - name: opentelemetry-config          configMap:            name: opentelemetry-collector-config---apiVersion: v1kind: ConfigMapmetadata:  name: opentelemetry-collector-configdata:    collector.yaml: |      receivers:        filelog/app:          include: [ /logs/* ]      processors:        batch:          timeout: 10s        resource:          attributes:            - key: my-metadata-key              value: my-metadata-value              action: insert      exporters:        otlp:          endpoint: "my.cloud.storage.hostname"      service:        pipelines:          logs:            receivers: [filelog/app]            processors: [batch, resource]            exporters: [otlp]    </code></pre></div></div></div></div><div class="paragraph">With this set-up, we're able to send our logs to our log storage appliance and we're able to more effectively help our customers when they ask for help.</div><div class="paragraph"><strong><br>Author:</strong><br>Henry Precheur (Senior Staff Engineer, Elotl)<br><br></div>]]></content:encoded></item><item><title><![CDATA[Unleashing the Power of ARM: Elevating Your Kubernetes Workloads with ARM Nodes]]></title><link><![CDATA[https://www.elotl.co/blog/unleashing-the-power-of-arm-elevating-your-kubernetes-workloads-with-arm-nodes]]></link><comments><![CDATA[https://www.elotl.co/blog/unleashing-the-power-of-arm-elevating-your-kubernetes-workloads-with-arm-nodes#comments]]></comments><pubDate>Mon, 29 Apr 2024 12:55:12 GMT</pubDate><category><![CDATA[ARM]]></category><category><![CDATA[Autoscaling]]></category><guid isPermaLink="false">https://www.elotl.co/blog/unleashing-the-power-of-arm-elevating-your-kubernetes-workloads-with-arm-nodes</guid><description><![CDATA[ The recent surge in ARM processor capabilities has sparked a wave of exploration beyond their traditional mobile device domain. This blog explains why you may want to consider using ARM nodes for your Kubernetes workloads. We'll identify potential benefits of leveraging ARM nodes for containerized deployments while acknowledging the inherent trade-offs and scenarios where x86-64 architectures may perform better and thus continue to be a better fit. Lastly we'll describe a seamless way to add AR [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:267px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/arm64-green-powerful.jpg?1714396460" style="margin-top: 5px; margin-bottom: 10px; margin-left: 10px; margin-right: 10px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="text-align:left;display:block;">The recent surge in ARM processor capabilities has sparked a wave of exploration beyond their traditional mobile device domain. This blog explains why you may want to consider using ARM nodes for your Kubernetes workloads. We'll identify potential benefits of leveraging ARM nodes for containerized deployments while acknowledging the inherent trade-offs and scenarios where x86-64 architectures may perform better and thus continue to be a better fit. Lastly we'll describe a seamless way to add ARM nodes to your Kubernetes clusters.<br /><br />In this blog, for the sake of clarity and brevity, I will be using the term 'ARM' to refer to ARM64 or ARM 64-bit processors, while 'x86' or 'x86-64' will be used interchangeably to denote Intel or AMD 64-bit processors.<br></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <h2 class="wsite-content-title"><font size="4">What Kubernetes Workloads Tend To Be Ideal for ARM Processors?</font><br></h2>  <h2 class="wsite-content-title"><font size="3">Inference-heavy tasks:</font></h2>  <div class="paragraph" style="text-align:left;">While the computations involved in Deep Learning training typically require GPUs for acceptable performance, DL inference is less computationally intense.&nbsp; Tasks that apply pre-trained models for DL regression or classification can benefit from ARM's power/performance relative to GPU or x86-64 systems. We presented data on running inference on ARM64 in our <a href="https://www.elotl.co/uploads/1/3/0/3/130365369/scale20x.pdf" target="_blank">Scale20x talk</a>.<br></div>  <div>  <!--BLOG_SUMMARY_END--></div>  <h2 class="wsite-content-title"><font size="3">Web Servers and Microservices:</font><br></h2>  <div class="paragraph" style="text-align:left;">Web servers and microservices typically involve handling numerous concurrent connections and lightweight compute tasks. They can perform acceptably on ARM64-based Kubernetes deployments, serving web content, handling API requests, and running containerized microservices efficiently. With the increasing availability of ARM-based cloud instances, organizations can optimize their web hosting infrastructure for cost-effectiveness and scalability by leveraging ARM architecture.<br></div>  <h2 class="wsite-content-title"><font size="3">Development and Testing Environments:</font><br></h2>  <div class="paragraph" style="text-align:left;">Development and testing environments, where workloads are often smaller in scale and resource requirements are modest, may be excellent candidates for ARM-based Kubernetes deployments. Developers can leverage ARM-based instances to build, test, and deploy applications in an environment that closely resembles production while minimizing costs. ARM-based Kubernetes resources can give developers an inexpensive platform for continuous integration, automated testing, and DevOps workflows.<br></div>  <h2 class="wsite-content-title"><font size="4">What Kubernetes Workloads Might be Less Suited for ARM Processors?</font><br></h2>  <div class="paragraph" style="text-align:left;">While ARM processors offer advantages for some workloads, not all Kubernetes workloads are equally suited for this architecture. Below are some specific scenarios where opting for ARM processors may not align with the workload's needs or requirements.<br></div>  <h2 class="wsite-content-title"><font size="3">High-Performance Computing (HPC):</font><br></h2>  <div class="paragraph" style="text-align:left;">HPC tasks often require specialized hardware and intense computational power, making them less suited for ARM processors. While ARM has advanced, x86-based processors may better handle complex simulations and scientific computing.<br></div>  <h2 class="wsite-content-title"><font size="3">Legacy Enterprise Applications:</font><br></h2>  <div class="paragraph" style="text-align:left;">ARM processors may pose compatibility challenges for legacy enterprise apps optimized for x86-64 architectures. Migrating such apps to ARM-based Kubernetes setups may need non-trivial re-engineering, testing, and could be difficult or costly for legacy x86-64 applications.<br></div>  <h2 class="wsite-content-title"><font size="3">Containerized Databases and Analytics:</font><br></h2>  <div class="paragraph" style="text-align:left;">ARM processors may struggle with high I/O demands and data-intensive tasks compared to x86-based processors. For large-scale data processing and high-volume databases, x86-64 architectures may offer better performance.<br />In summary, while ARM processors do have advantages, it's crucial to assess their suitability for specific Kubernetes workloads, especially considering performance and compatibility with existing applications.<br /><br /></div>  <h2 class="wsite-content-title"><font size="4">On the fence about ARM Nodes in Kubernetes Workloads Despite Their Ideal Fit?</font><br></h2>  <div class="paragraph" style="text-align:left;">Several factors may make it worthwhile, primarily Cost Savings, Energy Efficiency, and Performance. Lets explore these in detail.<br></div>  <h2 class="wsite-content-title"><font size="3">Cost Savings:</font><br></h2>  <div class="paragraph" style="text-align:left;">When it comes to running Kubernetes workloads, cost is often a concern for organizations, especially those managing large-scale deployments. ARM processors present an interesting proposition in this regard. Their lower upfront hardware costs and reduced operational expenses can make them an attractive alternative to traditional x86-64 processors. In cloud environments like Amazon EKS and Google GKE, where instances are billed based on usage, the cost differential between ARM and x86-64 instances can translate into significant savings over time.<br></div>  <h2 class="wsite-content-title"><font size="3">Energy Efficiency:</font><br></h2>  <div class="paragraph" style="text-align:left;">Another compelling advantage of ARM processors for Kubernetes workloads lies in their energy efficiency. ARM architecture is known for its ability to deliver comparable performance to x86-64 processors while consuming less power. This energy efficiency not only reduces operational costs but also contributes to sustainability efforts by minimizing the environmental impact of cloud computing. In a world increasingly concerned with reducing carbon footprints and achieving energy efficiency targets, ARM-based Kubernetes deployments align well with green computing initiatives. By harnessing the power of ARM architecture, organizations may be able to achieve a more sustainable and environmentally friendly approach to Kubernetes infrastructure management.<br></div>  <h2 class="wsite-content-title"><font size="3">Performance:</font><br></h2>  <div class="paragraph" style="text-align:left;">Contrary to popular belief, ARM processors can deliver the same or better performance for Kubernetes workloads, as compared that of traditional x86-64 processors in certain scenarios. While ARM-based instances may have historically been associated with low-power devices like smartphones and IoT gadgets, recent advancements in ARM architecture have ushered in a new era of performance capabilities. With ARM-based servers becoming increasingly prevalent in cloud environments, developers and operators have access to a wider range of ARM-powered instances then in the past. For many workloads, including web applications, microservices, and other similar containerized workloads, ARM processors offer ample computational power and efficiency. By carefully selecting ARM-based instances tailored to their specific workload characteristics, organizations can achieve optimal performance and resource utilization in their Kubernetes deployments.<br />In conclusion, ARM processors offer can offer benefits for Kubernetes workloads in cloud environments. From cost savings and energy efficiency to impressive performance capabilities, ARM architecture presents a viable alternative to traditional x86-64 processors for some workloads. By leveraging ARM-based instances, organizations can potentially optimize their cloud infrastructure costs, reduce operational expenses, and contribute to sustainability initiatives. Despite historical associations with low-power devices, ARM processors have evolved to deliver competitive performance for a wide range of Kubernetes workloads. With careful selection and optimization, ARM-based instances may be able to provide organizations with the performance and efficiency they need while embracing the advantages of ARM architecture.<br /><br /></div>  <h2 class="wsite-content-title"><font size="4">Optimizing Kubernetes Node Allocation with Intelligent Autoscaling</font><br></h2>  <div class="paragraph" style="text-align:left;">For Kubernetes deployments seeking to incorporate ARM nodes seamlessly, leveraging an intelligent autoscaler like Luna offers a streamlined solution. With Luna, ARM nodes can be effortlessly provisioned alongside x86-64 nodes, improving both cost efficiency and resource utilization.<br />By configuring Luna to allocate ARM nodes when they offer better pricing compared to x86-64 counterparts, administrators can obtain cost savings without operational complexity. Conversely, Luna intelligently allocates x86-64 nodes when they are the more cost-effective option, maintaining a balanced infrastructure and cost savings.<br />To ensure compatibility across architectures, container images must be multi-arch, enabling them to run seamlessly on both x86-64 and ARM nodes. Moreover, Luna provides granular control over node allocation through annotations, allowing administrators to specify preferences for instance families or to exclude certain families as needed.<br /><br />In summary, leveraging Luna autoscaler streamlines ARM node allocation in Kubernetes environments, enabling organizations to harness the benefits of ARM architecture while maintaining flexibility and cost efficiency in their deployments.<br /><br />To delve deeper into Luna's intelligent autoscaling of x86-64 and ARM nodes, check out our <a href="https://www.elotl.co/luna.html">Luna product page</a> for details. For step-by-step guidance, be sure to review our <a href="https://docs.elotl.co/luna/intro/">Documentation</a>. Ready to test Luna firsthand? <a href="https://www.elotl.co/luna-free-trial.html">Try Luna</a> today with our free trial and witness the efficiency and flexibility it brings to your cloud environments.<br /><br /></div>  <div class="paragraph"><strong>Author:</strong><br /><span></span>Justin Willoughby (Principal Solutions Architect, Elotl)<br /><span></span><strong>Contributors:</strong><br /><span></span>Anne Holler (Chief Scientist, Elotl)<br /><span></span>Henry Precheur (Senior Staff Engineer, Elotl)<br><br /><span></span></div>  <div><div style="margin: 10px 0 0 -10px"> <a title="Download file: scale20x.pdf" href="https://www.elotl.co/uploads/1/3/0/3/130365369/scale20x.pdf"><img src="//www.weebly.com/weebly/images/file_icons/pdf.png" width="36" height="36" style="float: left; position: relative; left: 0px; top: 0px; margin: 0 15px 15px 0; border: 0;" /></a><div style="float: left; text-align: left; position: relative;"><table style="font-size: 12px; font-family: tahoma; line-height: .9;"><tr><td colspan="2"><b> scale20x.pdf</b></td></tr><tr style="display: none;"><td>File Size:  </td><td>1215 kb</td></tr><tr style="display: none;"><td>File Type:  </td><td> pdf</td></tr></table><a title="Download file: scale20x.pdf" href="https://www.elotl.co/uploads/1/3/0/3/130365369/scale20x.pdf" style="font-weight: bold;">Download File</a></div> </div>  <hr style="clear: both; width: 100%; visibility: hidden"></hr></div>]]></content:encoded></item><item><title><![CDATA[The Benefits of Cycling Kubernetes Nodes: Optimizing Performance, Reliability, and Security]]></title><link><![CDATA[https://www.elotl.co/blog/the-benefits-of-cycling-kubernetes-nodes-optimizing-performance-reliability-and-security]]></link><comments><![CDATA[https://www.elotl.co/blog/the-benefits-of-cycling-kubernetes-nodes-optimizing-performance-reliability-and-security#comments]]></comments><pubDate>Tue, 09 Apr 2024 17:41:48 GMT</pubDate><category><![CDATA[Autoscaling]]></category><category><![CDATA[Luna]]></category><category><![CDATA[Node Management]]></category><guid isPermaLink="false">https://www.elotl.co/blog/the-benefits-of-cycling-kubernetes-nodes-optimizing-performance-reliability-and-security</guid><description><![CDATA[ Wondering whether cycling out older Kubernetes nodes periodically is a good idea? In the world of Kubernetes administration, the practice of rotating nodes often takes a backseat, even though it holds considerable advantages. While it's true that node cycling isn't universally applicable, it's worth exploring its merits for your environment. In this article, I will delve into many of the compelling reasons why considering node rotation might be beneficial for your clusters. We'll explore the ad [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:300px;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/the-benefits-of-cycling-kubernetes-nodes.jpg?1712685298" style="margin-top: 5px; margin-bottom: 10px; margin-left: 20px; margin-right: 10px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="text-align:left;display:block;"><span>Wondering whether cycling out older Kubernetes nodes periodically is a good idea?</span> In the world of Kubernetes administration, the practice of rotating nodes often takes a backseat, even though it holds considerable advantages. While it's true that node cycling isn't universally applicable, it's worth exploring its merits for your environment. In this article, I will delve into many of the compelling reasons why considering node rotation might be beneficial for your clusters. We'll explore the advantages of node rotation in Kubernetes and how it contributes to resource optimization, fault tolerance, security, and performance improvements.<br /><br />Why might someone think cycling of Kubernetes nodes is unnecessary? One reason for this could be a misconception about the stability of Kubernetes clusters. In environments where nodes rarely fail or resource usage remains relatively consistent, there might be a tendency to prioritize other tasks over node cycling. Additionally, the perceived complexity of implementing node rotation strategies, particularly in large-scale or production environments, could dissuade teams from actively considering it. Some teams might also be unaware of the potential performance gains and reliability improvements that can result from regular node cycling. However, despite these challenges or misconceptions, it's crucial to recognize that neglecting node rotation can lead to issues such as <span>resource exhaustion, reduced fault tolerance, security vulnerabilities, difficulties upgrading to newer versions, and degraded performance over time</span>. By acknowledging the importance of node cycling and implementing proactive strategies, administrators and DevOps teams can ensure the long-term health, resilience, and efficiency of their Kubernetes infrastructure. So, without delay, let's delve into the specifics.<br /><br /></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <div>  <!--BLOG_SUMMARY_END--></div>  <div class="paragraph" style="text-align:left;">Node rotation in Kubernetes aids in maintaining a secure environment through timely patch management and isolation of compromised nodes. By cycling nodes at regular intervals, security patches and updates can be deployed consistently, reducing the attack surface and mitigating potential vulnerabilities. In the event of a compromised node, cycling it out of the cluster helps contain the threat and prevent further damage, enhancing overall security posture.<br /><br />James Cunningham, a Lead Infrastructure Engineer at PlanetScale, highlights the multifaceted benefits of node cycling within Kubernetes environments, stating, <em>"It optimizes workload distribution, ensures a seamless refresh of nodes with the newest kernel and OS updates, all while maintaining stability and virtually eliminating state drift."</em> This encapsulates the transformative impact node cycling has on infrastructure maintenance and performance optimization. By periodically refreshing nodes, organizations can ensure that workloads are efficiently distributed, leveraging the latest kernel and OS updates seamlessly.<br /><br />Moreover, the assurance of utilizing updated packages without the need for disruptive reboots enhances system stability and security. Additionally, the mitigation of state drift to near-zero levels minimizes inconsistencies across the infrastructure, fostering a more reliable and predictable operational environment. Through proactive node cycling practices, organizations can effectively uphold operational excellence while continuously adapting to evolving workload demands.<br />Cycling Kubernetes nodes leads to performance improvements by leveraging newer hardware and optimizing networking infrastructure. Refreshing the underlying hardware or virtual infrastructure enhances performance by capitalizing on advancements in technology. Additionally, redistributing workloads across the cluster reduces resource contention and bottlenecks, resulting in better performance for applications and services running on Kubernetes.<br /><br />The adoption of efficient node management practices is pivotal for maintaining a resilient and high-performing infrastructure. James further sheds light on the effectiveness of node cycling within this context: <em>&ldquo;Node cycling serves as our seamless approach to upgrading kubelets post-upgrading the apiservers. Rather than setting off on some grand rescheduling process across the whole cluster after upgrading the apiservers, we set a 30-day timer and let computers do the hard work.&rdquo;</em> This quote underscores the practical benefits of node cycling, particularly in simplifying the upgrade process while reducing operational overhead. With node cycling, administrators can seamlessly ensure that kubelets are upgraded following apiserver updates, all without the need for immediate, large-scale rescheduling efforts. This streamlined approach not only enhances operational efficiency but also bolsters system reliability by keeping critical components up-to-date without interrupting ongoing workloads. By integrating node cycling into their Kubernetes management workflows, organizations, such as PlanetScale, can effectively navigate the complexities of infrastructure maintenance and stay agile in an ever-evolving landscape.<br /><br />Regular node cycling also facilitates proactive fault detection and mitigation. By replacing nodes on a scheduled basis, potential hardware failures or issues are addressed before they impact application availability. This approach ensures redundancy within the cluster, enabling seamless workload transition in case of unexpected node failures. Additionally, through automated health checks and compatibility validations during node cycling, the cluster's resilience and stability are reinforced, guaranteeing a robust foundation for running mission-critical applications.<br /><br />Wondering how to automate node cycling in your Kubernetes environment? <span>There are several methods available,</span> one of which is utilizing Luna. Luna stands out as an intelligent autoscaler capable of not only provisioning and managing nodes for workloads but also orchestrating the removal of nodes beyond a specified NodeTTL (Time to Live) value. This feature ensures efficient node cycling based on your defined TTL, streamlining operations effortlessly. For instance, if you prefer a weekly node cycling routine, simply configure the NodeTTL parameter within Luna to 7d, and voila! Luna takes care of the rest, seamlessly managing node lifecycle within your cluster.<br /><br />While node cycling offers numerous benefits for maintaining a healthy and efficient Kubernetes infrastructure, there are certain scenarios where it may not be practical or necessary. One such exception is in environments where workloads require long-running processes or persistent connections that cannot easily be migrated to other nodes. In these cases, interrupting these processes by cycling out nodes could result in service disruptions or data loss. Additionally, in environments with strict compliance or regulatory requirements, the process of cycling nodes out may introduce additional complexity and risk, especially if it involves downtime or configuration changes that could impact compliance status. So while node cycling is generally beneficial for most Kubernetes deployments, it's essential to consider these exceptions and weigh the potential trade-offs before implementing a node rotation strategy. Fortunately, Luna provides a solution for critical workloads that cannot or should not be terminated during node cycling processes. <span>With the capability to set a "do-not-evict" annotation on such workloads, Luna ensures that pods remain untouched until they have terminated naturally or the annotation is removed.</span> This functionality enables the smooth cycling of nodes within the cluster while avoiding any disruption to critical workloads.<br /><br />In conclusion, cycling Kubernetes nodes at regular intervals offers significant benefits across various aspects of Kubernetes management. By optimizing resource utilization, enhancing fault tolerance and reliability, strengthening security measures, and improving performance, node rotation contributes to a more efficient and resilient Kubernetes environment. Incorporating node cycling into your Kubernetes maintenance strategy can help ensure the smooth operation of your containerized workloads and enhance the overall stability of your infrastructure.<br /><br />To delve deeper into Luna's intelligent autoscaling capabilities, including node cycling, explore our <a href="https://www.elotl.co/luna.html">product page</a> for details. For step-by-step guidance, consult our <a href="https://docs.elotl.co/luna/intro/" target="_blank">Documentation</a>. Ready to test Luna firsthand? <a href="https://www.elotl.co/luna-free-trial.html">Try Luna</a> today with our free trial and witness the efficiency and flexibility it brings to your cloud environments.<br /><br /><strong>Author:</strong><br />Justin Willoughby (Principal Solutions Architect, Elotl)<br /><br /><strong>Contributors:</strong><br />James Cunningham (Lead Infrastructure Engineer, PlanetScale)<br />Henry Precheur (Senior Staff Engineer, Elotl)<br />Anne Holler (Chief Scientist, Elotl)<br /><br /></div>]]></content:encoded></item><item><title><![CDATA[Deep Learning Training with Ray and Ludwig using Elotl Luna]]></title><link><![CDATA[https://www.elotl.co/blog/deep-learning-training-with-ray-and-ludwig-using-elotl-luna]]></link><comments><![CDATA[https://www.elotl.co/blog/deep-learning-training-with-ray-and-ludwig-using-elotl-luna#comments]]></comments><pubDate>Thu, 22 Feb 2024 15:25:19 GMT</pubDate><category><![CDATA[Deep Learning]]></category><category><![CDATA[Luna]]></category><guid isPermaLink="false">https://www.elotl.co/blog/deep-learning-training-with-ray-and-ludwig-using-elotl-luna</guid><description><![CDATA[In this brief summary blog, we delve into the intriguing realm of GPU cost savings in the cloud through the use of Luna, an Intelligent Autoscaler. If you're passionate about harnessing the power of Deep Learning (DL) while optimizing expenses, this summary is for you. Join us as we explore how innovative technologies are revolutionizing the landscape of resource management in the realm of Deep Learning. Let's embark on a journey where efficiency meets intelligence, promising both technical insi [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/cloud-dl-cost.jpg?1708617557" style="margin-top: 5px; margin-bottom: 0px; margin-left: 20px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image"></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -0px; margin-bottom: 0px; text-align: center;" class="wsite-caption"></span></span><div class="paragraph" style="text-align:left;display:block;"><span>In this brief summary blog, we delve into the intriguing realm of GPU cost savings in the cloud through the use of Luna, an Intelligent Autoscaler.</span> <span>If you're passionate about harnessing the power of Deep Learning (DL) while optimizing expenses, this summary is for you.</span> Join us as we explore how innovative technologies are revolutionizing the landscape of resource management in the realm of Deep Learning. Let's embark on a journey where efficiency meets intelligence, promising both technical insights and a practical solution.<br><br>Deep Learning has and continues to transform many industries such as AI, Healthcare, Finance, Retail, E-commerce, and many others. Some of the challenges with DL include <span>its</span> high cost and operational overhead:<ol><li><em>Compute Costs</em>: Deep learning models require significant computational resources, which lead to high costs, especially for complex or large-scale projects. This is even more true when the compute remains provisioned when it&rsquo;s not needed.</li><li><em>Instance Management</em>: Managing cloud instances for training, inference, and experimentation creates operational overhead. This includes provisioning and configuring virtual machines, monitoring resource usage, and optimizing instance types for performance and cost efficiency.</li><li><em>Infrastructure Scaling</em>: Scaling deep learning workloads in the cloud involves dynamically adjusting compute resources to meet demand. This requires optimizing resource allocation to minimize costs while ensuring sufficient capacity.</li></ol><br>Open-source platforms like <a href="https://www.ray.io/"><span>Ray</span></a> and <a href="https://ludwig.ai/latest/"><span>Ludwig</span></a> have broadened DL accessibility, yet DL model&rsquo;s <span></span>intensive GPU resource demands present financial hurdles. Addressing this, Elotl Luna emerges as a solution, streamlining compute for Kubernetes clusters without the need for manual scaling which often results in wasted spend.</div><hr style="width:100%;clear:both;visibility:hidden;"><div><!--BLOG_SUMMARY_END--></div><div class="paragraph" style="text-align:left;">Running Ray and Ludwig on cloud Kubernetes clusters using Luna, an Intelligent Kubernetes Cluster Autoscaler, is a great approach to mitigating the challenges often faced with DL and public cloud GPU resource demands. Luna dynamically adjusts GPU resources based on workload needs, resulting in substantial efficiency gains.<br><br>Luna showed significant improvements over a fixed size Ray cluster on AWS, all while preserving AutoML performance quality:<ul><li>Reduced elapsed time by 61%</li><li>Reduced compute cost by 54%</li><li>Reduced idle Ray cluster cost by 66%</li></ul><br><span>The exploration and testing encompassed ML experiments utilizing Ludwig v0.4.1, leveraging its AutoML capability. These results were obtained during the ML training workload aimed at validating the newly added AutoML feature in Ludwig v0.4.1.</span> Luna&rsquo;s resource management can be used to provide just-in-time compute for Ludwig&rsquo;s AutoML across various datasets, employing Ray Tune for hyperparameter search on GPU-enabled workers. Results prove competitive with manually-tuned models, showcasing Luna&rsquo;s adaptability and efficiency in DL workflows.<br><br>Lessons learned underscore the substantial savings achieved in workload elapsed time, execution costs, idle costs, and operational complexity. This is just a glimpse into the transformative impact of Luna on DL training workloads in the cloud. For a comprehensive understanding, dive into the full details of the <a href="https://www.cncf.io/blog/2022/02/15/managing-public-cloud-resources-for-deep-learning-training-experiments-and-lessons-learned/">Managing public cloud resources for deep learning training: experiments and lessons learned</a> blog on the <span>Cloud Native Computing Foundation site</span>.<br><br>Furthermore, we encourage you to explore our subsequent research, which validates the efficacy of Ludwig v0.5.0 AutoML for text classification datasets. In this study Luna also showed significant savings as well.<ul><li>Reduced elapsed time by 7%</li><li>Reduced compute cost by 59%</li><li>Reduced idle Ray cluster cost by <span>66</span>%</li></ul><br>The full details of this experiment can be found by viewing the slides and/or video recording from the<a href="https://kubernetesaidayeu22.sched.com/event/zr9E/efficient-automl-with-ludwig-ray-and-nodeless-kubernetes-anne-marie-holler-elotl-travis-addair-predibase.">&nbsp;Efficient AutoML with Ludwig, Ray, and Nodeless Kubernetes</a>&nbsp;session from Kubernetes AI Day Europe.<br><br>In both cases, Luna was able to dramatically lower the cost and enhance the performance of the Deep Learning jobs.<br></div><div><div class="wsite-multicol"><div class="wsite-multicol-table-wrap" style="margin:0 -15px;"><table class="wsite-multicol-table"><tbody class="wsite-multicol-tbody"><tr class="wsite-multicol-tr"><td class="wsite-multicol-col" style="width:65.055762081784%; padding:0 15px;"><div><div id="146835663495715482" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><style>    table {      border-collapse: collapse; /* Remove space between cells */      border: 1px solid #ddd; /* Border around whole table */    }    th, td {      padding: 14px; /* Adjust padding as needed */      border: 1px solid #ddd; /* Border between cells */    }    th {      background-color: #f5f5f5; /* Light gray header */    }</style><table><thead><tr><th>Reduced</th><th>First Experiment</th><th>Second Experiment</th></tr></thead><tbody><tr><td>Elapsed time by</td><td>61%</td><td>7%</td></tr><tr><td>Compute cost by</td><td>54%</td><td>59%</td></tr><tr><td>Idle Ray cluster cost by</td><td>66%</td><td>66%</td></tr></tbody></table></div></div></td><td class="wsite-multicol-col" style="width:34.944237918216%; padding:0 15px;"><div class="wsite-spacer" style="height:50px;"></div></td></tr></tbody></table></div></div></div><div class="paragraph" style="text-align:left;"><br>While this summary has provided a glimpse into the fascinating world of GPU cost savings with an Luna, we must acknowledge that it merely scratches the surface of the comprehensive insights offered in the original blog and subsequent presentation. We hope this summary has sparked your curiosity and motivated you to explore the full depth of knowledge available. For a more detailed understanding, we encourage you to dive into the original blog and presentations linked above.<br><br>To explore the robust features and capabilities of Luna in greater detail, visit our <a href="https://www.elotl.co/luna.html">Luna Product</a> page. For comprehensive guidance, refer to our <a href="https://docs.elotl.co/luna/intro/" target="_blank">documentation</a>. Ready to experience firsthand the seamless management of compute for GPU workloads? Start testing Luna today and discover the efficiency and flexibility it offers for your cloud environments.<br><br><strong>Author:</strong><br>Justin Willoughby (Principal Solutions Architect, Elotl)<br><br><strong>Authors/Contributors from the full blog from which this summary blog is based:</strong><br>Anne Holler, Chi Su, Travis Addair, Henry Pr&ecirc;cheur, Pawe&#322; Bojanowski, Madhuri Yechuri, and Richard Liaw<br></div>]]></content:encoded></item><item><title><![CDATA[A Guide to Disaster Recovery for FerretDB with Elotl Nova on Kubernetes]]></title><link><![CDATA[https://www.elotl.co/blog/a-guide-to-disaster-recovery-for-ferretdb-with-elotl-nova-on-kubernetes]]></link><comments><![CDATA[https://www.elotl.co/blog/a-guide-to-disaster-recovery-for-ferretdb-with-elotl-nova-on-kubernetes#comments]]></comments><pubDate>Mon, 12 Feb 2024 20:00:29 GMT</pubDate><category><![CDATA[Disaster Recovery]]></category><category><![CDATA[Nova]]></category><guid isPermaLink="false">https://www.elotl.co/blog/a-guide-to-disaster-recovery-for-ferretdb-with-elotl-nova-on-kubernetes</guid><description><![CDATA[Originally published on blog.ferretdb.ioRunning a database without a disaster recovery process can result in loss of business continuity, resulting in revenue loss and reputation loss for a modern business.Cloud environments provide a vast set of choices in storage, networking, compute, load-balancing and other resources to build out DR solutions for your applications. However, these building blocks need to be architected and orchestrated to build a resilient end-to-end solution. Ensuring contin [...] ]]></description><content:encoded><![CDATA[<div class="paragraph">Originally published on <a href="https://blog.ferretdb.io/guide-disaster-recovery-ferretdb-elotl-nova-kubernetes/" target="_blank">blog.ferretdb.io</a><br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/ferretdb-elotl-nova-8ae8904f848588c61bcf90b3803d2d11_orig.jpg" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph">Running a database without a disaster recovery process can result in loss of business continuity, resulting in revenue loss and reputation loss for a modern business.<br><br>Cloud environments provide a vast set of choices in storage, networking, compute, load-balancing and other resources to build out DR solutions for your applications. However, these building blocks need to be architected and orchestrated to build a resilient end-to-end solution. Ensuring continuous operation of the databases backing your production apps is critical to avoid losing your customers' trust.<br><br>Successful disaster recovery requires:<br><ul><li>Reliable components to automate backup and recovery<br></li><li>A watertight way to identify problems<br></li><li>A list of steps to revive the database<br></li><li>Regular testing of the recovery process<br></li></ul><br>This blog post shows how to automate these four aspects of disaster recovery using FerretDB, Percona PostgreSQL and Nova. Nova automates parts of the recovery process, reducing mistakes and getting your data back online faster.<br></div><div><!--BLOG_SUMMARY_END--></div><h2 class="wsite-content-title"><font size="6">Components overview</font><br></h2><div class="paragraph">FerretDB is an open-source proxy that translates MongoDB wire protocol queries to SQL, with PostgreSQL or SQLite as the database engine.<br><br>Percona for PostgreSQL is a tool set to manage your PostgreSQL database system: it installs PostgreSQL and adds a selection of extensions that help manage the database.<br><br>Nova is a multi-cloud, multi-cluster control plane that orchestrates workloads across multiple Kubernetes clusters via user-defined policies.</div><h2 class="wsite-content-title"><font size="6">Defining a Disaster Recovery setup for FerretDB + Percona Postgres</font><br></h2><div class="paragraph">FerretDB operates as a stateless application, therefore during recovery Nova only needs to make sure it is connected to a primary PostgreSQL database.<br><br>To implement PostgreSQL's Disaster Recovery (DR), a primary cluster, standby cluster, and object storage, such as an S3 bucket, are required. The storage will be used for storing periodic backups performed on the primary cluster. The standby cluster will be configured as the backup location, so it is kept in-sync with the primary. When disaster strikes, the standby is set as a new primary to keep the database running (more details can be found here: Percona Blog).<br><br>For the entry point for our database, a proxy in front of the database directs communication to the appropriate instance.<br></div><h2 class="wsite-content-title"><font size="5">Basic setup</font><br></h2><div class="paragraph">Setup involves three clusters:<ol><li>Workload Cluster 1 contains<br>&nbsp; Percona Operator<br>&nbsp; PostgreSQL primary cluster<br>&nbsp; FerretDB</li><li>Workload Cluster 2 contains:<br>&nbsp; Percona Operator<br>&nbsp; PostgreSQL standby cluster<br>&nbsp; FerretDB</li><li>Workload Cluster 3 contains:<br>&nbsp; HAProxy, the single entry point to FerretDB.<br>&nbsp; HAProxy connected to FerretDB in cluster 1 (linked to the primary PostgreSQL).<br>&nbsp; After recovery, HAProxy will be connected to FerretDB in cluster 2 (linked to the new primary PostgreSQL).<br></li></ol><br>The proxy is a single point of failure, it is intentionally set up this way to simplify the demonstration of database recovery.</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/ferretdb-before-recovery-without-nova-c2e192c84a5f69f989ce308053e920e3_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph">With the described setup in place, Nova can execute the following recovery steps if Cluster 1 fails:<ol><li>Set Percona cluster 2 as primary<br></li><li>Set Percona cluster 1 as standby (You cannot have two primary clusters simultaneously in one setup as it would disrupt the backup process. If Cluster 1 is initially marked as failed due to network issues and Cluster 2 takes over, Nova must ensure that, if Cluster 1 becomes available again, it does not reconnect as the primary.)<br></li><li>Connect HAProxy to FerretDB in cluster 2<br></li></ol></div><h2 class="wsite-content-title"><font size="6">Automating the setup and recovery execution</font><br></h2><div class="paragraph">To simplify deployment across multiple servers, use Nova to deploy FerretDB, Percona Operator, and configure PostgreSQL and HAProxy. By setting up policies, Nova will direct workloads, along with their configurations, to the appropriate cluster. Detailed information about configuring policies in Nova are described in the <a href="https://docs.elotl.co/nova/intro" target="_blank">Nova Documentation</a>.<br></div><h2 class="wsite-content-title"><font size="5">Enhanced setup</font><br></h2><div class="paragraph">An additional Kubernetes cluster is required to host the Nova control plane, and Nova agents are incorporated into the existing Kubernetes clusters. This setup enables exclusive communication with the Nova control plane during the deployment and configuration of all components.</div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/ferretdb-before-recovery-2ee14993ee57f0fd7256de058ae60c7f_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><h2 class="wsite-content-title"><font size="5">Nova Schedule Policy for FerretDB</font><br></h2><div class="paragraph">With Nova scheduling policies, you can deploy all workloads and Nova will distribute them among clusters as needed. For example, the policy below spreads FerretDB deployment to two clusters with a different service name for each PostgresDB.<br></div><div><div id="173798172762472760" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> policy.elotl.co/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> SchedulePolicy</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> spread</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">ferretdb</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">namespaceSelector</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">matchExpressions</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> kubernetes.io/metadata.name</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">operator</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Exists</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">resourceSelectors</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">labelSelectors</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">matchLabels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">app</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ferretdb</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">groupBy</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">labelKey</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> app</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">clusterSelector</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">matchExpressions</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> kubernetes.io/metadata.name</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">operator</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> In</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">values</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">spreadConstraints</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">spreadMode</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Duplicate</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">topologyKey</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> kubernetes.io/metadata.name</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">overrides</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">topologyValue</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Deployment</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> apps/v1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ferretdb</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">namespace</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> default</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">override</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">fieldPath</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> spec.template.spec.containers</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain">.env</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain">.value</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                  </span><span class="token key atrule" style="color:#00a4db">staticValue</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> postgres</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//cluster1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">ha.psql</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">operator.svc</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">5432/zoo</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">topologyValue</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Deployment</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> apps/v1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ferretdb</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">namespace</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> default</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">override</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">fieldPath</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> spec.template.spec.containers</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain">.env</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain">.value</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                  </span><span class="token key atrule" style="color:#00a4db">staticValue</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> postgres</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//cluster2</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">ha.psql</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">operator.svc</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">5432/zoo</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">---</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> policy.elotl.co/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> SchedulePolicy</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> psql</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">ferretdb</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">namespaceSelector</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">matchLabels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">kubernetes.io/metadata.name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> default</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">clusterSelector</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">matchLabels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">kubernetes.io/metadata.name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">resourceSelectors</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">labelSelectors</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">matchLabels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">psql-cluster</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">---</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> policy.elotl.co/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> SchedulePolicy</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> psql</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">2</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">ferretdb</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">namespaceSelector</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">matchLabels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">kubernetes.io/metadata.name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> default</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">clusterSelector</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">matchLabels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">kubernetes.io/metadata.name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">resourceSelectors</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">labelSelectors</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">matchLabels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">psql-cluster</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cluster</span><span class="token punctuation" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">2</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"></span></button></div></div></div></div></div><h2 class="wsite-content-title"><font size="5">Recovery Plan</font><br></h2><div class="paragraph">Now that the FerretDB is up and running, Nova will be configured to execute a recovery plan when something goes wrong. You just need to convert the recovery steps we outlined above into Nova's recovery plan. The Recovery Plan is a Kubernetes Custom Resource and looks as follows:</div><div><div id="248172187805148958" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> recovery.elotl.co/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> RecoveryPlan</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> psql</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">primary</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">failover</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">plan</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">alertLabels</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">   </span><span class="token key atrule" style="color:#00a4db">app</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> example</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">app</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">steps</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">   </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> patch  </span><span class="token comment" style="color:#999988;font-style:italic"># set perconapgclusters 1 to standby</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">     </span><span class="token key atrule" style="color:#00a4db">patch</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"pg.percona.com/v2beta1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">resource</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"perconapgclusters"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">namespace</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"psql-operator"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"cluster1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">override</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">         </span><span class="token key atrule" style="color:#00a4db">fieldPath</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"spec.standby.enabled"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">         </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">           </span><span class="token key atrule" style="color:#00a4db">raw</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean important" style="color:#36acaa">true</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">patchType</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"application/merge-patch+json"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">   </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> patch  </span><span class="token comment" style="color:#999988;font-style:italic"># set perconapgclusters 2 to primary</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">     </span><span class="token key atrule" style="color:#00a4db">patch</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"pg.percona.com/v2beta1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">resource</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"perconapgclusters"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">namespace</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"psql-operator"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"cluster2"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">override</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">         </span><span class="token key atrule" style="color:#00a4db">fieldPath</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"spec.standby.enabled"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">         </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">           </span><span class="token key atrule" style="color:#00a4db">raw</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean important" style="color:#36acaa">false</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">patchType</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"application/merge-patch+json"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">   </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> readField   </span><span class="token comment" style="color:#999988;font-style:italic"># read ferretdb service hostname in cluster 2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">     </span><span class="token key atrule" style="color:#00a4db">readField</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"v1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">resource</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"services"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">namespace</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"default"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"ferretdb-service-2"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">fieldPath</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> "status.loadBalancer.ingress</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain">.hostname"       outputKey</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Cluster2IP"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> patch </span><span class="token comment" style="color:#999988;font-style:italic"># update HAProxy to point to ferretdb service in cluster 2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">patch</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"v1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">resource</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"configmaps"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">namespace</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"psql-operator"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"haproxy-config"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">override</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">         </span><span class="token key atrule" style="color:#00a4db">fieldPath</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"data"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">         </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">           </span><span class="token key atrule" style="color:#00a4db">raw</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token key atrule" style="color:#00a4db">"haproxy.cfg"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"defaults\n    mode tcp\n    timeout connect 5000ms\n    timeout client 50000ms\n    timeout server 50000ms\n\nfrontend fe_main\n    bind *:5432\n    default_backend be_db_2\n\nbackend be_db_2\n    server db2 {{ .Values.Cluster2IP }}:27017 check"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">       </span><span class="token key atrule" style="color:#00a4db">patchType</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"application/merge-patch+json"</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"></span></button></div></div></div></div></div><h2 class="wsite-content-title"><font size="5">Triggering the recovery plan execution</font><br></h2><div class="paragraph">Nova exposes a webhook endpoint that matches recovery plans with the alert's label. You can send an alert manually using a tool like curl. Alternatively, you can use an alert system, like AlertManager + Prometheus, which will automatically notify Nova when a certain metric goes beyond a set limit.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/ferretdb-recovery-77461a18b9d30b5c9a261ca6d0c98fed_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><h2 class="wsite-content-title"><font size="6">Summary</font><br></h2><div class="paragraph">The above steps, process, and execution has resulted in a successful setup of FerretDB to autonomously recover from disasters, such as region-wide failures. This configuration ensures seamless healing in case of unexpected events, greatly improving the resilience of the FerretDB deployment.<br><br>To learn more about FerretDB, see the <a href="https://docs.ferretdb.io/understanding-ferretdb/" target="_blank">documentation</a>.<br><br>To learn more about Nova, see <a href="https://docs.elotl.co/nova/intro/" target="_blank">Nova Documentation and try it for free</a>.<br><br><strong>Author:</strong><br>Maciek Urbanski (Senior Platform Engineer, Elotl)<br><br><strong>Contributors:</strong><br>Selvi Kadirvel, Henry Precheur, Janek Baranowski , Pawel Bojanowski, Justin Willoughby, Madhuri Yechuri<br></div>]]></content:encoded></item><item><title><![CDATA[Cloud GPU Allocation Got You Down? Elotl Luna to the Rescue!]]></title><link><![CDATA[https://www.elotl.co/blog/cloud-gpu-allocation-got-you-down-elotl-luna-to-the-rescue]]></link><comments><![CDATA[https://www.elotl.co/blog/cloud-gpu-allocation-got-you-down-elotl-luna-to-the-rescue#comments]]></comments><pubDate>Thu, 08 Feb 2024 19:02:30 GMT</pubDate><category><![CDATA[Luna]]></category><category><![CDATA[Machine Learning]]></category><guid isPermaLink="false">https://www.elotl.co/blog/cloud-gpu-allocation-got-you-down-elotl-luna-to-the-rescue</guid><description><![CDATA[ How do I efficiently run my AI or Machine Learning (ML) workloads in my Kubernetes clusters?Operating Kubernetes clusters with GPU compute manually presents several challenges, particularly in the allocation and management of GPU resources. One significant pain point is the potential for wasted spend, as manually allocated GPUs may remain idle during periods of low workload. In dynamic or bursty clusters, predicting the optimal GPU requirements becomes challenging, leading to suboptimal resourc [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:5px;*margin-top:10px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/published/floating-gpu.jpg?1707419095" style="margin-top: 0px; margin-bottom: 10px; margin-left: 20px; margin-right: 0px; border-width:1px;padding:3px; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="text-align:left;display:block;"><em>How do I efficiently run my AI or Machine Learning (ML) workloads in my Kubernetes clusters?</em><br /><br />Operating Kubernetes clusters with GPU compute manually presents several challenges, particularly in the allocation and management of GPU resources. One significant pain point is the potential for wasted spend, as manually allocated GPUs may remain idle during periods of low workload. In dynamic or bursty clusters, predicting the optimal GPU requirements becomes challenging, leading to suboptimal resource utilization and increased costs. Additionally, manual allocation necessitates constant monitoring of GPU availability, requiring administrators be aware of the GPU availability in clusters spread across different zones or regions. Once the GPU requirements are determined for a given workload, the administrator needs to manually add nodes when demand surges and remove them during periods of inactivity.<br /><br />There are many GPU types, each with different capabilities, running on different nodes types. The combination of these three factors makes manual GPU nodes management increasingly convoluted. Different workloads may require specific GPU models, leading to complexities in node allocation. Manually ensuring the correct GPU nodes for diverse workloads becomes a cumbersome task, especially when dealing with multiple applications with varying GPU preferences. This adds another layer of operational overhead, demanding detailed knowledge of GPU types, and again availability, and continuous adjustments to meet workload demands.<br /><br />Luna, an intelligent node autoscaler, addresses these pain points by automating GPU node allocation based on workload demands. Luna is aware of GPU availability, as such, it can dynamically choose and allocate needed GPU nodes, eliminating the need for manual intervention. This optimizes resource utilization and reduces wasted spend by scaling GPU resources in line with the workload. Moreover, Luna can allocate specific nodes as defined by the workload requirements, ensuring precise resource allocation tailored to the application's needs. This makes Luna perfectly suited for the most complex compute jobs like AI and ML workloads.<br /><br />Furthermore, Luna's core functionality includes the automatic allocation of alternative GPU nodes in cases where preferred GPUs are unavailable, bolstering its flexibility and resilience. This ensures that workloads with specific GPU preferences can seamlessly transition to available alternatives, maintaining uninterrupted operation. Controlled through annotations within the workload, users can specify cloud instance types to use or avoid, either by instance family or via regular expressions, along with desired GPU SKUs. This capability enables dynamic allocation based on GPU availability and workload demands, simplifying cluster management and promoting efficient scaling and resource utilization without the need for constant manual adjustments.<br /><br />Lastly, the advantages of Luna extend beyond resource optimization and workload adaptability in a single specific cloud. When organizations leverage various cloud providers, flexibility is paramount. An intelligent autoscaler designed to support GPU management within multiple cloud providers empowers users with the freedom to choose the most suitable cloud platform for their specific needs. With Luna enterprises are not locked into a single cloud provider, offering them the agility to transition workloads seamlessly between different cloud environments based on cost-effectiveness, performance, or specific features. Currently Luna supports four cloud providers: Amazon AWS with EKS, Google Cloud with GKE, Microsoft Azure with AKS, and Oracle Cloud Infrastructure with OKE. By providing a unified and agnostic approach to GPU resource management, Luna becomes a strategic asset, enabling organizations to harness the benefits of diverse cloud ecosystems without compromising efficiency or incurring cloud vendor lock-in.<br /><br />In summary, manually managing GPU compute in Kubernetes clusters poses challenges related to wasted spend, manual addition, monitoring, and removal of nodes. Luna addresses these pain points by:<ul><li>&nbsp;&nbsp;&nbsp; Streamlining GPU node allocation according to workload demands</li><li>&nbsp;&nbsp;&nbsp; Optimizing resource utilization by dynamically choosing and allocating nodes</li><li>&nbsp;&nbsp;&nbsp; Adapting to fluctuations in GPU availability seamlessly</li><li>&nbsp;&nbsp;&nbsp; Unify operations over multiple clusters and cloud providers: Amazon EKS, Google GKE, Azure AKS, and Oracle OKE</li></ul><br />Luna simplifies cluster node management, reduces operational overhead, and ensures efficient GPU resource utilization.<br /><br />To delve deeper into Luna's powerful features and capabilities, explore the <a href="https://www.elotl.co/luna.html">Luna product page</a> for details. For step-by-step guidance, consult our <a href="https://docs.elotl.co" target="_blank">Documentation</a>. Ready to experience the seamless management of GPU workloads firsthand? <a href="https://www.elotl.co/luna-free-trial.html">Try Luna</a> today with our free trial and witness the efficiency and flexibility it brings to your cloud environments.<br /><br /><strong>Author:</strong><br />Justin Willoughby (Principal Solutions Architect, Elotl)<br /><br /><strong>Contributors:</strong><br />Henry Precheur (Senior Staff Engineer, Elotl)<br />Anne Holler (Chief Scientist, Elotl)<br></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>]]></content:encoded></item><item><title><![CDATA[Luna 1.0.0 is out]]></title><link><![CDATA[https://www.elotl.co/blog/luna-100-is-out]]></link><comments><![CDATA[https://www.elotl.co/blog/luna-100-is-out#comments]]></comments><pubDate>Tue, 06 Feb 2024 17:20:19 GMT</pubDate><category><![CDATA[Luna]]></category><guid isPermaLink="false">https://www.elotl.co/blog/luna-100-is-out</guid><description><![CDATA[ The Elotl team is thrilled to announce a major milestone in our journey &mdash; the release of Luna&nbsp; version 1.0.0. Luna is a Intelligent Kubernetes Cluster Autoscaler that optimizes cost, simplifies operations, and supports four public Cloud Providers: Amazon EKS, Google GKE, Microsoft AKS, and Oracle OCI.While some might associate version 1.0.0 with potential hiccups, rest assured, this release is a testament to our commitment to excellence and stability. We&rsquo;ve diligently worked to [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:right;height:0px'></span><span style='display: table;width:auto;position:relative;float:right;max-width:100%;;clear:right;margin-top:0px;*margin-top:0px'><a><img src="https://www.elotl.co/uploads/1/3/0/3/130365369/editor/luna-logo-for-web.png?1707241364" style="margin-top: 10px; margin-bottom: 10px; margin-left: 20px; margin-right: 10px; border-width:0; max-width:100%" alt="Picture" class="galleryImageBorder wsite-image" /></a><span style="display: table-caption; caption-side: bottom; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;" class="wsite-caption"></span></span> <div class="paragraph" style="display:block;">The Elotl team is thrilled to announce a major milestone in our journey &mdash; the release of Luna&nbsp; version 1.0.0. Luna is a Intelligent Kubernetes Cluster Autoscaler that optimizes cost, simplifies operations, and supports four public Cloud Providers: Amazon EKS, Google GKE, Microsoft AKS, and Oracle OCI.<br />While some might associate version 1.0.0 with potential hiccups, rest assured, this release is a testament to our commitment to excellence and stability. We&rsquo;ve diligently worked to ensure that this version not only meets but exceeds expectations.<br /></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <h2 class="wsite-content-title"><font size="5">Why Luna Version 1.0.0 is a Milestone:</font><br></h2>  <div class="paragraph"><ul><li>Widened Horizon: Luna has been rigorously tested and optimized, making it suitable for a broad range of applications.</li><li>Trusted in Production: Version 1.0.0 builds upon the rock-solid foundation of its predecessor, version 0.7.4, which has been successfully running in diverse production clusters.<br></li></ul></div>  <h2 class="wsite-content-title"><font size="5">Give it a try</font><br></h2>  <div class="paragraph">To learn more about Luna, check out the <u><a href="https://www.elotl.co/luna.html">Luna product page</a></u>, you can also <a href="https://www.elotl.co/luna-free-trial.html"><u>download</u></a> the trial version of Luna, or read the <a href="https://docs.elotl.co/luna/intro/"><u>documentation</u></a>.&nbsp;<br />We dedicated extensive effort to building Luna into a robust cluster autoscaler, ensuring that every dollar brings optimal value. Luna is designed to enhance the efficiency of your Kubernetes workloads and streamline the scaling operations across multiple cloud environments. We encourage you to explore Luna, especially for clusters handling substantial, dynamic, or bursty workloads.<br></div>]]></content:encoded></item></channel></rss>