In the world of Kubernetes, understanding the basics of pods and nodes is important, but to truly optimize your infrastructure, you need to delve deeper. The real game-changer? Cluster Autoscalers. These tools dynamically adjust the size of your cluster, ensuring you meet workload demands without over-provisioning resources. But while many autoscalers focus solely on bin-packing, Luna takes it a step further with its innovative bin-selection feature, delivering an all-encompassing solution for workload management and cost efficiency. In this blog, we will explore both bin-packing and bin-selection, two essential strategies for Kubernetes autoscaling. By leveraging Luna, you can maximize efficiency, minimize waste, and keep costs under control, all while handling the complexities of varying workload sizes and resource requirements. Let’s dive in! What is Bin-Packing in Kubernetes? Bin-packing is the default approach for optimizing pod placement in Kubernetes, maximizing resource utilization across nodes. The concept is simple: pack as many items (pods) into as few bins (nodes) as possible, maximizing resource utilization and minimizing the number of nodes required.
When nodes in a cluster become over-utilized, pod performance suffers. Avoiding or addressing hot nodes can reduce workload latency and increase throughput. In this blog, we present two Ray Machine Learning serving experiments that show the performance benefit of Luna’s new Hot Node Mitigation (HNM) feature. With HNM enabled, Luna demonstrated a reduction in latency relative to the hot node runs: 40% in the first experiment and 70% in the second. It also increased throughput: 30% in the first and 40% in the second. We describe how the Luna smart cluster autoscaler with HNM addresses hot node performance issues by triggering the allocation and use of additional cluster resources.
INTRODUCTION
A pod's CPU and memory resource requests express its minimum resource allocations. The Kubernetes (K8s) scheduler uses these values as constraints for placing the pod on a node, leaving the pod pending when the settings cannot be respected. Cloud cluster autoscalers look at these values on pending pods to determine the amount of resources to add to a cluster.
A pod configured with both CPU and memory requests, and with limits equal to those requests, is in QoS class guaranteed. A K8s cluster hosting any non-guaranteed pods runs the risk that some nodes in the cluster could become over-utilized when such pods have CPU or memory usage bursts. Bursting pods running on hot nodes can have performance problems. A bursting pod’s attempts to use CPU above its CPU resource request can be throttled. And its attempts to use memory above its memory resource request can cause the pod to be killed. The K8s scheduler can worsen the situation, by continuing to schedule pods onto hot nodes. Introduction
Are you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the Elotl Nova policy-driven multi-cluster orchestrator simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity. Nova is autoscaler-aware, detecting if workload clusters are running either the K8s cluster autoscaler or the Elotl Luna intelligent cluster autoscaler.
In this blog, we examine how Nova policies combined with its autoscaler-awareness can be used to achieve a variety of "right place, right size" outcomes for several common ML/AI GPU workload scenarios. When Nova and Luna team up you can:
Using NVIDIA GPU Time-slicing in Cloud Kubernetes Clusters with the Luna Smart Cluster Autoscaler6/25/2024
Introduction
Kubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default. With NVIDIA GPU time-slicing, GPUs can be shared among K8s workloads by interleaving their GPU use. For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.
Using NVIDIA GPU time-slicing in a cloud Kubernetes cluster with a cluster autoscaler (CA) that is aware of the time-slicing configuration can significantly reduce costs. A time-slice aware “smart” CA prevents initial over-allocation of instances and optimizes instance selection, and reduces the risk of exceeding quotas and capacity limits. Also, on GKE, where GPU time-slicing is expected to be configured at the control plane level, a smart CA facilitates using time-slicing on GPU resources that are dynamically allocated.
At Elotl we develop Luna, an intelligent cluster autoscaler for Kubernetes. Luna gets deployed on customers' clusters and helps scale up and down compute resources to optimize cost.
Luna operates in environments where direct access isn’t always available. To overcome the problem of diagnosis and performance monitoring we have introduced the option for customers to securely send their Luna logs and metrics to our advanced log storage appliance. This empowers us to enhance our support capabilities, providing even more effective assistance to our customers. OpenTelemetry is fast becoming the standard for collecting metrics and logs in Kubernetes environments. We opted to run the OpenTelemetry collector as a sidecar for the Luna cluster autoscaler. It will gather and send the logs from a single pod, therefore running it as a sidecar was a perfect match. The recent surge in ARM processor capabilities has sparked a wave of exploration beyond their traditional mobile device domain. This blog explains why you may want to consider using ARM nodes for your Kubernetes workloads. We'll identify potential benefits of leveraging ARM nodes for containerized deployments while acknowledging the inherent trade-offs and scenarios where x86-64 architectures may perform better and thus continue to be a better fit. Lastly we'll describe a seamless way to add ARM nodes to your Kubernetes clusters. In this blog, for the sake of clarity and brevity, I will be using the term 'ARM' to refer to ARM64 or ARM 64-bit processors, while 'x86' or 'x86-64' will be used interchangeably to denote Intel or AMD 64-bit processors. What Kubernetes Workloads Tend To Be Ideal for ARM Processors? Inference-heavy tasks:While the computations involved in Deep Learning training typically require GPUs for acceptable performance, DL inference is less computationally intense. Tasks that apply pre-trained models for DL regression or classification can benefit from ARM's power/performance relative to GPU or x86-64 systems. We presented data on running inference on ARM64 in our Scale20x talk.
The Benefits of Cycling Kubernetes Nodes: Optimizing Performance, Reliability, and Security4/9/2024
Wondering whether cycling out older Kubernetes nodes periodically is a good idea? In the world of Kubernetes administration, the practice of rotating nodes often takes a backseat, even though it holds considerable advantages. While it's true that node cycling isn't universally applicable, it's worth exploring its merits for your environment. In this article, I will delve into many of the compelling reasons why considering node rotation might be beneficial for your clusters. We'll explore the advantages of node rotation in Kubernetes and how it contributes to resource optimization, fault tolerance, security, and performance improvements. Why might someone think cycling of Kubernetes nodes is unnecessary? One reason for this could be a misconception about the stability of Kubernetes clusters. In environments where nodes rarely fail or resource usage remains relatively consistent, there might be a tendency to prioritize other tasks over node cycling. Additionally, the perceived complexity of implementing node rotation strategies, particularly in large-scale or production environments, could dissuade teams from actively considering it. Some teams might also be unaware of the potential performance gains and reliability improvements that can result from regular node cycling. However, despite these challenges or misconceptions, it's crucial to recognize that neglecting node rotation can lead to issues such as resource exhaustion, reduced fault tolerance, security vulnerabilities, difficulties upgrading to newer versions, and degraded performance over time. By acknowledging the importance of node cycling and implementing proactive strategies, administrators and DevOps teams can ensure the long-term health, resilience, and efficiency of their Kubernetes infrastructure. So, without delay, let's delve into the specifics.
In this brief summary blog, we delve into the intriguing realm of GPU cost savings in the cloud through the use of Luna, an Intelligent Autoscaler. If you're passionate about harnessing the power of Deep Learning (DL) while optimizing expenses, this summary is for you. Join us as we explore how innovative technologies are revolutionizing the landscape of resource management in the realm of Deep Learning. Let's embark on a journey where efficiency meets intelligence, promising both technical insights and a practical solution.
Deep Learning has and continues to transform many industries such as AI, Healthcare, Finance, Retail, E-commerce, and many others. Some of the challenges with DL include its high cost and operational overhead:
Open-source platforms like Ray and Ludwig have broadened DL accessibility, yet DL model’s intensive GPU resource demands present financial hurdles. Addressing this, Elotl Luna emerges as a solution, streamlining compute for Kubernetes clusters without the need for manual scaling which often results in wasted spend.
Originally published on blog.ferretdb.io
Running a database without a disaster recovery process can result in loss of business continuity, resulting in revenue loss and reputation loss for a modern business.
Cloud environments provide a vast set of choices in storage, networking, compute, load-balancing and other resources to build out DR solutions for your applications. However, these building blocks need to be architected and orchestrated to build a resilient end-to-end solution. Ensuring continuous operation of the databases backing your production apps is critical to avoid losing your customers' trust. Successful disaster recovery requires:
This blog post shows how to automate these four aspects of disaster recovery using FerretDB, Percona PostgreSQL and Nova. Nova automates parts of the recovery process, reducing mistakes and getting your data back online faster. How do I efficiently run my AI or Machine Learning (ML) workloads in my Kubernetes clusters? Operating Kubernetes clusters with GPU compute manually presents several challenges, particularly in the allocation and management of GPU resources. One significant pain point is the potential for wasted spend, as manually allocated GPUs may remain idle during periods of low workload. In dynamic or bursty clusters, predicting the optimal GPU requirements becomes challenging, leading to suboptimal resource utilization and increased costs. Additionally, manual allocation necessitates constant monitoring of GPU availability, requiring administrators be aware of the GPU availability in clusters spread across different zones or regions. Once the GPU requirements are determined for a given workload, the administrator needs to manually add nodes when demand surges and remove them during periods of inactivity. There are many GPU types, each with different capabilities, running on different nodes types. The combination of these three factors makes manual GPU nodes management increasingly convoluted. Different workloads may require specific GPU models, leading to complexities in node allocation. Manually ensuring the correct GPU nodes for diverse workloads becomes a cumbersome task, especially when dealing with multiple applications with varying GPU preferences. This adds another layer of operational overhead, demanding detailed knowledge of GPU types, and again availability, and continuous adjustments to meet workload demands. Luna, an intelligent node autoscaler, addresses these pain points by automating GPU node allocation based on workload demands. Luna is aware of GPU availability, as such, it can dynamically choose and allocate needed GPU nodes, eliminating the need for manual intervention. This optimizes resource utilization and reduces wasted spend by scaling GPU resources in line with the workload. Moreover, Luna can allocate specific nodes as defined by the workload requirements, ensuring precise resource allocation tailored to the application's needs. This makes Luna perfectly suited for the most complex compute jobs like AI and ML workloads. Furthermore, Luna's core functionality includes the automatic allocation of alternative GPU nodes in cases where preferred GPUs are unavailable, bolstering its flexibility and resilience. This ensures that workloads with specific GPU preferences can seamlessly transition to available alternatives, maintaining uninterrupted operation. Controlled through annotations within the workload, users can specify cloud instance types to use or avoid, either by instance family or via regular expressions, along with desired GPU SKUs. This capability enables dynamic allocation based on GPU availability and workload demands, simplifying cluster management and promoting efficient scaling and resource utilization without the need for constant manual adjustments. Lastly, the advantages of Luna extend beyond resource optimization and workload adaptability in a single specific cloud. When organizations leverage various cloud providers, flexibility is paramount. An intelligent autoscaler designed to support GPU management within multiple cloud providers empowers users with the freedom to choose the most suitable cloud platform for their specific needs. With Luna enterprises are not locked into a single cloud provider, offering them the agility to transition workloads seamlessly between different cloud environments based on cost-effectiveness, performance, or specific features. Currently Luna supports four cloud providers: Amazon AWS with EKS, Google Cloud with GKE, Microsoft Azure with AKS, and Oracle Cloud Infrastructure with OKE. By providing a unified and agnostic approach to GPU resource management, Luna becomes a strategic asset, enabling organizations to harness the benefits of diverse cloud ecosystems without compromising efficiency or incurring cloud vendor lock-in. In summary, manually managing GPU compute in Kubernetes clusters poses challenges related to wasted spend, manual addition, monitoring, and removal of nodes. Luna addresses these pain points by:
Luna simplifies cluster node management, reduces operational overhead, and ensures efficient GPU resource utilization. To delve deeper into Luna's powerful features and capabilities, explore the Luna product page for details. For step-by-step guidance, consult our Documentation. Ready to experience the seamless management of GPU workloads firsthand? Try Luna today with our free trial and witness the efficiency and flexibility it brings to your cloud environments. Author: Justin Willoughby (Principal Solutions Architect, Elotl) Contributors: Henry Precheur (Senior Staff Engineer, Elotl) Anne Holler (Chief Scientist, Elotl) |