Blog Archives

Deep Learning Training with Ray and Ludwig using Elotl Luna

2/22/2024

In this brief summary blog, we delve into the intriguing realm of GPU cost savings in the cloud through the use of Luna, an Intelligent Autoscaler. If you're passionate about harnessing the power of Deep Learning (DL) while optimizing expenses, this summary is for you. Join us as we explore how innovative technologies are revolutionizing the landscape of resource management in the realm of Deep Learning. Let's embark on a journey where efficiency meets intelligence, promising both technical insights and a practical solution.

Deep Learning has and continues to transform many industries such as AI, Healthcare, Finance, Retail, E-commerce, and many others. Some of the challenges with DL include its high cost and operational overhead:

Compute Costs: Deep learning models require significant computational resources, which lead to high costs, especially for complex or large-scale projects. This is even more true when the compute remains provisioned when it’s not needed.
Instance Management: Managing cloud instances for training, inference, and experimentation creates operational overhead. This includes provisioning and configuring virtual machines, monitoring resource usage, and optimizing instance types for performance and cost efficiency.
Infrastructure Scaling: Scaling deep learning workloads in the cloud involves dynamically adjusting compute resources to meet demand. This requires optimizing resource allocation to minimize costs while ensuring sufficient capacity.

Open-source platforms like Ray and Ludwig have broadened DL accessibility, yet DL model’s intensive GPU resource demands present financial hurdles. Addressing this, Elotl Luna emerges as a solution, streamlining compute for Kubernetes clusters without the need for manual scaling which often results in wasted spend.

A Guide to Disaster Recovery for FerretDB with Elotl Nova on Kubernetes

2/12/2024

Originally published on blog.ferretdb.io

Running a database without a disaster recovery process can result in loss of business continuity, resulting in revenue loss and reputation loss for a modern business.

Cloud environments provide a vast set of choices in storage, networking, compute, load-balancing and other resources to build out DR solutions for your applications. However, these building blocks need to be architected and orchestrated to build a resilient end-to-end solution. Ensuring continuous operation of the databases backing your production apps is critical to avoid losing your customers' trust.

Successful disaster recovery requires:

Reliable components to automate backup and recovery
A watertight way to identify problems
A list of steps to revive the database
Regular testing of the recovery process

This blog post shows how to automate these four aspects of disaster recovery using FerretDB, Percona PostgreSQL and Nova. Nova automates parts of the recovery process, reducing mistakes and getting your data back online faster.

Cloud GPU Allocation Got You Down? Elotl Luna to the Rescue!

2/8/2024

How do I efficiently run my AI or Machine Learning (ML) workloads in my Kubernetes clusters?

Operating Kubernetes clusters with GPU compute manually presents several challenges, particularly in the allocation and management of GPU resources. One significant pain point is the potential for wasted spend, as manually allocated GPUs may remain idle during periods of low workload. In dynamic or bursty clusters, predicting the optimal GPU requirements becomes challenging, leading to suboptimal resource utilization and increased costs. Additionally, manual allocation necessitates constant monitoring of GPU availability, requiring administrators be aware of the GPU availability in clusters spread across different zones or regions. Once the GPU requirements are determined for a given workload, the administrator needs to manually add nodes when demand surges and remove them during periods of inactivity.

There are many GPU types, each with different capabilities, running on different nodes types. The combination of these three factors makes manual GPU nodes management increasingly convoluted. Different workloads may require specific GPU models, leading to complexities in node allocation. Manually ensuring the correct GPU nodes for diverse workloads becomes a cumbersome task, especially when dealing with multiple applications with varying GPU preferences. This adds another layer of operational overhead, demanding detailed knowledge of GPU types, and again availability, and continuous adjustments to meet workload demands.

Luna, an intelligent node autoscaler, addresses these pain points by automating GPU node allocation based on workload demands. Luna is aware of GPU availability, as such, it can dynamically choose and allocate needed GPU nodes, eliminating the need for manual intervention. This optimizes resource utilization and reduces wasted spend by scaling GPU resources in line with the workload. Moreover, Luna can allocate specific nodes as defined by the workload requirements, ensuring precise resource allocation tailored to the application's needs. This makes Luna perfectly suited for the most complex compute jobs like AI and ML workloads.

Furthermore, Luna's core functionality includes the automatic allocation of alternative GPU nodes in cases where preferred GPUs are unavailable, bolstering its flexibility and resilience. This ensures that workloads with specific GPU preferences can seamlessly transition to available alternatives, maintaining uninterrupted operation. Controlled through annotations within the workload, users can specify cloud instance types to use or avoid, either by instance family or via regular expressions, along with desired GPU SKUs. This capability enables dynamic allocation based on GPU availability and workload demands, simplifying cluster management and promoting efficient scaling and resource utilization without the need for constant manual adjustments.

Lastly, the advantages of Luna extend beyond resource optimization and workload adaptability in a single specific cloud. When organizations leverage various cloud providers, flexibility is paramount. An intelligent autoscaler designed to support GPU management within multiple cloud providers empowers users with the freedom to choose the most suitable cloud platform for their specific needs. With Luna enterprises are not locked into a single cloud provider, offering them the agility to transition workloads seamlessly between different cloud environments based on cost-effectiveness, performance, or specific features. Currently Luna supports four cloud providers: Amazon AWS with EKS, Google Cloud with GKE, Microsoft Azure with AKS, and Oracle Cloud Infrastructure with OKE. By providing a unified and agnostic approach to GPU resource management, Luna becomes a strategic asset, enabling organizations to harness the benefits of diverse cloud ecosystems without compromising efficiency or incurring cloud vendor lock-in.

In summary, manually managing GPU compute in Kubernetes clusters poses challenges related to wasted spend, manual addition, monitoring, and removal of nodes. Luna addresses these pain points by:

Streamlining GPU node allocation according to workload demands
Optimizing resource utilization by dynamically choosing and allocating nodes
Adapting to fluctuations in GPU availability seamlessly
Unify operations over multiple clusters and cloud providers: Amazon EKS, Google GKE, Azure AKS, and Oracle OKE

Luna simplifies cluster node management, reduces operational overhead, and ensures efficient GPU resource utilization.

To delve deeper into Luna's powerful features and capabilities, explore the Luna product page for details. For step-by-step guidance, consult our Documentation. Ready to experience the seamless management of GPU workloads firsthand? Try Luna today with our free trial and witness the efficiency and flexibility it brings to your cloud environments.

Author:
Justin Willoughby (Principal Solutions Architect, Elotl)

Contributors:
Henry Precheur (Senior Staff Engineer, Elotl)
Anne Holler (Chief Scientist, Elotl)

Luna 1.0.0 is out

2/6/2024

The Elotl team is thrilled to announce a major milestone in our journey — the release of Luna version 1.0.0. Luna is a Intelligent Kubernetes Cluster Autoscaler that optimizes cost, simplifies operations, and supports four public Cloud Providers: Amazon EKS, Google GKE, Microsoft AKS, and Oracle OCI.
While some might associate version 1.0.0 with potential hiccups, rest assured, this release is a testament to our commitment to excellence and stability. We’ve diligently worked to ensure that this version not only meets but exceeds expectations.

Why Luna Version 1.0.0 is a Milestone:

Widened Horizon: Luna has been rigorously tested and optimized, making it suitable for a broad range of applications.
Trusted in Production: Version 1.0.0 builds upon the rock-solid foundation of its predecessor, version 0.7.4, which has been successfully running in diverse production clusters.

Give it a try

To learn more about Luna, check out the Luna product page, you can also download the trial version of Luna, or read the documentation.
We dedicated extensive effort to building Luna into a robust cluster autoscaler, ensuring that every dollar brings optimal value. Luna is designed to enhance the efficiency of your Kubernetes workloads and streamline the scaling operations across multiple cloud environments. We encourage you to explore Luna, especially for clusters handling substantial, dynamic, or bursty workloads.

Blog