Introduction
Are you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the Elotl Nova policy-driven multi-cluster orchestrator simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity. Nova is autoscaler-aware, detecting if workload clusters are running either the K8s cluster autoscaler or the Elotl Luna intelligent cluster autoscaler.
In this blog, we examine how Nova policies combined with its autoscaler-awareness can be used to achieve a variety of "right place, right size" outcomes for several common ML/AI GPU workload scenarios. When Nova and Luna team up you can:
For clusters running in the cloud with a cluster autoscaler, the available cluster capacity is dynamic. Nova can schedule a workload on a cluster with dynamic capacity that satisfies the workload's placement policy, even if that target cluster does not currently have sufficient resources for the workload, since the autoscaler can provision the needed resources. When multiple clusters satisfy the workload's placement policy, Nova preferentially selects a target cluster with existing available cluster resources and otherwise selects an alternative target cluster running a cluster autoscaler.
Nova workloads placed using an available-capacity policy are gang-scheduled. This means that no single job within a workload will start running until all jobs in that workload can be executed simultaneously. Gang scheduling is crucial for ML/AI training jobs, as it ensures all components of a distributed training task begin processing in sync, maximizing efficiency and preventing data inconsistencies. Additionally, Nova automatically adds Luna's default pod placement label to the workloads it schedules, which allows the workloads to be handled seamlessly on either Luna or non-Luna clusters. Applying Nova+Luna to Some Common ML/AI GPU Resource Management Scenarios
We consider the following common GPU resource management scenarios:
Scenario: Training Production ML/AI Models on GPUs
Overview
For the scenario of training production ML/AI models on GPUs, the desired behavior is "fill and spill". The workloads should be gang-scheduled on a statically-allocated cluster if they fit or on a dynamically-allocated cluster if they don't. The workloads' high value warrants the cost of on-demand cloud resources, if needed, and the latency to obtain those resources dynamically is not an issue for the training job use case.
For the Nova example setup, we configure cluster static-cluster with a set of statically-allocated GPU instances and cluster dynamic-cluster with Luna configured to allocate similar cloud GPUs instances. Both clusters satisfy the Nova available-capacity placement policy. Nova places training workloads on static-cluster first since the resources are immediately available. When a training workload arrives that does not fit on static-cluster, Nova places it on dynamic-cluster and Luna adds resources to accommodate the pending workload. Example Setup
The scripts and K8s yaml input used in the example are available at elotl/skyray on Github. The commands that follow expect a clone of that repo at the SKYRAY_PATH environment variable.
The example is run on EKS cloud K8s clusters. The Nova control plane, installed on a EKS cluster comprising 2 CPU nodes, manages the static-cluster and dynamic-cluster workload EKS clusters, initially populated as shown below. The Luna cluster autoscaler is installed on dynamic-cluster, to scale the cluster to match workload resource requests. Luna is configured to allocate large EBS volumes, to handle the large instance types and storage needs of the example. Also, Luna bin-packing is disabled, since the example does not contain sets of small pods that benefit from scheduling on the same node.
KubeRay and its CRDs are deployed to the Nova control plane, along with a spread-duplicate policy for their placement. Nova places a copy of KubeRay and its CRDs on each workload cluster, meaning KubeRay is available on each cluster to handle any RayJobs, RayClusters, and RayServices placed by Nova on that cluster.
After the KubeRay spread-duplicate placement, the Nova control plane output shown below reflects that there are 2 copies of the kuberay-operator, one on each workload cluster.
And Luna has started an additional node in dynamic-cluster to host KubeRay, as shown below. The KubeRay operator has modest resource requests (100m CPU, 512Mi memory) that can be handled by the inexpensive t3a.small instance type (2 CPUs, 2Gi memory).
Example Runs
As a proxy for a production training workload, we use the Pytorch image train benchmark, run as a RayJob deployed on a Kubernetes cluster using KubeRay, adapted from the example here. The RayJob's RayCluster is configured with a CPU head and 2 single-GPU workers. The configuration of the RayJob with its associated RayCluster is available here.
A first copy of the RayJob is deployed to the Nova control plane in the rayjob1 namespace. Its placement uses a Nova available-capacity policy. Nova has native support for the RayCluster, RayJob, and RayService CRDs, and recognizes the resource requests in the podSpecs they contain. Hence, Nova is able to determine the computing resources needed for the pods comprising the RayJob. It chooses to place the RayJob and its RayCluster on static-cluster, since it has sufficient available capacity.
Another copy of the RayJob is deployed to the Nova control plane in the rayjob2 namespace. Its placement again uses an available-capacity policy, and Nova again chooses to place the RayJob and its RayCluster on static-cluster, since it has sufficient available capacity for a second copy of the training job.
A third copy of the RayJob is deployed to the Nova control plane in the rayjob3 namespace. Its placement again uses an available-capacity policy. This time Nova places the RayJob and its RayCluster on dynamic-cluster. Nova sees that static-cluster has insufficient remaining capacity for a third copy of the job and detects the Luna cluster autoscaler running on dynamic-cluster, which can obtain the needed resources.
All 3 copies of the RayJob can be seen from the Nova control plane:
And Luna scales up dynamic cluster accordingly:
With all 3 jobs eventually running to completion
Example Summary
This example demonstrated how Nova, working with Luna, makes handling gang-scheduling and "fill and spill" for a multi-worker ML/AI KubeRay/RayJob training job easy via a simple available-capacity policy-based approach. Nova and Luna can reduce the latency of your ML/AI workloads by scheduling on available compute resources in a matter of seconds.
Scenario: Training Experimental ML/AI Models on GPUs
Overview
For the scenario of training experimental ML/AI models on GPUs, the desired behavior is "fill, no spill". The workloads should be scheduled on a statically-allocated on-premise or reserved cluster set up for speculative training jobs, consisting of sunk-cost GPU instances. These training workloads have not yet proven to be high-value enough to warrant paying for any on-demand cloud resources.
For the Nova example setup, we configure cluster static-cluster with a set of statically-allocated GPU instances, which are intended to represent sunk-cost resources. The Nova cluster-specific placement policy is set to match only that cluster. Nova places all experimental training workloads on the cluster; any that cannot be run are pending in the cluster. Example Setup
The initial setup for this example is the same as that used for the previous example.
Example Runs
Again, we use the Pytorch image train benchmark, run as a RayJob deployed on a Kubernetes cluster using KubeRay, this time as a proxy for an experimental training job. The RayJob's RayCluster is again configured with a CPU head and 2 single-GPU workers, available here.
In this case, a first copy of the RayJob is deployed, in the rayjob1 namespace, to the Nova control plane. Its placement uses a specified-cluster policy, with the specified cluster set to static-cluster.
A second copy of the RayJob is deployed, in the rayjob2 namespace, to the Nova control plane. Its placement uses the same specified-cluster policy.
And a third copy of the RayJob is deployed, in the rayjob3 namespace, to the Nova control plane. Its placement again uses the same specified-cluster policy and is placed to static-cluster by Nova.
In this case, static-cluster does not have sufficient remaining resources to run the third copy of RayJob. Its unschedulable pods remain pending until capacity is freed up by the removal of previous job(s).
Example Summary
This example shows how Nova makes handling "fill, no spill" easy via a simple policy-based approach. This simplifies the operation of the cluster and saves money by keeping the workload on the sunk-cost GPUs.
Scenario: Serving Production vs Test/Dev ML/AI Models on GPUs
For the scenario of serving production vs test/dev ML/AI models on GPUs, the desired behavior is "select the right cluster". The online production serving workloads should be placed on the statically-allocated cluster that is configured to satisfy the performance SLA for the maximum supported production load. Online serving workloads have low latency requirements, since they are typically on the critical path of some time-sensitive business application (e.g., predicting a ride-sharing ETA). Hence, dynamic allocation of these resources is not desirable. [And in practice, an additional statically-allocated geo-distinct production cluster would be used to increase availability.] The test/dev serving workloads are placed on the dynamically-allocated cluster configured for lower cost and performance. Providing low latency access for test/dev serving workloads is not a requirement.
For the Nova example setup, cluster static-cluster is configured with a statically-allocated more powerful GPU instance and cluster dynamic-cluster will allocate a less powerful (and cheaper) GPU instance as needed. We add the label production to the static-cluster Nova cluster and the label development to the dynamic-cluster Nova cluster. We note that use of these cluster labels adds a layer of indirection that facilitates adding additional clusters to a category, e.g., adding another production cluster in a different region. We use a Nova cluster selection policy that matches the cluster label appropriate to the workload class. Example Setup
The initial setup for this example is the same as that used for the previous 2 examples, except with respect to the GPU instances in static-cluster. Previously, static-cluster had 4 g4dn.2xlarge instances, which have an NVIDIA T4 GPU. For this example, static-cluster has a single g5.xlarge instance, which has a higher-performing NVIDIA A10G GPU.
Example Runs
As a proxy for a production serving workload, we use the text summarizer model service, run as a RayService deployed on a Kubernetes cluster using KubeRay, adapted from the example here. The RayService's RayCluster is configured with a CPU head and 1 single-GPU worker. The configuration of the RayService with its associated RayCluster is available here.
The production namespace is spread-scheduled to all clusters. RayService is deployed to the Nova control plane in the production namespace. Based on this Nova label-matching policy, it is placed on static-cluster.
We validate its operation as follows:
Next, the development namespace is spread-scheduled to all clusters. We deploy the same RayService to the development namespace. Based on this Nova label-matching policy, it is placed on dynamic-cluster.
In this case, Luna allocates a g4dn.xlarge, which includes an NVIDIA T4 GPU, rather than the g5.xlarge, which includes an NVIDIA A10G GPU. The us-east per-hour on-demand price for the g4dn.xlarge is lower than the 1-year reserved price for the g5.xlarge, so the g4dn.xlarge is a good choice for the development workload, which does not warrant the more powerful GPU.
Again, we validate its operation as follows:
Example Summary
This example shows how Nova makes handling "select the right cluster" for classes of workloads easy via a simple policy-based approach. By using a Nova policy to select the performance/price ratio that matches each workload, Nova and Luna can reduce your cloud GPU bill while meeting your workloads' requirements.
Conclusion
We've shown how the Nova multi-cluster fleet manager, using its cloud autoscaler-aware feature with Luna, can achieve desired "right place, right size" outcomes for three common ML/AI GPU resource management scenarios: "fill and spill" for GPU production ML/AI model training, "fill, no spill" for GPU experimental ML/AI model training, and "select the right cluster" for handling
GPU Production vs Test/Dev ML/AI model serving. Nova and Luna can:
And we note that Nova supports a variety of scheduling policies and has been applied to diverse domains, including managing LLM+RAG deployments, multi-cloud disaster recovery, cloud-agnostic gitops, and K8s cluster upgrade. If you'd like to try Nova and Luna for your workloads, please download our free trial version: Nova, Luna.
Author:
Anne Holler (Chief Scientist, Elotl) Comments are closed.
|