Elotl
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
Search

Right Place, Right Size: Using an Autoscaler-Aware Multi-Cluster Kubernetes Fleet Manager for ML/AI Workloads

7/11/2024

 

Introduction

Picture
Are you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the Elotl Nova policy-driven multi-cluster orchestrator simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity.  Nova is autoscaler-aware, detecting if workload clusters are running either the K8s cluster autoscaler or the Elotl Luna intelligent cluster autoscaler.

In this blog, we examine how Nova policies combined with its autoscaler-awareness can be used to achieve a variety of "right place, right size" outcomes for several common ML/AI GPU workload scenarios. When Nova and Luna team up you can:
  1. Reduce the latency of critical ML/AI workloads by scheduling on available GPU compute.
  2. Reduce your bill by directing experimental jobs to sunk-cost clusters.
  3. Reduce your costs via policies that select GPUs with the desired price/performance.

For clusters running in the cloud with a cluster autoscaler, the available cluster capacity is dynamic.  Nova can schedule a workload on a cluster with dynamic capacity that satisfies the workload's placement policy, even if that target cluster does not currently have sufficient resources for the workload, since the autoscaler can provision the needed resources.  When multiple clusters satisfy the workload's placement policy, Nova preferentially selects a target cluster with existing available cluster resources and otherwise selects an alternative target cluster running a cluster autoscaler.

Nova workloads placed using an available-capacity policy are gang-scheduled. This means that no single job within a workload will start running until all jobs in that workload can be executed simultaneously. Gang scheduling is crucial for ML/AI training jobs, as it ensures all components of a distributed training task begin processing in sync, maximizing efficiency and preventing data inconsistencies.

Additionally, Nova automatically adds Luna's default pod placement label to the workloads it schedules, which allows the workloads to be handled seamlessly on either Luna or non-Luna clusters.

Applying Nova+Luna to Some Common ML/AI GPU Resource Management Scenarios

We consider the following common GPU resource management scenarios:
  • Training production ML/AI models on GPUs
  • Training experimental ML/AI models on GPUs
  • Serving production vs test/dev ML/AI models on GPUs
with respect to Nova management of two kinds of workload clusters:
  • Clusters with statically-allocated resources, comprising on-premise or reserved cloud resources, with no cluster autoscaler running.
  • Clusters with dynamically-allocated resources, comprising on-demand cloud resources, running the Luna cluster autoscaler.

Scenario: Training Production ML/AI Models on GPUs

Overview

For the scenario of training production ML/AI models on GPUs, the desired behavior is "fill and spill".  The workloads should be gang-scheduled on a statically-allocated cluster if they fit or on a dynamically-allocated cluster if they don't.  The workloads' high value warrants the cost of on-demand cloud resources, if needed, and the latency to obtain those resources dynamically is not an issue for the training job use case.

For the Nova example setup, we configure cluster static-cluster with a set of statically-allocated GPU instances and cluster dynamic-cluster with Luna configured to allocate similar cloud GPUs instances.  Both clusters satisfy the Nova available-capacity placement policy. Nova places training workloads on static-cluster first since the resources are immediately available.  When a training workload arrives that does not fit on static-cluster, Nova places it on dynamic-cluster and Luna adds resources to accommodate the pending workload.

Example Setup

The scripts and K8s yaml input used in the example are available at elotl/skyray on Github.  The commands that follow expect a clone of that repo at the SKYRAY_PATH environment variable.

The example is run on EKS cloud K8s clusters.  The Nova control plane, installed on a EKS cluster comprising 2 CPU nodes, manages the static-cluster and dynamic-cluster workload EKS clusters, initially populated as shown below.  The Luna cluster autoscaler is installed on dynamic-cluster, to scale the cluster to match workload resource requests.  Luna is configured to allocate large EBS volumes, to handle the large instance types and storage needs of the example.  Also, Luna bin-packing is disabled, since the example does not contain sets of small pods that benefit from scheduling on the same node.


kubectl --context=static-cluster get nodes -Lnode.kubernetes.io/instance-type

NAME                                            STATUS   ROLES    AGE     VERSION              INSTANCE-TYPE
ip-192-168-100-111.us-west-2.compute.internal   Ready    <none>   4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlarge
ip-192-168-105-241.us-west-2.compute.internal   Ready    <none>   4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlarge
ip-192-168-149-118.us-west-2.compute.internal   Ready    <none>   4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlarge
ip-192-168-181-48.us-west-2.compute.internal    Ready    <none>   28h     v1.29.3-eks-ae9a62a   t3a.2xlarge
ip-192-168-44-83.us-west-2.compute.internal     Ready    <none>   56d     v1.29.3-eks-ae9a62a   m5.large
ip-192-168-72-28.us-west-2.compute.internal     Ready    <none>   4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlarge
ip-192-168-78-25.us-west-2.compute.internal     Ready    <none>   56d     v1.29.3-eks-ae9a62a   m5.large
ip-192-168-8-48.us-west-2.compute.internal      Ready    <none>   28h     v1.29.3-eks-ae9a62a   t3a.2xlarge
    
 

kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-type

NAME                                          STATUS   ROLES    AGE   VERSION               INSTANCE-TYPE
ip-192-168-94-42.us-west-2.compute.internal   Ready    <none>   56d   v1.29.3-eks-ae9a62a   m5.large
    

KubeRay and its CRDs are deployed to the Nova control plane, along with a spread-duplicate policy for their placement.  Nova places a copy of KubeRay and its CRDs on each workload cluster, meaning KubeRay is available on each cluster to handle any RayJobs, RayClusters, and RayServices placed by Nova on that cluster.


kubectl apply -f ${SKYRAY_PATH}/policies/krpolicy.yaml
kubectl apply -f ${SKYRAY_PATH}/policies/crdpolicy.yaml
${SKYRAY_PATH}/deploy-scripts/deploy-kuberay-operator.sh
    

After the KubeRay spread-duplicate placement, the Nova control plane output shown below reflects that there are 2 copies of the kuberay-operator, one on each workload cluster.


kubectl get all --all-namespaces

NAMESPACE   NAME                       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
default     service/kuberay-operator   ClusterIP   10.96.241.6   <none>        8080/TCP   91s
default     service/kubernetes         ClusterIP   10.96.0.1     <none>        443/TCP    6m50s

NAMESPACE   NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
default     deployment.apps/kuberay-operator   2/1     2            2           91s
    

And Luna has started an additional node in dynamic-cluster to host KubeRay, as shown below.  The KubeRay operator has modest resource requests (100m CPU, 512Mi memory) that can be handled by the inexpensive t3a.small instance type (2 CPUs, 2Gi memory).


kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-type

NAME                                           STATUS   ROLES    AGE   VERSION               INSTANCE-TYPE
ip-192-168-182-75.us-west-2.compute.internal   Ready    <none>   55s   v1.29.3-eks-ae9a62a   t3a.small
ip-192-168-94-42.us-west-2.compute.internal    Ready    <none>   56d   v1.29.3-eks-ae9a62a   m5.large
    

Example Runs

As a proxy for a production training workload, we use the Pytorch image train benchmark, run as a RayJob deployed on a Kubernetes cluster using KubeRay, adapted from the example here.  The RayJob's RayCluster is configured with a CPU head and 2 single-GPU workers.  The configuration of the RayJob with its associated RayCluster is available here.

A first copy of the RayJob is deployed to the Nova control plane in the rayjob1 namespace.  Its placement uses a Nova available-capacity policy.  Nova has native support for the RayCluster, RayJob, and RayService CRDs, and recognizes the resource requests in the podSpecs they contain.  Hence, Nova is able to determine the computing resources needed for the pods comprising the RayJob.  It chooses to place the RayJob and its RayCluster on static-cluster, since it has sufficient available capacity.


export RAYCLUSTER_NAMESPACE1=rayjob1
${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE1} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
Spread-schedule namespace in which to run job
schedulepolicy.policy.elotl.co/ns-policy unchanged
namespace/rayjob1 created
Place training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-out
schedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob1 created
rayjob.ray.io/rayjob-train created
configmap/ray-job-code-train created

export TARG_CLUSTER1=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE1} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)
echo ${TARG_CLUSTER1}
static-cluster
    

Another copy of the RayJob is deployed to the Nova control plane in the rayjob2 namespace.  Its placement again uses an available-capacity policy, and Nova again chooses to place the RayJob and its RayCluster on static-cluster, since it has sufficient available capacity for a second copy of the training job.


export RAYCLUSTER_NAMESPACE2=rayjob2
${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE2} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
Spread-schedule namespace in which to run job
schedulepolicy.policy.elotl.co/ns-policy unchanged
namespace/rayjob2 created
Place training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-out
schedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob2 created
rayjob.ray.io/rayjob-train created
configmap/ray-job-code-train created  
                       
export TARG_CLUSTER2=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE2} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)
echo ${TARG_CLUSTER2}  
static-cluster
    

A third copy of the RayJob is deployed to the Nova control plane in the rayjob3 namespace.  Its placement again uses an available-capacity policy.  This time Nova places the RayJob and its RayCluster on dynamic-cluster. Nova sees that static-cluster has insufficient remaining capacity for a third copy of the job and detects the Luna cluster autoscaler running on dynamic-cluster, which can obtain the needed resources.


export RAYCLUSTER_NAMESPACE3=rayjob3
${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE3} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
Spread-schedule namespace in which to run job
schedulepolicy.policy.elotl.co/ns-policy unchanged
namespace/rayjob3 created
Place training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-out
schedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob3 created
rayjob.ray.io/rayjob-train created                            
configmap/ray-job-code-train created                          

export TARG_CLUSTER3=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE3} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)
echo ${TARG_CLUSTER3}  
dynamic-cluster
    

All 3 copies of the RayJob can be seen from the Nova control plane:


$ kubectl get all --all-namespaces
. . .
NAMESPACE NAME                        JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGE
rayjob1   rayjob.ray.io/rayjob-train               Running             2024-07-01T22:13:02Z              9m11s
rayjob2   rayjob.ray.io/rayjob-train   RUNNING     Running             2024-07-01T22:12:07Z              4m55s
rayjob3   rayjob.ray.io/rayjob-train               Initializing        2024-07-01T22:16:28Z              34s
    

And Luna scales up dynamic cluster accordingly:


kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-type

NAME                                            STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE
ip-192-168-161-254.us-west-2.compute.internal   Ready       4m47s   v1.29.3-eks-ae9a62a   t3a.2xlarge
ip-192-168-182-75.us-west-2.compute.internal    Ready       55m     v1.29.3-eks-ae9a62a   t3a.small
ip-192-168-61-229.us-west-2.compute.internal    Ready       4m24s   v1.29.3-eks-ae9a62a   g4dn.2xlarge
ip-192-168-63-192.us-west-2.compute.internal    Ready       4m27s   v1.29.3-eks-ae9a62a   g4dn.2xlarge
ip-192-168-94-42.us-west-2.compute.internal     Ready       56d     v1.29.3-eks-ae9a62a   m5.large
    

With all 3 jobs eventually running to completion


kubectl get all --all-namespaces
. . .
NAMESPACE NAME                       JOB STATUS DEPL STATUS START TIME          END TIME               AGE
rayjob1   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:13:02Z 2024-07-01T22:26:30Z   22m
rayjob2   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:12:07Z 2024-07-01T22:19:49Z   18m
rayjob3   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:16:28Z 2024-07-01T22:30:27Z   14m
    

Example Summary

This example demonstrated how Nova, working with Luna, makes handling gang-scheduling and "fill and spill" for a multi-worker ML/AI KubeRay/RayJob training job easy via a simple available-capacity policy-based approach. Nova and Luna can reduce the latency of your ML/AI workloads by scheduling on available compute resources in a matter of seconds.

Scenario: Training Experimental ML/AI Models on GPUs

Overview

For the scenario of training experimental ML/AI models on GPUs, the desired behavior is "fill, no spill".  The workloads should be scheduled on a statically-allocated on-premise or reserved cluster set up for speculative training jobs, consisting of sunk-cost GPU instances.  These training workloads have not yet proven to be high-value enough to warrant paying for any on-demand cloud resources.

For the Nova example setup, we configure cluster static-cluster with a set of statically-allocated GPU instances, which are intended to represent sunk-cost resources.  The Nova cluster-specific placement policy is set to match only that cluster.  Nova places all experimental training workloads on the cluster; any that cannot be run are pending in the cluster.

Example Setup

The initial setup for this example is the same as that used for the previous example.

Example Runs

Again, we use the Pytorch image train benchmark, run as a RayJob deployed on a Kubernetes cluster using KubeRay, this time as a proxy for an experimental training job.  The RayJob's RayCluster is again configured with a CPU head and 2 single-GPU workers, available here.

In this case, a first copy of the RayJob is deployed, in the rayjob1 namespace, to the Nova control plane.  Its placement uses a specified-cluster policy, with the specified cluster set to static-cluster.


export RAYCLUSTER_NAMESPACE=rayjob1
${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
Spread-schedule namespace in which to run job
schedulepolicy.policy.elotl.co/ns-policy created
namespace/rayjob1 created
Place training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-out
schedulepolicy.policy.elotl.co/rayjob-static-policy created
rayjob.ray.io/rayjob-train created
configmap/ray-job-code-train created

export TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)
echo ${TARG_CLUSTER}
static-cluster
    

A second copy of the RayJob is deployed, in the rayjob2 namespace, to the Nova control plane.  Its placement uses the same specified-cluster policy.


export RAYCLUSTER_NAMESPACE=rayjob2

${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
Spread-schedule namespace in which to run job
schedulepolicy.policy.elotl.co/ns-policy unchanged
namespace/rayjob2 created
Place training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-out
schedulepolicy.policy.elotl.co/rayjob-static-policy unchanged
rayjob.ray.io/rayjob-train created
configmap/ray-job-code-train created

export TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)
echo ${TARG_CLUSTER}
static-cluster
    

And a third copy of the RayJob is deployed, in the rayjob3 namespace, to the Nova control plane.  Its placement again uses the same specified-cluster policy and is placed to static-cluster by Nova.


export RAYCLUSTER_NAMESPACE=rayjob3
${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}
Spread-schedule namespace in which to run job
schedulepolicy.policy.elotl.co/ns-policy unchanged
namespace/rayjob3 created
Place training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-out
schedulepolicy.policy.elotl.co/rayjob-static-policy unchanged
rayjob.ray.io/rayjob-train created
configmap/ray-job-code-train created

export TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)
echo ${TARG_CLUSTER}
static-cluster
    

In this case, static-cluster does not have sufficient remaining resources to run the third copy of RayJob.  Its unschedulable pods remain pending until capacity is freed up by the removal of previous job(s).


kubectl get all --all-namespaces
. . .
NAMESPACE NAME                      JOB STATUS DEPL STATUS START TIME             END TIME               AGE
rayjob1  rayjob.ray.io/rayjob-train SUCCEEDED Complete     2024-07-02T13:49:21Z   2024-07-02T13:56:49Z   8m5s
rayjob2  rayjob.ray.io/rayjob-train RUNNING   Running      2024-07-02T13:53:16Z                          4m10s
rayjob3  rayjob.ray.io/rayjob-train           Initializing 2024-07-02T13:54:47Z                          2m39s
…
kubectl get all --all-namespaces
. . .
NAMESPACE NAME                       JOB STATUS DEPL STATUS START TIME             END TIME               AGE
rayjob2   rayjob.ray.io/rayjob-train SUCCEEDED  Complete    2024-07-02T13:53:16Z   2024-07-02T14:00:49Z   12m
rayjob3   rayjob.ray.io/rayjob-train RUNNING    Running     2024-07-02T13:54:47Z                          11m
…
kubectl get all --all-namespaces
. . .
NAMESPACE NAME                      JOB STATUS DEPL STATUS   START TIME             END TIME               AGE
rayjob2   rayjob.ray.io/rayjob-train SUCCEEDED Complete      2024-07-02T13:53:16Z   2024-07-02T14:00:49Z   14m
rayjob3   rayjob.ray.io/rayjob-train SUCCEEDED Complete      2024-07-02T13:54:47Z   2024-07-02T14:07:53Z   13m
    

Example Summary

This example shows how Nova makes handling "fill, no spill" easy via a simple policy-based approach. This simplifies the operation of the cluster and saves money by keeping the workload on the sunk-cost GPUs.

Scenario: Serving Production vs Test/Dev ML/AI Models on GPUs

For the scenario of serving production vs test/dev ML/AI models on GPUs, the desired behavior is "select the right cluster".  The online production serving workloads should be placed on the statically-allocated cluster that is configured to satisfy the performance SLA for the maximum supported production load.  Online serving workloads have low latency requirements, since they are typically on the critical path of some time-sensitive business application (e.g., predicting a ride-sharing ETA).  Hence, dynamic allocation of these resources is not desirable.  [And in practice, an additional statically-allocated geo-distinct production cluster would be used to increase availability.]  The test/dev serving workloads are placed on the dynamically-allocated cluster configured for lower cost and performance.  Providing low latency access for test/dev serving workloads is not a requirement.


For the Nova example setup, cluster static-cluster is configured with a statically-allocated more powerful GPU instance and cluster dynamic-cluster will allocate a less powerful (and cheaper) GPU instance as needed.  We add the label production to the static-cluster Nova cluster and the label development to the dynamic-cluster Nova cluster.  We note that use of these cluster labels adds a layer of indirection that facilitates adding additional clusters to a category, e.g., adding another production cluster in a different region.  We use a Nova cluster selection policy that matches the cluster label appropriate to the workload class.

Example Setup

The initial setup for this example is the same as that used for the previous 2 examples, except with respect to the GPU instances in static-cluster.  Previously, static-cluster had 4 g4dn.2xlarge instances, which have an NVIDIA T4 GPU.  For this example, static-cluster has a single g5.xlarge instance, which has a higher-performing NVIDIA A10G GPU.

kubectl --context=static-cluster get nodes -Lnode.kubernetes.io/instance-type

NAME                                           STATUS   ROLES    AGE   VERSION               INSTANCE-TYPE
ip-192-168-181-48.us-west-2.compute.internal   Ready    <none>   9d    v1.29.3-eks-ae9a62a   t3a.2xlarge
ip-192-168-44-83.us-west-2.compute.internal    Ready    <none>   64d   v1.29.3-eks-ae9a62a   m5.large
ip-192-168-72-62.us-west-2.compute.internal    Ready    <none>   95m   v1.29.3-eks-ae9a62a   g5.xlarge
ip-192-168-78-25.us-west-2.compute.internal    Ready    <none>   64d   v1.29.3-eks-ae9a62a   m5.large
ip-192-168-8-48.us-west-2.compute.internal     Ready    <none>   9d    v1.29.3-eks-ae9a62a   t3a.2xlarge
    

Example Runs

As a proxy for a production serving workload, we use the text summarizer model service, run as a RayService deployed on a Kubernetes cluster using KubeRay, adapted from the example here. The RayService's RayCluster is configured with a CPU head and 1 single-GPU worker.  The configuration of the RayService with its associated RayCluster is available here.

The production namespace is spread-scheduled to all clusters.  RayService is deployed to the Nova control plane in the production namespace.  Based on this Nova label-matching policy, it is placed on static-cluster.


$ kubectl apply -f ${SKYRAY_PATH}/deploy-scripts/ray-service.text-summarizer.yaml --namespace=production
rayservice.ray.io/text-summarizer created

kubectl --context=static-cluster get all -n production

NAME                                                          READY   STATUS    RESTARTS   AGE
pod/text-summarizer-raycluster-ntcfh-head-tmnqr               1/1     Running   0          68m
pod/text-summarizer-raycluster-ntcfh-worker-gpu-group-wft6f   1/1     Running   0          68m

NAME                                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGE
service/text-summarizer-head-svc                    ClusterIP   10.100.6.157     <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   60m
service/text-summarizer-raycluster-ntcfh-head-svc   ClusterIP   10.100.197.135   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   68m
service/text-summarizer-serve-svc                   ClusterIP   10.100.205.162   <none>        8000/TCP                                        60m

NAME                                                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
raycluster.ray.io/text-summarizer-raycluster-ntcfh   1                 1                   5      20G      1      ready    68m

NAME                                AGE
rayservice.ray.io/text-summarizer   68m
    

We validate its operation as follows:


kubectl --context=static-cluster port-forward svc/text-summarizer-serve-svc 8000 -n production

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
Handling connection for 8000

python text_summarizer_req.py
Paris is the capital and most populous city of France. It has an estimated population of 2,175,601 residents as of 2018. The City of Paris is the centre of the French capital.
    

Next, the development namespace is spread-scheduled to all clusters.  We deploy the same RayService to the development namespace.  Based on this Nova label-matching policy, it is placed on dynamic-cluster.


kubectl apply -f ${SKYRAY_PATH}/deploy-scripts/ray-service.text-summarizer.yaml --namespace=development
rayservice.ray.io/text-summarizer created

kubectl --context=dynamic-cluster get all -n development

NAME                                                          READY   STATUS    RESTARTS   AGE
pod/text-summarizer-raycluster-2xnts-head-68bvm               1/1     Running   0          47m
pod/text-summarizer-raycluster-2xnts-worker-gpu-group-s8pbn   1/1     Running   0          47m

NAME                                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGE
service/text-summarizer-head-svc                    ClusterIP   10.100.45.127   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   37m
service/text-summarizer-raycluster-2xnts-head-svc   ClusterIP   10.100.46.227   <none>        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   47m
service/text-summarizer-serve-svc                   ClusterIP   10.100.209.7    <none>        8000/TCP                                        37m

NAME                                                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGE
raycluster.ray.io/text-summarizer-raycluster-2xnts   1                 1                   5      20G      1      ready    47m

NAME                                AGE
rayservice.ray.io/text-summarizer   47m
    

In this case, Luna allocates a g4dn.xlarge, which includes an NVIDIA T4 GPU, rather than the g5.xlarge, which includes an NVIDIA A10G GPU.  The us-east per-hour on-demand price for the g4dn.xlarge is lower than the 1-year reserved price for the g5.xlarge, so the g4dn.xlarge is a good choice for the development workload, which does not warrant the more powerful GPU.


kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-type

NAME                                            STATUS   ROLES    AGE   VERSION               INSTANCE-TYPE
ip-192-168-164-97.us-west-2.compute.internal    Ready    <none>   8d    v1.29.3-eks-ae9a62a   t3a.small
ip-192-168-171-101.us-west-2.compute.internal   Ready    <none>   48m   v1.29.3-eks-ae9a62a   t3a.xlarge
ip-192-168-49-24.us-west-2.compute.internal     Ready    <none>   48m   v1.29.3-eks-ae9a62a   g4dn.xlarge
ip-192-168-94-42.us-west-2.compute.internal     Ready    <none>   64d   v1.29.3-eks-ae9a62a   m5.large
    

Again, we validate its operation as follows:


kubectl --context=dynamic-cluster port-forward svc/text-summarizer-serve-svc 8000 -n development
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
Handling connection for 8000

python text_summarizer_req.py
Paris is the capital and most populous city of France. It has an estimated population of 2,175,601 residents as of 2018. The City of Paris is the centre of the French capital.
    

Example Summary

This example shows how Nova makes handling "select the right cluster" for classes of workloads easy via a simple policy-based approach. By using a Nova policy to select the performance/price ratio that matches each workload, Nova and Luna can reduce your cloud GPU bill while meeting your workloads' requirements.

Conclusion

We've shown how the Nova multi-cluster fleet manager, using its cloud autoscaler-aware feature with Luna, can achieve desired "right place, right size" outcomes for three common ML/AI GPU resource management scenarios: "fill and spill" for GPU production ML/AI model training, "fill, no spill" for GPU experimental ML/AI model training, and "select the right cluster" for handling
GPU Production vs Test/Dev ML/AI model serving.

Nova and Luna can:
  1. Reduce the latency of critical ML/AI workloads by scheduling on available GPU compute.
  2. Reduce your bill by directing experimental jobs to sunk-cost clusters.
  3. Reduce your costs via policies that select GPUs with the desired price/performance.

And we note that Nova supports a variety of scheduling policies and has been applied to diverse domains, including managing LLM+RAG deployments, multi-cloud disaster recovery, cloud-agnostic gitops, and K8s cluster upgrade.

If you'd like to try Nova and Luna for your workloads, please download our free trial version: Nova, Luna.

Author:
Anne Holler (Chief Scientist, Elotl)


Comments are closed.

    Topic

    All
    ARM
    Autoscaling
    Deep Learning
    Disaster Recovery
    GPU Time-slicing
    Luna
    Machine Learning
    Node Management
    Nova
    Troubleshooting
    VPA

    Archives

    November 2025
    September 2025
    August 2025
    July 2025
    May 2025
    April 2025
    January 2025
    November 2024
    October 2024
    August 2024
    July 2024
    June 2024
    April 2024
    February 2024

    RSS Feed

​© 2025 Elotl, Inc.
  • Home
  • Products
    • Luna
    • Nova
  • Resources
  • Podcast
  • Company
    • Team
    • Careers
    • Contact
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial