Elotl - Blog

Right Place, Right Size: Using an Autoscaler-Aware Multi-Cluster Kubernetes Fleet Manager for ML/AI Workloads

Thu, 11 Jul 2024 18:58:11 GMT

Introduction

Are you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the Elotl Nova policy-driven multi-cluster orchestrator simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity. Nova is autoscaler-aware, detecting if workload clusters are running either the K8s cluster autoscaler or the Elotl Luna intelligent cluster autoscaler.

In this blog, we examine how Nova policies combined with its autoscaler-awareness can be used to achieve a variety of "right place, right size" outcomes for several common ML/AI GPU workload scenarios. When Nova and Luna team up you can:

Reduce the latency of critical ML/AI workloads by scheduling on available GPU compute.
Reduce your bill by directing experimental jobs to sunk-cost clusters.
Reduce your costs via policies that select GPUs with the desired price/performance.

For clusters running in the cloud with a cluster autoscaler, the available cluster capacity is dynamic. Nova can schedule a workload on a cluster with dynamic capacity that satisfies the workload's placement policy, even if that target cluster does not currently have sufficient resources for the workload, since the autoscaler can provision the needed resources. When multiple clusters satisfy the workload's placement policy, Nova preferentially selects a target cluster with existing available cluster resources and otherwise selects an alternative target cluster running a cluster autoscaler.

Nova workloads placed using an available-capacity policy are gang-scheduled. This means that no single job within a workload will start running until all jobs in that workload can be executed simultaneously. Gang scheduling is crucial for ML/AI training jobs, as it ensures all components of a distributed training task begin processing in sync, maximizing efficiency and preventing data inconsistencies.

Additionally, Nova automatically adds Luna's default pod placement label to the workloads it schedules, which allows the workloads to be handled seamlessly on either Luna or non-Luna clusters.

Applying Nova+Luna to Some Common ML/AI GPU Resource Management Scenarios

We consider the following common GPU resource management scenarios:

Training production ML/AI models on GPUs
Training experimental ML/AI models on GPUs
Serving production vs test/dev ML/AI models on GPUs

with respect to Nova management of two kinds of workload clusters:

Clusters with statically-allocated resources, comprising on-premise or reserved cloud resources, with no cluster autoscaler running.
Clusters with dynamically-allocated resources, comprising on-demand cloud resources, running the Luna cluster autoscaler.

Scenario: Training Production ML/AI Models on GPUs

Overview

For the scenario of training production ML/AI models on GPUs, the desired behavior is "fill and spill". The workloads should be gang-scheduled on a statically-allocated cluster if they fit or on a dynamically-allocated cluster if they don't. The workloads' high value warrants the cost of on-demand cloud resources, if needed, and the latency to obtain those resources dynamically is not an issue for the training job use case.

For the Nova example setup, we configure cluster static-cluster with a set of statically-allocated GPU instances and cluster dynamic-cluster with Luna configured to allocate similar cloud GPUs instances. Both clusters satisfy the Nova available-capacity placement policy. Nova places training workloads on static-cluster first since the resources are immediately available. When a training workload arrives that does not fit on static-cluster, Nova places it on dynamic-cluster and Luna adds resources to accommodate the pending workload.

Example Setup

The scripts and K8s yaml input used in the example are available at elotl/skyray on Github. The commands that follow expect a clone of that repo at the SKYRAY_PATH environment variable.

The example is run on EKS cloud K8s clusters. The Nova control plane, installed on a EKS cluster comprising 2 CPU nodes, manages the static-cluster and dynamic-cluster workload EKS clusters, initially populated as shown below. The Luna cluster autoscaler is installed on dynamic-cluster, to scale the cluster to match workload resource requests. Luna is configured to allocate large EBS volumes, to handle the large instance types and storage needs of the example. Also, Luna bin-packing is disabled, since the example does not contain sets of small pods that benefit from scheduling on the same node.

kubectl --context=static-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                            STATUS   ROLES    AGE     VERSION              INSTANCE-TYPEip-192-168-100-111.us-west-2.compute.internal   Ready       4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-105-241.us-west-2.compute.internal   Ready       4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-149-118.us-west-2.compute.internal   Ready       4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-181-48.us-west-2.compute.internal    Ready       28h     v1.29.3-eks-ae9a62a   t3a.2xlargeip-192-168-44-83.us-west-2.compute.internal     Ready       56d     v1.29.3-eks-ae9a62a   m5.largeip-192-168-72-28.us-west-2.compute.internal     Ready       4h33m   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-78-25.us-west-2.compute.internal     Ready       56d     v1.29.3-eks-ae9a62a   m5.largeip-192-168-8-48.us-west-2.compute.internal      Ready       28h     v1.29.3-eks-ae9a62a   t3a.2xlarge

kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                          STATUS   ROLES    AGE   VERSION               INSTANCE-TYPEip-192-168-94-42.us-west-2.compute.internal   Ready       56d   v1.29.3-eks-ae9a62a   m5.large

KubeRay and its CRDs are deployed to the Nova control plane, along with a spread-duplicate policy for their placement. Nova places a copy of KubeRay and its CRDs on each workload cluster, meaning KubeRay is available on each cluster to handle any RayJobs, RayClusters, and RayServices placed by Nova on that cluster.

kubectl apply -f ${SKYRAY_PATH}/policies/krpolicy.yamlkubectl apply -f ${SKYRAY_PATH}/policies/crdpolicy.yaml${SKYRAY_PATH}/deploy-scripts/deploy-kuberay-operator.sh

After the KubeRay spread-duplicate placement, the Nova control plane output shown below reflects that there are 2 copies of the kuberay-operator, one on each workload cluster.

kubectl get all --all-namespacesNAMESPACE   NAME                       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGEdefault     service/kuberay-operator   ClusterIP   10.96.241.6           8080/TCP   91sdefault     service/kubernetes         ClusterIP   10.96.0.1             443/TCP    6m50sNAMESPACE   NAME                               READY   UP-TO-DATE   AVAILABLE   AGEdefault     deployment.apps/kuberay-operator   2/1     2            2           91s

Example Runs

As a proxy for a production training workload, we use the Pytorch image train benchmark, run as a RayJob deployed on a Kubernetes cluster using KubeRay, adapted from the example here. The RayJob's RayCluster is configured with a CPU head and 2 single-GPU workers. The configuration of the RayJob with its associated RayCluster is available here.

A first copy of the RayJob is deployed to the Nova control plane in the rayjob1 namespace. Its placement uses a Nova available-capacity policy. Nova has native support for the RayCluster, RayJob, and RayService CRDs, and recognizes the resource requests in the podSpecs they contain. Hence, Nova is able to determine the computing resources needed for the pods comprising the RayJob. It chooses to place the RayJob and its RayCluster on static-cluster, since it has sufficient available capacity.

export RAYCLUSTER_NAMESPACE1=rayjob1${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE1} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob1 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob1 createdrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train createdexport TARG_CLUSTER1=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE1} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER1}static-cluster

Another copy of the RayJob is deployed to the Nova control plane in the rayjob2 namespace. Its placement again uses an available-capacity policy, and Nova again chooses to place the RayJob and its RayCluster on static-cluster, since it has sufficient available capacity for a second copy of the training job.

export RAYCLUSTER_NAMESPACE2=rayjob2${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE2} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob2 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob2 createdrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train created                         export TARG_CLUSTER2=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE2} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER2}  static-cluster

A third copy of the RayJob is deployed to the Nova control plane in the rayjob3 namespace. Its placement again uses an available-capacity policy. This time Nova places the RayJob and its RayCluster on dynamic-cluster. Nova sees that static-cluster has insufficient remaining capacity for a third copy of the job and detects the Luna cluster autoscaler running on dynamic-cluster, which can obtain the needed resources.

export RAYCLUSTER_NAMESPACE3=rayjob3${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE3} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob3 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-capacity-policy-rayjob3 createdrayjob.ray.io/rayjob-train created                            configmap/ray-job-code-train created                          export TARG_CLUSTER3=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE3} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER3}  dynamic-cluster

All 3 copies of the RayJob can be seen from the Nova control plane:

$ kubectl get all --all-namespaces. . .NAMESPACE NAME                        JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME   AGErayjob1   rayjob.ray.io/rayjob-train               Running             2024-07-01T22:13:02Z              9m11srayjob2   rayjob.ray.io/rayjob-train   RUNNING     Running             2024-07-01T22:12:07Z              4m55srayjob3   rayjob.ray.io/rayjob-train               Initializing        2024-07-01T22:16:28Z              34s

And Luna scales up dynamic cluster accordingly:

kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                            STATUS   ROLES    AGE     VERSION               INSTANCE-TYPEip-192-168-161-254.us-west-2.compute.internal   Ready       4m47s   v1.29.3-eks-ae9a62a   t3a.2xlargeip-192-168-182-75.us-west-2.compute.internal    Ready       55m     v1.29.3-eks-ae9a62a   t3a.smallip-192-168-61-229.us-west-2.compute.internal    Ready       4m24s   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-63-192.us-west-2.compute.internal    Ready       4m27s   v1.29.3-eks-ae9a62a   g4dn.2xlargeip-192-168-94-42.us-west-2.compute.internal     Ready       56d     v1.29.3-eks-ae9a62a   m5.large

With all 3 jobs eventually running to completion

kubectl get all --all-namespaces. . .NAMESPACE NAME                       JOB STATUS DEPL STATUS START TIME          END TIME               AGErayjob1   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:13:02Z 2024-07-01T22:26:30Z   22mrayjob2   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:12:07Z 2024-07-01T22:19:49Z   18mrayjob3   rayjob.ray.io/rayjob-train SUCCEEDED  Complete   2024-07-01T22:16:28Z 2024-07-01T22:30:27Z   14m

Example Summary

This example demonstrated how Nova, working with Luna, makes handling gang-scheduling and "fill and spill" for a multi-worker ML/AI KubeRay/RayJob training job easy via a simple available-capacity policy-based approach. Nova and Luna can reduce the latency of your ML/AI workloads by scheduling on available compute resources in a matter of seconds.

Scenario: Training Experimental ML/AI Models on GPUs

Overview

For the scenario of training experimental ML/AI models on GPUs, the desired behavior is "fill, no spill". The workloads should be scheduled on a statically-allocated on-premise or reserved cluster set up for speculative training jobs, consisting of sunk-cost GPU instances. These training workloads have not yet proven to be high-value enough to warrant paying for any on-demand cloud resources.

For the Nova example setup, we configure cluster static-cluster with a set of statically-allocated GPU instances, which are intended to represent sunk-cost resources. The Nova cluster-specific placement policy is set to match only that cluster. Nova places all experimental training workloads on the cluster; any that cannot be run are pending in the cluster.

Example Setup

The initial setup for this example is the same as that used for the previous example.

Example Runs

Again, we use the Pytorch image train benchmark, run as a RayJob deployed on a Kubernetes cluster using KubeRay, this time as a proxy for an experimental training job. The RayJob's RayCluster is again configured with a CPU head and 2 single-GPU workers, available here.

In this case, a first copy of the RayJob is deployed, in the rayjob1 namespace, to the Nova control plane. Its placement uses a specified-cluster policy, with the specified cluster set to static-cluster.

export RAYCLUSTER_NAMESPACE=rayjob1${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy creatednamespace/rayjob1 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-static-policy createdrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train createdexport TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER}static-cluster

A second copy of the RayJob is deployed, in the rayjob2 namespace, to the Nova control plane. Its placement uses the same specified-cluster policy.

export RAYCLUSTER_NAMESPACE=rayjob2${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob2 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-static-policy unchangedrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train createdexport TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER}static-cluster

And a third copy of the RayJob is deployed, in the rayjob3 namespace, to the Nova control plane. Its placement again uses the same specified-cluster policy and is placed to static-cluster by Nova.

export RAYCLUSTER_NAMESPACE=rayjob3${SKYRAY_PATH}/deploy-scripts/deploy-rayjob-train-static.sh ${SKYRAY_PATH} ${RAYCLUSTER_NAMESPACE} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY}Spread-schedule namespace in which to run jobschedulepolicy.policy.elotl.co/ns-policy unchangednamespace/rayjob3 createdPlace training ray job on cluster w/sufficient capacity; job runs until terminal state or 600s time-outschedulepolicy.policy.elotl.co/rayjob-static-policy unchangedrayjob.ray.io/rayjob-train createdconfigmap/ray-job-code-train createdexport TARG_CLUSTER=$(kubectl get rayjob.ray.io/rayjob-train -n ${RAYCLUSTER_NAMESPACE} -L nova.elotl.co/target-cluster | awk {'print $NF'} | tail -1)echo ${TARG_CLUSTER}static-cluster

In this case, static-cluster does not have sufficient remaining resources to run the third copy of RayJob. Its unschedulable pods remain pending until capacity is freed up by the removal of previous job(s).

kubectl get all --all-namespaces. . .NAMESPACE NAME                      JOB STATUS DEPL STATUS START TIME             END TIME               AGErayjob1  rayjob.ray.io/rayjob-train SUCCEEDED Complete     2024-07-02T13:49:21Z   2024-07-02T13:56:49Z   8m5srayjob2  rayjob.ray.io/rayjob-train RUNNING   Running      2024-07-02T13:53:16Z                          4m10srayjob3  rayjob.ray.io/rayjob-train           Initializing 2024-07-02T13:54:47Z                          2m39sâ€¦kubectl get all --all-namespaces. . .NAMESPACE NAME                       JOB STATUS DEPL STATUS START TIME             END TIME               AGErayjob2   rayjob.ray.io/rayjob-train SUCCEEDED  Complete    2024-07-02T13:53:16Z   2024-07-02T14:00:49Z   12mrayjob3   rayjob.ray.io/rayjob-train RUNNING    Running     2024-07-02T13:54:47Z                          11mâ€¦kubectl get all --all-namespaces. . .NAMESPACE NAME                      JOB STATUS DEPL STATUS   START TIME             END TIME               AGErayjob2   rayjob.ray.io/rayjob-train SUCCEEDED Complete      2024-07-02T13:53:16Z   2024-07-02T14:00:49Z   14mrayjob3   rayjob.ray.io/rayjob-train SUCCEEDED Complete      2024-07-02T13:54:47Z   2024-07-02T14:07:53Z   13m

Example Summary

This example shows how Nova makes handling "fill, no spill" easy via a simple policy-based approach. This simplifies the operation of the cluster and saves money by keeping the workload on the sunk-cost GPUs.

Scenario: Serving Production vs Test/Dev ML/AI Models on GPUs

For the scenario of serving production vs test/dev ML/AI models on GPUs, the desired behavior is "select the right cluster". The online production serving workloads should be placed on the statically-allocated cluster that is configured to satisfy the performance SLA for the maximum supported production load. Online serving workloads have low latency requirements, since they are typically on the critical path of some time-sensitive business application (e.g., predicting a ride-sharing ETA). Hence, dynamic allocation of these resources is not desirable. [And in practice, an additional statically-allocated geo-distinct production cluster would be used to increase availability.] The test/dev serving workloads are placed on the dynamically-allocated cluster configured for lower cost and performance. Providing low latency access for test/dev serving workloads is not a requirement.

For the Nova example setup, cluster static-cluster is configured with a statically-allocated more powerful GPU instance and cluster dynamic-cluster will allocate a less powerful (and cheaper) GPU instance as needed. We add the label production to the static-cluster Nova cluster and the label development to the dynamic-cluster Nova cluster. We note that use of these cluster labels adds a layer of indirection that facilitates adding additional clusters to a category, e.g., adding another production cluster in a different region. We use a Nova cluster selection policy that matches the cluster label appropriate to the workload class.

Example Setup

The initial setup for this example is the same as that used for the previous 2 examples, except with respect to the GPU instances in static-cluster. Previously, static-cluster had 4 g4dn.2xlarge instances, which have an NVIDIA T4 GPU. For this example, static-cluster has a single g5.xlarge instance, which has a higher-performing NVIDIA A10G GPU.

kubectl --context=static-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                           STATUS   ROLES    AGE   VERSION               INSTANCE-TYPEip-192-168-181-48.us-west-2.compute.internal   Ready       9d    v1.29.3-eks-ae9a62a   t3a.2xlargeip-192-168-44-83.us-west-2.compute.internal    Ready       64d   v1.29.3-eks-ae9a62a   m5.largeip-192-168-72-62.us-west-2.compute.internal    Ready       95m   v1.29.3-eks-ae9a62a   g5.xlargeip-192-168-78-25.us-west-2.compute.internal    Ready       64d   v1.29.3-eks-ae9a62a   m5.largeip-192-168-8-48.us-west-2.compute.internal     Ready       9d    v1.29.3-eks-ae9a62a   t3a.2xlarge

Example Runs

As a proxy for a production serving workload, we use the text summarizer model service, run as a RayService deployed on a Kubernetes cluster using KubeRay, adapted from the example here. The RayService's RayCluster is configured with a CPU head and 1 single-GPU worker. The configuration of the RayService with its associated RayCluster is available here.

The production namespace is spread-scheduled to all clusters. RayService is deployed to the Nova control plane in the production namespace. Based on this Nova label-matching policy, it is placed on static-cluster.

$ kubectl apply -f ${SKYRAY_PATH}/deploy-scripts/ray-service.text-summarizer.yaml --namespace=productionrayservice.ray.io/text-summarizer createdkubectl --context=static-cluster get all -n productionNAME                                                          READY   STATUS    RESTARTS   AGEpod/text-summarizer-raycluster-ntcfh-head-tmnqr               1/1     Running   0          68mpod/text-summarizer-raycluster-ntcfh-worker-gpu-group-wft6f   1/1     Running   0          68mNAME                                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                         AGEservice/text-summarizer-head-svc                    ClusterIP   10.100.6.157             10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   60mservice/text-summarizer-raycluster-ntcfh-head-svc   ClusterIP   10.100.197.135           10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   68mservice/text-summarizer-serve-svc                   ClusterIP   10.100.205.162           8000/TCP                                        60mNAME                                                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGEraycluster.ray.io/text-summarizer-raycluster-ntcfh   1                 1                   5      20G      1      ready    68mNAME                                AGErayservice.ray.io/text-summarizer   68m

We validate its operation as follows:

kubectl --context=static-cluster port-forward svc/text-summarizer-serve-svc 8000 -n productionForwarding from 127.0.0.1:8000 -> 8000Forwarding from [::1]:8000 -> 8000Handling connection for 8000python text_summarizer_req.pyParis is the capital and most populous city of France. It has an estimated population of 2,175,601 residents as of 2018. The City of Paris is the centre of the French capital.

Next, the development namespace is spread-scheduled to all clusters. We deploy the same RayService to the development namespace. Based on this Nova label-matching policy, it is placed on dynamic-cluster.

kubectl apply -f ${SKYRAY_PATH}/deploy-scripts/ray-service.text-summarizer.yaml --namespace=developmentrayservice.ray.io/text-summarizer createdkubectl --context=dynamic-cluster get all -n developmentNAME                                                          READY   STATUS    RESTARTS   AGEpod/text-summarizer-raycluster-2xnts-head-68bvm               1/1     Running   0          47mpod/text-summarizer-raycluster-2xnts-worker-gpu-group-s8pbn   1/1     Running   0          47mNAME                                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGEservice/text-summarizer-head-svc                    ClusterIP   10.100.45.127           10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   37mservice/text-summarizer-raycluster-2xnts-head-svc   ClusterIP   10.100.46.227           10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   47mservice/text-summarizer-serve-svc                   ClusterIP   10.100.209.7            8000/TCP                                        37mNAME                                                 DESIRED WORKERS   AVAILABLE WORKERS   CPUS   MEMORY   GPUS   STATUS   AGEraycluster.ray.io/text-summarizer-raycluster-2xnts   1                 1                   5      20G      1      ready    47mNAME                                AGErayservice.ray.io/text-summarizer   47m

In this case, Luna allocates a g4dn.xlarge, which includes an NVIDIA T4 GPU, rather than the g5.xlarge, which includes an NVIDIA A10G GPU. The us-east per-hour on-demand price for the g4dn.xlarge is lower than the 1-year reserved price for the g5.xlarge, so the g4dn.xlarge is a good choice for the development workload, which does not warrant the more powerful GPU.

kubectl --context=dynamic-cluster get nodes -Lnode.kubernetes.io/instance-typeNAME                                            STATUS   ROLES    AGE   VERSION               INSTANCE-TYPEip-192-168-164-97.us-west-2.compute.internal    Ready       8d    v1.29.3-eks-ae9a62a   t3a.smallip-192-168-171-101.us-west-2.compute.internal   Ready       48m   v1.29.3-eks-ae9a62a   t3a.xlargeip-192-168-49-24.us-west-2.compute.internal     Ready       48m   v1.29.3-eks-ae9a62a   g4dn.xlargeip-192-168-94-42.us-west-2.compute.internal     Ready       64d   v1.29.3-eks-ae9a62a   m5.large

Again, we validate its operation as follows:

kubectl --context=dynamic-cluster port-forward svc/text-summarizer-serve-svc 8000 -n developmentForwarding from 127.0.0.1:8000 -> 8000Forwarding from [::1]:8000 -> 8000Handling connection for 8000python text_summarizer_req.pyParis is the capital and most populous city of France. It has an estimated population of 2,175,601 residents as of 2018. The City of Paris is the centre of the French capital.

Example Summary

This example shows how Nova makes handling "select the right cluster" for classes of workloads easy via a simple policy-based approach. By using a Nova policy to select the performance/price ratio that matches each workload, Nova and Luna can reduce your cloud GPU bill while meeting your workloadsâ€™ requirements.

Conclusion

We've shown how the Nova multi-cluster fleet manager, using its cloud autoscaler-aware feature with Luna, can achieve desired "right place, right size" outcomes for three common ML/AI GPU resource management scenarios: "fill and spill" for GPU production ML/AI model training, "fill, no spill" for GPU experimental ML/AI model training, and "select the right cluster" for handling
GPU Production vs Test/Dev ML/AI model serving.

Nova and Luna can:

Reduce the latency of critical ML/AI workloads by scheduling on available GPU compute.
Reduce your bill by directing experimental jobs to sunk-cost clusters.
Reduce your costs via policies that select GPUs with the desired price/performance.

And we note that Nova supports a variety of scheduling policies and has been applied to diverse domains, including managing LLM+RAG deployments, multi-cloud disaster recovery, cloud-agnostic gitops, and K8s cluster upgrade.

If you'd like to try Nova and Luna for your workloads, please download our free trial version: Nova, Luna.

Author:
Anne Holler (Chief Scientist, Elotl)

Using NVIDIA GPU Time-slicing in Cloud Kubernetes Clusters with the Luna Smart Cluster Autoscaler

Tue, 25 Jun 2024 18:00:16 GMT

Introduction

Kubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default. With NVIDIA GPU time-slicing, GPUs can be shared among K8s workloads by interleaving their GPU use. For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.

Using NVIDIA GPU time-slicing in a cloud Kubernetes cluster with a cluster autoscaler (CA) that is aware of the time-slicing configuration can significantly reduce costs. A time-slice aware “smart” CA prevents initial over-allocation of instances and optimizes instance selection, and reduces the risk of exceeding quotas and capacity limits. Also, on GKE, where GPU time-slicing is expected to be configured at the control plane level, a smart CA facilitates using time-slicing on GPU resources that are dynamically allocated.

In this blog, we describe how to use cluster NVIDIA GPU time-slicing in AKS, EKS, OKE, and GKE cloud K8s clusters with Luna, a smart CA that supports GPU time-slicing. We provide examples demonstrating the advantages of using Luna with NVIDIA GPU time-slicing.

Configuring NVIDIA GPU Time-slicing on Cloud K8s

Luna is a smart CA that provides the option nvidiaGPUTimeSlices to indicate the NVIDIA GPU slices value used by GPUs in the K8s cluster. When the option is set to N greater than 1, Luna treats the GPUs in cloud instances as being N copies of themselves with respect to resource allocation and scheduling. Luna supports AKS, EKS, OKE, and GKE cloud K8s clusters.

On AKS, EKS and OKE, NVIDIA GPU time-slicing is configured so that it is transparent to the cluster control plane and to GPU workloads running on the cluster. Appendix A describes how NVIDIA GPU time-slicing can be enabled for all GPUs in the cluster via helm deployment of the nvidia-device-plugin, with an associated configmap specifying the number of slices. GPU workloads specify their desired GPU count as usual via the nvidia.com/gpu resource limit and are allocated GPU slices for each GPU they request.

On GKE, NVIDIA GPU time-slicing is visible to the cluster control plane. Time-slicing is specified at the node pool level, with the GPU slice count set as clients-per-gpu. Luna handles the node pool setting when nvidiaGPUTimeSlices is greater than 1. On GKE, time-slicing is also visible to GPU workloads themselves: GPU workloads running on GKE time-sliced GPUs must include nodeSelectors indicating that the workload can use time-shared GPUs and specifying the max clients-per-gpu value allowed. Such workloads are limited to an nvidia.com/gpu resource limit value of 1.

Luna Benefits for GPU Time-Slicing

We’ve mentioned that running the Luna smart CA, configured to be aware of the GPU time-slices setting, reduces expenses as well as quota and capacity limit risks, by avoiding initial over-allocation of instances and by optimizing instance choice. Let’s look at these two areas.

Luna Avoiding Instance Over-allocation for GPU Time-Slicing

With respect to initial over-allocation of instances, a CA that is not aware of the GPU time-slices setting of N will initially allocate Nx more nodes than needed. For example, to place 4 1-GPU workloads, a CA that doesn’t know time-slices=2 could allocate 2 2-GPU nodes, when 1 2-GPU node can provide 4 slices. Note that this initial over-allocation may unnecessarily hit instance quota or capacity limits. If the CA can subsequently consolidate the workloads and scale in the over-allocated node(s), the expense associated with this issue can be limited.

Luna Optimizing Instance Choice for GPU Time-Slicing

With respect to optimizing instance choice, we observe that for many clouds, the cost of GPU instances increases non-linearly with the instance’s GPU count. For example, in AWS us-west region using Luna’s current price list, a g4dn.xlarge with 1 T4 GPU is $0.526/hr, while a g4dn.12xlarge with 4 T4 GPUs is $3.912; the latter is ~7.4x more costly for only 4x more T4 GPUs. Hence, allocating the instance GPU count in light of the time-slices setting can yield significant ongoing savings by choosing instances with fewer GPUs. And our experience is that instances with fewer GPUs tend to have higher quotas and more cloud capacity.

The benefit of optimizing instance choice can be substantial. In the next section, we present EKS, AKS, and OKE examples to illustrate. And we include a GKE example to show how a smart CA facilitates use of control-plane-aware NVIDIA GPU time-slicing.

Examples: Luna Optimizing Instance Choice for GPU Time-Slicing

For our examples, we set NVIDIA GPU time-slices to 2. We consider small 1-GPU workloads that can run together on a single NVIDIA GPU node with time-slices=2. We configure Luna to create bin-packing nodes with 2 GPUs (via setting Luna option binPackingNodeGPU=2). And we configure Luna to place bin-pack 2 1-GPU workloads onto the same node (via setting binSelectPodGPUThreshold=2).

For each of the 4 clouds supported by Luna, we consider the example of launching 2 small 1-GPU workloads. We examine the benefits of setting Luna’s nvidiaGPUTimeSlices option to 2.

EKS

For our example of deploying 2 small 1-GPU workloads in an EKS cluster with Luna, we use the deployment spec in Appendix B.1. The EKS cluster is configured with GPU time-slices set to 2. It is located in the us-east region and the prices we give are from Luna’s current price list.

When Luna is run without knowledge of the GPU time-slice setting (i.e., nvidiaGPUTimeSlices is set to the default of 1), it allocates a g3.8xlarge instance, which at $2.28/hr is the lowest price 2-GPU instance that meets the desired resource requirements for bin-packing. However, g3* instances have M60 GPUs, which were designed for graphics-intensive workloads, and are not well-suited for ML tasks. Setting binPackingNodeTypeRegexp: ^([^g]|g($|[^3])).*$ to avoid g3s, Luna allocates a g4dn.12xlarge, which at $3.9120/hr is the next lowest price multi-GPU instance, with 4 T4s. [We note that the default EBS size is insufficient for g4dn.12xlarge instances and the Luna option aws.blockDeviceMappings needs to be set to allocate a larger EBS size.]

Given that NVIDIA GPU time-slices is 2, a 1-GPU instance can instead be used. When Luna is run with nvidiaGPUTimeSlices=2, it allocates a g4dn.xlarge, which is AWS’ least expensive 1-GPU instance type. At $0.526/hr, it is much cheaper than the previous 2 alternatives, with respect to both instance and per-slice price. This data is summarized in Table 1.

EKS	Instance Type	GPU Type	GPU Count	Instance Price	Price per Slice
Luna option (default) nvidiaGPUTimeSlices=1	g3.8xlarge	M60	2	$2.280/hr	$0.570/hr
Luna option (default) nvidiaGPUTimeSlices=1 and g3 instances excluded	g4dn.12xlarge	T4	4	$3.912/hr	$0.489/hr
Luna option nvidiaGPUTimeSlices=2	g4dn.xlarge	T4	1	$0.526/hr	$0.263/hr

Table 1: EKS w/NVIDIA GPU time-slices=2, Luna option nvidiaGPUTimeSlices set to 1 vs 2

AKS

For our example of deploying 2 small 1-GPU workloads in an AKS cluster with Luna, we use the deployment spec in Appendix B.2. The AKS cluster is configured with GPU time-slices set to 2. It is located in the east us region and the prices we give were recently fetched by Luna.

When Luna is run without knowledge of the GPU time-slice setting (i.e., nvidiaGPUTimeSlices is set to the default of 1), it allocates a Standard_NC64as_T4_v3 instance, which at $4.352/hr is the lowest price multi-GPU instance that meets the desired resource requirements for bin-packing, comprising 4 T4 GPUs.

Given that NVIDIA GPU time-slices is 2, a 1-GPU instance can instead be used. When Luna is run with nvidiaGPUTimeSlices set to 2, it allocates a Standard_NC4as_T4_v3, which at $0.526/hr is much cheaper than the Standard_NC64as_T4_v3, in terms of both instance and per-slice price. This data is summarized in Table 2.

AKS	Instance Type	GPU Type	GPU Count	Instance Price	Price per Slice
Luna option (default) nvidiaGPUTimeSlices=1	Standard_NC64as_T4_v3	T4	4	$4.352/hr	$0.544/hr
Luna option nvidiaGPUTimeSlices=2	Standard_NC64as_T4_v3	T4	1	$0.526/hr	$0.263/hr

Table 2: AKS w/NVIDIA GPU time-slices=2, Luna option nvidiaGPUTimeSlices set to 1 vs 2

OKE

For our example of deploying 2 small 1-GPU workloads in an OKE cluster with Luna, we use the deployment spec in Appendix B.3. The OKE cluster is configured with GPU time-slices set to 2. It is located in the us east region and the prices we give are from Luna’s current price list.

When Luna is run without knowledge of the GPU time-slice setting, it fails to allocate any instance, because our account currently has no quota to run multi-GPU instances (and a quota increase request has been outstanding for an extended period).

When Luna is run with nvidiaGPUTimeSlices set to 2, it allocates a VM.GPU2.1, which is $1.275/hr. In this case, the quota issue prevented the scenario from running at all w/o Luna configured to respect the time-slices setting. This data is summarized in Table 3.

OKE	Instance Type	GPU Type	GPU Count	Instance Price	Price per Slice
Luna option nvidiaGPUTimeSlices=2	VM.GPU2.1	P100	1	$1.275/hr	$0.6375/hr

Table 3: OKE w/NVIDIA GPU time-slices=2, Luna option nvidiaGPUTimeSlices set to 2

GKE

For our example of deploying 2 small 1-GPU workloads in a GKE cluster with Luna, we use the deployment spec in Appendix B.4. The GKE cluster is configured with GPU time-slices set to 2. It is located in the us central1 region and the prices we give are from Luna’s current price list.

On GKE, NVIDIA time-slices cannot be enabled without setting Luna’s nvidiaGPUTimeSlices option accordingly, since Luna needs to configure time-slicing in the node-pool appropriately.

When Luna is run with nvidiaGPUTimeSlices set to 2, it allocates an n1-standard-4 node with 1 T4 GPU, which is $0.540/hr. In this case, Luna is required to enable NVIDIA GPU time-slicing on dynamically-allocated nodes. This data is summarized in Table 4.

GKE	Instance Type	GPU Type	GPU Count	Instance Price	Price per Slice
Luna option nvidiaGPUTimeSlices=2	n1-standard-4	T4	1	$0.540/hr	$0.270/hr

Table 4: GKE w/NVIDIA GPU time-slices=2, Luna option nvidiaGPUTimeSlices set to 2

Conclusion

For cloud K8s clusters running non-demanding non-production GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. In this blog, we’ve explained how to set up NVIDIA GPU time-slicing in AKS, EKS, OKE, and GKE cloud K8s clusters. We’ve discussed the benefits of using the Luna smart CA with the time-slices setting, which include avoiding initial over-allocation of instances and optimizing instance choice. With respect to optimizing instance choice, we found that Luna instance choice halved the price per GPU slice on EKS and AKS. On OKE, we showed that Luna instance choice avoided hitting our current quota limits. And on GKE, we demonstrated how Luna facilitated CA dynamic node allocation interoperation with NVIDIA GPU time-slicing.

Want to see how effortlessly you can manage GPU time-slicing with Luna? Try Luna today with our free trial and experience the enhanced efficiency and flexibility it brings to your cloud environments.

Future Work

GPU time-slicing is supported across NVIDIA GPU models, and provides flexible sharing levels. However, the technique does not enforce memory and fault isolation and targets non-production workloads. Recent NVIDIA GPUs support MIG (Multi-Instance GPU) sharing, which partitions each GPU into smaller, predefined instances, with memory and fault isolation enforced by the hardware. Luna support for NVIDIA MIG in Cloud K8s clusters is an area for future work, depending on customer interest in MIG allocation for their workloads.

Appendix A: Configuring NVIDIA GPU time-slicing in a K8s cluster

# This is for use on EKS, AKS, and OKE.  Delete any existing NVIDIA daemonset installationkubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system# Create file nvidia-device-plugin.yaml ConfigMap w/timeslice gpu replicasapiVersion: v1kind: ConfigMapmetadata:  name: nvidia-device-plugin  namespace: kube-systemdata:  any: |-    version: v1    flags:      migStrategy: none    sharing:      timeSlicing:        resources:        - name: nvidia.com/gpu          replicas: ${GPU_SLICE_COUNT}# Set environment variable to desired replica count, e.g., 2export GPU_SLICE_COUNT=2# Deploy ConfigMap from fileenvsubst < nvidia-device-plugin.yaml | kubectl apply -f -# Install/Upgrade NVIDIA driver using helm with ConfigMap specifiedhelm repo add nvdp https://nvidia.github.io/k8s-device-pluginhelm repo update# Use on AKS and EKShelm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace kube-system --version v0.15.0 --set config.name=nvidia-device-plugin --force --set gfd.enabled=true# Use on OKE, which taints GPU nodes w/{effect: NoSchedule; key: nvidia.com/gpu operator: Exists}helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace kube-system --version v0.15.0 --set config.name=nvidia-device-plugin --force --set gfd.enabled=true --set-json='nfd.worker.tolerations=[{"operator":"Exists"}]'# Once driver is running, K8s sees each NVIDIA gpu as GPU_SLICE_COUNT replicaskubectl describe node ip-192-168-48-69.us-west-2.compute.internal…Allocatable: …  nvidia.com/gpu:     2

Appendix B: Deployment of 2 pods, each requesting 1 GPU

B.1 EKS

# Define deployment comprising 2 pods, each pod requesting 1 gpuapiVersion: apps/v1kind: Deploymentmetadata:  name: gpu-replicas-gpu  labels:        app: gpu-replicas-gpuspec:  replicas: 2  selector:        matchLabels:          app: gpu-replicas-gpu  template:        metadata:          labels:            app: gpu-replicas-gpu            elotl-luna: "true"        spec:          containers:            - name: dcgmproftester12              image: nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04              command: ["/bin/sh", "-c"]              args:                - while true; do /usr/bin/dcgmproftester12 --no-dcgm-validation -t 1004 -d 30; sleep 30; done              resources:                requests:                  cpu: "1"                  memory: "2G"                limits:                  nvidia.com/gpu: 1              securityContext:                capabilities:                  add: ["SYS_ADMIN"]

B.2 AKS

# Define deployment comprising 2 pods, each pod requesting 1 gpuapiVersion: apps/v1kind: Deploymentmetadata:  name: gpu-replicas-gpu  labels:             app: gpu-replicas-gpuspec:  replicas: 2  selector:             matchLabels:               app: gpu-replicas-gpu  template:             metadata:               labels:                 app: gpu-replicas-gpu                 elotl-luna: "true"             spec:               containers:                 - name: dcgmproftester12                   image: nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04                   command: ["/bin/sh", "-c"]                   args:                     - while true; do /usr/bin/dcgmproftester12 --no-dcgm-validation -t 1004 -d 30; sleep 30; done                   resources:                     requests:                       cpu: "1"                       memory: "2G"                     limits:                       nvidia.com/gpu: 1                   securityContext:                     capabilities:                       add: ["SYS_ADMIN"]

B.3 OKE

# Define deployment comprising 2 pods, each pod requesting 1 gpuapiVersion: apps/v1kind: Deploymentmetadata:  name: gpu-replicas-gpu  labels:         app: gpu-replicas-gpuspec:  replicas: 2  selector:         matchLabels:           app: gpu-replicas-gpu  template:         metadata:           labels:             app: gpu-replicas-gpu             elotl-luna: "true"         spec:           containers:             - name: cuda-vector-add               image: "k8s.gcr.io/cuda-vector-add:v0.1"               command: ["/bin/sh", "-c"]               args:                 - while true; do ./vectorAdd; sleep 30; done               resources:                 requests:                   cpu: "1"                   memory: "2G"                 limits:                   nvidia.com/gpu: 1

B.4 GKE

# Define deployment comprising 2 pods, each pod requesting 1 gpu# Luna options must include placeNodeSelector=trueapiVersion: apps/v1kind: Deploymentmetadata:  name: gpu-replicas-gpu  labels:        app: gpu-replicas-gpuspec:  replicas: 2  selector:        matchLabels:          app: gpu-replicas-gpu  template:        metadata:          labels:            app: gpu-replicas-gpu            elotl-luna: "true"        spec:          nodeSelector:            cloud.google.com/gke-gpu-sharing-strategy: "time-sharing"            cloud.google.com/gke-max-shared-clients-per-gpu: "2"          containers:            - name: dcgmproftester11              image: nvcr.io/nvidia/cloud-native/dcgm:3.3.0-1-ubuntu22.04              command: ["/bin/sh", "-c"]              args:                - while true; do /usr/bin/dcgmproftester11 --no-dcgm-validation -t 1004 -d 30; sleep 30; done              resources:                requests:                  cpu: "1"                  memory: "2G"                limits:                  nvidia.com/gpu: 1              securityContext:                capabilities:                  add: ["SYS_ADMIN"]

References

Selected KubeCon talks

Unlocking the Full Potential of GPUs for AI Workloads on Kubernetes - Kevin Klues, NVIDIA; https://www.youtube.com/watch?v=1QfShSQLsbs; KubeCon2023NA
- Using DRA for maximum flexibility in GPU scheduling, emerging K8s technology
Efficient Access to Shared GPU Resources: Mechanisms and Use Cases - Diogo Guerra & Diana Gaponcic; https://www.youtube.com/watch?v=jkcEQE9C338; KubeCon2023EU
- CERN experience with GPU time-sharing and MIG
Improving GPU Utilization using Kubernetes - Maulin Patel & Pradeep Venkatachalam, Google; https://www.youtube.com/watch?v=X876kr-LkPA; KubeCon2022EU
- GKE ave GPU utilization is 25%, getting worse, discusses time-sharing and MIG

Selected Blogs

Author:

Anne Holler (Chief Scientist, Elotl)

How to run the OpenTelemetry collector as a Kubernetes sidecar

Wed, 12 Jun 2024 17:49:47 GMT

At Elotl we develop Luna, an intelligent cluster autoscaler for Kubernetes. Luna gets deployed on customers' clusters and helps scale up and down compute resources to optimize cost.

Luna operates in environments where direct access isn’t always available. To overcome the problem of diagnosis and performance monitoring we have introduced the option for customers to securely send their Luna logs and metrics to our advanced log storage appliance. This empowers us to enhance our support capabilities, providing even more effective assistance to our customers.

OpenTelemetry is fast becoming the standard for collecting metrics and logs in Kubernetes environments. We opted to run the OpenTelemetry collector as a sidecar for the Luna cluster autoscaler. It will gather and send the logs from a single pod, therefore running it as a sidecar was a perfect match.

Sidecar for a single pod

A sidecar is a pod that runs alongside another pod. In our case, the Luna autoscaler writes logs to files in the /logs directory. To read these logs, we needed to share the directory between the main pod and its sidecar pod.
With Kubernetes, the OpenTelemetry collector can be deployed as a daemonset with a deployment or as a sidecar. While the daemonset and deployment set-up is well-documented in the official documentation, the sidecar set-up is documented using the OpenTelemetry operator. As of the writing of this blog post, the documentation does not cover deploying the collector as a sidecar for a single pod.

Add the sidecar container

For this post, the pod we want to scrape the logs from will be named my-pod. First, add the volume and mount it within the main pod, which is part of a deployment:

apiVersion: apps/v1kind: Deployment...spec:  template:    spec:      containers:      - name: my-pod        ...      volumeMounts:      - name: logs        mountPath: /logs      volumes:      - name: logs        emptyDir: {}          sizeLimit: 500Mi

Next, add the OpenTelemetry collector pod. Note that we use the otel/opentelemetry-collector-contrib image because it supports reading local directories, unlike the default otel/opentelemetry-collector image:

...    spec:      containers:        - name: my-pod          ...          volumeMounts:            - name: logs              mountPath: /logs        - name: opentelemetry-collector          image: otel/opentelemetry-collector-contrib:latest          volumeMounts:            - name: logs              mountPath: /logs

Configuring the OpenTelemetry Collector

With the two pods running and sharing a mounted directory, we need to configure the collector to:

Gather logs from the /logs directory
Process these logs and add metadata
Upload the logs to our log storage service

We'll add a ConfigMap and expose it to the collector by mounting it.

...    spec:      containers:        ...        - name: opentelemetry-collector          image: otel/opentelemetry-collector-contrib:0.96.0          volumeMounts:            - name: logs              mountPath: /logs            - name: opentelemetry-config              mountPath: /conf      volumes:        - name: logs          emptyDir:            sizeLimit: 500Mi        - name: opentelemetry-config          configMap:            name: opentelemetry-collector-config---apiVersion: v1kind: ConfigMapmetadata:  name: opentelemetry-collector-configdata:    collector.yaml: ""

The collector configuration consists of four sections:

receivers: Specifies where the logs should be read from
processors: Adds metadata to the logs
exporters: Sets the endpoint for our cloud storage service
service: Combines the parameters from the previous sections

Receivers

For receivers, we use the filelog/app receiver to read data from the /logs directory:

acollector.yaml: |  receivers:    filelog/app:      include: [ /logs/* ]

Processors

For processors, we use the batch and resource processors. The resource processor allows adding keys with desired metadata to each log with the attributes subsection:

collector.yaml: |  ...  processors:    batch:      timeout: 10s    resource:      attributes:        - key: my-metadata-key          value: my-metadata-value          action: insert

Exporters

For exporters, we use the otlp (OpenTelemetry Protocol Exporter) exporter to send the logs to our cloud storage service:

collector.yaml: |  ...  exporters:    otlp:      endpoint: "my.cloud.storage.hostname"

Service

Finally, for service, we combine all the predefined sections into a logical pipeline:

collector.yaml: |  ...  service:    pipelines:      logs:        receivers: [filelog/app]        processors: [batch, resource]        exporters: [otlp]

Full ConfigMap

Here's the complete ConfigMap:

apiVersion: v1kind: ConfigMapmetadata:  name: opentelemetry-collector-configdata:    collector.yaml: |      receivers:        filelog/app:          include: [ /logs/* ]      processors:        batch:          timeout: 10s        resource:          attributes:            - key: my-metadata-key              value: my-metadata-value              action: insert      exporters:        otlp:          endpoint: "my.cloud.storage.hostname"      service:        pipelines:          logs:            receivers: [filelog/app]            processors: [batch, resource]            exporters: [otlp]

With the ConfigMap ready, we can use it as a parameter for the collector via /conf/collector.conf, this file expose the ConfigMap opentelemetry-collector-config as a YAML file:

...    spec:      containers:        ...        - name: opentelemetry-collector          image: otel/opentelemetry-collector-contrib:0.96.0          args:            - --config=/conf/collector.yaml          volumeMounts:            - name: logs              mountPath: /logs

The full listing

The final deployment snippet will look like this:

apiVersion: apps/v1kind: Deployment...spec:  template:    spec:      containers:        - name: my-pod          volumeMounts:            - name: logs              mountPath: /logs          ...        - name: opentelemetry-collector          image: otel/opentelemetry-collector-contrib:0.96.0          args:            - --config=/conf/collector.yaml          volumeMounts:            - name: logs              mountPath: /logs            - name: opentelemetry-config              mountPath: /conf      volumes:        - name: logs          emptyDir:            sizeLimit: 500Mi        - name: opentelemetry-config          configMap:            name: opentelemetry-collector-config---apiVersion: v1kind: ConfigMapmetadata:  name: opentelemetry-collector-configdata:    collector.yaml: |      receivers:        filelog/app:          include: [ /logs/* ]      processors:        batch:          timeout: 10s        resource:          attributes:            - key: my-metadata-key              value: my-metadata-value              action: insert      exporters:        otlp:          endpoint: "my.cloud.storage.hostname"      service:        pipelines:          logs:            receivers: [filelog/app]            processors: [batch, resource]            exporters: [otlp]

With this set-up, we're able to send our logs to our log storage appliance and we're able to more effectively help our customers when they ask for help.

Author:
Henry Precheur (Senior Staff Engineer, Elotl)

Unleashing the Power of ARM: Elevating Your Kubernetes Workloads with ARM Nodes

Mon, 29 Apr 2024 12:55:12 GMT

The recent surge in ARM processor capabilities has sparked a wave of exploration beyond their traditional mobile device domain. This blog explains why you may want to consider using ARM nodes for your Kubernetes workloads. We'll identify potential benefits of leveraging ARM nodes for containerized deployments while acknowledging the inherent trade-offs and scenarios where x86-64 architectures may perform better and thus continue to be a better fit. Lastly we'll describe a seamless way to add ARM nodes to your Kubernetes clusters.

In this blog, for the sake of clarity and brevity, I will be using the term 'ARM' to refer to ARM64 or ARM 64-bit processors, while 'x86' or 'x86-64' will be used interchangeably to denote Intel or AMD 64-bit processors.

What Kubernetes Workloads Tend To Be Ideal for ARM Processors?

Inference-heavy tasks:

While the computations involved in Deep Learning training typically require GPUs for acceptable performance, DL inference is less computationally intense. Tasks that apply pre-trained models for DL regression or classification can benefit from ARM's power/performance relative to GPU or x86-64 systems. We presented data on running inference on ARM64 in our Scale20x talk.

Web Servers and Microservices:

Web servers and microservices typically involve handling numerous concurrent connections and lightweight compute tasks. They can perform acceptably on ARM64-based Kubernetes deployments, serving web content, handling API requests, and running containerized microservices efficiently. With the increasing availability of ARM-based cloud instances, organizations can optimize their web hosting infrastructure for cost-effectiveness and scalability by leveraging ARM architecture.

Development and Testing Environments:

Development and testing environments, where workloads are often smaller in scale and resource requirements are modest, may be excellent candidates for ARM-based Kubernetes deployments. Developers can leverage ARM-based instances to build, test, and deploy applications in an environment that closely resembles production while minimizing costs. ARM-based Kubernetes resources can give developers an inexpensive platform for continuous integration, automated testing, and DevOps workflows.

What Kubernetes Workloads Might be Less Suited for ARM Processors?

While ARM processors offer advantages for some workloads, not all Kubernetes workloads are equally suited for this architecture. Below are some specific scenarios where opting for ARM processors may not align with the workload's needs or requirements.

High-Performance Computing (HPC):

HPC tasks often require specialized hardware and intense computational power, making them less suited for ARM processors. While ARM has advanced, x86-based processors may better handle complex simulations and scientific computing.

Legacy Enterprise Applications:

ARM processors may pose compatibility challenges for legacy enterprise apps optimized for x86-64 architectures. Migrating such apps to ARM-based Kubernetes setups may need non-trivial re-engineering, testing, and could be difficult or costly for legacy x86-64 applications.

Containerized Databases and Analytics:

ARM processors may struggle with high I/O demands and data-intensive tasks compared to x86-based processors. For large-scale data processing and high-volume databases, x86-64 architectures may offer better performance.
In summary, while ARM processors do have advantages, it's crucial to assess their suitability for specific Kubernetes workloads, especially considering performance and compatibility with existing applications.

On the fence about ARM Nodes in Kubernetes Workloads Despite Their Ideal Fit?

Several factors may make it worthwhile, primarily Cost Savings, Energy Efficiency, and Performance. Lets explore these in detail.

Cost Savings:

When it comes to running Kubernetes workloads, cost is often a concern for organizations, especially those managing large-scale deployments. ARM processors present an interesting proposition in this regard. Their lower upfront hardware costs and reduced operational expenses can make them an attractive alternative to traditional x86-64 processors. In cloud environments like Amazon EKS and Google GKE, where instances are billed based on usage, the cost differential between ARM and x86-64 instances can translate into significant savings over time.

Energy Efficiency:

Another compelling advantage of ARM processors for Kubernetes workloads lies in their energy efficiency. ARM architecture is known for its ability to deliver comparable performance to x86-64 processors while consuming less power. This energy efficiency not only reduces operational costs but also contributes to sustainability efforts by minimizing the environmental impact of cloud computing. In a world increasingly concerned with reducing carbon footprints and achieving energy efficiency targets, ARM-based Kubernetes deployments align well with green computing initiatives. By harnessing the power of ARM architecture, organizations may be able to achieve a more sustainable and environmentally friendly approach to Kubernetes infrastructure management.

Performance:

Contrary to popular belief, ARM processors can deliver the same or better performance for Kubernetes workloads, as compared that of traditional x86-64 processors in certain scenarios. While ARM-based instances may have historically been associated with low-power devices like smartphones and IoT gadgets, recent advancements in ARM architecture have ushered in a new era of performance capabilities. With ARM-based servers becoming increasingly prevalent in cloud environments, developers and operators have access to a wider range of ARM-powered instances then in the past. For many workloads, including web applications, microservices, and other similar containerized workloads, ARM processors offer ample computational power and efficiency. By carefully selecting ARM-based instances tailored to their specific workload characteristics, organizations can achieve optimal performance and resource utilization in their Kubernetes deployments.
In conclusion, ARM processors offer can offer benefits for Kubernetes workloads in cloud environments. From cost savings and energy efficiency to impressive performance capabilities, ARM architecture presents a viable alternative to traditional x86-64 processors for some workloads. By leveraging ARM-based instances, organizations can potentially optimize their cloud infrastructure costs, reduce operational expenses, and contribute to sustainability initiatives. Despite historical associations with low-power devices, ARM processors have evolved to deliver competitive performance for a wide range of Kubernetes workloads. With careful selection and optimization, ARM-based instances may be able to provide organizations with the performance and efficiency they need while embracing the advantages of ARM architecture.

Optimizing Kubernetes Node Allocation with Intelligent Autoscaling

For Kubernetes deployments seeking to incorporate ARM nodes seamlessly, leveraging an intelligent autoscaler like Luna offers a streamlined solution. With Luna, ARM nodes can be effortlessly provisioned alongside x86-64 nodes, improving both cost efficiency and resource utilization.
By configuring Luna to allocate ARM nodes when they offer better pricing compared to x86-64 counterparts, administrators can obtain cost savings without operational complexity. Conversely, Luna intelligently allocates x86-64 nodes when they are the more cost-effective option, maintaining a balanced infrastructure and cost savings.
To ensure compatibility across architectures, container images must be multi-arch, enabling them to run seamlessly on both x86-64 and ARM nodes. Moreover, Luna provides granular control over node allocation through annotations, allowing administrators to specify preferences for instance families or to exclude certain families as needed.

In summary, leveraging Luna autoscaler streamlines ARM node allocation in Kubernetes environments, enabling organizations to harness the benefits of ARM architecture while maintaining flexibility and cost efficiency in their deployments.

To delve deeper into Luna's intelligent autoscaling of x86-64 and ARM nodes, check out our Luna product page for details. For step-by-step guidance, be sure to review our Documentation. Ready to test Luna firsthand? Try Luna today with our free trial and witness the efficiency and flexibility it brings to your cloud environments.

Author:
Justin Willoughby (Principal Solutions Architect, Elotl)
Contributors:
Anne Holler (Chief Scientist, Elotl)
Henry Precheur (Senior Staff Engineer, Elotl)

scale20x.pdf
File Size:	1215 kb
File Type:	pdf

Download File

The Benefits of Cycling Kubernetes Nodes: Optimizing Performance, Reliability, and Security

Tue, 09 Apr 2024 17:41:48 GMT

Wondering whether cycling out older Kubernetes nodes periodically is a good idea? In the world of Kubernetes administration, the practice of rotating nodes often takes a backseat, even though it holds considerable advantages. While it's true that node cycling isn't universally applicable, it's worth exploring its merits for your environment. In this article, I will delve into many of the compelling reasons why considering node rotation might be beneficial for your clusters. We'll explore the advantages of node rotation in Kubernetes and how it contributes to resource optimization, fault tolerance, security, and performance improvements.

Why might someone think cycling of Kubernetes nodes is unnecessary? One reason for this could be a misconception about the stability of Kubernetes clusters. In environments where nodes rarely fail or resource usage remains relatively consistent, there might be a tendency to prioritize other tasks over node cycling. Additionally, the perceived complexity of implementing node rotation strategies, particularly in large-scale or production environments, could dissuade teams from actively considering it. Some teams might also be unaware of the potential performance gains and reliability improvements that can result from regular node cycling. However, despite these challenges or misconceptions, it's crucial to recognize that neglecting node rotation can lead to issues such as resource exhaustion, reduced fault tolerance, security vulnerabilities, difficulties upgrading to newer versions, and degraded performance over time. By acknowledging the importance of node cycling and implementing proactive strategies, administrators and DevOps teams can ensure the long-term health, resilience, and efficiency of their Kubernetes infrastructure. So, without delay, let's delve into the specifics.

Node rotation in Kubernetes aids in maintaining a secure environment through timely patch management and isolation of compromised nodes. By cycling nodes at regular intervals, security patches and updates can be deployed consistently, reducing the attack surface and mitigating potential vulnerabilities. In the event of a compromised node, cycling it out of the cluster helps contain the threat and prevent further damage, enhancing overall security posture.

James Cunningham, a Lead Infrastructure Engineer at PlanetScale, highlights the multifaceted benefits of node cycling within Kubernetes environments, stating, "It optimizes workload distribution, ensures a seamless refresh of nodes with the newest kernel and OS updates, all while maintaining stability and virtually eliminating state drift." This encapsulates the transformative impact node cycling has on infrastructure maintenance and performance optimization. By periodically refreshing nodes, organizations can ensure that workloads are efficiently distributed, leveraging the latest kernel and OS updates seamlessly.

Moreover, the assurance of utilizing updated packages without the need for disruptive reboots enhances system stability and security. Additionally, the mitigation of state drift to near-zero levels minimizes inconsistencies across the infrastructure, fostering a more reliable and predictable operational environment. Through proactive node cycling practices, organizations can effectively uphold operational excellence while continuously adapting to evolving workload demands.
Cycling Kubernetes nodes leads to performance improvements by leveraging newer hardware and optimizing networking infrastructure. Refreshing the underlying hardware or virtual infrastructure enhances performance by capitalizing on advancements in technology. Additionally, redistributing workloads across the cluster reduces resource contention and bottlenecks, resulting in better performance for applications and services running on Kubernetes.

The adoption of efficient node management practices is pivotal for maintaining a resilient and high-performing infrastructure. James further sheds light on the effectiveness of node cycling within this context: “Node cycling serves as our seamless approach to upgrading kubelets post-upgrading the apiservers. Rather than setting off on some grand rescheduling process across the whole cluster after upgrading the apiservers, we set a 30-day timer and let computers do the hard work.” This quote underscores the practical benefits of node cycling, particularly in simplifying the upgrade process while reducing operational overhead. With node cycling, administrators can seamlessly ensure that kubelets are upgraded following apiserver updates, all without the need for immediate, large-scale rescheduling efforts. This streamlined approach not only enhances operational efficiency but also bolsters system reliability by keeping critical components up-to-date without interrupting ongoing workloads. By integrating node cycling into their Kubernetes management workflows, organizations, such as PlanetScale, can effectively navigate the complexities of infrastructure maintenance and stay agile in an ever-evolving landscape.

Regular node cycling also facilitates proactive fault detection and mitigation. By replacing nodes on a scheduled basis, potential hardware failures or issues are addressed before they impact application availability. This approach ensures redundancy within the cluster, enabling seamless workload transition in case of unexpected node failures. Additionally, through automated health checks and compatibility validations during node cycling, the cluster's resilience and stability are reinforced, guaranteeing a robust foundation for running mission-critical applications.

Wondering how to automate node cycling in your Kubernetes environment? There are several methods available, one of which is utilizing Luna. Luna stands out as an intelligent autoscaler capable of not only provisioning and managing nodes for workloads but also orchestrating the removal of nodes beyond a specified NodeTTL (Time to Live) value. This feature ensures efficient node cycling based on your defined TTL, streamlining operations effortlessly. For instance, if you prefer a weekly node cycling routine, simply configure the NodeTTL parameter within Luna to 7d, and voila! Luna takes care of the rest, seamlessly managing node lifecycle within your cluster.

While node cycling offers numerous benefits for maintaining a healthy and efficient Kubernetes infrastructure, there are certain scenarios where it may not be practical or necessary. One such exception is in environments where workloads require long-running processes or persistent connections that cannot easily be migrated to other nodes. In these cases, interrupting these processes by cycling out nodes could result in service disruptions or data loss. Additionally, in environments with strict compliance or regulatory requirements, the process of cycling nodes out may introduce additional complexity and risk, especially if it involves downtime or configuration changes that could impact compliance status. So while node cycling is generally beneficial for most Kubernetes deployments, it's essential to consider these exceptions and weigh the potential trade-offs before implementing a node rotation strategy. Fortunately, Luna provides a solution for critical workloads that cannot or should not be terminated during node cycling processes. With the capability to set a "do-not-evict" annotation on such workloads, Luna ensures that pods remain untouched until they have terminated naturally or the annotation is removed. This functionality enables the smooth cycling of nodes within the cluster while avoiding any disruption to critical workloads.

In conclusion, cycling Kubernetes nodes at regular intervals offers significant benefits across various aspects of Kubernetes management. By optimizing resource utilization, enhancing fault tolerance and reliability, strengthening security measures, and improving performance, node rotation contributes to a more efficient and resilient Kubernetes environment. Incorporating node cycling into your Kubernetes maintenance strategy can help ensure the smooth operation of your containerized workloads and enhance the overall stability of your infrastructure.

To delve deeper into Luna's intelligent autoscaling capabilities, including node cycling, explore our product page for details. For step-by-step guidance, consult our Documentation. Ready to test Luna firsthand? Try Luna today with our free trial and witness the efficiency and flexibility it brings to your cloud environments.

Author:
Justin Willoughby (Principal Solutions Architect, Elotl)

Contributors:
James Cunningham (Lead Infrastructure Engineer, PlanetScale)
Henry Precheur (Senior Staff Engineer, Elotl)
Anne Holler (Chief Scientist, Elotl)

Deep Learning Training with Ray and Ludwig using Elotl Luna

Thu, 22 Feb 2024 15:25:19 GMT

In this brief summary blog, we delve into the intriguing realm of GPU cost savings in the cloud through the use of Luna, an Intelligent Autoscaler. If you're passionate about harnessing the power of Deep Learning (DL) while optimizing expenses, this summary is for you. Join us as we explore how innovative technologies are revolutionizing the landscape of resource management in the realm of Deep Learning. Let's embark on a journey where efficiency meets intelligence, promising both technical insights and a practical solution.

Deep Learning has and continues to transform many industries such as AI, Healthcare, Finance, Retail, E-commerce, and many others. Some of the challenges with DL include its high cost and operational overhead:

Compute Costs: Deep learning models require significant computational resources, which lead to high costs, especially for complex or large-scale projects. This is even more true when the compute remains provisioned when it’s not needed.
Instance Management: Managing cloud instances for training, inference, and experimentation creates operational overhead. This includes provisioning and configuring virtual machines, monitoring resource usage, and optimizing instance types for performance and cost efficiency.
Infrastructure Scaling: Scaling deep learning workloads in the cloud involves dynamically adjusting compute resources to meet demand. This requires optimizing resource allocation to minimize costs while ensuring sufficient capacity.

Open-source platforms like Ray and Ludwig have broadened DL accessibility, yet DL model’s intensive GPU resource demands present financial hurdles. Addressing this, Elotl Luna emerges as a solution, streamlining compute for Kubernetes clusters without the need for manual scaling which often results in wasted spend.

Running Ray and Ludwig on cloud Kubernetes clusters using Luna, an Intelligent Kubernetes Cluster Autoscaler, is a great approach to mitigating the challenges often faced with DL and public cloud GPU resource demands. Luna dynamically adjusts GPU resources based on workload needs, resulting in substantial efficiency gains.

Luna showed significant improvements over a fixed size Ray cluster on AWS, all while preserving AutoML performance quality:

Reduced elapsed time by 61%
Reduced compute cost by 54%
Reduced idle Ray cluster cost by 66%

The exploration and testing encompassed ML experiments utilizing Ludwig v0.4.1, leveraging its AutoML capability. These results were obtained during the ML training workload aimed at validating the newly added AutoML feature in Ludwig v0.4.1. Luna’s resource management can be used to provide just-in-time compute for Ludwig’s AutoML across various datasets, employing Ray Tune for hyperparameter search on GPU-enabled workers. Results prove competitive with manually-tuned models, showcasing Luna’s adaptability and efficiency in DL workflows.

Lessons learned underscore the substantial savings achieved in workload elapsed time, execution costs, idle costs, and operational complexity. This is just a glimpse into the transformative impact of Luna on DL training workloads in the cloud. For a comprehensive understanding, dive into the full details of the Managing public cloud resources for deep learning training: experiments and lessons learned blog on the Cloud Native Computing Foundation site.

Furthermore, we encourage you to explore our subsequent research, which validates the efficacy of Ludwig v0.5.0 AutoML for text classification datasets. In this study Luna also showed significant savings as well.

Reduced elapsed time by 7%
Reduced compute cost by 59%
Reduced idle Ray cluster cost by 66%

The full details of this experiment can be found by viewing the slides and/or video recording from the Efficient AutoML with Ludwig, Ray, and Nodeless Kubernetes session from Kubernetes AI Day Europe.

In both cases, Luna was able to dramatically lower the cost and enhance the performance of the Deep Learning jobs.

Reduced	First Experiment	Second Experiment
Elapsed time by	61%	7%
Compute cost by	54%	59%
Idle Ray cluster cost by	66%	66%

While this summary has provided a glimpse into the fascinating world of GPU cost savings with an Luna, we must acknowledge that it merely scratches the surface of the comprehensive insights offered in the original blog and subsequent presentation. We hope this summary has sparked your curiosity and motivated you to explore the full depth of knowledge available. For a more detailed understanding, we encourage you to dive into the original blog and presentations linked above.

To explore the robust features and capabilities of Luna in greater detail, visit our Luna Product page. For comprehensive guidance, refer to our documentation. Ready to experience firsthand the seamless management of compute for GPU workloads? Start testing Luna today and discover the efficiency and flexibility it offers for your cloud environments.

Author:
Justin Willoughby (Principal Solutions Architect, Elotl)

Authors/Contributors from the full blog from which this summary blog is based:
Anne Holler, Chi Su, Travis Addair, Henry Prêcheur, Paweł Bojanowski, Madhuri Yechuri, and Richard Liaw

A Guide to Disaster Recovery for FerretDB with Elotl Nova on Kubernetes

Mon, 12 Feb 2024 20:00:29 GMT

Originally published on blog.ferretdb.io

Running a database without a disaster recovery process can result in loss of business continuity, resulting in revenue loss and reputation loss for a modern business.

Cloud environments provide a vast set of choices in storage, networking, compute, load-balancing and other resources to build out DR solutions for your applications. However, these building blocks need to be architected and orchestrated to build a resilient end-to-end solution. Ensuring continuous operation of the databases backing your production apps is critical to avoid losing your customers' trust.

Successful disaster recovery requires:

Reliable components to automate backup and recovery
A watertight way to identify problems
A list of steps to revive the database
Regular testing of the recovery process

This blog post shows how to automate these four aspects of disaster recovery using FerretDB, Percona PostgreSQL and Nova. Nova automates parts of the recovery process, reducing mistakes and getting your data back online faster.

Components overview

FerretDB is an open-source proxy that translates MongoDB wire protocol queries to SQL, with PostgreSQL or SQLite as the database engine.

Percona for PostgreSQL is a tool set to manage your PostgreSQL database system: it installs PostgreSQL and adds a selection of extensions that help manage the database.

Nova is a multi-cloud, multi-cluster control plane that orchestrates workloads across multiple Kubernetes clusters via user-defined policies.

Defining a Disaster Recovery setup for FerretDB + Percona Postgres

FerretDB operates as a stateless application, therefore during recovery Nova only needs to make sure it is connected to a primary PostgreSQL database.

To implement PostgreSQL's Disaster Recovery (DR), a primary cluster, standby cluster, and object storage, such as an S3 bucket, are required. The storage will be used for storing periodic backups performed on the primary cluster. The standby cluster will be configured as the backup location, so it is kept in-sync with the primary. When disaster strikes, the standby is set as a new primary to keep the database running (more details can be found here: Percona Blog).

For the entry point for our database, a proxy in front of the database directs communication to the appropriate instance.

Basic setup

Setup involves three clusters:

Workload Cluster 1 contains
Percona Operator
PostgreSQL primary cluster
FerretDB
Workload Cluster 2 contains:
Percona Operator
PostgreSQL standby cluster
FerretDB
Workload Cluster 3 contains:
HAProxy, the single entry point to FerretDB.
HAProxy connected to FerretDB in cluster 1 (linked to the primary PostgreSQL).
After recovery, HAProxy will be connected to FerretDB in cluster 2 (linked to the new primary PostgreSQL).

The proxy is a single point of failure, it is intentionally set up this way to simplify the demonstration of database recovery.

With the described setup in place, Nova can execute the following recovery steps if Cluster 1 fails:

Set Percona cluster 2 as primary
Set Percona cluster 1 as standby (You cannot have two primary clusters simultaneously in one setup as it would disrupt the backup process. If Cluster 1 is initially marked as failed due to network issues and Cluster 2 takes over, Nova must ensure that, if Cluster 1 becomes available again, it does not reconnect as the primary.)
Connect HAProxy to FerretDB in cluster 2

Automating the setup and recovery execution

To simplify deployment across multiple servers, use Nova to deploy FerretDB, Percona Operator, and configure PostgreSQL and HAProxy. By setting up policies, Nova will direct workloads, along with their configurations, to the appropriate cluster. Detailed information about configuring policies in Nova are described in the Nova Documentation.

Enhanced setup

An additional Kubernetes cluster is required to host the Nova control plane, and Nova agents are incorporated into the existing Kubernetes clusters. This setup enables exclusive communication with the Nova control plane during the deployment and configuration of all components.

Nova Schedule Policy for FerretDB

With Nova scheduling policies, you can deploy all workloads and Nova will distribute them among clusters as needed. For example, the policy below spreads FerretDB deployment to two clusters with a different service name for each PostgresDB.

apiVersion: policy.elotl.co/v1alpha1
kind: SchedulePolicy
metadata:
  name: spread-ferretdb
spec:
  namespaceSelector:
    matchExpressions:
      - key: kubernetes.io/metadata.name
        operator: Exists
  resourceSelectors:
    labelSelectors:
      - matchLabels:
          app: ferretdb
  groupBy:
    labelKey: app
  clusterSelector:
    matchExpressions:
      - key: kubernetes.io/metadata.name
        operator: In
        values:
          - cluster-1
          - cluster-2
  spreadConstraints:
    spreadMode: Duplicate
    topologyKey: kubernetes.io/metadata.name
    overrides:
      - topologyValue: cluster-1
        resources:
          - kind: Deployment
            apiVersion: apps/v1
            name: ferretdb
            namespace: default
            override:
              - fieldPath: spec.template.spec.containers[0].env[0].value
                value:
                  staticValue: postgres://cluster1-ha.psql-operator.svc:5432/zoo
      - topologyValue: cluster-2
        resources:
          - kind: Deployment
            apiVersion: apps/v1
            name: ferretdb
            namespace: default
            override:
              - fieldPath: spec.template.spec.containers[0].env[0].value
                value:
                  staticValue: postgres://cluster2-ha.psql-operator.svc:5432/zoo
---
apiVersion: policy.elotl.co/v1alpha1
kind: SchedulePolicy
metadata:
  name: psql-cluster-1-ferretdb
spec:
  namespaceSelector:
    matchLabels:
      kubernetes.io/metadata.name: default
  clusterSelector:
    matchLabels:
      kubernetes.io/metadata.name: cluster-1
  resourceSelectors:
    labelSelectors:
      - matchLabels:
          psql-cluster: cluster-1
---
apiVersion: policy.elotl.co/v1alpha1
kind: SchedulePolicy
metadata:
  name: psql-cluster-2-ferretdb
spec:
  namespaceSelector:
    matchLabels:
      kubernetes.io/metadata.name: default
  clusterSelector:
    matchLabels:
      kubernetes.io/metadata.name: cluster-2
  resourceSelectors:
    labelSelectors:
      - matchLabels:
          psql-cluster: cluster-2

Recovery Plan

Now that the FerretDB is up and running, Nova will be configured to execute a recovery plan when something goes wrong. You just need to convert the recovery steps we outlined above into Nova's recovery plan. The Recovery Plan is a Kubernetes Custom Resource and looks as follows:

apiVersion: recovery.elotl.co/v1alpha1
kind: RecoveryPlan
metadata:
 name: psql-primary-failover-plan
spec:
 alertLabels:
   app: example-app
 steps:
   - type: patch  # set perconapgclusters 1 to standby
     patch:
       apiVersion: "pg.percona.com/v2beta1"
       resource: "perconapgclusters"
       namespace: "psql-operator"
       name: "cluster1"
       override:
         fieldPath: "spec.standby.enabled"
         value:
           raw: true
       patchType: "application/merge-patch+json"
   - type: patch  # set perconapgclusters 2 to primary
     patch:
       apiVersion: "pg.percona.com/v2beta1"
       resource: "perconapgclusters"
       namespace: "psql-operator"
       name: "cluster2"
       override:
         fieldPath: "spec.standby.enabled"
         value:
           raw: false
       patchType: "application/merge-patch+json"
   - type: readField   # read ferretdb service hostname in cluster 2
     readField:
       apiVersion: "v1"
       resource: "services"
       namespace: "default"
       name: "ferretdb-service-2"
       fieldPath: "status.loadBalancer.ingress[0].hostname"       outputKey: "Cluster2IP"
  - type: patch # update HAProxy to point to ferretdb service in cluster 2
    patch:
       apiVersion: "v1"
       resource: "configmaps"
       namespace: "psql-operator"
       name: "haproxy-config"
       override:
         fieldPath: "data"
         value:
           raw: {"haproxy.cfg": "defaults\n    mode tcp\n    timeout connect 5000ms\n    timeout client 50000ms\n    timeout server 50000ms\n\nfrontend fe_main\n    bind *:5432\n    default_backend be_db_2\n\nbackend be_db_2\n    server db2 {{ .Values.Cluster2IP }}:27017 check"}
       patchType: "application/merge-patch+json"

Triggering the recovery plan execution

Nova exposes a webhook endpoint that matches recovery plans with the alert's label. You can send an alert manually using a tool like curl. Alternatively, you can use an alert system, like AlertManager + Prometheus, which will automatically notify Nova when a certain metric goes beyond a set limit.

Summary

The above steps, process, and execution has resulted in a successful setup of FerretDB to autonomously recover from disasters, such as region-wide failures. This configuration ensures seamless healing in case of unexpected events, greatly improving the resilience of the FerretDB deployment.

To learn more about FerretDB, see the documentation.

To learn more about Nova, see Nova Documentation and try it for free.

Author:
Maciek Urbanski (Senior Platform Engineer, Elotl)

Contributors:
Selvi Kadirvel, Henry Precheur, Janek Baranowski , Pawel Bojanowski, Justin Willoughby, Madhuri Yechuri

Cloud GPU Allocation Got You Down? Elotl Luna to the Rescue!

Thu, 08 Feb 2024 19:02:30 GMT

How do I efficiently run my AI or Machine Learning (ML) workloads in my Kubernetes clusters?

Operating Kubernetes clusters with GPU compute manually presents several challenges, particularly in the allocation and management of GPU resources. One significant pain point is the potential for wasted spend, as manually allocated GPUs may remain idle during periods of low workload. In dynamic or bursty clusters, predicting the optimal GPU requirements becomes challenging, leading to suboptimal resource utilization and increased costs. Additionally, manual allocation necessitates constant monitoring of GPU availability, requiring administrators be aware of the GPU availability in clusters spread across different zones or regions. Once the GPU requirements are determined for a given workload, the administrator needs to manually add nodes when demand surges and remove them during periods of inactivity.

There are many GPU types, each with different capabilities, running on different nodes types. The combination of these three factors makes manual GPU nodes management increasingly convoluted. Different workloads may require specific GPU models, leading to complexities in node allocation. Manually ensuring the correct GPU nodes for diverse workloads becomes a cumbersome task, especially when dealing with multiple applications with varying GPU preferences. This adds another layer of operational overhead, demanding detailed knowledge of GPU types, and again availability, and continuous adjustments to meet workload demands.

Luna, an intelligent node autoscaler, addresses these pain points by automating GPU node allocation based on workload demands. Luna is aware of GPU availability, as such, it can dynamically choose and allocate needed GPU nodes, eliminating the need for manual intervention. This optimizes resource utilization and reduces wasted spend by scaling GPU resources in line with the workload. Moreover, Luna can allocate specific nodes as defined by the workload requirements, ensuring precise resource allocation tailored to the application's needs. This makes Luna perfectly suited for the most complex compute jobs like AI and ML workloads.

Furthermore, Luna's core functionality includes the automatic allocation of alternative GPU nodes in cases where preferred GPUs are unavailable, bolstering its flexibility and resilience. This ensures that workloads with specific GPU preferences can seamlessly transition to available alternatives, maintaining uninterrupted operation. Controlled through annotations within the workload, users can specify cloud instance types to use or avoid, either by instance family or via regular expressions, along with desired GPU SKUs. This capability enables dynamic allocation based on GPU availability and workload demands, simplifying cluster management and promoting efficient scaling and resource utilization without the need for constant manual adjustments.

Lastly, the advantages of Luna extend beyond resource optimization and workload adaptability in a single specific cloud. When organizations leverage various cloud providers, flexibility is paramount. An intelligent autoscaler designed to support GPU management within multiple cloud providers empowers users with the freedom to choose the most suitable cloud platform for their specific needs. With Luna enterprises are not locked into a single cloud provider, offering them the agility to transition workloads seamlessly between different cloud environments based on cost-effectiveness, performance, or specific features. Currently Luna supports four cloud providers: Amazon AWS with EKS, Google Cloud with GKE, Microsoft Azure with AKS, and Oracle Cloud Infrastructure with OKE. By providing a unified and agnostic approach to GPU resource management, Luna becomes a strategic asset, enabling organizations to harness the benefits of diverse cloud ecosystems without compromising efficiency or incurring cloud vendor lock-in.

In summary, manually managing GPU compute in Kubernetes clusters poses challenges related to wasted spend, manual addition, monitoring, and removal of nodes. Luna addresses these pain points by:

Streamlining GPU node allocation according to workload demands
Optimizing resource utilization by dynamically choosing and allocating nodes
Adapting to fluctuations in GPU availability seamlessly
Unify operations over multiple clusters and cloud providers: Amazon EKS, Google GKE, Azure AKS, and Oracle OKE

Luna simplifies cluster node management, reduces operational overhead, and ensures efficient GPU resource utilization.

To delve deeper into Luna's powerful features and capabilities, explore the Luna product page for details. For step-by-step guidance, consult our Documentation. Ready to experience the seamless management of GPU workloads firsthand? Try Luna today with our free trial and witness the efficiency and flexibility it brings to your cloud environments.

Author:
Justin Willoughby (Principal Solutions Architect, Elotl)

Contributors:
Henry Precheur (Senior Staff Engineer, Elotl)
Anne Holler (Chief Scientist, Elotl)

Luna 1.0.0 is out

Tue, 06 Feb 2024 17:20:19 GMT

The Elotl team is thrilled to announce a major milestone in our journey — the release of Luna version 1.0.0. Luna is a Intelligent Kubernetes Cluster Autoscaler that optimizes cost, simplifies operations, and supports four public Cloud Providers: Amazon EKS, Google GKE, Microsoft AKS, and Oracle OCI.
While some might associate version 1.0.0 with potential hiccups, rest assured, this release is a testament to our commitment to excellence and stability. We’ve diligently worked to ensure that this version not only meets but exceeds expectations.

Why Luna Version 1.0.0 is a Milestone:

Widened Horizon: Luna has been rigorously tested and optimized, making it suitable for a broad range of applications.
Trusted in Production: Version 1.0.0 builds upon the rock-solid foundation of its predecessor, version 0.7.4, which has been successfully running in diverse production clusters.

Give it a try

To learn more about Luna, check out the Luna product page, you can also download the trial version of Luna, or read the documentation.
We dedicated extensive effort to building Luna into a robust cluster autoscaler, ensuring that every dollar brings optimal value. Luna is designed to enhance the efficiency of your Kubernetes workloads and streamline the scaling operations across multiple cloud environments. We encourage you to explore Luna, especially for clusters handling substantial, dynamic, or bursty workloads.