Using NVIDIA GPU Time-slicing in Cloud Kubernetes Clusters with the Luna Smart Cluster Autoscaler6/25/2024
Introduction
Kubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default. With NVIDIA GPU time-slicing, GPUs can be shared among K8s workloads by interleaving their GPU use. For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.
Using NVIDIA GPU time-slicing in a cloud Kubernetes cluster with a cluster autoscaler (CA) that is aware of the time-slicing configuration can significantly reduce costs. A time-slice aware “smart” CA prevents initial over-allocation of instances and optimizes instance selection, and reduces the risk of exceeding quotas and capacity limits. Also, on GKE, where GPU time-slicing is expected to be configured at the control plane level, a smart CA facilitates using time-slicing on GPU resources that are dynamically allocated.
At Elotl we develop Luna, an intelligent cluster autoscaler for Kubernetes. Luna gets deployed on customers' clusters and helps scale up and down compute resources to optimize cost.
Luna operates in environments where direct access isn’t always available. To overcome the problem of diagnosis and performance monitoring we have introduced the option for customers to securely send their Luna logs and metrics to our advanced log storage appliance. This empowers us to enhance our support capabilities, providing even more effective assistance to our customers. OpenTelemetry is fast becoming the standard for collecting metrics and logs in Kubernetes environments. We opted to run the OpenTelemetry collector as a sidecar for the Luna cluster autoscaler. It will gather and send the logs from a single pod, therefore running it as a sidecar was a perfect match. |