Using NVIDIA GPU Time-slicing in Cloud Kubernetes Clusters with the Luna Smart Cluster Autoscaler6/25/2024
Introduction
Kubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default. With NVIDIA GPU time-slicing, GPUs can be shared among K8s workloads by interleaving their GPU use. For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.
Using NVIDIA GPU time-slicing in a cloud Kubernetes cluster with a cluster autoscaler (CA) that is aware of the time-slicing configuration can significantly reduce costs. A time-slice aware “smart” CA prevents initial over-allocation of instances and optimizes instance selection, and reduces the risk of exceeding quotas and capacity limits. Also, on GKE, where GPU time-slicing is expected to be configured at the control plane level, a smart CA facilitates using time-slicing on GPU resources that are dynamically allocated. |