Using NVIDIA GPU Time-slicing in Cloud Kubernetes Clusters with the Luna Smart Cluster Autoscaler6/25/2024
Introduction
Kubernetes (K8s) workloads are given exclusive access to their allocated GPUs by default. With NVIDIA GPU time-slicing, GPUs can be shared among K8s workloads by interleaving their GPU use. For cloud K8s clusters running non-demanding GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. Note that NVIDIA GPU time-slicing is intended for non-production test/dev workloads, as it does not enforce memory and fault isolation.
Using NVIDIA GPU time-slicing in a cloud Kubernetes cluster with a cluster autoscaler (CA) that is aware of the time-slicing configuration can significantly reduce costs. A time-slice aware “smart” CA prevents initial over-allocation of instances and optimizes instance selection, and reduces the risk of exceeding quotas and capacity limits. Also, on GKE, where GPU time-slicing is expected to be configured at the control plane level, a smart CA facilitates using time-slicing on GPU resources that are dynamically allocated.
In this blog, we describe how to use cluster NVIDIA GPU time-slicing in AKS, EKS, OKE, and GKE cloud K8s clusters with Luna, a smart CA that supports GPU time-slicing. We provide examples demonstrating the advantages of using Luna with NVIDIA GPU time-slicing.
Configuring NVIDIA GPU Time-slicing on Cloud K8s
Luna is a smart CA that provides the option nvidiaGPUTimeSlices to indicate the NVIDIA GPU slices value used by GPUs in the K8s cluster. When the option is set to N greater than 1, Luna treats the GPUs in cloud instances as being N copies of themselves with respect to resource allocation and scheduling. Luna supports AKS, EKS, OKE, and GKE cloud K8s clusters.
On AKS, EKS and OKE, NVIDIA GPU time-slicing is configured so that it is transparent to the cluster control plane and to GPU workloads running on the cluster. Appendix A describes how NVIDIA GPU time-slicing can be enabled for all GPUs in the cluster via helm deployment of the nvidia-device-plugin, with an associated configmap specifying the number of slices. GPU workloads specify their desired GPU count as usual via the nvidia.com/gpu resource limit and are allocated GPU slices for each GPU they request. On GKE, NVIDIA GPU time-slicing is visible to the cluster control plane. Time-slicing is specified at the node pool level, with the GPU slice count set as clients-per-gpu. Luna handles the node pool setting when nvidiaGPUTimeSlices is greater than 1. On GKE, time-slicing is also visible to GPU workloads themselves: GPU workloads running on GKE time-sliced GPUs must include nodeSelectors indicating that the workload can use time-shared GPUs and specifying the max clients-per-gpu value allowed. Such workloads are limited to an nvidia.com/gpu resource limit value of 1. Luna Benefits for GPU Time-Slicing
We’ve mentioned that running the Luna smart CA, configured to be aware of the GPU time-slices setting, reduces expenses as well as quota and capacity limit risks, by avoiding initial over-allocation of instances and by optimizing instance choice. Let’s look at these two areas.
Luna Avoiding Instance Over-allocation for GPU Time-Slicing
With respect to initial over-allocation of instances, a CA that is not aware of the GPU time-slices setting of N will initially allocate Nx more nodes than needed. For example, to place 4 1-GPU workloads, a CA that doesn’t know time-slices=2 could allocate 2 2-GPU nodes, when 1 2-GPU node can provide 4 slices. Note that this initial over-allocation may unnecessarily hit instance quota or capacity limits. If the CA can subsequently consolidate the workloads and scale in the over-allocated node(s), the expense associated with this issue can be limited.
Luna Optimizing Instance Choice for GPU Time-Slicing
With respect to optimizing instance choice, we observe that for many clouds, the cost of GPU instances increases non-linearly with the instance’s GPU count. For example, in AWS us-west region using Luna’s current price list, a g4dn.xlarge with 1 T4 GPU is $0.526/hr, while a g4dn.12xlarge with 4 T4 GPUs is $3.912; the latter is ~7.4x more costly for only 4x more T4 GPUs. Hence, allocating the instance GPU count in light of the time-slices setting can yield significant ongoing savings by choosing instances with fewer GPUs. And our experience is that instances with fewer GPUs tend to have higher quotas and more cloud capacity.
The benefit of optimizing instance choice can be substantial. In the next section, we present EKS, AKS, and OKE examples to illustrate. And we include a GKE example to show how a smart CA facilitates use of control-plane-aware NVIDIA GPU time-slicing. Examples: Luna Optimizing Instance Choice for GPU Time-Slicing
For our examples, we set NVIDIA GPU time-slices to 2. We consider small 1-GPU workloads that can run together on a single NVIDIA GPU node with time-slices=2. We configure Luna to create bin-packing nodes with 2 GPUs (via setting Luna option binPackingNodeGPU=2). And we configure Luna to place bin-pack 2 1-GPU workloads onto the same node (via setting binSelectPodGPUThreshold=2).
For each of the 4 clouds supported by Luna, we consider the example of launching 2 small 1-GPU workloads. We examine the benefits of setting Luna’s nvidiaGPUTimeSlices option to 2. EKS
For our example of deploying 2 small 1-GPU workloads in an EKS cluster with Luna, we use the deployment spec in Appendix B.1. The EKS cluster is configured with GPU time-slices set to 2. It is located in the us-east region and the prices we give are from Luna’s current price list.
When Luna is run without knowledge of the GPU time-slice setting (i.e., nvidiaGPUTimeSlices is set to the default of 1), it allocates a g3.8xlarge instance, which at $2.28/hr is the lowest price 2-GPU instance that meets the desired resource requirements for bin-packing. However, g3* instances have M60 GPUs, which were designed for graphics-intensive workloads, and are not well-suited for ML tasks. Setting binPackingNodeTypeRegexp: ^([^g]|g($|[^3])).*$ to avoid g3s, Luna allocates a g4dn.12xlarge, which at $3.9120/hr is the next lowest price multi-GPU instance, with 4 T4s. [We note that the default EBS size is insufficient for g4dn.12xlarge instances and the Luna option aws.blockDeviceMappings needs to be set to allocate a larger EBS size.] Given that NVIDIA GPU time-slices is 2, a 1-GPU instance can instead be used. When Luna is run with nvidiaGPUTimeSlices=2, it allocates a g4dn.xlarge, which is AWS’ least expensive 1-GPU instance type. At $0.526/hr, it is much cheaper than the previous 2 alternatives, with respect to both instance and per-slice price. This data is summarized in Table 1.
Table 1: EKS w/NVIDIA GPU time-slices=2, Luna option nvidiaGPUTimeSlices set to 1 vs 2
AKS
For our example of deploying 2 small 1-GPU workloads in an AKS cluster with Luna, we use the deployment spec in Appendix B.2. The AKS cluster is configured with GPU time-slices set to 2. It is located in the east us region and the prices we give were recently fetched by Luna.
When Luna is run without knowledge of the GPU time-slice setting (i.e., nvidiaGPUTimeSlices is set to the default of 1), it allocates a Standard_NC64as_T4_v3 instance, which at $4.352/hr is the lowest price multi-GPU instance that meets the desired resource requirements for bin-packing, comprising 4 T4 GPUs. Given that NVIDIA GPU time-slices is 2, a 1-GPU instance can instead be used. When Luna is run with nvidiaGPUTimeSlices set to 2, it allocates a Standard_NC4as_T4_v3, which at $0.526/hr is much cheaper than the Standard_NC64as_T4_v3, in terms of both instance and per-slice price. This data is summarized in Table 2.
Table 2: AKS w/NVIDIA GPU time-slices=2, Luna option nvidiaGPUTimeSlices set to 1 vs 2
OKE
For our example of deploying 2 small 1-GPU workloads in an OKE cluster with Luna, we use the deployment spec in Appendix B.3. The OKE cluster is configured with GPU time-slices set to 2. It is located in the us east region and the prices we give are from Luna’s current price list.
When Luna is run without knowledge of the GPU time-slice setting, it fails to allocate any instance, because our account currently has no quota to run multi-GPU instances (and a quota increase request has been outstanding for an extended period). When Luna is run with nvidiaGPUTimeSlices set to 2, it allocates a VM.GPU2.1, which is $1.275/hr. In this case, the quota issue prevented the scenario from running at all w/o Luna configured to respect the time-slices setting. This data is summarized in Table 3.
Table 3: OKE w/NVIDIA GPU time-slices=2, Luna option nvidiaGPUTimeSlices set to 2
GKE
For our example of deploying 2 small 1-GPU workloads in a GKE cluster with Luna, we use the deployment spec in Appendix B.4. The GKE cluster is configured with GPU time-slices set to 2. It is located in the us central1 region and the prices we give are from Luna’s current price list.
On GKE, NVIDIA time-slices cannot be enabled without setting Luna’s nvidiaGPUTimeSlices option accordingly, since Luna needs to configure time-slicing in the node-pool appropriately. When Luna is run with nvidiaGPUTimeSlices set to 2, it allocates an n1-standard-4 node with 1 T4 GPU, which is $0.540/hr. In this case, Luna is required to enable NVIDIA GPU time-slicing on dynamically-allocated nodes. This data is summarized in Table 4.
Table 4: GKE w/NVIDIA GPU time-slices=2, Luna option nvidiaGPUTimeSlices set to 2
Conclusion
For cloud K8s clusters running non-demanding non-production GPU workloads, configuring NVIDIA GPU time-slicing can significantly reduce GPU costs. In this blog, we’ve explained how to set up NVIDIA GPU time-slicing in AKS, EKS, OKE, and GKE cloud K8s clusters. We’ve discussed the benefits of using the Luna smart CA with the time-slices setting, which include avoiding initial over-allocation of instances and optimizing instance choice. With respect to optimizing instance choice, we found that Luna instance choice halved the price per GPU slice on EKS and AKS. On OKE, we showed that Luna instance choice avoided hitting our current quota limits. And on GKE, we demonstrated how Luna facilitated CA dynamic node allocation interoperation with NVIDIA GPU time-slicing.
Want to see how effortlessly you can manage GPU time-slicing with Luna? Try Luna today with our free trial and experience the enhanced efficiency and flexibility it brings to your cloud environments. Future Work
GPU time-slicing is supported across NVIDIA GPU models, and provides flexible sharing levels. However, the technique does not enforce memory and fault isolation and targets non-production workloads. Recent NVIDIA GPUs support MIG (Multi-Instance GPU) sharing, which partitions each GPU into smaller, predefined instances, with memory and fault isolation enforced by the hardware. Luna support for NVIDIA MIG in Cloud K8s clusters is an area for future work, depending on customer interest in MIG allocation for their workloads.
Appendix A: Configuring NVIDIA GPU time-slicing in a K8s cluster
Appendix B: Deployment of 2 pods, each requesting 1 GPU
B.1 EKS
B.2 AKS
B.3 OKE
B.4 GKE
References
Selected KubeCon talks
Selected Blogs
Author: Anne Holler (Chief Scientist, Elotl) Comments are closed.
|