Experiences using Luna Smart Autoscaling of Public Cloud Kubernetes Clusters for Offline Inference using GPUs![]()
Offline inference is well-suited to take advantage of spot GPU capacity in public clouds. However, obtaining spot and on-demand GPU instances can be frustrating, time-consuming, and costly. The Luna smart cluster autoscaler scales cloud Kubernetes (K8s) clusters with the least-expensive available spot and on-demand instances, in accordance with constraints that can include GPU SKU and count as well as maximum estimated hourly cost. In this blog, we share recent experiences with offline inference on GKE, AKS, and EKS clusters using Luna. Luna efficiently handled the toil of finding the lowest-priced available spot GPU instances, reducing estimated hourly costs by 38-50% versus an on-demand baseline and turning an often tedious task into bargain-jolt fun.
Introduction
Applications such as query/response chatbots are handled via online serving, in which each input and prompt is provided in real-time to the model running on one or more GPU workers. Automatic instance allocation for online serving presents efficiency challenges. Real-time response is sensitive to scaling latency during usage spikes and can be impacted by spot reclamation and replacement. Also, peak online serving usage often overlaps with peak cloud resource usage, affecting the available capacity for GPU instances. We've previously discussed aspects of using the Luna smart cluster autoscaler to automatically allocate instances for online serving, e.g., scaling Helix to handle ML load and reducing deploy time for new ML workers.
This blog focuses on offline inference, which avoids the challenges with the real-time burstiness of online serving. Applications such as text summarization, content generation, and financial forecasting employ offline inference, in which input and prompt pairs are sent to the model as a batch job, with the output being stored for subsequent use. Automatic instance allocation for offline inferencing can achieve greater resource efficiency than that for online serving. Offline prediction jobs are generally tolerant of scaling latency and spot instance reclamation and replacement, can be run off-peak, and are often configured with a fixed-size set of instances to handle the input load, which is typically known in advance.
We present experiences using Luna to allocate spot and on-demand GPU instances on GKE, AKS, and EKS cloud K8s clusters for offline inference. We share observations on resource efficiency in terms of GPU instance costs, and on instance availability and allocation search. The results show the cost savings from utilizing spot pricing and instance choice flexibility, and the value of using Luna to efficiently manage instance allocation in compliance with constraints and guardrails. While the results represent a small sample size, and your mileage may vary, we hope they demonstrate strategies you will find beneficial for your offline inference jobs. Example Offline Inference Workload
For offline inferencing, we chose to use the Ray AI platform, with the KubeRay operator to deploy a RayJob on K8s. We adapted this simple batch inference example, that runs an inference job for image classification on a single-node Ray cluster. The single-node Ray cluster comprises a GPU-enabled head that serves as a worker, which was run on an on-demand instance with 4 Nvidia T4 GPUs. This basic setup was adequate for the purpose of exercising GPU instance allocation and measuring instance cost on a set of cloud vendors. We updated our version of the example to indicate that Luna should handle allocating the instances for the Ray cluster head and for the pod that submits the Ray job to the Ray cluster. We added the shutdownAfterJobFinishes option to have the Ray cluster automatically deleted after the RayJob completes, to avoid consuming resources once the Ray cluster becomes idle.
We changed several aspects of the example around GPU SKU choice, GPU count, and pricing category to make obtaining the GPU cloud capacity easier and less costly, as described below. These aspects may be worth considering for your workloads. Flexible GPU SKU choice. By default, Luna will choose the least expensive instance that meets a pending pod's resource requirements, but since the GPU-enabled Ray head in the Ray example was run on an instance with Nvidia T4 GPUs, we wanted to specify that Luna use that SKU in our experiments. However, we found the T4 SKU could be in short supply. We added a Luna annotation to the Ray head configuration indicating that Luna could choose a node with any GPU SKU in a list specified by the env variable RAY_CLUSTER_GPU_SKUS, which we populated with SKUs chosen as described below. Giving Luna the option to choose between several GPU SKU options facilitated its obtaining spot GPU capacity in a timely manner. Flexible GPU count. In the Ray example, the GPU-enabled Ray head was run on an instance with 4 T4 GPUs. However, we found that 4-GPU instances had lower availability and higher cost relative to T4 instances with fewer GPUs, and that the example ran fine with fewer T4s. The constant 4 was replaced with the env variable RAY_CLUSTER_GPU_COUNT to allow us to reduce this value, with RAY_CLUSTER_CPU_COUNT and RAY_CLUSTER_MEMORY_SIZE env variables added to allow us to scale down the CPU and memory requests accordingly. Flexible pricing category for the Ray head and Ray job submitter. In the Ray example, the workloads were run on pre-allocated on-demand instances. We updated the job configs to allow the user to specify the price categories from which Luna should request an instance via the env variable BATCH_JOB_PRICE_CATEGORIES. This option can be set to “on-demand” or to “spot” to indicate that Luna should only use that specific pricing category or the option can be set to “spot,on-demand” to have Luna choose the instance having the lowest estimated price drawn from either category. Also, we added pod annotations to place guardrails on instance cost, to avoid very expensive instances, and on GPU count, to reduce the instance selection search space. Here is the updated version of the RayJob configuration. To deploy the RayJob with a specific configuration, we did the following:
Luna Operation on Offline Inference Workload
Each offline inference workload run was performed on a cloud K8s cluster running Luna 1.2.16. For the workload’s pending pods and their constraints, Luna generates a list of candidate instance types with price categories and sorts them by estimated hourly cost. Luna estimates spot hourly cost as a configurable ratio spotPriceRatioEstimate of on-demand hourly cost; the default value is 0.5, which is a conservative estimate on GKE, AKS, and EKS. Luna then selects the candidate with the lowest estimated cost and sends a request to the cloud vendor to allocate it. When the requested instance type in the specified price category is readily available, the cloud vendor completes the allocation within Luna’s default scaleUpTimeout time of 10m.
When a requested instance type and category combination is not currently available, Luna generates a new request as follows. If the request fails with the cloud reporting insufficient capacity, Luna avoids the associated combination for a configurable back-off time and generates a new allocation request for the candidate with the next lowest estimated cost. If the cloud vendor keeps the request running for longer than scaleUpTimeout, Luna discontinues that request and, as in the failure case, avoids using the associated combination for a configurable back-off time and generates a new request for the candidate with the next lowest estimated cost. We’ve found that Luna’s strategy of discontinuing long-running allocation requests, which we’ve seen often persist for 40m or more and then fail on GKE, is efficient since it allows Luna to retry instance allocation with an alternative candidate that is allocated successfully sooner. GKE Offline Inference Allocation Results
The GKE runs were executed on a standard GKE regional cluster running K8s 1.32 in the us-central1 region. This region offers a wide selection of GPU-enabled instance types and GKE regional clusters support more instance availability than zonal clusters. We ran the workload during US daytime hours, likely a peak usage period for the region. Our goal was to capture data that reflects conditions when spot and on-demand GPU capacity might be limited, providing a conservative estimate of the spot benefit compared to what would be seen for off-peak runs.
For the RayJob submitter pod configuration, which specifies instance-offerings but no resource requests, Luna chose an e2-medium instance. This instance type has a low on-demand price ($0.0553/hr) and no issues were found with obtaining spot capacity for this instance type. The main costs and capacity challenges were in allocating a node to host the GPU-enabled Ray cluster head. Results are given in Table 1. The first row represents the on-demand baseline for comparison with spot allocation. We initially attempted to have Luna allocate an on-demand node that matched the node used in the Ray example, i.e., an instance that could provide 4 T4 GPUs, 54 CPUs, and 54 GB memory, for which we specified no constraints on maximum GPUs or cost. However, Luna was not able to obtain an instance for that config after a round of trying all 5 candidate instance types with its default 10m scaleUpTimeout for each. Seeing that Luna had tried all candidates, we canceled the RayJob; while Luna would have continued to try to get a matching instance, and presumably would have eventually been successful, we considered the latency to get this instance type was too high for our use case. We tried a scaled-down config, with 2 T4 GPUs (as per the Ray example GPU SKU), RAY_CLUSTER_CPU_COUNT set to 27 CPUs, and RAY_CLUSTER_MEMORY_SIZE set to 27 GB memory, and Luna successfully obtained an instance which we used as our baseline. We next had Luna try to allocate a spot node, using the baseline resource config with spot added to the price category. We also added more GPU SKUs to RAY_CLUSTER_GPU_SKUS, to give Luna more options to find spot nodes. And since the additional SKUs were more costly, we added a node cost max. After Luna tried two spot T4 instance types whose long-running scaling operations hit Luna’s 10m scaleUpTimeout and were discontinued, Luna obtained a 2-GPU P4 spot instance, which was 38% cheaper than the on-demand 2-GPU T4 instance. Using Luna’s strategy of retrying an alternative candidate when scale-up time exceeds scaleUpTimeout, an alternative spot instance was found in around 20m, rather than likely spending around 40m trying and ultimately failing to allocate the first candidate T4 spot instance.
Table 1: Luna GKE node allocation for RayJob GPU-enabled head with specified constraints
AKS Offline Inference Allocation Results
The AKS runs were executed on an AKS cluster running K8s 1.31 in the east-us region. As with GKE, the workload was run during US daytime, for a conservative estimate of the spot benefits. Note that to use spot, tolerations needed to be added to the Ray head and Ray job submitter.
For the RayJob submitter pod configuration, Luna allocated a Standard_B2als_v2 instance. This instance type has a low on-demand price ($0.0376/hr) and spot capacity was available for the type. The results for allocating a node to host the GPU-enabled Ray cluster head are given in Table 2. Luna was able to allocate an on-demand 4-GPU T4 node corresponding to the node used in the Ray example run, shown in row 1. However, there were challenges allocating a spot node for comparison. Luna was not able to allocate a spot node for the original config due to insufficient capacity. Also, Azure does not support many instance types with 2 GPUs, including having no 2-GPU T4 nodes. Hence, for spot allocation, the Ray head was scaled down to a config of 1 GPU with 14 CPUs and 14 GB memory. As with GKE spot allocation, more GPU SKU choices were added to RAY_CLUSTER_GPU_SKUS, along with a max node cost. With this config, Luna obtained a spot instance with 1 T4 GPU at a cost of $0.60/hr. To compare this 1-GPU cost to the baseline cost of $4.35/hr for 4 GPUs, the baseline cost was normalized via dividing it by 4 and the spot cost was compared to that quotient; the spot cost was 45% lower.
Table 2: Luna AKS node allocation for RayJob GPU-enabled head with specified constraints
EKS Offline Inference Allocation Results
The EKS runs were executed on an EKS cluster running K8s 1.32 in the us-west-2 region. As was the case for GKE and AKS, the workload was run during US daytime hours, with the intent of yielding a conservative estimate of the spot benefits.
For the RayJob submitter pod configuration, Luna allocated a t3a.small instance. This instance type has a low on-demand price ($0.0188/hr) and there were no issues obtaining spot capacity for the type. Results for allocating a node to host the GPU-enabled Ray cluster head are given in Table 3. Luna was able to allocate an on-demand node with 4 T4 GPUs as in the Ray documentation; the result is shown in row 1. Note that RAY_CLUSTER_CPU_COUNT was dropped to 44 and RAY_CLUSTER_MEMORY_SIZE to 44 GB, given that AWS does not have any 4-GPU T4 instances with enough CPUs to handle the original request of 54. Row 2 shows the results of adding spot to the input pricing category; Luna was able to allocate a spot version of the same instance type.
Table 3: Luna EKS node allocation for RayJob GPU-enabled head with specified constraints
Conclusion
Offline prediction jobs are typically not considered sensitive to node allocation latency and to the impact of spot reclamation and replacement and hence are ideal candidates for spot node use. We’ve presented the results of using the Luna smart cluster autoscaler to allocate spot and on-demand instances on GKE, AKS, and EKS clusters for an example offline prediction job. We’ve shown conservative estimated hourly cost savings of 38-50% using spot, achieved in an easy (and hence fun!) way with Luna’s efficient approach to instance allocation search.
We invite you to have fun with Luna! Download the free trial version of Luna or reach out to us at [email protected] if you would like to try Luna for your batch inference (or any other) workloads! Author: Anne Holler (Chief Scientist, Elotl) Comments are closed.
|