Experiences using Luna Smart Autoscaling of Public Cloud Kubernetes Clusters for Offline Inference using GPUs![]()
Offline inference is well-suited to take advantage of spot GPU capacity in public clouds. However, obtaining spot and on-demand GPU instances can be frustrating, time-consuming, and costly. The Luna smart cluster autoscaler scales cloud Kubernetes (K8s) clusters with the least-expensive available spot and on-demand instances, in accordance with constraints that can include GPU SKU and count as well as maximum estimated hourly cost. In this blog, we share recent experiences with offline inference on GKE, AKS, and EKS clusters using Luna. Luna efficiently handled the toil of finding the lowest-priced available spot GPU instances, reducing estimated hourly costs by 38-50% versus an on-demand baseline and turning an often tedious task into bargain-jolt fun.
Introduction
Applications such as query/response chatbots are handled via online serving, in which each input and prompt is provided in real-time to the model running on one or more GPU workers. Automatic instance allocation for online serving presents efficiency challenges. Real-time response is sensitive to scaling latency during usage spikes and can be impacted by spot reclamation and replacement. Also, peak online serving usage often overlaps with peak cloud resource usage, affecting the available capacity for GPU instances. We've previously discussed aspects of using the Luna smart cluster autoscaler to automatically allocate instances for online serving, e.g., scaling Helix to handle ML load and reducing deploy time for new ML workers.
|