OVERVIEW![]()
26 minutes! 26 long minutes was our wait time in one example case for our chatbot to be operational. Our LLM Kubernetes service runs in the cloud, and we found that deploying it from start to finish took between 13 and 26 minutes, which negatively impacted our agility and our happiness! Spinning up the service does involve a lot of work: creating the GPU node, pulling the large container image, and downloading the files containing the LLM weights to run our model. But we hoped we could make some simple changes to speed it up, and we did. In this post you will learn how to do just-in-time provisioning of an LLM service in cloud Kubernetes at deployment times that won't bum you out.
We share our experience with straightforward, low-cost, off-the-shelf methods to reduce container image fetch and model download times on EKS, GKE, and AKS clusters running the Luna smart cluster autoscaler. Our example LLM serving workload is a KubeRay RayService using vLLM to serve an open-source model downloaded from HuggingFace. We measured deploy-time improvements of up to 60%. ![]() Running Kubernetes on AWS using Elastic Kubernetes Service (EKS) offers a robust platform for container orchestration, but the challenge of managing the underlying compute infrastructure persists. This limitation can be addressed through various approaches, including the fully managed simplicity of EKS Auto Mode or the granular control offered by an intelligent Kubernetes cluster autoscaler like Luna. In this post, we’ll explore the advantages of each, helping you choose the best scaling strategy for your workloads. Introduction EKS Auto Mode is a fully managed solution aimed at reducing operational complexity for Kubernetes clusters on AWS. It automates essential tasks like node provisioning, scaling, and lifecycle management, offering an ideal entry point for teams new to EKS or operating simpler workloads.
In contrast, compute autoscalers like Luna offer greater flexibility and customization, allowing you to optimize your infrastructure for the demands of complex and/or resource-intensive workloads. |