OVERVIEW![]()
26 minutes! 26 long minutes was our wait time in one example case for our chatbot to be operational. Our LLM Kubernetes service runs in the cloud, and we found that deploying it from start to finish took between 13 and 26 minutes, which negatively impacted our agility and our happiness! Spinning up the service does involve a lot of work: creating the GPU node, pulling the large container image, and downloading the files containing the LLM weights to run our model. But we hoped we could make some simple changes to speed it up, and we did. In this post you will learn how to do just-in-time provisioning of an LLM service in cloud Kubernetes at deployment times that won't bum you out.
We share our experience with straightforward, low-cost, off-the-shelf methods to reduce container image fetch and model download times on EKS, GKE, and AKS clusters running the Luna smart cluster autoscaler. Our example LLM serving workload is a KubeRay RayService using vLLM to serve an open-source model downloaded from HuggingFace. We measured deploy-time improvements of up to 60%. APPROACH
We observed that deploying LLM serving workloads on autoscaled cloud Kubernetes clusters can take between 13 to 26 minutes. Key components of this time include adding a GPU node to the cluster to host the LLM serving worker pod, fetching the container image for that pod from a container registry, and downloading the LLM weights for model serving by that pod. There are a number of approaches to reducing LLM deploy time, which have various cost and complexity trade-offs.
One approach to reducing node scale-up time is to use node over-provisioning via low-priority pod deployment to keep extra node(s) available for scale-up, and to have a daemonset pre-pull the container image(s) of interest into the image cache on the extra node(s). We utilized this approach in our previous work described in this Elotl blog and Scale describes using this kind of approach in this Scale blog. A downside with this approach is the cost overhead of the extra idle node(s). Our previous work involved serving ML models that could run on CPU-only nodes, where the cost overhead was relatively low; our current work involves serving LLM models requiring more expensive GPU nodes, so the cost overhead was higher than we wanted. Hence, we focused on allocating GPU nodes on demand and on techniques to quickly populate new nodes with the image of interest. To quickly populate an image on new nodes, we first explored using Dragonfly pre-seeding with peer-to-peer distribution, but we did not get the performance results we expected despite a number of tuning attempts. We were also deterred by its usage complexity. We then looked at using cloud-vendor solutions to preload or cache/stream the images and found the solutions gave good results out-of-the-box, and were well-supported by the Luna smart cluster autoscaler. A drawback with this approach is the need for cloud-specific setup, but since each cloud's setup is fairly simple and reasonably well-documented, this was not a deal-breaker for us. And we’re including setup detail links in this blog, so hopefully it will be even easier for you, blog reader! With respect to reducing the time to download the model weights, we wanted to utilize HuggingFace's optimizations in this area before looking at the ROI of pursuing further improvement on our side. We found downloading with HF_HUB_ENABLE_HF_TRANSFER enabled gave a modest additional improvement in startup time relative to that given by the image load improvements. We have not yet looked at techniques such as pre-downloading the weights to shared fast storage with corresponding retargeting of the model loading path. We note that our model of interest is stored using the safetensors representation. PER-CLOUD IMPROVEMENTS
In this section, we present our experience with simple low-cost off-the-shelf methods for reducing container image fetch and model download time on EKS, GKE, and AKS clusters running the Luna smart cluster autoscaler. Our example LLM serving workload is a KubeRay-deployed RayService using vLLM to serve an open-source model downloaded from HuggingFace. Our target use case is inexpensive self-hosted LLM serving that does not require service guarantees for sudden extreme load bursts.
We collected baseline and improved deployment times for a KubeRay RayService using vLLM to serve the open-source model microsoft/Phi-3-mini-4k-instruct downloaded from HuggingFace. Deployment time is measured from K8s submission until the service/llm-model-serve-serve-svc endpoint is ready. We ran both static and dynamic setups. For the static setup, we ran without the Ray Autoscaler, specifying a CPU Ray head and GPU Ray workers, with replicas set to 1. For the dynamic setup, we ran with the Ray Autoscaler, specifying a CPU Ray head and GPU Ray workers, with replicas and minReplicas set to 0; the Ray Autoscaler scaled up to 1 replica during the deployment. The dynamic setup requires more time to deploy than the static setup, since the scale up from 0 to 1 GPU Ray worker replicas is not started until after the Ray Head is configured and the service workload is submitted to it, whereas in the static setup, the single GPU Ray worker is created in parallel with the CPU Ray head. Reducing EKS LLM Scale-up Time
To reduce image load time on EKS, we chose the strategy described here of using Bottlerocket node images with a data volume pre-populated to contain a snapshot of our container image. The Luna smart autoscaler supports allocating Bottlerocket nodes. As described below, we built an ECR image for our workload container, took a snapshot of it, and configured Luna to use Bottlerocket with our snapshot. Our LLM serving workload uses the ray-ml image rayproject/ray-ml:2.33.0.914af0-py311 from dockerhub, which is also published to ECR as public.ecr.aws/anyscale/ray-ml:2.33.0-py311. In addition, our RayService config ran “pip install vllm==0.5.4”, which we discovered impacted scale up time. And to use HF_HUB_ENABLE_HF_TRANSFER to speed up model download, we needed to include “pip install hf_transfer” as well. So we created a new ECR container image that combined public.ecr.aws/anyscale/ray-ml:2.33.0-py311 with the vllm and hf_transfer pip installs. We took a snapshot of the resulting ECR image using the instructions here. We set up our cluster as described here, with Luna configured as described here to use Bottlerocket node images and the snapshot. Table 1 contains the EKS measurement results, with the improved time including the impact of both the reduced image load time and reduced model download time using hf_transfer. Both static and dynamic deployment times were significantly improved, with static time reduced by 26% and dynamic time reduced by 46%, almost twice as much. We expected the improvement to be higher for the dynamic case, given that the time to create the worker node and pull its image is not overlapped with the time to create the head node and pull its image, so the worker image pull speedup is more impactful. We note that using hf_transfer for model download without also using the custom ECR image is slower than the baseline; the time needed to do the “pip install hf_transfer” at runtime is higher than the time saved by the faster model download.
Table 1: EKS RayService Baseline and Improved Deployment Times
Reducing GKE LLM Scale-up Time
To reduce image load time on GKE, we chose the strategy described here of Image Streaming from the GCP Artifact Registry with warmed multi-level caches. The Luna smart autoscaler supports GKE Image Streaming. As described below, we built a GCR Artifact Registry image for our workload container, enabled Image Streaming on our cluster, and configured Luna to allow the nodes it allocates to pull from Artifact Registry for Image Streaming.
The workload container we built consisted of rayproject/ray-ml:2.33.0.914af0-py311 from dockerhub plus the vllm and hf_transfer pip installs, similar to our ECR image. We stored it in the GCP Artifact Registry. We set up our cluster as described here and enabled Image Streaming on it, and we configured Luna as described here to allow the nodes it allocates to pull from Artifact Registry. Note that we did an initial fetch of the image to prewarm GCP’s multi-level caches, which is required to see the image load benefits. Table 2 contains the GKE measurement results, with the improved time including the impact of both the reduced image load time and reduced model download time using hf_transfer. Again, the static and dynamic deployment times were significantly improved, with static time reduced by 47% and dynamic time reduced by 48%. Unlike on EKS, we did not see the expected much higher impact of the improvements in the dynamic case; we speculate that this is because there was some serialization of the node setup even in the static case. Note that an instance type with slightly larger memory (15GB -> 16GB) was allocated for the Ray Head in the Dynamic setup, to accommodate the [modest] additional resources needed to run the Ray Autoscaler; this was not needed in the EKS case, since instance type chosen for the static already had 16GB memory. As on EKS, using hf_transfer for model download without also using the custom GCR image is slower than the baseline, due to the cost to pip install hf_transfer.
Table 2: GKE RayService Baseline and Improved Deployment Times
Reducing AKS LLM Scale-up Time
To reduce image load time on AKS, we chose the (preview feature) strategy described here of Artifact Streaming from the Azure Container Registry to AKS. The Luna smart autoscaler supports AKS Artifact Streaming. As described below, we built an ACR image, enabled Artifact Streaming on it, and configured Luna to enable Artifact Streaming on the nodes that it creates.
Our ACR image for our workload container consisted of rayproject/ray-ml:2.33.0.914af0-py311 from dockerhub plus the vllm and hf_transfer pip installs, similar to our ECR and GCR images. As per the feature link, we registered the ArtifactStreamingPreview feature in our subscription and enabled Artifact Streaming on our ACR image. We set up our cluster as described here, and configured Luna as described here to enable Artifact Streaming on the nodes that it creates. Table 3 contains the measurement results on AKS, with the improved time including the impact of both the reduced image load time and reduced model download time using hf_transfer. Both static and dynamic deployment times were significantly improved, with static time reduced by 47% and dynamic time reduced by 60%. As we had expected and had also observed on EKS, the dynamic time reduction was higher than the static reduction. As on EKS and GKE, we note that using hf_transfer for model download without also using the custom ACR image is slower than the baseline, due to the cost to pip install hf_transfer.
Table 3: AKS RayService Baseline and Improved Deployment Times
SUMMARY
In this blog, we've shared our experience with simple low-cost off-the-shelf methods for reducing container image fetch and model download time on EKS, GKE, and AKS clusters. The Luna smart cluster autoscaler support for each cloud’s image fetch acceleration feature made our job easier. For our example LLM serving workload of a KubeRay-deployed RayService using vLLM to serve an open-source model downloaded from HuggingFace, deploy-time was cut roughly in half in most cases. For EKS, deploy-time was reduced by 26% to 47%; for GKE, deploy-time was reduced by 47% to 48%; and for AKS, deploy-time was reduced by 47% to 60%.
By the way, we note that our target use case is inexpensive self-hosted LLM serving that does not require service guarantees for sudden extreme load bursts. The methods we present do not yield the very low latencies of hosted LLM serving scale-up such as, e.g., provided by the Anyscale product, which uses a custom container image format and client to lower image pull times, a special library for fast image loading that streams tensors directly from cloud storage onto the GPU, and a direct interface between the Ray autoscaler and the system control plane for accelerated node allocation. Such hosted products can be a great choice, depending on your use case and budget. Please reach out to share your experiences with these deploy-time reduction strategies for your scale-up scenarios. You can get the free trial version of Luna here. Thanks for reading our blog and we’ll post more material as/when we find more improvements! Author: Anne Holler (Chief Scientist, Elotl) Comments are closed.
|