When nodes in a cluster become over-utilized, pod performance suffers. Avoiding or addressing hot nodes can reduce workload latency and increase throughput. In this blog, we present two Ray Machine Learning serving experiments that show the performance benefit of Luna’s new Hot Node Mitigation (HNM) feature. With HNM enabled, Luna demonstrated a reduction in latency relative to the hot node runs: 40% in the first experiment and 70% in the second. It also increased throughput: 30% in the first and 40% in the second. We describe how the Luna smart cluster autoscaler with HNM addresses hot node performance issues by triggering the allocation and use of additional cluster resources.
INTRODUCTION
A pod's CPU and memory resource requests express its minimum resource allocations. The Kubernetes (K8s) scheduler uses these values as constraints for placing the pod on a node, leaving the pod pending when the settings cannot be respected. Cloud cluster autoscalers look at these values on pending pods to determine the amount of resources to add to a cluster.
A pod configured with both CPU and memory requests, and with limits equal to those requests, is in QoS class guaranteed. A K8s cluster hosting any non-guaranteed pods runs the risk that some nodes in the cluster could become over-utilized when such pods have CPU or memory usage bursts. Bursting pods running on hot nodes can have performance problems. A bursting pod’s attempts to use CPU above its CPU resource request can be throttled. And its attempts to use memory above its memory resource request can cause the pod to be killed. The K8s scheduler can worsen the situation, by continuing to schedule pods onto hot nodes.
The Vertical Pod Autoscaler (VPA) can recommend and optionally set a pod's CPU and memory resource requests and limits, based on K8s metrics server data, and hence can be used to avoid or address hot nodes. However, there are various trade-offs in using VPA, and by default VPA can reduce but does not eliminate hot node risks. Cloud cluster autoscalers obtain resources for pending pods and typically do not address the issue of hot nodes. With these concerns in mind, we introduced the Hot Node Mitigation (HNM) feature to the Luna smart cluster autoscaler. With HNM enabled, Luna monitors its allocated nodes’ CPU and memory utilization using K8s metrics server data, and takes action to avoid or reduce high CPU or memory utilization.
In this blog, we describe the K8s hot node problem and discuss handling it via VPA and via Luna's HNM feature. We present two experiments showing how HNM reduces the impact of high utilization. The experiments involve ML workloads. Such workloads are challenging to handle since they are sensitive to the latency impact both of high utilization and of the cluster scaling operations intended to address high utilization. These experiments demonstrate that Luna HNM can be an effective chill pill to cure significant pod performance problems. HANDLING HIGH K8S NODE UTILIZATION
Pods that are not in the guaranteed QoS class introduce the risk that cluster nodes can become highly utilized. Determining how to set a pod’s CPU and memory request and limit values so that it is in the guaranteed QoS class is challenging. The pod may be running a new workload, for which the resource needs have not yet been established. Or the pod's resource needs may evolve over time, as its use case changes. Or the pod's resource needs may have rare bursts, and configuring its resource requests to handle such peaks is inefficient in the normal case.
Vertical Pod Autoscaler (VPA)
VPA can be used to recommend and optionally set a pod's CPU and memory resource requests and limits, based on the pod’s metrics server data. By default, VPA-generated settings maintain the ratios between limits and requests that were specified in the initial container configuration. And if no limits were specified, VPA does not generate limits. Hence, by default, VPA reduces the likelihood of hot nodes when it makes pod requests settings larger, but it does not increase the number of pods with guaranteed QoS or completely eliminate the risk of hot nodes.
There are various trade-offs in using VPA. When VPA is run in auto (default) or recreate mode, it can be disruptive, since it restarts pods if their VPA-recommended resource requests differ non-trivially from (either below or above) their current resource requests. And if VPA is run in initial or recommendation-only mode, it is not real-time responsive to current conditions. Also, VPA is not tested in large clusters, according to its github README, and users have reported scaling issues when VPA is handling large numbers of pods. Hence, while VPA can help mitigate high node utilization, it may introduce challenges such as unnecessary pod restarts, delayed responses to hot node events, or scalability issues in large-scale environments. Hot Node Mitigation (HNM)
Luna's HNM, by focusing on node hot spots when they occur, is intended to be responsive, disruptive only when appropriate, and scalable. In general, Luna allocates node resources for pods based on the pods' resource request settings. For smaller pods, Luna allocates nodes on which multiple pods may be bin-packed. For larger pods or those with node configuration constraints, Luna allocates a node for each pod. If Luna-managed bin-packed pods have no resource request settings or if their request settings are lower than pod usage, Luna-allocated bin-packed nodes may become highly utilized, causing performance problems.
When Luna HNM is enabled (via the manageHighUtilization.enabled configuration option set to true), Luna uses K8s metrics server data to monitor the CPU and memory utilization of Luna-allocated bin-packed nodes, and takes action to avoid or reduce high CPU or memory utilization. CPU utilization is computed as usage over CPU capacity. Usage is the CPU core usage reported by metrics server, which averages it over the metrics server configured window period (e.g., 30s or more). Memory utilization is computed as the instantaneous working set memory over memory capacity. The Luna HNM loop runs every manageHighUtilization.loopPeriod, and uses metrics server node and pod CPU and memory utilization data and configuration options to characterize busy nodes as yellow or red. Yellow nodes [CPU utilization >= manageHighUtilization.yellowCPU (default 60) or memory utilization >= manageHighUtilization.yellowMemory (default 65)] are considered warm. HNM taints warm nodes to prevent the K8s Scheduler from adding more pods onto them. This diminishes the likelihood of warm nodes transitioning to high CPU or memory utilization. Red nodes [CPU utilization >= manageHighUtilization.redCPU (default 80) or memory utilization >= manageHighUtilization.redMemory (default 85)] are considered hot. In addition to tainting them, HNM performs an eviction of the highest CPU- or memory-demand Luna-scheduled pod (based on pod metrics server data) on them that meets the same pod eviction restrictions applied for Luna node scale-down, which considers a number of factors including respecting the do-not-evict annotation. This reduces high CPU or memory utilization. Lightly-used nodes are considered green [CPU utilization < manageHighUtilization.greenCPU (default 10) and memory utilization < manageHighUtilization.greenMemory (default 15)]. If green nodes have an HNM taint, it is removed from the node. This allows nodes that are no longer warm or hot to again host additional pods. The large gap between the yellow and green thresholds is intended to avoid the situation that the node taint flaps on and off, engendering associated pod placement churn. Note that bin-packed pods which have no CPU and memory request settings (or that have CPU and memory request settings that are inaccurate and very low) introduce the additional risk that the nodes they are running on appear to Luna to be under-utilized with respect to requests and hence candidates for scale-down. For this case, scaleDown.binPackNodeUtilizationThreshold can be set to 0.0, if desired, so Luna only scales down nodes running no Luna-managed pods. LUNA HOT NODE MITIGATION EXPERIMENTS
In this section, we present two experiments. One shows how Luna HNM can reduce the impact of high utilization via hot node pod eviction. The other shows how Luna HNM can avoid the impact of high utilization via warm node tainting.
For our experiments, we use the hey load generator to present queries to an online Machine Learning (ML) model that does text summarization. The ML serving workload runs on a Ray cluster with CPU Ray worker(s), deployed by KubeRay on a Luna-enabled AKS cluster. The AKS cluster has 2 static nodes of type Standard_DS2_v2 (2 CPUs, 7G), on which Luna and KubeRay are deployed. We chose to deploy KubeRay onto statically-allocated compute rather than having Luna deploy KubeRay onto dynamically-allocated compute, since KubeRay’s role is infrastructure-related and its resource needs are low. It is configured with guaranteed QoS set at CPU requests=limits=100m and memory requests=limits=512Mi. When hot node pod eviction is used to reduce node utilization, Luna may need to allocate an additional node to handle the evicted pod. For the online ML model serving use case, which is latency-sensitive, adding that node needs to happen as quickly as possible, since the node scale-up time is on the critical path of addressing the serving performance problem caused by evicting a server worker. We first indicate how we reduced node scale-up time and then present the two experiments. Reducing Node Scale-up Time
Two key components of node scale-up time are node instance allocation time and image pull time. For the instance types in our experiments, we observed node instance allocation times to be 1-2 minutes and pull times for the large rayproject/ray-ml:2.9.0 image to be >5 minutes.
To hide the latency of node instance allocation time, we used over-provisioning, as discussed here and here. We deployed a low-priority single-pod deployment configured to consume one bin-packing node, with the idea of keeping a single extra node available for bin-pack scale-up. The expense of this idle node was considered worthwhile for the example ML serving use case. To hide the latency of pulling the large ray-ml image, we used this daemonset to pre-pull the image into the cache on each K8s node. There are a number of general-purpose tools intended to address the image pull latency problem (e.g., kube-fledged, dragonfly). We chose a custom daemonset for the simple purposes of our experiment. HNM Hot Node Pod Eviction
To show the impact of HNM hot node pod eviction, we compare load testing performance results on the RayService text summarizer with 2 CPU Ray workers for 3 configurations:
1. Baseline: the 2 CPU workers are configured for guaranteed QoS and are placed by Luna on 2 separate bin-packing nodes.
2. HNM-Disabled: the 2 CPU workers are configured for Burstable QoS (requests<limits) and are placed by Luna on the same bin-packing node. HNM is not enabled to mitigate.
3. HNM-Enabled: the 2 CPU workers are configured for Burstable QoS (requests<limits) and are placed by Luna on the same bin-packing node. HNM is enabled with redCPU set to 70. HNM mitigates the node’s high CPU utilization by evicting one of the Ray worker pods, which is restarted on another node.3.
For all 3 configurations, the Ray head was annotated for placement on a bin-select node to simplify analysis of the bin-packing scenarios. We note that the Ray head uses guaranteed QoS and hence is not subject to performance impact from bursting.
Luna bin-packing node size is configured as 8 CPUs and 32Gi memory. The Standard_A8m_v2 instance type is used, since it is the least expensive node that satisfies this bin-pack node size. The Luna bin-select thresholds are set to 7 CPUs and 30G memory. The baseline RayService configuration is here and the Burstable RayService configuration is here. As you can see, the Baseline Ray workers have requests=limits of 4 CPUs and 16G memory and the Burstable Ray workers have requests of 3 CPUs and 12G memory, meaning that Baseline workers requests do not fit on the same bin-packing node and Burstable workers do. With port-forwarding set in a separate terminal:
and with the ML serving model input set as:
the load test is run for 300 seconds using 10 threads and per-query time-out of 60 seconds as:
The results of the experiment are given in Table 1. The HNM-Disabled row shows the substantial impact that CPU contention has on the average response time (40% worse) and number of responses generated (30% fewer) during the 300 seconds run relative to the baseline. The first HNM-Enabled row reflects that pod eviction and restart has a short-term negative impact relative to HNM-Disabled, since during the eviction/restart period, the full load is being handled by a single Ray worker. The second HNM-Enabled row shows that after that period, performance that matches the baseline is achieved.
Note that the performance impact of pod eviction/restart by HNM for high CPU utilization is worthwhile only if the load persists for a non-trivial period after the eviction/restart. The ROI of pod eviction is significantly improved if the memory is the highly utilized resource, since memory contention can lead to pod OOM termination. Hence, for memory contention, eviction and restart can be worthwhile for shorter load spike duration.
Table 1: Impact of HNM Hot Node Pod Eviction on Text Summarizer Model serving load
Let’s next consider an example where hot node performance problems can be avoided if warm nodes are tainted to inhibit additional pod placement on them. HNM Warm Node Tainting
To show the impact of HNM warm node pod tainting, we have Luna place a CPU stress test pod on a bin-packing node, to act as a noisy neighbor for our experiment. This pod has Best-Effort QoS, specifying neither requests nor limits, which means its requests values are treated as 0. We set the Luna option scaleDown.binPackNodeUtilizationThreshold to 0.0 to have Luna scale-down only consider nodes not running any Luna-managed pods, as previously discussed.
We compare load testing performance results on the RayService text summarizer with 1 CPU Ray worker (not 2 CPU Ray workers as in the previous experiment) for 2 configurations: 1. HNM-Enabled: the CPU worker is configured for Burstable QoS (requests<limits) and is not placed on the same node as the CPU stress test pod, because HNM has tainted that node due to its utilization exceeding yellowCPU.
2. HNM-Disabled: the CPU worker is configured for Burstable QoS (requests<limits) and is placed on the same node as the CPU stress test pod, since that node appears to have plenty of resources from the standpoint of requests values.
For both configurations, the Ray head is placed on a bin-select node, as in the previous experiment.
Luna bin-packing node size is configured as 8 CPUs and 32Gi memory; the Standard_A8m_v2 instance type is used. The Luna bin-select thresholds are set to 7 CPUs and 30G memory. The Burstable RayService configuration is here, with requests set to 2 CPUs and 12G memory and limits set to 4 CPUs and 16G memory. The load test run uses the same TEXT input and port-forwarding as the previous experiment. The HNM-Enabled load test is run for 300 seconds using 10 threads and per-query time-out of 60 seconds as:
The HNM-Disabled configuration could not complete any queries with the per-query time-out set to 60. It was re-run using the per-query time-out of 120 seconds as:
The results of the experiment are given in Table 2. For HNM-Enabled, the single Burstable ray CPU worker pod was not placed on the same node as the CPU stress test pod, since the node was tainted by HNM due to warm utilization. However, for HNM-Disabled, the single Burstable ray CPU worker was placed on the same node as the CPU stress pod and this noisy neighbor greatly impacted its performance. No successful responses were returned within the 60s timeout and significantly poorer average response time (70% higher) and number of responses (40% lower) were observed with the 120s timeout.
Table 2: Impact of HNM Warm Node Tainting on Text Summarizer Model serving load
POSSIBLE FUTURE WORK
While we’ve presented experiments where the current HNM feature worked well, we note two limitations of the current HNM feature.
CONCLUSION
We used Ray to run two ML online serving workloads. In both cases, Luna Hot Node Mitigation allowed us to significantly reduce the latency (by 40% and 70%) and increase the throughput (by 30% and 40%) relative to runs on hot nodes.
Take a look at your clusters; do you have non-guaranteed QoS pods and hot nodes? This could be slowing your workloads down. Please feel free to download our free trial version and/or to reach out with any questions or comments. We’re dedicated to continually enhancing Luna and the Hot Node Mitigation feature. And to do so effectively, we need to hear from you! We welcome your feedback on how our current HNM solution works for you and whether our proposed improvements would be helpful in your set up. Please share your experiences and insights so we can tailor our solution to your needs. Thanks for taking the time to read the blog and have a great day!
Author:
Anne Holler (Chief Scientist, Elotl) Comments are closed.
|