Elotl
  • Home
  • Platform
    • Luna
    • Nova
  • Resources
    • Blog
    • Youtube
    • Podcast
    • Meetup
  • Usecases
    • GenAI
  • Company
    • Team
    • Careers
    • Contact
    • News
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
  • Home
  • Platform
    • Luna
    • Nova
  • Resources
    • Blog
    • Youtube
    • Podcast
    • Meetup
  • Usecases
    • GenAI
  • Company
    • Team
    • Careers
    • Contact
    • News
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial
Search

Blog

Luna Hot Node Mitigation: A Chill Pill to Cure Pod Performance Problems

8/21/2024

 
Picture
When nodes in a cluster become over-utilized, pod performance suffers. Avoiding or addressing hot nodes can reduce workload latency and increase throughput.  In this blog, we present two Ray Machine Learning serving experiments that show the performance benefit of Luna’s new Hot Node Mitigation (HNM) feature. With HNM enabled, Luna demonstrated a reduction in latency relative to the hot node runs: 40% in the first experiment and 70% in the second. It also increased throughput: 30% in the first and 40% in the second. We describe how the Luna smart cluster autoscaler with HNM addresses hot node performance issues by triggering the allocation and use of additional cluster resources.

INTRODUCTION

A pod's CPU and memory resource requests express its minimum resource allocations.  The Kubernetes (K8s) scheduler uses these values as constraints for placing the pod on a node, leaving the pod pending when the settings cannot be respected.  Cloud cluster autoscalers look at these values on pending pods to determine the amount of resources to add to a cluster.

A pod configured with both CPU and memory requests, and with limits equal to those requests, is in QoS class guaranteed.  A K8s cluster hosting any non-guaranteed pods runs the risk that some nodes in the cluster could become over-utilized when such pods have CPU or memory usage bursts. Bursting pods running on hot nodes can have performance problems.  A bursting pod’s attempts to use CPU above its CPU resource request can be throttled.  And its attempts to use memory above its memory resource request can cause the pod to be killed.  The K8s scheduler can worsen the situation, by continuing to schedule pods onto hot nodes.

Read More

    Topic

    All
    ARM
    Autoscaling
    Deep Learning
    Disaster Recovery
    GPU Time-slicing
    Luna
    Machine Learning
    Node Management
    Nova
    Troubleshooting
    VPA

    Archives

    May 2025
    April 2025
    January 2025
    November 2024
    October 2024
    August 2024
    July 2024
    June 2024
    April 2024
    February 2024

    RSS Feed

​© 2025 Elotl, Inc.
  • Home
  • Platform
    • Luna
    • Nova
  • Resources
    • Blog
    • Youtube
    • Podcast
    • Meetup
  • Usecases
    • GenAI
  • Company
    • Team
    • Careers
    • Contact
    • News
  • Free Trial
    • Luna Free Trial
    • Nova Free Trial