ABSTRACT
In a multi-cluster Kubernetes (K8s) environment, when there are insufficient statically-allocated free cluster resources to schedule a workload, an autoscaled cloud cluster can be used to obtain the resources needed to run the workload. Selecting among your autoscaled cloud clusters the one that can obtain those resources at the lowest estimated price is desirable, particularly for AI workloads requiring GPUs, since cloud GPU supply can be limited and costs can be high and can vary greatly across vendors.
In this blog, we present Thrifty-Nova, a tool for performing cost-ordered workload placement on autoscaled cloud clusters. Thrifty-Nova leverages the Nova fleet manager's policy-driven multi-cluster scheduling and the Luna Smart cluster autoscaler's node cost estimate feature to create a Nova placement policy that is customized to the workload with respect to relevant cloud resource availability and price. We give several examples of Thrifty-Nova usage that show the value of automating workload cluster selection in cost-order priority, given the impact of workload configuration and dynamic resource availability on successful placement. Abstract In our blogs “SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency” and "SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases", we discussed SuperSkyRay’s support for running Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh as well as SuperSkyRay’s non-disruptive handling of Ray apps that outgrow single-cluster placement via extending them to multi-cluster placement. In this blog, we consider SuperSkyRay’s handling of KubeRay RayServices that outgrow the single Kubernetes (K8s) clusters hosting them due to Ray cluster upgrade or reconfigure with zero downtime. To support zero downtime (default), the RayService keeps the current Ray cluster running while it brings up an additional Ray cluster with the new configuration; the upgrade or reconfiguration is incomplete until the new version of the Ray cluster is available. SuperSkyRay can reschedule a RayService deployed on a single cluster onto a different cluster to avoid the update stalling indefinitely when there are insufficient resources for a second RayCluster. While this relocation involves downtime, it is appropriate when time-to-update is critical and resources are limited. Introduction When any field in spec.rayClusterConfig of a running RayService is changed, KubeRay by default performs a zero downtime upgrade of the Ray cluster as follows. It keeps the current copy of the Ray cluster running to continue processing service requests while it deploys an additional version of the Ray cluster with the updates. Once the new version is fully ready, it switches the service to using the updated Ray cluster and removes the old Ray cluster. While this avoids service downtime, it requires that the K8s cluster hosting the RayService have sufficient resources to run two copies of the Ray cluster. When this is not possible, the service update remains incomplete for an indefinite period of time, which is undesirable. (RayService no-downtime upgrade can be disabled by setting ENABLE_ZERO_DOWNTIME to false, so cluster config changes do not engender any upgrade operation, which can also be undesirable.)
SuperSkyRay, Part 2: Scaling Ray AI Apps Across K8s Clusters for No-downtime Resource Increases11/2/2025
Abstract In our previous blog, SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency, we discussed how SuperSkyRay could be used to run Ray apps managed by KubeRay across multiple K8s clusters linked by Cilium Cluster Mesh. In this blog, we turn our attention to how SuperSkyRay can non-disruptively handle Ray apps that outgrow their single Kubernetes (K8s) cluster placement. SuperSkyRay can dynamically change the Ray app placement from single-cluster to cross-cluster, increasing the app’s resources without requiring any app relocation downtime. Introduction When SuperNova (Nova w/multi-cluster-capacity set) performs capacity-based scheduling of a K8s object group, it prefers to place the group on a single cluster if possible, since that choice is simpler in terms of management and networking than cross-cluster placement. If a group placed on a single cluster contains an app for which the worker count is later scaled up, the result may no longer fit on that cluster, e.g., because the cluster has reached its fixed size limit, as is the case of on-premise or cloud reserved-instance clusters. When a group no longer fits on its cluster, SuperNova seeks to reschedule the group.
SuperSkyRay, Part 1: Running Ray AI Apps Across K8s Clusters for Resource and Time Efficiency11/2/2025
Abstract This blog presents SuperSkyRay, a name we gave to supporting Ray app execution via KubeRay across Kubernetes (K8s) clusters running the Cilium Cluster Mesh multi-cluster datapath. SuperSkyRay uses the Nova K8s fleet manager to perform cross-cluster placement in accordance with KubeRay and Cluster Mesh operation. SuperSkyRay addresses the resource and time inefficiency that occurs when resources needed for Ray apps are fragmented across K8s clusters. Introduction Organizations using KubeRay to run the Ray ML platform on K8s often have multiple clusters for reasons such as resource availability and cost, service continuity, geo-location, and quality of service. SkyRay reduces the toil of managing instances of KubeRay running on a fleet of K8s clusters by providing policy-driven resource-aware scheduling of Ray apps onto K8s clusters. However, SkyRay does not address the inefficiency that occurs if the desired scale of a Ray app exceeds the spare capacity of any single cluster in the fleet, while at the same time the fleet has sufficient idle resources fragmented across clusters. In this case, the app runs with fewer resources than desired or is delayed until enough single-cluster capacity is freed. This inefficiency could be addressed if the Ray app could be run across multiple K8s clusters.
Introduction
Are you tired of juggling multiple Kubernetes clusters, desperately trying to match your ML/AI workloads to the right resources? A smart K8s fleet manager like the Elotl Nova policy-driven multi-cluster orchestrator simplifies the use of multiple clusters by presenting a single K8s endpoint for workload submission and by choosing a target cluster for the workload based on placement policies and candidate cluster available capacity. Nova is autoscaler-aware, detecting if workload clusters are running either the K8s cluster autoscaler or the Elotl Luna intelligent cluster autoscaler.
In this blog, we examine how Nova policies combined with its autoscaler-awareness can be used to achieve a variety of "right place, right size" outcomes for several common ML/AI GPU workload scenarios. When Nova and Luna team up you can:
Originally published on blog.ferretdb.io
Running a database without a disaster recovery process can result in loss of business continuity, resulting in revenue loss and reputation loss for a modern business.
Cloud environments provide a vast set of choices in storage, networking, compute, load-balancing and other resources to build out DR solutions for your applications. However, these building blocks need to be architected and orchestrated to build a resilient end-to-end solution. Ensuring continuous operation of the databases backing your production apps is critical to avoid losing your customers' trust. Successful disaster recovery requires:
This blog post shows how to automate these four aspects of disaster recovery using FerretDB, Percona PostgreSQL and Nova. Nova automates parts of the recovery process, reducing mistakes and getting your data back online faster. |
Topic
All
Archives
November 2025
|






RSS Feed