Save Millions on Your Cloud Bill: 9 Strategies for Kubernetes Cost Optimization
As infrastructure scales, cloud compute costs can quickly snowball, especially when you are operating 100s of Kubernetes clusters. There really isn’t a single magic bullet to tame Kubernetes expenditure. Optimizing Kubernetes costs extends beyond implementing clever technical solutions; it is very much about understanding and navigating the nuances and operational realities of your applications and the underlying infrastructure.
In this blog we will lay out a set of actionable strategies designed to improve compute utilization on Kubernetes. You can selectively adopt and combine these approaches to build a comprehensive multi-dimensional cost optimization blueprint, precisely tailored to address the distinct challenges of your platform.
In Part 2 of this blog (coming soon), we will share a case study detailing real-world application of these strategies by an organization operating cloud-scale infrastructure on Kubernetes, spanning multiple cloud providers and stretched across continents.
Measuring Efficiency
Four core resource metrics are crucial for understanding consumption at a fundamental level. They are defined for both CPU and memory.
Capacity: The total physical resources (CPU cores, memory in bytes) of a node.
Allocatable: The total amount of a resource available on a node for pods, subtracting overhead reserved for the Kubernetes system and the operating system. This is actual capacity available for scheduling workloads.
Requested: The amount of a resource that a pod explicitly declares it needs to run. Kubernetes uses this value for scheduling decisions, ensuring a node has enough guaranteed capacity for the pod.
Used: The actual, real-time consumption of a resource by a pod.
Understanding the difference between "cost" and "utilization" is crucial. Utilization can be measured at various levels of aggregation, and selecting the appropriate utilization metric is helpful when implementing an optimization strategy.
Used/Allocatable (U/A): This utilization metric quantifies the consumption of resources relative to the total capacity available across all nodes in a cluster. It provides a measure of how busy the underlying infrastructure is. High U/A generally indicates that the provisioned hardware is being used efficiently.
Used/Requested (U/R): This utilization metric quantifies how effectively the resources requested by individual pods are actually being consumed. It highlights the discrepancy between what a workload declares it needs and what it actually uses. Low U/R indicates overprovisioning at the workload level, leading to wasted allocations.
Requested/Allocatable (R/A): This utilization metric quantifies how densely pods are packed onto nodes by comparing the total requested resources by all scheduled pods against the nodes’ allocatable capacity. A low Requested/Allocatable ratio suggests sparse bin-packing, implying more nodes are provisioned than necessary.
Remember that high utilization does not always equal low cost. Ultimately, cost should be the overarching measure of success for any optimization effort.
Fig: visualizing resource utilization?
Strategies for Kubernetes Cost Optimization
1. Implement Cost Visibility and Attribution
Kubernetes workloads are distributed across diverse node types and can scale up and out. This dynamic nature obscures the origin of compute cost, especially in shared multi-tenant clusters. Lack of granular attribution prevents the identification of cost inefficiencies and the accountability of resource consumption to specific teams or applications.
Solution: Adopt cost visibility and attribution mechanisms for your platform. For simpler deployments, off-the-shelf tools such as KubeCost or OpenCost can provide real-time monitoring and cost allocation by pod, label, namespace, and service. These tools can capture price signals for CPU-seconds, memory bytes, storage IOPS, network egress, etc. For complex, multi-cloud, or highly customized platforms, consider developing an in-house attribution system. This typically involves aggregating resource utilization metrics and reconciling them with cloud provider billing data.
Pitfalls to avoid:
Delayed Adoption: Cost visibility is frequently an afterthought. This can lead to substantial, unidentifiable waste.
Umbrella Cost Models: For shared resources, such as databases, metrics, blog storage, etc. an “umbrella” cost attribution is ineffective because of its unclear ownership. Define a clear cost model that allocates shared resource costs based on usage or pre-defined criteria to teams or applications.
Lack of Governance: Without continuous audits and automated alerts, cost optimization efforts can regress. Adopt a regular review process that flags anomalies based on past costs and projected modeling.
2. Adopt Multi-Tenant Clusters
It is common to provision single-tenant Kubernetes clusters per team or application, often driven by a perceived need for strict isolation and developer insistence. This model inherently leads to overprovisioning due to fragmented resource pools and control plane overhead.
Solution: Transition to a multi-tenant cluster architecture. Kubernetes offers robust controls that enable diverse workloads to coexist securely and efficiently on shared hardware. These include:
Namespaces: Provide logical isolation for tenants within a single cluster.
Resource Requests and Limits: Crucial for managing resource consumption of application pods, preventing individual workloads from monopolizing resources.
Resource Quotas and LimitRanges: Enforce aggregate resource consumption per namespace, ensuring fair access to shared resources.
Role-Based Access Control (RBAC): Restrict tenant access to only their designated namespaces and resources, ensuring API isolation.
Network Policies: Control communication between pods across namespaces, providing network isolation.
Pod Security Standards (PSS) or Security Contexts: Enforce security hygiene and prevent privilege escalation, safeguarding the underlying nodes.
Pitfalls to avoid:
"Noisy Neighbor" Phenomenon: Workloads can still exhibit spike CPU or memory usage impacting applications co-located on shared nodes. Kubernetes schedules pods based on resource requests, which can lead to contention if applications constantly exceed their requests.
Network and Disk I/O Bottlenecks: Kubernetes lacks native mechanisms for specifying network and disk I/O requests. This can lead to performance degradation if I/O intensive workloads on the same node contend for shared bandwidth. Implement monitoring for these resources and consider using pod anti-affinity or topology spread constraints for throughput constrained applications.
3. Custom Scheduler for Bin-Packing Optimization
The default Kubernetes scheduler aims to distribute workloads uniformly across nodes. This "spreading" strategy leaves unused capacity across multiple nodes, leading to sparse bin-packing and persistent overprovisioning.
Solution: Configure a custom Kubernetes scheduler to favor tighter bin-packing (e.g., using MostAllocated or RequestedToCapacityRatio scoring strategies). This approach fills existing nodes more completely before provisioning new ones, leading to higher resource utilization.
Pitfalls to avoid:
Scheduler Unavailability: The custom scheduler, like any self-managed workload, may experience unavailability. This will lead to pods remaining in a pending state. Design the custom scheduler for high availability and consider using it in conjunction with the default scheduler with fail-open mechanics.
Pods per node Limits: When using larger node instance types cloud providers often impose max-pods-per-node limits. This restricts the number of pods that can be placed on a single node even if it has ample CPU and memory capacity. Carefully plan for use of larger instance types, or use a combination of large and small nodes.
4. Automate Resource Tuning for Workloads
Manually configuring CPU and memory requests and limits for pods is inefficient. Developers frequently err on the side of overprovisioning, driven by concerns about performance, stability, and the potential impact of noisy neighbors. This results in significant wasted resources. Moreover, these initial resource requests are seldom audited or adjusted over time as application needs evolve.
Solution: Automate the process of resource assignment and tuning. This involves continuously monitoring the actual resource usage of workloads and dynamically adjusting their CPU and memory requests and limits to align more closely with real-time consumption patterns. While Kubernetes offers Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), they cannot be used in conjunction. You can implement custom tools to use the VPA recommendations to tune requests along with a cool off logic to prevent thrashing.
Pitfalls to avoid:
Phased Rollout and Escape Hatches: Implement automated tuning with a fail-open approach, starting with a phased rollout. Provide clear escape hatches and mechanisms for developers to temporarily disable or override automation for specific workloads if critical issues arise.
Developer Control and Trust: Automation creates leverage, but platforms with thousands of microservices require careful consideration of application-specific nuances. Engage developers by providing dashboards with tuning recommendations to foster trust and gradual adoption.
Resource Use Profiles: Not all applications exhibit identical resource consumption patterns. Utilize profiles for tuning (e.g. guaranteed, burstable, etc.) to cater to diverse workload requirements and criticality levels.
Application of Scaled-Down Requests: Be judicious about when scaled-down requests are applied. If adjustments are only applied at deploy time it can confuse developers when troubleshooting new changes. Implement mechanisms for safe, programmatic restarts that clearly attribute performance changes to the tuning.
5. Enable Controlled Evictions
Workloads that prevent disruption with PodDisruptionBudgets (PDBs) set to 0 or annotations like safe-to-evict: false, block the Cluster Autoscaler from scaling down nodes even when they are sparsely packed. These configurations are common for singletons or critical workloads, and are problematic in multi-tenant platforms.
Solution: Instate policies to disallow the use of PDB set to 0 and safe-to-evict: false annotations. Proper application architecture hygiene like handling SIGTERM signals, setting terminationGracePeriodSeconds suffices for most cases. For more complex workloads, build custom tooling to orchestrate graceful workload eviction by exposing a pre-shutdown hook. For example, in stateful applications this may involve pre-shutdown steps like draining connections, committing transactions, or backing up data.
Pitfalls to avoid:
Long-Running Jobs: Long-running streaming workloads like Apache Flink, or batch jobs with significant internal state, may not recover gracefully from disruptions. Forced restarts can lead to jobs restarting from scratch, incurring greater overall cost due to re-processing or extended execution times. Isolate such workloads onto a dedicated node pool or cluster. Newer versions of Flink may support checkpointing (or similar specialized recovery mechanisms).
Cost of Exceptions: Strictly control workloads that genuinely cannot withstand eviction by isolating them onto dedicated node pools or clusters. This simplifies cost accounting for these exceptions and prevents impacting overall cluster efficiency.
6. Spot Nodes for Ephemeral Workloads
Workloads with frequent spin-up and spin-down cycles, such as short-lived jobs, analytics pipelines, or data ingestion agents, can prevent the Cluster Autoscaler from effectively scaling down underutilized nodes. Running these ephemeral workloads on standard on-demand nodes can be expensive.
Solution: Schedule ephemeral workloads onto spot nodes. Spot nodes leverage unused cloud provider capacity and can be significantly cheaper (e.g. up to 58% cheaper on AKS) than on-demand instances. This strategy is equally effective for non-business-critical tasks where occasional interruptions are acceptable.
Pitfalls to avoid:
Interruption Notices: Cloud providers can reclaim spot nodes with short notice (e.g., 30 seconds on Microsoft Azure). Ensure workloads running on spot nodes are designed for fault tolerance and can gracefully handle abrupt termination within this limited window.
Non-Standard Shutdowns: Some cloud providers may not follow standard Kubernetes shutdown processes (e.g., sending SIGTERM) when reclaiming spot nodes, potentially leading to immediate termination. See this blog that describes hardening Spot nodes by implementing custom tooling.
Spot Capacity Shortages: Spot instance availability is dynamic and can fluctuate based on region, instance type, and any number of arbitrary reasons. Derisk this by using a mix of spot instance types combined with a baseline of on-demand instances for stability. Configure Cluster Autoscaler to prioritize scaling up spot nodes before others, allowing workloads to spill over to on-demand nodes only when spot capacity is unavailable.
7. Align Node Shapes with Workload Demands
Discrepancies between the CPU:memory ratios of provisioned nodes and the actual consumption patterns of workloads can lead to resource imbalances. For example, if workloads are memory-intensive but nodes have a high CPU:memory ratio, CPU resources may be underutilized while memory becomes a bottleneck. This necessitates more nodes than actually required.
Solution: Adjust your node instance types’ CPU:memory ratio to align with your aggregate workload consumption patterns, calculated as Σ(CPU Used):Σ(memory Used). Ensure this is done per node pool and cluster. By migrating to optimal node shapes, you can close the gap between CPU and memory efficiency, maximizing utilization and reducing overall compute costs.
Pitfalls to avoid:
In-place Node Type Changes: Node instance types cannot be changed in place. Implementing this strategy requires custom tooling to orchestrate a node pool change to replace nodes with the desired instance types without disruption workloads. See this blog for a highly automated way of replacing a node pool.
Premature Optimization: This strategy is best executed after workloads have been right-sized and resource requests/limits are accurately defined. Optimizing node shape before workload rightsizing is ineffective as the CPU and memory efficiency will change later, necessitates replacing node instance type again.
System Overhead: Remember that Kubernetes and its components (kubelet, kube-proxy, container runtime) consume a portion of node resources. Account for this system overhead when evaluating available CPU and memory for your workloads, especially on smaller nodes.
8. Custom Metric-Based Autoscaling
Kubernetes’ Horizontal Pod Autoscaler (HPA) natively supports scaling based on CPU and memory utilization. However, many applications have unique scaling requirements that are more accurately reflected by business-specific or application-level metrics (e.g., requests per second, queue depth, active connections etc.). Without the ability to autoscale on these metrics, organizations often resort to overprovisioning to handle peak loads.
Solution: Enable custom metric-based autoscaling on your platform. This typically involves deploying a custom metrics adapter (e.g. Prometheus Adapter) that exposes metrics from your monitoring system to the Kubernetes Custom Metrics API (custom.metrics.k8s.io). Alternatively, solutions like KEDA (Kubernetes Event-Driven Autoscaling) provide a more flexible approach, offering built-in “scalers” for various event sources (message queues, databases, cloud services), including the ability to scale down to zero pods when no events are present.
Pitfalls to avoid:
Metrics Infrastructure Reliability: Tightly coupling Kubernetes scaling behavior to your metrics infrastructure requires ensuring the same level of availability and reliability for the metrics pipeline itself. Failures in metrics collection can lead to incorrect scaling decisions.
Metric Accuracy: Effectiveness of custom metric autoscaling depends on the accuracy of the chosen metrics. Inaccurate metrics can lead to undesirable scaling behavior (e.g., thrashing or flapping where replicas frequently fluctuate).
Single Custom Metrics Server Limitation: Historically, Kubernetes environments have had a limitation of supporting only one custom metrics server. This means that if you install a solution that provides a custom metrics server (e.g., KEDA's metrics adapter or Prometheus Adapter), other tools attempting to register their own custom metrics API may conflict. This requires careful planning of your metrics ecosystem.
Failure Modes and Defaults: Design for failure modes. What happens if the custom metrics server becomes unavailable? Implement sane default replica counts or fallback mechanisms to ensure application stability during metric outages.