Clean Compute

But What About Reliability? The Kubernetes Cost Optimization Paradox

Nibir Bora — Wed, 08 Apr 2026 17:31:08 GMT

This post is the follow-up to Part 1: Save Millions on Your Cloud Bill, where we focus on the harder question: how do you apply those ideas in a real production environment without compromising reliability?

This article is based on a talk presented at KubeCon North America 2025. You can watch the talk here and view the slides here.

Special thanks to Zain Malik for his contributions to the ideas and work behind this talk.

Business leaders believe reliability can only be achieved at a high cost. There is some truth to this. Cut costs without considering reliability, and application stability takes a hit. Spooked executives abandon cost optimization to preserve customer trust. And you are back at sky high costs.

This is why every cost optimization initiative is met with the cautionary question:

But what about reliability?

The fix is simple in principle: reliability needs to be an intentional step in reducing cost. The paradox isn't between cost and reliability. It's between fear and understanding. Once you understand the gaps in your reliability posture, reliability peaks and cost savings follow.

The optimization faux pas (left) vs. cost optimization done right (right).

In Part 1 of this blog we laid out 11 strategies to reduce Kubernetes costs, starting from absolute basics like:

Use shared multi-tenant clusters
Enable Node Pool autoscaling, and ensure scale down to zero
Declare resource requests & limits for workloads
Use Horizontal Pod Autoscaling (HPA)

Intermediate techniques as your platform scales:

FinOps & cost visibility per team, per service, per namespace, etc.
Use spot nodes for ephemeral workloads
Align node shapes with workload demands
Enable custom metric-based autoscaling

And advanced techniques when infra cost justifies dedicated engineering:

Automatic request tuning for workloads
Custom scheduler for optimized bin-packing
Cluster Autoscaler expander profiles to scale across node pools
Enable controlled evictions to remove any eviction blockers

In this post, we show that cost optimizations done right actually increase platform reliability. This isn’t a contrarian theory. It’s a selection of war stories (along with technical solutions) from an organization operating cloud-scale Kubernetes infrastructure.

Technical Roadblocks

Each optimization technique from Part 1 comes with a reliability concern. Here’s how we addressed them by extending Kubernetes.

Protecting the Data Layer

Distributed databases like CockroachDB rely on quorum across regions. If multiple replicas are lost simultaneously in different regions, the database loses quorum or becomes inconsistent. This can take down every transactional service causing a serious global outage.

Evictions in different regions can happen for a myriad of reasons - Cluster Autoscaler (CA) scaling down an underutilized node, planned maintenance, node pool upgrades, etc.

Simultaneous evictions from Cluster Autoscaler in Cluster A and a Node Pool upgrade in Cluster B can break CockroachDB quorum.

The instinctive response is to set PodDisruptionBudget (PDB) maxUnavailable to 0 on every CockroachDB pod. This feels safe. But nodes running CockroachDB pods can now never be drained. This isn’t just idle capacity burning money. It’s also an operational overhead because any node pool upgrade requires manual steps.

Solution

We built a multi-cluster distributed lock using the Kubernetes Leases API. The Leases API is typically used for leader election within a single cluster. We implemented a custom Operator extending Leases to coordinate evictions across clusters.

Multi-cluster eviction coordination using the Leases API. A Validating Webhook in each cluster intercepts eviction requests and acquires a global lease from the management cluster before allowing the eviction to proceed.

Before CA in any region attempts to evict a CockroachDB pod, it must first acquire a global lease. Only one cluster can hold that lease at any given moment. If a node pool upgrade in another cluster tries to evict a CockroachDB pod at the same time, it fails to acquire the lease and the eviction request is rejected.

This guarantees at most one CockroachDB pod is evicted globally at any given time. Quorum is never at risk.

By extending Kubernetes you can enable controlled evictions without a security blanket of overprovisioned resources. This gives you reliability and cost efficiency at the same time.

HPA vs. VPA

Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) cannot be used together. This is a well-known Kubernetes limitation. Platforms that rely on horizontal scaling for reliability are forced to trade off automatically adjusting resource requests based on actual usage.

The instinctive response is to opt out of VPA entirely. Platform maintainers accept manually adjusting resource requests as an operational overhead. This doesn’t scale and leads to cumulative waste.

Solution

The key was figuring out how to use both HPA and VPA together. We built an automatic resource tuner to address this.

Automatic resource tuning pipeline. Resource Recommender produces recommendations based on historical usage data and applies them via a Mutating Admission Webhook at the next rollout.

VPA in recommender mode analyzes usage and produces recommendations, but never applies them. We built a custom controller that blends these recommendations with a weighted average of usage data over the last several days. This produces a resource recommendation we can control - how frequently it gets generated, variance across regions, and overhead.

A Mutating Admission Webhook applies these recommendations at the next application rollout. We also built in self-service opt-in and opt-out mechanisms along with tuning profiles so developers can customize how their workloads get tuned.

When an upstream tool doesn’t quite fit your problem, the answer is often to extend it, not abandon it. Using HPA and VPA together by extending Kubernetes reduces both cost and operational overhead.

High Spec Node Flop

The instinctive path to reducing per-pod overhead is to use bigger nodes. Larger instance families like 96-core machines mean less amortized cost from DaemonSets, monitoring agents, network interfaces, and kubelet. More pods per node, lower overall cost. At least on paper.

In reality, cloud providers impose a max-pods-per-node limit. Without kernel level tuning you cannot pack too many pods on a node. These huge expensive machines end up half empty.

Solution

We reverted to smaller 32-core nodes and aligned node shape to the aggregate resource profile of our workloads. Our workloads were mostly memory-heavy. Across our fleet, memory would saturate first while CPU cores sat unused on nodes.

By switching from a 1:4 CPU to memory ratio to a 1:8, we packed pods more efficiently and reduced the gap between CPU and memory utilization. That’s where the savings came from.

Sometimes you have to dial back on a path that looks right on paper but doesn’t deliver in practice. When direct cost optimization isn’t possible, understand the real constraint before optimizing again.

The Flink Dilemma

Apache Flink powers critical ETL pipelines at most businesses. Some of these stateful stream processing jobs run for as long as 45 minutes or even hours. When moving workloads to Spot instances (up to 58% cheaper on most cloud providers) Flink is an obvious target.

The problem is that older versions of Flink didn’t support partial recovery or checkpoints. If a Spot node running a Flink job is reclaimed, the entire job restarts from scratch. This cascades delays to downstream systems and risks data inconsistencies.

The instinctive response is to optimize anyway and accept the occasional restart. 58% savings is hard to ignore.

Solution

We made a deliberate decision not to run Flink on Spot instances. The potential disruption cost overshadowed the potential cost savings.

Instead, we isolated Flink onto a dedicated on-demand node pool and explicitly allowed safe-to-evict: false annotations so these jobs won’t be evicted mid-execution. We accepted higher costs for one node pool. But this problem workload didn’t drag down the overall platform efficiency.

Sustained efficiency in a mature platform comes from knowing when not to optimize to protect reliability. If workload inefficiencies cannot be eliminated, isolate them instead.

Noisy Neighbor

Multi-tenant clusters mean workloads compete for shared resources. This is fundamental for cost efficiency, but comes with downsides. For example, a single pod throttling CPU on a node, or latency spikes from a memory-hungry neighbor.

When users cannot root cause these failures easily, “noisy neighbor” becomes the catch-all explanation. The instinctive response is to isolate workloads onto dedicated node pools or clusters. This is one of the most widely accepted arguments against multi-tenancy.

But isolation at that level destroys the density gains that make multi-tenancy worthwhile. You’re solving a diagnosis problem with expensive infrastructure.

Solution

We introduced Pressure Stall Information (PSI) metrics that tell users exactly how long a workload has been waiting for CPU cycles, blocked on I/O, or under memory pressure. A comprehensive node-level dashboard exposing CPU utilization, memory pressure, file descriptor counts, conntrack table usage, context switch rates, and per-workload PSI metrics lets engineers quickly identify what’s actually constraining their workload.

For workloads that genuinely need isolation, we introduced CPU core pinning via the Kubernetes CPU Manager static policy. This allows latency-sensitive workloads to request dedicated CPU cores on a node, eliminating context switching.

Noisy neighbor is almost always a visibility problem, not a multi-tenancy problem.

Cultural Roadblocks

Technical blockers are tough. Organizational resistance is harder. You can’t debug fear. You can’t automate trust. Dealing with people requires a different toolkit entirely.

FUD & Ownership

Engineers pour their careers into making their services stable. The idea of an automated system touching their workloads, let alone modifying resources on a schedule they don’t control, is genuinely frightening. This is a completely natural response.

The instinctive reaction from leadership is to accept overprovisioning as an operating principle. Overprovisioning buys perceived reliability. Nobody gets paged for having too much headroom.

But perceived reliability isn’t actual reliability. It’s expensive guesswork. And it doesn’t scale. When we rolled out automatic resource tuning to over 1000 microservices, the biggest hurdle was pure Fear, Uncertainty, and Doubt (FUD).

The Fear Loop - overprovisioning creates perceived reliability, which masks waste until instability returns. Breaking the loop - radical visibility and control enables automated resource tuning, eliminating waste while improving reliability.

Solution

We countered fear with radical visibility and control.

We gave developers dashboards showing exact CPU and memory usage vs. requested, alongside wasted dollars. Critically, we showed what the automatic tuner proposed to change those requests to. Automation was no longer a black box. Data replaced fear.

We also gave every team a self-service opt-out. A simple annotation to revert to their original requests without involving the platform team or raising a support ticket. No panic paging during an incident. This safety valve changed everything. Engineers trusted the automation more because they had the power to say no.

Break the fear loop with radical visibility and control. Build trust with data, transparency, control, and a little bit of empathy.

Sacred Workloads

Every organization has services that even the most prolific engineers refuse to touch. They are usually justified by high revenue generation, a big customer promise, or some historical outage story. These workloads run on dedicated hardware or have massive resource buffers.

The instinctive response is to not question it. If the service makes money and hasn’t gone down recently, leave it alone.

We had one critical transactional pipeline running with resource requests set at roughly 2000% of its average baseline usage. The team insisted it was imperative but couldn’t explain why. There was talk of “spikiness” and “rare edge cases” and historical incidents. But observability was weak, application behavior wasn’t well understood, and documentation didn’t exist. There was no technical truth to this overprovisioning. It just felt safer that way. Organizational scar tissue.

The Sacred Workload Anxiety Loop - historical incident trauma drives overprovisioning, creating perceived safety that persists unchallenged. Breaking the loop - improved telemetry and profiling replaces anxiety with data, enabling right-sized requests built on confidence, not waste.

Solution

We invested in better metrics, traces, and profiling for these services first. We expanded developer tools so engineers could quickly take a heap dump from an application during a failure.

Once teams could actually see what the workload was doing, we could confirm whether it genuinely had unusual needs. In many cases, it didn’t. Teams were relieved to finally have data rather than anxiety. This enabled us to pull back the security blanket of overprovisioning from these sacred workloads.

Sacred workloads are sacred because people don’t understand them. When fear fills the knowledge gap, start with visibility, not optimization.

Eviction Blockers

When a business grows fast, velocity wins over stability. Teams ship features and stability becomes tomorrow’s problem. At this inflection point we realized that 80% of nodes on our multi-tenant clusters had at least one pod blocking eviction. These workloads couldn’t handle shutdown signals gracefully without causing downstream business failures.

The instinctive response is to treat this as someone else’s problem. Application teams don’t want to rewrite shutdown logic. Platform teams don’t want to force changes that might cause outages. So the blockers stay.

This is a direct manifestation of organizational and technical debt. Even if workloads were tuned for optimal usage, the savings never materialized because these nodes could not be scaled down.

The Tech Debt Spiral - rapid feature velocity drives overprovisioning for safety, accumulating overhead and waste under the illusion of perceived safety. Breaking the loop - extending Kubernetes to eliminate eviction blockers safely, removing waste while improving reliability.

Solution

We introduced a Disruption Probe - a custom Validating Admission Webhook that gives developers full control over when a workload can shut down safely. The webhook intercepts eviction signals and executes a configurable program within the pod’s container before shutdown. Teams define what “safe” means for their service.

The Validating Admission Webhook intercepts eviction requests. It executes a developer-defined eviction probe in the pod’s container and admits or denies the eviction based on the result.

By giving developers control over eviction safety, the platform team no longer had to demand teams remove eviction blockers. Teams removed them voluntarily, without hurting reliability.

Organizational and technical debt is the cancer to any cost optimization effort. Technical innovation that gives developers control back is the only way to reliably cure it.

Personal Incentives

All the investment in automation, observability, and developer tooling won’t move the needle on your cloud bill if the humans responsible have no reason to care. Most engineering organizations reward technical novelty as a proxy for business impact. When efficiency isn’t rewarded, waste becomes normalized. The company’s bottom line takes a hit.

The instinctive response is to mandate cost optimization top-down. Leadership sets targets, platform teams enforce them, and application teams comply reluctantly.

This breeds resentment, not ownership. Mandates work until the next reorg or priority shift. Then waste creeps back.

The Doom Loop - when cloud spend is opaque and efficiency isn’t rewarded, waste is normalized. Breaking the loop - gamified accountability and aligned incentives drives proactive optimization and cost savings.

Solution

We changed the incentives.

The dashboards from automatic resource tuning evolved into leaderboards. Teams could see their waste ranked against others. Once waste became visible and measurable, positive peer pressure naturally kicked in.

Step two was making efficiency a formal performance dimension. Optimization results were folded into team goals, performance reviews, and bonus calculations for both engineers and managers. Once efficiency became personally consequential, the “don’t touch my workload” culture gave way to “how do I get my cost down?”

The most effective solution sometimes isn’t technology at all; It’s understanding human motivation to gamify accountability.

The Payoff

If there is one takeaway - you don’t have to choose between cost and reliability. You get both. But only if reliability is an intentional step in every optimization, not an afterthought. We reduced our annual cloud spend by 40%. Double digit millions saved per year. And our business reliability posture improved alongside it.

Before and after optimization. Reduced allocatable capacity (red), tighter resource requests (yellow), and the increased actual usage (green).

But progress wasn’t always linear. We tried bigger nodes and had to revert. We wanted Flink on Spot and had to say no. We rolled out automation and had to build escape hatches to earn trust. Success often comes down to how you handle the unexpected. Agility is key.

The process of optimizing cost forced us to build better tooling, deeper observability, and more resilient shutdown mechanisms. Every reliability concern in this blog was met with a technical solution that made the platform more resilient than before.

Kubernetes is not the end product. It is a framework for building platforms. By embracing its extensibility - writing custom controllers, configuring schedulers, leveraging APIs like Leases for novel purposes - you can transform it from a resource allocator into an intelligent, self-optimizing system. One that overcomes the cost of organizational fear.

Because the most expensive infrastructure isn’t compute. It’s fear.

In-Place PVC Re-Binding: Zero-Downtime Disk Migration on Kubernetes

Maxim Nazarenko — Mon, 23 Mar 2026 17:06:38 GMT

Kubernetes has been on a decade-long journey to decouple its core from vendor-specific storage solutions by migrating from in-tree storage plugins to Container Storage Interface (CSI) drivers. On Microsoft Azure, the built-in kubernetes.io/azure-disk storage provisioner was deprecated in v1.19 and entirely removed in v1.26¹^,². Failure to migrate meant any scheduling event, including a routine deployment, could prevent a stateful application from re-attaching its underlying storage, causing application failure.

Standard migration paths require downtime. At our scale, taking hundreds of disks (backing data stores like ClickHouse, CockroachDB, Kafka, Prometheus) offline was off limits. This blog introduces an in-place PVC re-binding technique that swaps a PersistentVolumeClaim’s backing PersistentVolume while keeping the underlying disk intact. It requires only a single pod restart per volume, done entirely using the Kubernetes API natively with no custom software or control plane hacks. We use Azure managed disks on Azure Kubernetes Service (AKS) to illustrate, but this method works universally on self-managed and cloud-provider managed Kubernetes distributions.

The CSI migration is old news for most teams. But the PVC re-binding technique itself unlocks operational capabilities often considered too risky by platform teams (e.g. modify performance tier for SSD). We used it to migrate several hundred production disks in under 2 months without a single incident or byte of data loss.

What Makes This Hard

To understand why this migration is difficult, we need to talk about how Kubernetes handles persistent storage and immutability.

Kubernetes persistent storage has three core components.

A PersistentVolume (PV) is a cluster-level resource that represents a real piece of storage, like an Azure Disk.
A PersistentVolumeClaim (PVC) is a request for storage made by an application, living in the same namespace as its pods.
A StorageClass defines the type of storage and, critically, the provisioner responsible for creating it (e.g. disk.csi.azure.com).

These come together through dynamic volume provisioning. When a developer creates a PVC that specifies a StorageClass, the provisioner automatically creates a PV meeting the claim’s specifications. Kubernetes then binds the PVC to the PV. This binding is an exclusive one-to-one mapping enforced by the claimRef attribute on the PV.

The problem is that nearly every field that matters for this migration is immutable.

The provisioner field in a StorageClass is immutable. We can’t simply update it to point to the new CSI driver.
We could create a new StorageClass. But once a PV is bound to a PVC, we can’t replace all references to the old one.
A PV’s spec.persistentVolumeSource, which defines the actual storage backend, is also immutable. Patching it returns: “spec.persistentVolumeSource is immutable after creation”.
A StatefulSet’s spec.volumeClaimTemplates is immutable too. Changing the storageClassName in this template is rejected with a “forbidden“ error.
Same for a pod’s spec.volumes section. It’s immutable and a patch will fail with a “forbidden“ error.

Kubernetes persistent storage lifecycle and immutable fields.

These constraints are deliberate. They enforce Kubernetes' persistence principle: storage is a durable, stable resource, while pods are ephemeral and replaceable. StorageClasses, PVs, PVCs, like several other Kubernetes objects are immutable. The only way to change these resources is to destroy and recreate them.

Understanding Kubernetes Nuances

Given the immutability constraints, a direct migration is impossible. But lesser-known behaviors in Kubernetes provide the building blocks for our live migration strategy.

StorageClass objects are passive. They are only used at the moment of provisioning, after a PV is bound to a PVC, the StorageClass plays no role. This means an existing PV and PVC is completely unaffected if their original StorageClass is deleted. We can exploit this behavior by deleting the deprecated in-tree StorageClass and immediately creating a new CSI-based one with the exact same name. This wouldn’t impact running applications.
A PV’s claimRef controls binding behavior. When a PV is firmly bound to a PVC, its spec.claimRef contains the kind, name, namespace, and crucially, the uid and resourceVersion of the PVC. If the reference contains a uid, the PV controller considers the binding firm. If it does not, the PV is considered an available candidate for binding to a PVC with a matching name and namespace. This is the key insight. We can manually create a second PV that points to the same underlying Azure Disk but is defined as a CSI volume. By setting name and namespace in its claimRef but omitting the uid, this new PV becomes a “honeypot” volume, waiting to be claimed by a PVC of the right name.
The pvc-protection finalizer prevents premature deletion. Kubernetes automatically adds the kubernetes.io/pvc-protection finalizer to any PVC actively used by a pod. With this finalizer present, deleting the PVC only sets a deletionTimestamp, putting it into a Terminating state. The PVC object isn’t actually removed until the pod using it is deleted, which deletes the finalizer. This built-in safety mechanism prevents race conditions. It ensures that when we delete the pod, the StatefulSet controller won’t immediately create a new empty volume before our “honeypot” PV can be claimed.

In-Place PVC Re-Binding Algorithm

Before starting the per-disk migration, we replace the legacy StorageClass with a new CSI-based one that has the exact same name. This tricks the control plane into using the new CSI driver when it automatically re-creates the PVC later.

Create a backup of the legacy StorageClass.

kubectl get sc managed-premium -o yaml > managed-premium-legacy.yaml

The legacy StorageClass will look like this:

# managed-premium-legacy.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium
provisioner: kubernetes.io/azure-disk # The legacy in-tree provisioner
parameters:
  storageaccounttype: Premium_LRS
  kind: Managed
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Delete the legacy StorageClass.

kubectl delete sc managed-premium

Create the new CSI StorageClass with the same name.

# managed-premium-csi.yaml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-premium # The exact same name as the old one
provisioner: disk.csi.azure.com # The new CSI provisioner
parameters:
  skuName: Premium_LRS
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Apply this new StorageClass:

kubectl apply -f managed-premium-csi.yaml

The in-place PVC re-binding algorithm sequence.

With the groundwork laid, we execute the following steps for each individual disk.

Identify the target resource. Start by identifying the specific StatefulSet pod to migrate, then get the name of its PVC, its bound PV, and the URI of the underlying Azure Disk. The disk URI is our critical identifier for the physical storage.

# Set variables for your environment

export POD_NAME=””
export PVC_NAME=$(kubectl get pod $POD_NAME -o jsonpath=’{.spec.volumes[?(@.persistentVolumeClaim)].persistentVolumeClaim.claimName}’)
export PV_NAME=$(kubectl get pvc $PVC_NAME -o jsonpath=’{.spec.volumeName}’)

echo “Pod: $POD_NAME”
echo “PVC: $PVC_NAME”
echo “PV:  $PV_NAME”

# Get the Azure Disk URI from the legacy PV object and save it

export DISK_URI=$(kubectl get pv $PV_NAME -o jsonpath=’{.spec.azureDisk.diskURI}’)

echo “Disk URI: $DISK_URI”

Set the legacy PV’s reclaim policy to “Retain”. This is an essential safety measure that ensure the physical Azure Disk is not automatically deleted when we delete the Kubernetes PV object later.

kubectl patch pv $PV_NAME -p ‘{”spec”:{”persistentVolumeReclaimPolicy”:”Retain”}}’

Create a new CSI PV. Create a new PV object that points to the same underlying Azure Disk. Set its name and namespace to match the legacy PVC in the PV’s claimRef section, but omit the uid and resourceVersion fields to allow re-binding. This is our “honeypot” PV.

# pv-csi-yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-migrated-csi # A new, unique name for the PV object
spec:
  capacity:
    storage: 100Gi # IMPORTANT: Use the actual size of your disk
  accessModes:
    - ReadWriteOnce # Match the original access modes
  persistentVolumeReclaimPolicy: Retain # Or Delete, if you prefer post-migration
  storageClassName: managed-premium # The name of the StorageClass we replaced
  claimRef:
    # IMPORTANT: These must match the original PVC exactly
    name: my-claim # Use your PVC_NAME variable here
    namespace: default # The namespace of your PVC
    # CRITICAL: Do NOT include ‘uid’ or ‘resourceVersion’. This is intentional.
  csi:
    driver: disk.csi.azure.com
    volumeHandle:  # Paste the DISK_URI from Step 1
    volumeAttributes:
      fsType: ext4 # Or xfs, matching your original disk

Apply the new PV.

kubectl apply -f pv-csi.yaml

4. Trigger the re-binding. Delete the legacy PVC and the corresponding StatefulSet pod.

kubectl delete pod $PVC_NAME
kubectl delete pod $POD_NAME

The following happens automatically:

Deleting the PVC puts it into a “Terminating” state. The pvc-protection finalizer keeps it alive as long as the pod is running.
The pod goes through its shutdown sequence and is deleted.
When the pod is deleted, the finalizer is removed from the PVC, and the PVC is fully deleted.
The StatefulSet controller creates a new pod to replace the deleted one. The new pod creates a new PVC with the same name.
Kubernetes finds the honeypot CSI PV we created in Step 3 and binds it to the new PVC.
The new pod starts and the Azure Disk is mounted to it.

5. Verify the migration. Watch the pod start successfully and verify the application is running correctly.

kubectl get pods -w

Check the PVC status is “Bound”.

kubectl get pvc $PVC_NAME

# NAME        STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS      AGE
# my-claim    Bound    pv-migrated-csi     100Gi      RWO            managed-premium   15m

Confirm the PVC is bound to the new CSI PV.

kubectl get pv pv-migrated-csi

# NAME              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                  STORAGECLASS      REASON   AGE
# pv-migrated-csi   100Gi      RWO            Retain           Bound    default/my-claim       managed-premium            5m

Confirm the legacy PV’s status is ‘Released’, indicating it is no longer bound and can be safely cleaned up.

kubectl get pv $PV_NAME

# NAME        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM     STORAGECLASS      REASON   AGE
# pv-legacy   100Gi      RWO            Retain           Released   ...       ...                          20m

Once the application is running correctly on the new pod with all data intact, safely delete the legacy PV object.

Alternative Approaches

There are several other ways to solve this problem. Each comes with tradeoffs in service disruption, data loss risk, and operational complexity.

Microsoft Static Volume

Microsoft’s official documentation for migrating from in-tree to CSI drivers on AKS proposes a similar but more manual method. The process involves patching the original PV’s reclaimPolicy to “Retain”, manually creating new PV and PVC manifests that point to the same underlying Azure Disk, and then updating the application deployment to reference the newly created PVC.

This approach preserves data on the disk. However, it requires a full application redeployment to switch to the new PVC, which means downtime and an operational maintenance window for each migration.

Orphan and Adopt

Another approach is to orphan pods from their controlling StatefulSet. This involves deleting the StatefulSet with the --cascade=orphan flag, which leaves the pods and their PVCs running but unmanaged. A new StatefulSet using the updated CSI StorageClass can then be created to “adopt” the existing pods.

The risk here is significant. Without a controller, pods won’t get restarted or rescheduled in case of a node failure or eviction. For critical stateful workloads, this exposure window can lead to permanent data loss.

Backup and Restore

Platforms with mature Day-2 operations can perform a “cold” migration using tools like Velero. This takes a complete snapshot of the application and its data, which can then be restored with modifications applied to the StorageClass before restore.

Backup-restore is powerful for disaster recovery but requires pausing applications. For large disks this introduces significant downtime. In a microservices architecture where pausing one service can cause cascading failures, this is a non-starter.

Forking the Control Plane (The Datadog Approach)

At KubeCon EU 2024, Datadog presented an approach that involved forking the Kubernetes source code and patching the API server to bypass immutability constraints on live objects. This gives ultimate control over storage definitions of running pods.

This strategy isn’t suitable for managed Kubernetes services like AKS, GKE, or EKS, where access to modify control plane components is restricted. Forking the Kubernetes codebase also introduces long-term maintenance overhead and the risk of deviating from upstream. Unsustainable for most platform teams.

Custom Operator for Disk Swaps (The “ATOM”-ic Approach)

ATOMS implemented a custom operator to handle shrinking a cloud provider managed disk. It uses a custom resource, mutating webhook, and volume-populator to provision new disks and transfer data between old and new PVCs. It handles volume resizing declaratively without manual operation.

For our use case this was more machinery than needed. We did not need to make any changes to the underlying Azure Disk. No data copy required. That said, a custom operator is a natural automation layer on top of the re-binding technique for teams that need ongoing storage operations.

Alternative Approaches and their tradeoffs

Conclusion

We used this technique to migrate several hundred PVs on a platform operating 100s of Kubernetes clusters across a multi-region topology. The bulk of the complexity came from coordinating across large-scale data stores (like ClickHouse, CockroachDB, ElasticSearch, Kafka, and Prometheus), all running on cloud managed disks for durability and resilience where downtime or data loss was off limits. The migration was completed in under 2 months with a lean platform team and zero incidents.

A few learnings from this project:

Deep systems knowledge trumps brute force. The solution came from understanding the less obvious mechanics of the Kubernetes control plane (how claimRef binding works, when finalizers fire, and StorageClass behavior at runtime). Working with the system’s guardrails produced a simpler, safer result than any brute force approach.
Experimentation builds operational excellence. We uncovered critical edge cases (like premature PV deletion causing Multi-Attach errors during node drains) only by pushing the system to failure on staging. Confidence in production comes from understanding how a system breaks, not just how it works.
Automation is the key to reliability at scale. Automation lets us move fast and consistently, reducing the risk of human error. We automated the entire algorithm but gated the final pod restart with human approval so teams have control over when it’s safe to restart an application.

Shoutout to Rasmus Bach Krabbe and the storage team at ATOMS for walking us through the inner workings of their PvcAutoscaler. We took one look at all that machinery and decided there had to be a lazier way. Their operator is a serious piece of infrastructure at scale.

Save Millions on Your Cloud Bill: 11 Strategies for Kubernetes Cost Optimization

Nibir Bora — Tue, 27 May 2025 18:52:48 GMT

Figure: A multi-dimensional Kubernetes cost optimization model.

Are your cloud bills spiraling out of control? You're not alone. 84% of organizations struggle to manage their cloud spend¹. 69% of IT organizations experience budget overruns². A whopping 28% of cloud spending is wasted annually, and an astonishing 70% of companies are unsure about their exact spend on cloud³.

As infrastructure scales, cloud compute costs can quickly snowball, especially when operating 100s of Kubernetes clusters. In this blog we will lay out 11 actionable strategies to improve compute utilization on Kubernetes.

Keep these principles in mind as you craft your own cost optimization blueprint:

There's no single magic bullet. Optimizing costs extends beyond implementing clever technical solutions; it demands understanding and navigating the nuances and operational realities of your applications and underlying infrastructure.
You can't optimize what you don't see. Before you can effectively control costs, you must be able to accurately measure and attribute where your money is going. Optimization efforts are merely guesswork without visibility
High utilization doesn’t always equal low cost. Don’t fall for the illusion of high utilization. Ultimately, cost should remain the overarching measure of success for any optimization effort.
Optimization vs. reliability is a delicate dance. Overly aggressive cost optimization will compromise system stability and performance. Balance with reliability indicators while only eliminating true waste.
Shift Left by building cost consciousness. Cost optimization isn’t solely the responsibility of a central FinOps or platform team. Foster a culture where developers are empowered with cost insights and tools to operate efficient applications.
Architect for efficiency early. While it’s wise to avoid premature optimization on unconfirmed needs, foundational architecture choices like multi-tenant clusters, or planning data residency, have long-term impact on your cloud spend.

1. FinOps & Cost Visibility

Kubernetes workloads often autoscale and are distributed across diverse node types, especially in shared multi-tenant clusters. This obscures compute cost origins, prevents identifying inefficiencies, and assigning accountability of resources to specific teams or applications. Cloud providers usually have built in cost dashboards, but only provide visibility only at the VM or node level and lack attribution at the application level,

Solution: Adopt cost visibility and attribution for your platform. For simpler deployments, off-the-shelf tools like OpenCost provide real-time cost allocation by pod, label, namespace, and service, capturing price signals for CPU-seconds, memory bytes, storage IOPS, network egress, etc. For complex, multi-cloud, or customized platforms, consider developing an in-house attribution system by aggregating utilization metrics and reconciling them with cloud provider billing data.

Pitfalls to avoid:

Delayed Adoption: Cost visibility is frequently an afterthought. The painful result? Substantial, untraceable waste.
Umbrella Cost Models: This rarely works for shared resources (databases, metrics, blog storage, etc.) due to unclear ownership. Define a clear cost model that allocates shared resource costs based on usage or pre-defined criteria to teams or applications.
Lack of Governance: Without continuous audits and automated alerts, optimization efforts can easily regress. Implement regular review processes to flag anomalies based on projected modeling.

2. Custom Scheduler for Bin-Packing

The default Kubernetes scheduler distributes workloads uniformly across nodes. This leads to persistent overprovisioning resulting in unused capacity across multiple nodes and sparse bin-packing.

Solution: Configure a custom Kubernetes scheduler to favor tighter bin-packing using MostAllocated or RequestedToCapacityRatio scoring strategies. This fills existing nodes more completely before provisioning new ones, boosting resource utilization. Combine with cost-aware scheduling (e.g. prioritize cheaper node types) when multiple node pools are involved.

Pitfalls to avoid:

Scheduler Unavailability: The self-managed custom scheduler can become unavailable and lead to pending pods. Design for high availability and consider using it in conjunction with the default scheduler with fail-open mechanics.
Pods per node limits: Cloud providers’ max-pods-per-node limits on larger node instance types can restrict how tightly you can bin-pack pods on them. Carefully plan for use of larger instance types, or use a combination of large and small nodes.

3. Cluster Autoscaler Optimization

Default Cluster Autoscaler (CA) settings like --scale-down-utilization-threshold of 0.5 (50% node utilization) and --scale-down-unneeded-time of 10 minutes, lead to slow scale-down operations. In dynamic platforms, this conservative behavior results in nodes lingering longer than necessary translating into unnecessary costs.

Solution: Configure CA to more aggressively consolidate nodes by increasing --scale-down-utilization-threshold to a higher value (e.g. 0.7 to scale down nodes when utilization falls below 70%), and decreasing --scale-down-unneeded-time (e.g. to 5 minutes) to consolidate nodes faster. Combine this with appropriate CA expander strategies (e.g., priority, least-waste, spot-instance) to optimize node selection when multiple node pools are involved. This strategy works with CA alternatives like Karpenter as well.

Pitfalls to avoid:

Scaling Oscillation or Thrashing: Overly aggressive scale-down can lead to nodes being removed then immediately added back. This is wasteful as most cloud providers price nodes per hour. Balance stability by fine-tuning scale-down delays timers like --scale-down-delay-after-add and --scale-down-delay-after-delete.
Anti-affinity Rules: Pods with anti-affinity or topology spread constraint can block node scale-down and remove underutilized nodes. Use multi-tenant clusters to alleviate this.
Workloads Blocking Eviction: Workloads configured with PodDisruptionBudgets (PDBs) set to 0 or annotations like safe-to-evict: false explicitly block node scale-down by preventing pods from being drained from a node.

4. Spot Nodes for Ephemeral Workloads

Short-lived ephemeral workloads like analytics pipelines, batch jobs, data ingestion agents, etc. frequently spin-up and spin-down. This prevents Cluster Autoscaler from scaling down nodes. Running these workloads on on-demand nodes doesn’t make financial sense.

Solution: Schedule ephemeral workloads onto spot nodes. Spot nodes are significantly cheaper (at least 60% cheaper on AWS EKS, GCP GKE and Azure AKS) than on-demand instances. Also use spot nodes for non-business-critical workloads where occasional interruptions are acceptable.

Pitfalls to avoid:

Sudden Eviction: Cloud providers reclaim spot nodes with short notice (e.g., 30 seconds on Microsoft Azure). Ensure workloads are fault tolerant and can withstand abrupt termination.
Non-Standard Shutdowns: Some cloud providers may not follow standard Kubernetes shutdown processes (e.g., sending SIGTERM). See this blog that describes hardening spot nodes with custom tooling.
Spot Capacity Shortages: Spot availability fluctuates. De-risk by diversifying your spot instance types. Combine with a baseline of on-demand instances for stability, allowing workloads to spill over to on-demand nodes only when spot capacity is unavailable. Configure Cluster Autoscaler expander to prioritize scaling up spot nodes before others.

5. Automatic Request Tuning

Manually configuring CPU and memory requests and limits for pods is inefficient. Developers often err on the side of overprovisioning, driven by concerns about performance, stability, and the potential impact of noisy neighbors. Furthermore, initial resource requests are rarely audited or adjusted over time as application needs evolve. This leads to wasted resources.

Solution: Automate resource request setting and tuning. Continuously monitor actual workload usage and dynamically adjust requests and limits to match consumption patterns. Kubernetes Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) cannot be used in conjunction. But you can leverage VPA recommendations to tune requests alongside HPA, paired with cool-off logic to prevent thrashing. Note that newer Kubernetes versions support VPA in initial mode where it assigns resource requests on pod creation and never changes them later.

Pitfalls to avoid:

Phased Rollout and Escape Hatches: Implement with a fail-open approach and provide clear escape hatches for developers to temporarily disable or override automation for specific workloads during critical issues.
Developer Control and Trust: For platforms with hundreds of microservices foster trust by providing dashboards with tuning recommendations.
Resource Use Profiles: Craft tuning profiles (e.g. guaranteed, burstable, etc.) to cater to diverse workload consumption patterns and criticality.
Application of Scaled-Down Requests: If tuned down requests are applied only at deploy time, it can confuse developers troubleshooting new changes. Implement safe, programmatic restarts that attribute changes clearly.

6. Custom Metric-Based Autoscaling

HPA natively supports scaling based on CPU and memory utilization. However, this is insufficient for applications with scaling needs reflected best in business or application-level metrics like requests per second, queue depth, active connections, etc. Scaling solely on CPU or memory can cause instability and failures, leading to reliability concerns. Consequently, teams often overprovision to handle peak loads.

Solution: Enable custom metric-based autoscaling. Deploy a custom metrics adapter (e.g. Prometheus Adapter) that exposes metrics from your monitoring system to the Kubernetes Custom Metrics API (custom.metrics.k8s.io). Alternatively, KEDA provides flexible, event-driven autoscaling with built-in “scalers” for various event sources (message queues, databases, cloud services), including the ability to scale down to zero pods.

Pitfalls to avoid:

Metrics Infrastructure Reliability: Scaling behavior is tied to metrics pipeline’s availability, and can lead to incorrect scaling decisions. Ensure high reliability for your metrics infrastructure.
Metric Accuracy: Inaccurate metrics can lead to undesirable scaling behavior (e.g., thrashing or flapping).
Single Custom Metrics Server Limitation: Historically, Kubernetes environments only supported one custom metrics server. Carefully plan your metrics ecosystem.
Failure Modes and Defaults: Design for failure modes. Implement sane default replica counts or fallbacks to ensure application stability during metric outages.

7. Multi-Tenant Clusters

It is common to provision single-tenant Kubernetes clusters per team or application when starting off. This approach, driven by a perceived need for strict isolation or sometimes developer insistence, is a classic recipe for overprovisioning.

Solution: Transition to a multi-tenant cluster architecture. Kubernetes offers robust controls for diverse workloads to coexist securely and efficiently on shared hardware:

Namespaces: Logical tenant isolation within a cluster.
Resource Requests and Limits: Manage pod resource consumption, preventing individual workloads from monopolizing resources.
Resource Quotas and LimitRanges: Enforce aggregate resource consumption per namespace.
Role-Based Access Control (RBAC): Restrict tenant access to only their designated namespaces, ensuring API isolation.
Network Policies: Providing network isolation by controlling inter-pod communication across namespaces.
Pod Security Standards (PSS) or Security Contexts: Enforce security hygiene and prevent privilege escalation.

Pitfalls to avoid:

"Noisy Neighbor" Phenomenon: Workloads can still spike CPU or memory usage, impacting applications co-located on nodes. Applications that constantly exceed their requests are problematic as Kubernetes schedules by resource requests not real-time use.
Network and Disk I/O Bottlenecks: Kubernetes doesn’t currently allow specifying network and disk I/O requests. I/O intensive applications can starve colocated workloads of shared bandwidth on nodes. Implement monitoring and consider pod anti-affinity or topology spread constraints.

8. Align Node Shapes with Workload Demands

Discrepancies between node CPU:memory ratios and that of workload consumption lead to resource imbalances. For example, memory-intensive workloads on high CPU:memory nodes can underutilize CPU while bottlenecking memory. This leads to more nodes than actually required.

Solution: Migrate each node pool to node shapes that closes the gap between CPU and memory efficiency. To do so, adjust your node instance types’ CPU:memory ratio to align with aggregate workload consumption, Σ(CPU Used):Σ(memory Used). This will reduce overall compute costs. You can easily use an off-the-shelf tool like Karpenter to automate this.

Pitfalls to avoid:

In-place Node Type Changes: Node instance types cannot be changed in place. Build custom tooling to orchestrate node pool replacement without disrupting workloads. See this blog for a highly automated way of replacing a node pool.
Premature Optimization: Best executed after workloads are right-sized. Optimizing node shape before workload rightsizing is ineffective as the CPU and memory efficiency will change later, requiring replacing node instance type again.
System Overhead: Kubernetes components (kubelet, kube-proxy, container runtime) consume a portion of node resources. Account for this overhead when evaluating available CPU and memory, especially on smaller nodes.

9. Enable Controlled Evictions

Workloads with PodDisruptionBudgets (PDBs) set to 0 or safe-to-evict: false annotations block Cluster Autoscaler node scale-down operations. These configurations are common for singletons or critical workloads, and are problematic in multi-tenant platforms.

Solution: Instate policies to disallow PDB set to 0 and safe-to-evict: false annotations. For most cases, proper application architecture hygiene (handling SIGTERM, setting terminationGracePeriodSeconds) suffices. For complex workloads, build custom tooling to orchestrate graceful workload eviction by exposing a pre-shutdown hook. For example, in stateful applications like CockroachDB this may execute steps like draining connections, committing transactions, or backing up data.

Pitfalls to avoid:

Long-Running Jobs: Long-running workloads, like Apache Flink, with significant internal state, may not recover gracefully from disruptions. Restarts are costly due to re-processing or extended execution times. Isolate such workloads onto a dedicated node pool or cluster. Newer versions of Flink may support checkpointing (or similar specialized recovery mechanisms).
Cost of Exceptions: Strictly control non-evictable workloads by isolating them onto dedicated node pools or clusters. This simplifies cost accounting for these exceptions and prevents impacting overall cluster efficiency.

10. Right-Size Persistent Volumes

Persistent storage costs in Kubernetes can accumulate rapidly when using cloud managed Persistent Volumes (PVs). Without active management, teams default to expensive storage classes, over-provision volume sizes, or leave unused volumes lingering, leading to accumulating costs.

Solution: Continuously monitor storage consumption. Cloud managed disks are easy to size up as needed, so always start with the smallest possible volume. Encourage the use of different StorageClass definitions exposing various cloud storage types based on workload performance needs. For large-scale complex platforms, consider building custom tooling to dynamically resize PVs to align with actual usage.

Pitfalls to avoid:

Accidental Data Loss: Implement backup and retention strategies to avoid accidental data loss from aggressive deletion policies, or when resizing.
Unused or Orphaned Volumes: Unused volumes often liner after application decommissioning or even from incomplete pod evictions sometimes. Automate identifying and deleting unused or orphaned PVCs and their underlying PVs.

11. Optimize Network Costs

Network costs can become an unexpected line item in cloud bills, because cloud providers charge for all ingress and egress traffic, cross-region data transfer, load balancers, gateways, etc. A multi-region Kubernetes architecture built for resilience can come at an exorbitant price. Applications with high data transfer or public-facing services can rapidly accumulate network charges too.

Solution: Minimize cross-region and inter-zone traffic by carefully architecting your data residency. For example, prohibit stateless services from direct cross-region data access. Prefer internal Load Balancers, Ingress Controllers, and private endpoints for internal traffic. Minimize data flowing out of the cloud provider’s network to the internet, and use Content Delivery Networks (CDNs) for static content.

Pitfalls to avoid:

Network Complexity: Advanced networking solutions (e.g., custom ingress setups, service mesh for traffic routing) add complexity to your Kubernetes environment, and come with management and troubleshooting overhead.
Reduced Resilience: Confining network traffic to a single zone or region, without implementing robust failover mechanisms, increases the risk of application downtime during cloud provider zone or regional outages.
Provider Pricing Maze: All cloud providers charge differently for network traffic, so an architecture that works for one may not work for another. Study your specific cloud provider's cost structure. For instance, it may be cheaper to perform data backup onto another region instead of maintaining continuous cross-region data consistency.

Figure: Cost optimization strategies applied to different parts of the platform.

Continue reading Part 2: But What About Reliability? of this blog, where we share a case study detailing real-world application of these strategies by an organization operating cloud-scale Kubernetes infrastructure across multiple cloud providers and continents. Get ready for behind-the-scenes war stories and first-hand lessons.