Kubernetes clusters routinely pay for replicas nobody decided to keep. Someone raised an HPA’s minReplicas during an incident or before a launch, the pressure passed, and the number never came back down. The autoscaler can’t scale below the floor, so those replicas run 24/7 - and everything about them looks normal.
This kind of over-provisioning is hard to spot because a workload sitting at its floor produces no alerts, no errors, and no scaling events. The short version of the fix: find HPAs pinned at min with utilization far below target, then halve the gap stepwise, holding each step through a full daily peak. Here’s how to do that without breaking anything.
A too-high floor is almost never a calculation. It’s an artifact of how teams respond to pressure:
The incident bump. A service falls over during a traffic spike, someone raises minReplicas from 3 to 10 as part of the mitigation, and the incident closes. The retro action item says “revisit scaling config”. Nobody does. The real fix was usually elsewhere - a slow scale-up, a missing readiness probe, an undersized CPU request - but the floor stays at 10 because removing it feels like reopening the incident.
Launch-day fear. Before a big release, floors get raised “just for the launch”. The launch goes fine. Six months later the floor is still there, and by now nobody remembers whether it’s load-bearing.
Copy-paste defaults. A new service inherits its HPA manifest from the team’s most critical service, including minReplicas: 8. The new service gets a fraction of the traffic, but it starts life with the same floor.
The underlying problem is an asymmetry: raising a floor is instant, safe, and relieves pressure right now. Lowering one is slow, feels risky, and the reward is a cost number on someone else’s dashboard. So floors only ever ratchet up.
Here’s the math for an example service. Its pods request 1 vCPU each, the HPA has minReplicas: 10 and a 70% CPU target, but demand never needs more than 4 replicas. The other 6 exist only because of the floor.
On-demand compute in us-east-1 runs $0.096/hour for an m6i.large (2 vCPU, 8 GiB; AWS list price as of June 2026), which works out to $0.048 per vCPU-hour:
6 idle replicas x 1 vCPU x 730 hours/month x $0.048/vCPU-hour
= ~$210/month for one service
That’s one workload. Ten services with a similar gap is about $2,100/month, roughly $25,000/year, for capacity demand doesn’t need.
Your prices and gaps will differ. The point is the structure of the math: floor minus actual need, times request size, times every hour of the month. Floors are expensive precisely because they apply 24/7, whether or not anyone is using the service.
The HPA does report that the floor is binding - but in the least actionable way possible.
The desired replica count is clamped to minReplicas before it’s written to status. If the algorithm computes 3 replicas and the floor is 10, kubectl describe hpa shows desired = 10, current = 10, and no scaling events. What it does show is a ScalingLimited condition with reason TooFewReplicas: proof the floor is binding, but not by how much. A floor that’s one replica too high looks identical to a floor that’s seven too high, and the same condition is true on every deliberately conservative floor, so nobody alerts on it.
Meanwhile, replicas == min looks like the healthy steady state. Dashboards show a flat replica count, nothing flaps, and no one investigates a flat line.
The tell is utilization headroom. If your target is 70% and the workload sits at 12% CPU for hours while parked at the floor, the autoscaler isn’t choosing the replica count - your config is. The fraction of time a workload spends pinned at minReplicas is the fraction of time you, not demand, picked the capacity.
Start with a cluster-wide snapshot. This works for HPAs with a CPU resource metric (the common case):
kubectl get hpa -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,MIN:.spec.minReplicas,CURRENT:.status.currentReplicas,TARGET%:.spec.metrics[0].resource.target.averageUtilization,CPU%:.status.currentMetrics[0].resource.current.averageUtilization'
You’re looking for rows where CURRENT equals MIN and CPU% is far below TARGET%. The ScalingLimited condition catches every HPA whose floor is binding right now, without eyeballing columns:
kubectl get hpa -A -o json | jq -r '.items[]
| select(.status.conditions[]? | .type == "ScalingLimited" and .status == "True" and .reason == "TooFewReplicas")
| .metadata.namespace + "/" + .metadata.name'
A point-in-time snapshot can mislead, though - you might be looking during a lull. Confirm with kubectl top pods against the pods’ requests, then check the pattern over time. If you run Prometheus with kube-state-metrics, two queries give you the full picture.
What fraction of the last two weeks was this HPA pinned at its floor?
avg_over_time(
(
kube_horizontalpodautoscaler_status_current_replicas
== bool
kube_horizontalpodautoscaler_spec_min_replicas
)[14d:5m]
)
A result near 1.0 means the floor, not the autoscaler, has been setting capacity essentially all the time.
What did p95 utilization against requests actually look like? For a deployment named checkout in prod:
quantile_over_time(0.95,
(
100 *
sum(rate(container_cpu_usage_seconds_total{namespace="prod", pod=~"checkout-.*", container!="", container!="POD"}[5m]))
/
sum(kube_pod_container_resource_requests{namespace="prod", pod=~"checkout-.*", resource="cpu"})
)[14d:5m]
)
A result of 12 means p95 utilization was 12% of requested CPU. One caveat: pod=~"checkout-.*" also matches sibling workloads like checkout-worker, so tighten the regex if you have them (or join on kube_pod_owner for exactness).
A workload that spends most of the window at its floor and whose p95 utilization sits well below the HPA target is a candidate. Both conditions matter: time-at-floor alone might just mean a well-sized floor, and low average utilization alone might hide short daily peaks that genuinely need the capacity.
Don’t jump straight from 10 to the computed optimum. The history that put the floor at 10 might encode something your two-week query window didn’t see. Lower it stepwise instead:
kubectl patch hpa checkout -n prod -p '{"spec":{"minReplicas":10}}'. Anyone on call should be able to run it without context.behavior.scaleUp - we covered the behavior block in Kubernetes HPA vs KEDA.Two constraints bound how low you can go regardless of utilization data. If a PodDisruptionBudget sets minAvailable: 2, a floor of 1 will block node drains. And if you rely on spreading replicas across three zones for availability, a floor below 3 quietly gives that up. Check both before any reduction - utilization is not the only input to a floor.
Not every pinned-at-floor workload is waste. Three patterns justify deliberate headroom:
A justified floor has a reason someone can state today. An unjustified one has a reason nobody remembers. If you can’t say what a floor protects against, the queries above will tell you what it costs to keep.
Or let the audit run continuously: stepscale watches your HPAs and flags floors with sustained utilization headroom as reviewable recommendations - recommend-only by default, running entirely in your cluster.