Kubernetes minReplicas Set Too High: How to Find It and Lower It Safely

stepscale · June 9, 2026 ·

#kubernetes#hpa#minreplicas#autoscaling#cost-optimization

Kubernetes clusters routinely pay for replicas nobody decided to keep. Someone raised an HPA’s minReplicas during an incident or before a launch, the pressure passed, and the number never came back down. The autoscaler can’t scale below the floor, so those replicas run 24/7 - and everything about them looks normal.

This kind of over-provisioning is hard to spot because a workload sitting at its floor produces no alerts, no errors, and no scaling events. The short version of the fix: find HPAs pinned at min with utilization far below target, then halve the gap stepwise, holding each step through a full daily peak. Here’s how to do that without breaking anything.

How minReplicas Ends Up Too High

A too-high floor is almost never a calculation. It’s an artifact of how teams respond to pressure:

The incident bump. A service falls over during a traffic spike, someone raises minReplicas from 3 to 10 as part of the mitigation, and the incident closes. The retro action item says “revisit scaling config”. Nobody does. The real fix was usually elsewhere - a slow scale-up, a missing readiness probe, an undersized CPU request - but the floor stays at 10 because removing it feels like reopening the incident.

Launch-day fear. Before a big release, floors get raised “just for the launch”. The launch goes fine. Six months later the floor is still there, and by now nobody remembers whether it’s load-bearing.

Copy-paste defaults. A new service inherits its HPA manifest from the team’s most critical service, including minReplicas: 8. The new service gets a fraction of the traffic, but it starts life with the same floor.

The underlying problem is an asymmetry: raising a floor is instant, safe, and relieves pressure right now. Lowering one is slow, feels risky, and the reward is a cost number on someone else’s dashboard. So floors only ever ratchet up.

What an Oversized Floor Costs

Here’s the math for an example service. Its pods request 1 vCPU each, the HPA has minReplicas: 10 and a 70% CPU target, but demand never needs more than 4 replicas. The other 6 exist only because of the floor.

On-demand compute in us-east-1 runs $0.096/hour for an m6i.large (2 vCPU, 8 GiB; AWS list price as of June 2026), which works out to $0.048 per vCPU-hour:

6 idle replicas x 1 vCPU x 730 hours/month x $0.048/vCPU-hour
= ~$210/month for one service

That’s one workload. Ten services with a similar gap is about $2,100/month, roughly $25,000/year, for capacity demand doesn’t need.

Your prices and gaps will differ. The point is the structure of the math: floor minus actual need, times request size, times every hour of the month. Floors are expensive precisely because they apply 24/7, whether or not anyone is using the service.

Why the Waste Goes Unnoticed

The HPA does report that the floor is binding - but in the least actionable way possible.

The desired replica count is clamped to minReplicas before it’s written to status. If the algorithm computes 3 replicas and the floor is 10, kubectl describe hpa shows desired = 10, current = 10, and no scaling events. What it does show is a ScalingLimited condition with reason TooFewReplicas: proof the floor is binding, but not by how much. A floor that’s one replica too high looks identical to a floor that’s seven too high, and the same condition is true on every deliberately conservative floor, so nobody alerts on it.

Meanwhile, replicas == min looks like the healthy steady state. Dashboards show a flat replica count, nothing flaps, and no one investigates a flat line.

The tell is utilization headroom. If your target is 70% and the workload sits at 12% CPU for hours while parked at the floor, the autoscaler isn’t choosing the replica count - your config is. The fraction of time a workload spends pinned at minReplicas is the fraction of time you, not demand, picked the capacity.

Finding the Candidates

Start with a cluster-wide snapshot. This works for HPAs with a CPU resource metric (the common case):

kubectl get hpa -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,MIN:.spec.minReplicas,CURRENT:.status.currentReplicas,TARGET%:.spec.metrics[0].resource.target.averageUtilization,CPU%:.status.currentMetrics[0].resource.current.averageUtilization'

You’re looking for rows where CURRENT equals MIN and CPU% is far below TARGET%. The ScalingLimited condition catches every HPA whose floor is binding right now, without eyeballing columns:

kubectl get hpa -A -o json | jq -r '.items[]
  | select(.status.conditions[]? | .type == "ScalingLimited" and .status == "True" and .reason == "TooFewReplicas")
  | .metadata.namespace + "/" + .metadata.name'

A point-in-time snapshot can mislead, though - you might be looking during a lull. Confirm with kubectl top pods against the pods’ requests, then check the pattern over time. If you run Prometheus with kube-state-metrics, two queries give you the full picture.

What fraction of the last two weeks was this HPA pinned at its floor?

avg_over_time(
  (
    kube_horizontalpodautoscaler_status_current_replicas
    == bool
    kube_horizontalpodautoscaler_spec_min_replicas
  )[14d:5m]
)

A result near 1.0 means the floor, not the autoscaler, has been setting capacity essentially all the time.

What did p95 utilization against requests actually look like? For a deployment named checkout in prod:

quantile_over_time(0.95,
  (
    100 *
    sum(rate(container_cpu_usage_seconds_total{namespace="prod", pod=~"checkout-.*", container!="", container!="POD"}[5m]))
    /
    sum(kube_pod_container_resource_requests{namespace="prod", pod=~"checkout-.*", resource="cpu"})
  )[14d:5m]
)

A result of 12 means p95 utilization was 12% of requested CPU. One caveat: pod=~"checkout-.*" also matches sibling workloads like checkout-worker, so tighten the regex if you have them (or join on kube_pod_owner for exactness).

A workload that spends most of the window at its floor and whose p95 utilization sits well below the HPA target is a candidate. Both conditions matter: time-at-floor alone might just mean a well-sized floor, and low average utilization alone might hide short daily peaks that genuinely need the capacity.

Lowering the Floor Safely

Don’t jump straight from 10 to the computed optimum. The history that put the floor at 10 might encode something your two-week query window didn’t see. Lower it stepwise instead:

Halve the gap. If the floor is 10 and the data says 4, go to 7 first. Each step is small enough to revert without drama.
Define the rollback before you change anything. Note the old value in the change ticket. The rollback is one command: kubectl patch hpa checkout -n prod -p '{"spec":{"minReplicas":10}}'. Anyone on call should be able to run it without context.
Pick your health signals up front. p95 latency, error rate, and CPU relative to the HPA target are the usual three. Decide what “degraded” means before the change, not while staring at a graph afterwards.
Hold each step through at least one full daily peak. A floor change that looks fine at 11pm can hurt at the 9am ramp. Treat each step as being on probation until it has survived the busiest window of the day.
Check the real scale-up path. With a lower floor, the morning ramp now requires actual scaling instead of pre-paid headroom. The HPA’s scale-up defaults are aggressive (add 100% or 4 pods every 15 seconds, whichever is more), so the usual bottlenecks are elsewhere: metric lag, pod startup time, and node provisioning. Verify nobody configured a conservative behavior.scaleUp - we covered the behavior block in Kubernetes HPA vs KEDA.

Two constraints bound how low you can go regardless of utilization data. If a PodDisruptionBudget sets minAvailable: 2, a floor of 1 will block node drains. And if you rely on spreading replicas across three zones for availability, a floor below 3 quietly gives that up. Check both before any reduction - utilization is not the only input to a floor.

When a High Floor Is Right

Not every pinned-at-floor workload is waste. Three patterns justify deliberate headroom:

Slow pod startup plus a sharp daily ramp. If pods take 3 minutes to become ready and traffic doubles in 5, a higher floor before the ramp is rational. The better fix is usually a scheduled floor - raised before the known peak, lowered after - rather than paying for the peak 24/7. KEDA’s cron scaler does exactly this.
Failover headroom. Running 4 replicas where 2 would do, spread across zones, so a zone loss doesn’t degrade service. That’s an availability decision, not an autoscaling mistake.
Warm-pool semantics. JIT-heavy runtimes and cache-warming services genuinely degrade when a cold replica joins; their floors price in measured warmup cost.

A justified floor has a reason someone can state today. An unjustified one has a reason nobody remembers. If you can’t say what a floor protects against, the queries above will tell you what it costs to keep.

What to Do Next

Run the custom-columns snapshot and list every HPA sitting at its floor with utilization far below target
Confirm each candidate over two weeks with the time-at-floor and p95 utilization queries
Check PDB and zone-spread constraints, then halve the gap and hold the new floor through at least one full daily peak
Re-run the audit quarterly - floors ratchet up on their own; nothing ratchets them down

Or let the audit run continuously: stepscale watches your HPAs and flags floors with sustained utilization headroom as reviewable recommendations - recommend-only by default, running entirely in your cluster.

Kubernetes HPA vs KEDA - Which autoscaler fits your workloads, and how to configure HPA’s behavior block properly
How to Reduce AWS ECS Costs - Cost optimization strategies that apply across ECS and Kubernetes