ECS Autoscaling Best Practices That Actually Work
AWS ECS autoscaling sounds simple on paper - set a target, let AWS handle the rest. In practice, most teams end up with services that scale too slowly, overshoot on the way up, and refuse to scale down. Here's what we've learned running ECS autoscaling in production across dozens of services.
The short version: target tracking works for request-driven services, step scaling gives you control for batch workloads, and queue-based scaling is the only sane option if your services consume from SQS, Kafka, or similar queues.
The Three Autoscaling Approaches
ECS supports three autoscaling policy types. Each one fits a different workload shape.
Target Tracking
You pick a metric (CPU, memory, ALB request count) and a target value. AWS figures out how many tasks you need. This is the default recommendation from AWS, and it works well for HTTP services behind a load balancer.
# CloudFormation example
ScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyType: TargetTrackingScaling
TargetTrackingScalingPolicyConfiguration:
TargetValue: 60.0
PredefinedMetricSpecification:
PredefinedMetricType: ECSServiceAverageCPUUtilization
ScaleInCooldown: 300
ScaleOutCooldown: 60
The 60% CPU target is a good starting point. We've seen teams set it to 80% thinking they'll save money - then wonder why their p99 latency spikes during scale-out because there's no headroom left.
When to use it: API services, web backends, anything where CPU or request count correlates directly with load.
When it fails: Batch processors, queue consumers, services where CPU doesn't reflect actual workload. A service consuming SQS messages might sit at 10% CPU while a queue of 50,000 messages builds up.
Step Scaling
You define specific thresholds and how many tasks to add or remove at each level. More work to configure, but you get precise control.
StepScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyType: StepScaling
StepScalingPolicyConfiguration:
AdjustmentType: ChangeInCapacity
StepAdjustments:
- MetricIntervalLowerBound: 0
MetricIntervalUpperBound: 1000
ScalingAdjustment: 2
- MetricIntervalLowerBound: 1000
MetricIntervalUpperBound: 5000
ScalingAdjustment: 5
- MetricIntervalLowerBound: 5000
ScalingAdjustment: 10
Cooldown: 60
The config above says: if the metric crosses the alarm threshold by 0-1000, add 2 tasks. By 1000-5000, add 5. Over 5000, add 10. This lets you react proportionally to different levels of load.
When to use it: When you know your scaling curve and want fine-grained control. Works well for workloads with predictable patterns where you've measured the relationship between metric and capacity.
When it fails: When you don't know the right thresholds yet. Getting step scaling wrong means either over-provisioning or being too slow to react.
Queue-Based Scaling (Custom Metrics)
Neither target tracking nor step scaling works well for queue consumers. The problem is fundamental: CloudWatch CPU metrics tell you how busy your current tasks are, not how much work is waiting.
A service with 5 tasks at 20% CPU and 100,000 messages in the queue needs to scale up aggressively. But target tracking sees 20% CPU and thinks everything is fine.
The fix is scaling based on queue depth directly:
# Calculate desired tasks based on queue depth
messages_in_queue = 45000
messages_per_task = 1000 # each task processes ~1000 msg/min
desired_tasks = math.ceil(messages_in_queue / messages_per_task)
desired_tasks = max(min_tasks, min(desired_tasks, max_tasks))
This is where tools like fast-autoscaler come in - it's an open-source Lambda function that reads queue metrics and updates your ECS service's desired count directly, bypassing CloudWatch-based autoscaling entirely.
When to use it: Any service consuming from SQS, Kafka, RabbitMQ, Redis queues, or Kinesis streams. Also useful for services processing S3 event notifications.
Cooldown Periods: The Most Misconfigured Setting
Cooldowns prevent your service from scaling up and down repeatedly (thrashing). AWS defaults are often too conservative.
Here's what we've found works:
| Scenario | Scale-out cooldown | Scale-in cooldown |
|---|---|---|
| API service (target tracking) | 60s | 300s |
| Queue consumer (step scaling) | 30s | 120s |
| Batch processor | 60s | 600s |
The pattern: scale out fast, scale in slow. When load hits, you want tasks up quickly. When load drops, wait longer to confirm it's actually gone before removing capacity.
A common mistake: setting scale-in cooldown to 60 seconds. What happens is traffic drops briefly during a lull, tasks get removed, then traffic comes back and you're scrambling to scale up again. 300 seconds minimum for scale-in on any production service.
Min and Max Task Counts
Setting minCapacity and maxCapacity wrong causes the most common autoscaling failures.
Min tasks too low: Setting min to 1 means a cold start on every traffic spike. If your service takes 45 seconds to start and register with the load balancer, you'll have a full minute of degraded service on every scale-out from 1.
For production services, calculate your minimum based on:
- Baseline traffic during your quietest hour
- How many tasks you need to handle that traffic at 50% capacity (leaving headroom)
If your quietest hour needs 3 tasks at full load, set min to 2.
Max tasks too low: We've seen teams set max to 20 "to control costs" and then eat a full outage when a marketing campaign drives 10x normal traffic. Your max should be your absolute ceiling based on what your VPC, database connections, and downstream services can handle - not a cost control measure. Use billing alerts for cost control instead.
Max tasks too high: Less common, but setting max to 1000 when your RDS instance can only handle 200 connections means autoscaling could bring down your database. Know your downstream limits.
The Task Startup Time Problem
ECS autoscaling has a built-in delay that most teams underestimate:
- Scaling decision made (0s)
- New task provisioned by ECS (5-15s for Fargate, longer for EC2)
- Container image pulled (10-60s depending on image size)
- Application starts (varies - 5s to 120s)
- Health check passes (your health check interval + 1 successful check)
- ALB target registration (15-30s)
Total: 45 seconds to 4+ minutes from decision to serving traffic.
This means your scaling needs to be predictive, not reactive. By the time your new tasks are serving traffic, the spike might already be over.
Fixes:
- Keep images small. A 50MB image pulls in 5 seconds. A 2GB image takes a minute. Use multi-stage builds, alpine base images, and avoid shipping build tools in your production image.
- Reduce health check intervals. If your health check runs every 30 seconds with a 3-success threshold, that's 90 seconds just for health checks. Use 10-second intervals with 2-success threshold for faster registration.
- Pre-warm during known peak hours. If you know traffic spikes at 9am, schedule a scale-up at 8:50am. Use scheduled scaling alongside your dynamic policy.
ScheduledAction:
Type: AWS::ApplicationAutoScaling::ScheduledAction
Properties:
ScheduledActionName: morning-warmup
Schedule: "cron(50 8 ? * MON-FRI *)"
ScalableTargetAction:
MinCapacity: 10
Scaling Based on Multiple Metrics
A single metric is rarely enough. Your API service might need to scale on both CPU and request count:
- CPU handles compute-heavy requests that process data
- Request count handles lightweight requests that don't use much CPU but need connection slots
ECS lets you attach multiple scaling policies to one service. The policies operate independently - whichever one calls for the most tasks wins. This is the correct behavior: if either metric says you need more capacity, you should get more capacity.
Don't try to build a single composite metric by averaging CPU and request count. It dilutes both signals.
Monitoring Your Autoscaling
You can't improve what you don't measure. Track these:
- Scaling event frequency: If you're scaling more than 10 times per hour, your thresholds or cooldowns are wrong
- Time between scale-out decision and task serving traffic: This is your actual scaling latency
- Queue depth at scale-out trigger vs peak queue depth: If the queue hits 50k before scaling kicks in but peaks at 200k, you're scaling too late
- Cost per scaling event: Each task-hour costs money. Thrashing wastes it
CloudWatch dashboards work for this, but stepscale AI takes it further by analyzing your scaling patterns over time and automatically tuning your thresholds, cooldowns, and min/max values based on actual workload data. Instead of manually adjusting these numbers, the AI learns your traffic patterns and optimizes the configuration for you.
Common Mistakes We See
1. Using CPU scaling for queue consumers. Already covered this, but it's the #1 mistake. CPU tells you how busy tasks are, not how much work is waiting.
2. Not testing autoscaling before production. Run a load test that simulates your actual traffic pattern. Steady ramp-up, sudden spike, gradual decline. Watch how your scaling responds.
3. Ignoring downstream limits. Your ECS service might scale to 100 tasks, but if your RDS instance only handles 50 connections and each task opens 2, you've just created a database outage. Always check: database connections, API rate limits, NAT gateway bandwidth, and any shared resources.
4. Setting identical scale-out and scale-in thresholds. If you scale out at 60% CPU and scale in at 59% CPU, you'll thrash endlessly. Create a gap: scale out at 70%, scale in at 40%.
5. Forgetting about Fargate spot termination. If you use Fargate Spot for cost savings, your tasks can be interrupted with 30 seconds notice. Your min capacity should use regular Fargate, with spot only for additional capacity. Mix capacity providers:
CapacityProviderStrategy:
- CapacityProvider: FARGATE
Base: 3
Weight: 1
- CapacityProvider: FARGATE_SPOT
Weight: 3
This keeps 3 tasks on regular Fargate (stable base) and adds spot tasks at a 3:1 ratio for scale-out.
What to Do Next
- Audit your current scaling policies. Check if you're using the right policy type for your workload pattern
- Measure your actual scaling latency. Time from decision to traffic-serving. If it's over 2 minutes, optimize your startup time
- Review your cooldowns. Scale-out should be 60s or less. Scale-in should be 300s or more
- Set up scaling dashboards. You need visibility into how often you scale and whether it's working
- For queue-based workloads, look at fast-autoscaler for reactive scaling, or stepscale AI if you want the configuration tuned automatically