Queue-Based Autoscaling on AWS - SQS, Kafka, and Beyond
CPU-based autoscaling breaks down the moment your service reads from a queue. Your tasks might sit at 15% CPU while 200,000 messages pile up waiting to be processed. The CPU metric tells you how busy your workers are, not how much work is waiting.
Queue-based autoscaling fixes this by scaling on the actual signal that matters: how many messages need processing. Here's how to set it up for SQS, Kafka, RabbitMQ, and other queue systems on AWS.
Why CPU Metrics Fail for Queue Consumers
Consider a service consuming SQS messages. Each message triggers an API call to a third-party service that takes 2 seconds. The CPU usage during that 2-second wait? Nearly zero - the task is blocked on I/O.
With 10 tasks running, CPU sits at 8%. Target tracking autoscaling sees 8% and thinks "we have way too much capacity, let's scale down." Meanwhile, the queue grows because each task can only process 30 messages per minute.
This isn't an edge case. It's the default behavior for:
- Services calling external APIs
- Workers doing database-heavy operations
- Anything with I/O-bound processing (file downloads, S3 operations, HTTP requests)
- ETL pipelines that spend most time waiting on data sources
The only reliable metric for these workloads is the queue itself.
SQS-Based Autoscaling
The Metrics That Matter
SQS exposes three relevant CloudWatch metrics:
| Metric | What it tells you |
|---|---|
ApproximateNumberOfMessagesVisible | Messages waiting to be picked up |
ApproximateNumberOfMessagesNotVisible | Messages currently being processed (in-flight) |
ApproximateAgeOfOldestMessage | How long the oldest message has been waiting |
MessagesVisible is your primary scaling signal. It tells you how much work is queued up.
MessagesNotVisible is useful as a secondary metric - it tells you how many messages your current workers are actively processing. If this number equals your workers * batch size, your workers are fully saturated.
AgeOfOldestMessage is great for alerting but bad for scaling. A single stuck message can make this metric spike even when the queue is otherwise healthy.
Option 1: CloudWatch Alarm + Step Scaling
The native AWS approach. Create a CloudWatch alarm on ApproximateNumberOfMessagesVisible and attach a step scaling policy:
QueueDepthAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: order-queue-depth-high
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Statistic: Average
Period: 60
EvaluationPeriods: 1
Threshold: 100
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: QueueName
Value: !GetAtt OrderQueue.QueueName
AlarmActions:
- !Ref ScaleUpPolicy
ScaleUpPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyType: StepScaling
StepScalingPolicyConfiguration:
AdjustmentType: ChangeInCapacity
StepAdjustments:
- MetricIntervalLowerBound: 0
MetricIntervalUpperBound: 500
ScalingAdjustment: 2
- MetricIntervalLowerBound: 500
MetricIntervalUpperBound: 5000
ScalingAdjustment: 5
- MetricIntervalLowerBound: 5000
ScalingAdjustment: 15
Cooldown: 60
This approach works. The problem is granularity. CloudWatch evaluates every 60 seconds (or 10 seconds with detailed monitoring), and the alarm needs EvaluationPeriods to trigger. With 1-minute periods and 1 evaluation period, you're looking at 1-2 minutes from queue spike to scaling action. Add task startup time and you're at 3-5 minutes.
For some workloads that's fine. For a payment processing queue where latency matters, it's not.
Option 2: Lambda-Based Direct Scaling
Skip CloudWatch alarms entirely. Run a Lambda function every 1-2 minutes that reads the queue depth and calls UpdateService directly:
import boto3
import math
def handler(event, context):
sqs = boto3.client('sqs')
ecs = boto3.client('ecs')
# Get queue metrics
attrs = sqs.get_queue_attributes(
QueueUrl=QUEUE_URL,
AttributeNames=['ApproximateNumberOfMessagesVisible',
'ApproximateNumberOfMessagesNotVisible']
)
visible = int(attrs['Attributes']['ApproximateNumberOfMessagesVisible'])
in_flight = int(attrs['Attributes']['ApproximateNumberOfMessagesNotVisible'])
# Calculate desired tasks
messages_per_task = 500 # each task handles ~500 msg/min
desired = math.ceil(visible / messages_per_task)
desired = max(MIN_TASKS, min(desired, MAX_TASKS))
# Get current state
service = ecs.describe_services(
cluster=CLUSTER, services=[SERVICE]
)['services'][0]
current = service['desiredCount']
if desired != current:
ecs.update_service(
cluster=CLUSTER,
service=SERVICE,
desiredCount=desired
)
return {
'visible': visible,
'in_flight': in_flight,
'current_tasks': current,
'desired_tasks': desired
}
This is the approach fast-autoscaler uses, with added features like cooldown management, state tracking in S3, priority overrides for scale-up, and support for multiple queue providers.
The advantage over CloudWatch alarms: you control the scaling formula directly. You can factor in both visible and in-flight messages, apply different ratios at different times of day, or implement custom cooldown logic.
Option 3: Target Tracking on a Custom Metric
You can publish a custom metric to CloudWatch that represents "messages per task" and use target tracking on it:
# Publish custom metric
cloudwatch.put_metric_data(
Namespace='Custom/Autoscaling',
MetricData=[{
'MetricName': 'MessagesPerTask',
'Value': visible / current_task_count,
'Unit': 'Count'
}]
)
Then target track on it:
TargetTrackingPolicy:
TargetValue: 100.0 # target 100 messages per task
CustomizedMetricSpecification:
MetricName: MessagesPerTask
Namespace: Custom/Autoscaling
Statistic: Average
This can work, but it has a bootstrap problem: when you have 0 or 1 tasks, the "messages per task" ratio is huge, causing aggressive scale-up. And when the queue is empty, you need tasks to stay at their minimum, but the metric goes to 0 which might trigger unwanted scale-in.
We prefer the Lambda-based approach for its simplicity and control.
Kafka-Based Autoscaling
Kafka autoscaling is fundamentally different from SQS because Kafka tracks consumer progress per partition.
The key metric is consumer lag - the difference between the latest offset in each partition and your consumer group's committed offset. High lag means your consumers are falling behind.
from kafka import KafkaAdminClient, KafkaConsumer
admin = KafkaAdminClient(bootstrap_servers=KAFKA_BROKERS)
consumer = KafkaConsumer(bootstrap_servers=KAFKA_BROKERS)
# Get end offsets for all partitions
partitions = consumer.partitions_for_topic(TOPIC)
tp_list = [TopicPartition(TOPIC, p) for p in partitions]
end_offsets = consumer.end_offsets(tp_list)
# Get consumer group offsets
group_offsets = admin.list_consumer_group_offsets(CONSUMER_GROUP)
# Calculate total lag
total_lag = sum(
end_offsets[tp] - group_offsets[tp].offset
for tp in tp_list
if tp in group_offsets
)
A gotcha with Kafka scaling: you can't have more consumers than partitions. If your topic has 12 partitions, scaling beyond 12 pods wastes resources - the extra pods sit idle. Make sure your maxReplicas or MAX_TASKS doesn't exceed your partition count.
If you consistently need more parallelism, increase the partition count first. But be careful - increasing partitions on an existing topic can break ordering guarantees if your application depends on key-based ordering.
RabbitMQ-Based Autoscaling
RabbitMQ exposes queue metrics through its management API:
import pika
connection = pika.BlockingConnection(
pika.ConnectionParameters(host=RABBITMQ_HOST)
)
channel = connection.channel()
# Passive declare to get queue info without modifying it
result = channel.queue_declare(queue=QUEUE_NAME, passive=True)
message_count = result.method.message_count
consumer_count = result.method.consumer_count
message_count is your primary scaling signal - equivalent to SQS's MessagesVisible.
RabbitMQ doesn't expose an "in-flight" count directly the way SQS does. You can estimate it from the consumer count and prefetch settings: estimated_in_flight = consumer_count * prefetch_count.
Redis Queue Autoscaling
For Redis-based queues (using Lists or Streams):
Lists: LLEN queue_name gives you the queue depth. Simple and fast.
Streams with Consumer Groups: Use XINFO GROUPS stream_name to get pending message counts per consumer group. The pel-count field tells you how many messages have been delivered but not acknowledged.
import redis
r = redis.Redis(host=REDIS_HOST)
# For List-based queues
queue_depth = r.llen('job_queue')
# For Stream-based queues
groups = r.xinfo_groups('event_stream')
for group in groups:
pending = group['pel-count']
lag = group.get('lag', 0) # Redis 7.0+
The Scaling Formula
Regardless of queue type, the core formula is the same:
desired_tasks = ceil(queue_depth / messages_per_task)
desired_tasks = clamp(desired_tasks, min_tasks, max_tasks)
The hard part is determining messages_per_task - how many messages each task can process per scaling interval. This depends on:
- Message processing time (average and p95)
- Whether processing is CPU-bound or I/O-bound
- Batch size if your consumer batches messages
- Error/retry rate
Start with a conservative estimate and adjust. If each message takes 200ms to process and your scaling interval is 60 seconds, each task handles ~300 messages per minute. Set messages_per_task to 200 to leave headroom.
Handling Scale-Down Safely
Scaling up is easy - add more workers. Scaling down is where things break.
If you remove a task that's mid-processing, you either lose that message or it gets reprocessed when the visibility timeout expires (SQS) or when the consumer group rebalances (Kafka).
Best practices for safe scale-down:
-
Use graceful shutdown. Catch SIGTERM, stop pulling new messages, finish current work, then exit. ECS sends SIGTERM with a configurable
stopTimeout(default 30s). -
Scale down slowly. Don't remove 10 tasks at once. Remove 1-2 at a time with a cooldown between removals. This gives message processing time to complete.
-
Don't scale down during processing spikes. If visible messages are low but in-flight messages are high, your current workers are busy. Wait until both are low before scaling down.
-
Use a higher cooldown for scale-in. 120 seconds minimum. 300 seconds is better. This prevents removing capacity during brief traffic lulls.
Tuning Over Time
The initial configuration is never perfect. Your messages_per_task ratio, thresholds, and cooldowns all need adjustment as:
- Traffic patterns shift seasonally
- Message processing time changes as your service evolves
- Downstream dependencies get faster or slower
- New message types with different processing characteristics are added
This is where manual tuning gets exhausting. You set values, monitor for a week, adjust, repeat. For teams with dozens of queue-consuming services, it becomes a part-time job.
stepscale AI automates this tuning loop. It analyzes your scaling history - queue depths, task counts, processing rates over time - and generates optimized configuration values. The tuning runs periodically (not per-invocation), so it adds minimal cost while keeping your scaling parameters current.
What to Do Next
- Identify your queue-consuming services that are currently using CPU-based autoscaling. These are your highest-priority candidates for queue-based scaling
- Measure your message processing time (average and p95) to calculate
messages_per_task - For SQS on ECS, try fast-autoscaler - it handles the Lambda setup, multi-queue support, cooldown management, and state tracking out of the box
- For Kubernetes, look at KEDA with the appropriate trigger for your queue system
- Set up monitoring on queue depth, processing latency, and task count so you can see how your scaling is performing