Queue-Based Autoscaling on AWS - SQS, Kafka, and Beyond

CPU-based autoscaling breaks down the moment your service reads from a queue. Your tasks might sit at 15% CPU while 200,000 messages pile up waiting to be processed. The CPU metric tells you how busy your workers are, not how much work is waiting.

Queue-based autoscaling fixes this by scaling on the actual signal that matters: how many messages need processing. Here’s how to set it up for SQS, Kafka, RabbitMQ, and other queue systems on AWS.

Why CPU Metrics Fail for Queue Consumers

Consider a service consuming SQS messages. Each message triggers an API call to a third-party service that takes 2 seconds. The CPU usage during that 2-second wait? Nearly zero - the task is blocked on I/O.

With 10 tasks running, CPU sits at 8%. Target tracking autoscaling sees 8% and thinks “we have way too much capacity, let’s scale down.” Meanwhile, the queue grows because each task can only process 30 messages per minute.

This isn’t an edge case. It’s the default behavior for:

Services calling external APIs
Workers doing database-heavy operations
Anything with I/O-bound processing (file downloads, S3 operations, HTTP requests)
ETL pipelines that spend most time waiting on data sources

The only reliable metric for these workloads is the queue itself.

SQS-Based Autoscaling

The Metrics That Matter

SQS exposes three relevant CloudWatch metrics:

Metric	What it tells you
`ApproximateNumberOfMessagesVisible`	Messages waiting to be picked up
`ApproximateNumberOfMessagesNotVisible`	Messages currently being processed (in-flight)
`ApproximateAgeOfOldestMessage`	How long the oldest message has been waiting

MessagesVisible is your primary scaling signal. It tells you how much work is queued up.

MessagesNotVisible is useful as a secondary metric - it tells you how many messages your current workers are actively processing. If this number equals your workers * batch size, your workers are fully saturated.

AgeOfOldestMessage is great for alerting but bad for scaling. A single stuck message can make this metric spike even when the queue is otherwise healthy.

Option 1: CloudWatch Alarm + Step Scaling

The native AWS approach. Create a CloudWatch alarm on ApproximateNumberOfMessagesVisible and attach a step scaling policy:

QueueDepthAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: order-queue-depth-high
    MetricName: ApproximateNumberOfMessagesVisible
    Namespace: AWS/SQS
    Statistic: Average
    Period: 60
    EvaluationPeriods: 1
    Threshold: 100
    ComparisonOperator: GreaterThanThreshold
    Dimensions:
      - Name: QueueName
        Value: !GetAtt OrderQueue.QueueName
    AlarmActions:
      - !Ref ScaleUpPolicy

ScaleUpPolicy:
  Type: AWS::ApplicationAutoScaling::ScalingPolicy
  Properties:
    PolicyType: StepScaling
    StepScalingPolicyConfiguration:
      AdjustmentType: ChangeInCapacity
      StepAdjustments:
        - MetricIntervalLowerBound: 0
          MetricIntervalUpperBound: 500
          ScalingAdjustment: 2
        - MetricIntervalLowerBound: 500
          MetricIntervalUpperBound: 5000
          ScalingAdjustment: 5
        - MetricIntervalLowerBound: 5000
          ScalingAdjustment: 15
      Cooldown: 60

This approach works. The problem is granularity. CloudWatch evaluates every 60 seconds (or 10 seconds with detailed monitoring), and the alarm needs EvaluationPeriods to trigger. With 1-minute periods and 1 evaluation period, you’re looking at 1-2 minutes from queue spike to scaling action. Add task startup time and you’re at 3-5 minutes.

For some workloads that’s fine. For a payment processing queue where latency matters, it’s not.

Option 2: Lambda-Based Direct Scaling

Skip CloudWatch alarms entirely. Run a Lambda function every 1-2 minutes that reads the queue depth and calls UpdateService directly:

import boto3
import math

def handler(event, context):
    sqs = boto3.client('sqs')
    ecs = boto3.client('ecs')

    # Get queue metrics
    attrs = sqs.get_queue_attributes(
        QueueUrl=QUEUE_URL,
        AttributeNames=['ApproximateNumberOfMessagesVisible',
                        'ApproximateNumberOfMessagesNotVisible']
    )
    visible = int(attrs['Attributes']['ApproximateNumberOfMessagesVisible'])
    in_flight = int(attrs['Attributes']['ApproximateNumberOfMessagesNotVisible'])

    # Calculate desired tasks
    messages_per_task = 500  # each task handles ~500 msg/min
    desired = math.ceil(visible / messages_per_task)
    desired = max(MIN_TASKS, min(desired, MAX_TASKS))

    # Get current state
    service = ecs.describe_services(
        cluster=CLUSTER, services=[SERVICE]
    )['services'][0]
    current = service['desiredCount']

    if desired != current:
        ecs.update_service(
            cluster=CLUSTER,
            service=SERVICE,
            desiredCount=desired
        )

    return {
        'visible': visible,
        'in_flight': in_flight,
        'current_tasks': current,
        'desired_tasks': desired
    }

A production-grade version of this pattern includes cooldown management, state tracking in S3 or DynamoDB, priority overrides for emergency scale-up, and support for multiple queue providers (SQS, Kafka, RabbitMQ, Redis, Kinesis).

The advantage over CloudWatch alarms: you control the scaling formula directly. You can factor in both visible and in-flight messages, apply different ratios at different times of day, or implement custom cooldown logic. The downside is that you now own the formula, so the messages-per-task value, min/max bounds, and cooldown intervals all need ongoing tuning as your workload changes. This is exactly where stepscale AI earns its keep: it learns those values from your traffic history and keeps them current automatically.

Option 3: Target Tracking on a Custom Metric

You can publish a custom metric to CloudWatch that represents “messages per task” and use target tracking on it:

# Publish custom metric
cloudwatch.put_metric_data(
    Namespace='Custom/Autoscaling',
    MetricData=[{
        'MetricName': 'MessagesPerTask',
        'Value': visible / current_task_count,
        'Unit': 'Count'
    }]
)

Then target track on it:

TargetTrackingPolicy:
  TargetValue: 100.0  # target 100 messages per task
  CustomizedMetricSpecification:
    MetricName: MessagesPerTask
    Namespace: Custom/Autoscaling
    Statistic: Average

This can work, but it has a bootstrap problem: when you have 0 or 1 tasks, the “messages per task” ratio is huge, causing aggressive scale-up. And when the queue is empty, you need tasks to stay at their minimum, but the metric goes to 0 which might trigger unwanted scale-in.

We prefer the Lambda-based approach for its simplicity and control.

Kafka-Based Autoscaling

Kafka autoscaling is fundamentally different from SQS because Kafka tracks consumer progress per partition.

The key metric is consumer lag - the difference between the latest offset in each partition and your consumer group’s committed offset. High lag means your consumers are falling behind.

from kafka import KafkaAdminClient, KafkaConsumer

admin = KafkaAdminClient(bootstrap_servers=KAFKA_BROKERS)
consumer = KafkaConsumer(bootstrap_servers=KAFKA_BROKERS)

# Get end offsets for all partitions
partitions = consumer.partitions_for_topic(TOPIC)
tp_list = [TopicPartition(TOPIC, p) for p in partitions]
end_offsets = consumer.end_offsets(tp_list)

# Get consumer group offsets
group_offsets = admin.list_consumer_group_offsets(CONSUMER_GROUP)

# Calculate total lag
total_lag = sum(
    end_offsets[tp] - group_offsets[tp].offset
    for tp in tp_list
    if tp in group_offsets
)

A gotcha with Kafka scaling: you can’t have more consumers than partitions. If your topic has 12 partitions, scaling beyond 12 pods wastes resources - the extra pods sit idle. Make sure your maxReplicas or MAX_TASKS doesn’t exceed your partition count.

If you consistently need more parallelism, increase the partition count first. But be careful - increasing partitions on an existing topic can break ordering guarantees if your application depends on key-based ordering.

RabbitMQ-Based Autoscaling

RabbitMQ exposes queue metrics through its management API:

import pika

connection = pika.BlockingConnection(
    pika.ConnectionParameters(host=RABBITMQ_HOST)
)
channel = connection.channel()

# Passive declare to get queue info without modifying it
result = channel.queue_declare(queue=QUEUE_NAME, passive=True)
message_count = result.method.message_count
consumer_count = result.method.consumer_count

message_count is your primary scaling signal - equivalent to SQS’s MessagesVisible.

RabbitMQ doesn’t expose an “in-flight” count directly the way SQS does. You can estimate it from the consumer count and prefetch settings: estimated_in_flight = consumer_count * prefetch_count.

Redis Queue Autoscaling

For Redis-based queues (using Lists or Streams):

Lists: LLEN queue_name gives you the queue depth. Simple and fast.

Streams with Consumer Groups: Use XINFO GROUPS stream_name to get pending message counts per consumer group. The pel-count field tells you how many messages have been delivered but not acknowledged.

import redis

r = redis.Redis(host=REDIS_HOST)

# For List-based queues
queue_depth = r.llen('job_queue')

# For Stream-based queues
groups = r.xinfo_groups('event_stream')
for group in groups:
    pending = group['pel-count']
    lag = group.get('lag', 0)  # Redis 7.0+

The Scaling Formula

Regardless of queue type, the core formula is the same:

desired_tasks = ceil(queue_depth / messages_per_task)
desired_tasks = clamp(desired_tasks, min_tasks, max_tasks)

The hard part is determining messages_per_task - how many messages each task can process per scaling interval. This depends on:

Message processing time (average and p95)
Whether processing is CPU-bound or I/O-bound
Batch size if your consumer batches messages
Error/retry rate

Start with a conservative estimate and adjust. If each message takes 200ms to process and your scaling interval is 60 seconds, each task handles ~300 messages per minute. Set messages_per_task to 200 to leave headroom.

Handling Scale-Down Safely

Scaling up is easy - add more workers. Scaling down is where things break.

If you remove a task that’s mid-processing, you either lose that message or it gets reprocessed when the visibility timeout expires (SQS) or when the consumer group rebalances (Kafka).

Best practices for safe scale-down:

Use graceful shutdown. Catch SIGTERM, stop pulling new messages, finish current work, then exit. ECS sends SIGTERM with a configurable stopTimeout (default 30s).
Scale down slowly. Don’t remove 10 tasks at once. Remove 1-2 at a time with a cooldown between removals. This gives message processing time to complete.
Don’t scale down during processing spikes. If visible messages are low but in-flight messages are high, your current workers are busy. Wait until both are low before scaling down.
Use a higher cooldown for scale-in. 120 seconds minimum. 300 seconds is better. This prevents removing capacity during brief traffic lulls.

Tuning Over Time

The initial configuration is never perfect. Your messages_per_task ratio, thresholds, and cooldowns all need adjustment as:

Traffic patterns shift seasonally
Message processing time changes as your service evolves
Downstream dependencies get faster or slower
New message types with different processing characteristics are added

This is where manual tuning gets exhausting. You set values, monitor for a week, adjust, repeat. For teams with dozens of queue-consuming services, it becomes a part-time job.

stepscale AI automates this tuning loop. It analyzes your scaling history - queue depths, task counts, processing rates over time - and generates optimized configuration values. The tuning runs periodically (not per-invocation), so it adds minimal cost while keeping your scaling parameters current.

What to Do Next

Identify your queue-consuming services that are currently using CPU-based autoscaling. These are your highest-priority candidates for queue-based scaling
Measure your message processing time (average and p95) to calculate messages_per_task
For SQS on ECS, start with a Lambda-based scaler reading queue depth on a 1-2 minute cadence. Layer stepscale AI on top once it is running, so the messages-per-task ratio and cooldowns get tuned to your actual traffic instead of left at a copy-pasted default
For Kubernetes, look at KEDA with the appropriate trigger for your queue system
Set up monitoring on queue depth, processing latency, and task count so you can see how your scaling is performing

ECS Autoscaling Best Practices - Full guide to target tracking, step scaling, and cooldown configuration
Kubernetes HPA vs KEDA - Which Kubernetes autoscaler fits queue-based workloads
How to Reduce AWS ECS Costs - Right-sizing and Fargate Spot strategies for queue consumers
Why Choose Fargate Over EKS with EC2 - Compute model tradeoffs for event-driven architectures