< Home

Did you ever had the problem that your ECS cluster isn’t scaling extra EC2 instances as resource provider? Even though some tasks are failing?

Does the error look very similar to

service stroobantsdev-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance x has insufficient CPU units available. For more information, see the Troubleshooting section.

service stroobantsdev-service was unable to place a task because no container instance met all of its requirements. The closest matching container-instance x has insufficient memory available. For more information, see the Troubleshooting section.

But when you go look at the metrics you see that there is only like 10% CPU Usage/Memory Usage?

The problem lies in that you have set the reservation too high! So you have a couple fixes

  1. Lower the amount of CPU/Memory units your task will reserve
  2. Remove the CPU / Memory limits in the task definition (a bit more unpredictable behaviours could happen)
  3. And if those 2 are not possible (because the CPU/Memory usage is maybe low for this period but those services will need more when a processing task comes but you still need the reservation) you should scale another instance into your cluster.

Now, for 99% of the cases you mostly scale on CPU/Memory usage (not possible here), you scale on queue size or any other third party events.

But you can also scale on CPU/Memory reservation

Scaling on CPU/Memory reservation

The first thing we must look at is, how do we want to scale? For this example I’ll take that we want to scale if one of the two reservation is above 75%. Now. we could create multiple alarms that would scale in/out when this happens but these could conflict with each other. For example when CPU is above 75% but memory is under 40% (which would trigger a scale down) it would cause a conflict as one alarm wants to scale out and another wants to scale in.

So let’s take a formula I remembered somewhere deep in memory.

sqrt(CPUReservation^2 + MemoryReservation^2)

Now if one of them is 75 or above it, this would return >=75 even when the other one would be zero. For the scale in you should try to find a good number but I use 40 at the moment. As everything with cloud (and FinOps), you should iterate over these values to what fits your situation. There is no magic number. Everything also depends on your instance sizes, container usages,…

How to implement this in the AWS CDK?

First we will define our metrics. AWS publishes a CPUReservation and MemoryReservation metric under the namespace AWS/ECS. The dimensions are our cluster name (which here I get from a created cluster).

    const reservationCpuMetric = new Metric({
      namespace: 'AWS/ECS',
      metricName: 'CPUReservation',
      statistic: "Average",
      period: cdk.Duration.minutes(1),
      dimensions: {
        "ClusterName": cluster.clusterName
      }

    });
    const reservationMemoryMetric = new Metric({
      namespace: 'AWS/ECS',
      metricName: 'MemoryReservation',
      statistic: "Average",
      period: cdk.Duration.minutes(1),
      dimensions: {
        "ClusterName": cluster.clusterName
      }
    });

    const scaleReservation = new MathExpression({
      expression: "(m1^2+m2^2)^(1/2)",
      period: cdk.Duration.minutes(1),
      usingMetrics: {
        'm1': reservationCpuMetric,
        'm2': reservationMemoryMetric,

      }
    });

so, as you can see, the important part here is the scaleReservation which will be a math expression using the reservation of CPU and memory metric and apply our formula on it. Now to apply the scaling itself

    const autoScalingGroup = new AutoScalingGroup(this, "clusterAsgSpotFleet", {
      vpc,
      instanceType: new InstanceType('t3.large'),
      machineImage: ecs.EcsOptimizedImage.amazonLinux2(),
      minCapacity: 1,
      maxCapacity: 5,
      updatePolicy: UpdatePolicy.rollingUpdate(),
      maxInstanceLifetime: cdk.Duration.days(14),
    });

    autoScalingGroup.addUserData(`
#!/bin/bash
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=${cluster.clusterName}
EOF`);

    autoScalingGroup.scaleOnMetric('ScaleToReservation',{
      metric: averageReservation,
      scalingSteps: [
        { upper: 40, change: -1 },
        { lower: 75, change: +1 },
      ],
      adjustmentType: AdjustmentType.CHANGE_IN_CAPACITY,
    });

    const capacityProvider = new ecs.AsgCapacityProvider(this, 'clusterAsgSpotFleetProvider', {
      autoScalingGroup,
    });

    cluster.addAsgCapacityProvider(capacityProvider);

Here we created and added the autoscaling group to our cluster. The important part is the scaleOnMetric where we configure how the group should scale when the metric is >=75 (+1 instance) and <=40 (-1 instance)

< Home