In this guide, I’ll try to cover topic of building modern autoscaling solution for Amazon ECS Cluster with Fargate. You may think that there is nothing groundbreaking in this topic, because AWS offers an autoscaling solution already built into the ECS service, but you will find, that it will not meet every case of workload working inside the cluster.
Table of Contents
What is challenging when it comes to autoscaling?
The most difficult thing in the autoscaling process is to understand what workload is working inside the cluster, whether adding or removing resources (in our case, tasks) is possible and will not affect, for example, active connections happening. Let’s bring the case from the last project I’ve worked on:
Imagine we have an ECS cluster with containers, which are running a service which allows establishing RDP connections from the web-browser. When customer is requesting an RDP connection, service inside container acts as a proxy and delivers connection from the web-browser, through the cluster and at the end to the target server. The whole session is active until customer disconnects from the target server.
ECS Fargate cluster under the hood
To understand the scaling process, we first need to understand what components our ECS cluster consists of. At the very top is our Cluster, where the Service is located. The Service has a definition of the Tasks operating in it, their number and connections to the network (e.g. Load Balancer). Task is where our container resides. Hence it’s a Fargate cluster, this means that you don’t need to manage the server where the cluster is located, when it comes to resources allocated, they’re declared in the definition of the Task.
So… how to autoscale this beast?
To scale our ECS cluster up, we need to add additional Tasks to the Service, and to scale it down, we need to remove such Tasks from the Service. As you see at the example below to scale up, we’ve added new Task to the Service thus increasing the number of sessions that the cluster can handle.
To provide a compromise between performance and cost of such ECS cluster, we need to ensure that the cluster scales up (adds more Tasks) as load increases and scales down (removes Tasks) when the increase discontinues. We must provide a solution that satisfies the customers using our service by ensuring its efficient operation, but we also need to ensure that our solution responds to decreasing traffic without affecting active users of the service.
Now let’s answer why we can’t use the autoscaling solution built into the ECS cluster, but create our own? This is because the AWS autoscaling mechanism takes into account the cluster load, but does not take into account whether there are active sessions connected to the service running inside the cluster. This means that there is a risk that the cluster will delete actively running sessions when scaled down, which we cannot allow. That’s why we needed other solution 🙂
Let’s collect requirements for autoscaling solution
Moving forward, we had to consider how to implement the autoscaling solution, which is taking into account the sessions that are active. First, we needed to collect information about what tasks have active sessions. Let’s focus on solving this problem.
Collecting information (metrics) about sessions that are active in specific task
Our application working in a container is quite aware, in the sense that it is able to inform the external API about how many active sessions it has and in which Task within the Cluster these sessions are being handled. What we needed was an external API that would be prepared to receive metrics coming from the application working in our container.
Our solution was to create an infrastructure consisting of a Private API Gateway with an endpoint connected to the SQS queue. After sending the metric (information containing information about session activity in a given Task) to the API Gateway’s endpoint, metric is placed on the SQS queue, which is a buffer for the Lambda function that writes this metric to the DynamoDB. When the application sends an active session metric, it is added to DynamoDB, and when the session is terminated, the metric is removed from DynamoDB. Having this in place, we know to which Tasks of the Cluster, there are Active Sessions. One less problem 😛
Reading cluster utilization
The next step to autoscale our cluster is to collect information about the current load of the cluster in terms of CPU and memory consumption. If the load is high, the situation is relatively simple, we scale up, if the utilization is low and we have the opportunity to scale down, we do it, but here the situation is more difficult.
Scaling up is done by adding a new Task to the Service in our ECS cluster. Scaling down is done by removing unused Tasks from the ECS’s Service. So, to scale down, we need to make sure that the given Task is unused (meaning that it has no active sessions, that’s why we needed the metrics shown in the previous point). With all this in mind, let’s define the architecture for these assumptions.
In order to check the cluster utilization, I’ve created an EventBridge Rule, which is scheduled to be called every 1 minute, this Rule runs a Lambda function which duty is to check the utilization of the main ECS cluster in CloudWatch o and inform whether the cluster requires scaling up if the CPU/memory load exceeds the defined level, and if not, function should inform whether the cluster can be scaled down by removing unused Tasks from the ECS Service.
AWS Step Functions as an autoscaling executor
Having information about whether we should scale our cluster up or down, we also need a way to perform the scaling operation. Scaling up involves increasing the desired number of Tasks within the ECS’s Service definition, unfortunately in the case of scaling down, we need to take more steps to deregister unused Task from the Service. To scale down we need to:
- Deregister specific Task from Load balancer’s targets,
- Wait for it to be deregistered from the Load balancer,
- Stop this task in the ECS’s Service,
- Remove it from the ECS’s Service.
As you see in the process of scaling down, there are some “steps” so, as the name suggests, it will be best to use AWS Step Functions for this purpose 🙂
We wanted to add additional step in the cluster’s analysis stage, which in case of a need for cluster’s scaling, will trigger Step Function which will execute all steps needed for both scale out or scale in. The definition of our Step Function looks as follows:
The steps that are needed to scale are pretty well described inside the function diagram above. Below I will discuss interesting topics.
One of the steps of our state machine is connecting to external services and waiting for them to perform the assigned task. An example of this is deregistering Tasks from Load Balancer. Using Step Function, we pay for any transition of the state of our state machine, which means that if we know that the example process of deregistering a target from the Load Balancer will take about 5 minutes, it is not worth paying for checking whether such deregistration occurred every second or less. It’s better to for example make the first check after 4 minutes and each subsequent check every minute, thanks to this we will limit the number of state changes, and thus the cost of our state machine.
Try catch and Retry
It is also important to understand when to use Retry and when to use Try Catch blocks in Step Functions definition.
Retry is used when we want to repeat a specific stage inside the state machine that has failed, e.g. if we failed to unregister the task from the Load Balancer, let’s try this specific step again.
Try Catch block allows us to move to a specific stage inside our state machine when a specific error occurs. This is similar to what Retry offers, but Try Catch allows us to “step back” in the state machine when the situation requires it. For example, if an error occurs while checking whether a given Task is Healthy in our process, there is no point in repeating the conditional statement that is checking the Task’s health, and we have to go back a few steps back to retrieve information about the Task’s health from the API again and than to check its health again.
Invoking Step Functions
As an addition, I am attaching below what type of parameters are passed when calling Step Function autoscaling execution.
Scaling out event
Scaling in event
Testing the whole solution
I’m a big fan of testing everything that is possible. I follow the principle and identify with statement of Dr. Werner Vogels, CTO of Amazon:
Everything Fails All the Time
So we have no choice 🙂 We have to test our solution.
Types of cloud testings
The most popular types of system tests are:
- Unit tests, focusing on testing the smallest possible fragments of software (especially in terms of code) in simulated conditions,
- Integration tests, focusing on checking the behaviour of individual infrastructure elements correlating between themselves,
- E2E tests that check the behaviour of the entire service in conditions that are the most similar to real life.
Don’t forget to check my previous post with ready project built in AWS CDK, covering the topic of testing Lambda functions:
Assignment of infrastructure elements to test types
In order to test our infrastructure for autoscaling, we need to start by dividing it into different types of tests.
Unit tests of the Lambda function
The Lambda function in our case makes decisions about whether to scale the cluster up or down, or do nothing. Therefore, we need to prepare ready-made situations and check whether the function will make correct decisions for the given parameters (simulated cluster load).
Above I presented an example diagram for performing unit tests of Lambda function, we can see here that thanks to mocking we simulate responses from external services such as CloudWatch or DynamoDB. We want to check whether the decision about scaling will be made correctly, when we simulate that the load on the ECS cluster is large or small. We also want to simulate that a given Task is unused or in use and we will check whether the Lambda function will mark such a Task as possible to be deleted or not.
Integration tests of the autoscaling infrastructure
One of the integration tests that should be performed in our case is an integration test between AWS Step Functions performing autoscaling and the ECS cluster. In particular, we want to check whether the cluster will be scaled appropriately under given conditions. For this purpose, I created another testing state machine, the task of which is to call the first scaling state machine, which will scale the cluster, and then to check the cluster state whether it will be scaled correctly.
The definition of the Testing Step Functions looks as follows:
Each type of test showed, must be able to be called from the code level. I describe the infrastructure for the discussed solution using the IaC tool – AWS CDK. The framework used to execute tests is pytest. The steps that are taken when starting integration tests are as follows:
- Disabling the cluster state analyzer,
- Preparing the cluster, launching the appropriate number of tasks,
- Creating a test scenario (e.g. scale the cluster down to 1 task),
- Sending the scenario to Testing Step Function, which tests the cluster’s behavior,
- Waiting for the result.
E2E tests of the autoscaling infrastructure
The last type of testing will be to simulate the load inside the cluster and check whether Lambda, which analyzes the cluster load, will make the appropriate decision regarding scaling. This type of test scenario is the closest to real life.
CI/CD and the autoscaling solution
In our case, temporary stack is created, when a Pull Request is created by the developer, all types of tests are automatically performed on the temporary stack. When each of them succeeds, it is possible to merge the changes and thus the deployment process to target environments such as production is started. If any test fails, during the deployment after the merge, the rollback process begins.
AWS re:Invent 2023 news – AWS Step Functions – Redrive
When I was creating the described autoscaling solution, I really missed the possibility of repeating the state machine execution from the moment the error occurred. Now we have this opportunity thanks to Step Functions – Redrive 🙂
This topic was also covered during my and Przemek Malak‘s speech on 4Developers 2023 Łódź. On this occasion, I would like to thank the entire community who participated in our session in such large numbers, and, above all, my co-host Przemek (he has a great blog with more guides about Serverless topic 🙂 – malak.cloud).
Thank you for reaching out to that place. If you want to know more about AWS and the cloud, check the below posts: