Running containers on ECS is just the way AWS imagined it: super simple with almost no barrier to entry. Spin up the cluster, boot up some nodes and you are ready to go. This works great and solves many problems you face getting started with containers in cloud environments.
At some point, you get the bill, and this is where things can get tricky. Unless you are using Fargate ECS is utilizing regular EC2 instances that have a certain cost associated to it. Keeping this under control can be done in multiple ways, the easiest being reserved instances.
Getting the cost down, even more, there are spot instances with the promise of up to 90 % cost reduction. While this 90 % is a very magical number and a good sales pitch, this is rarely achieved — yet we can look at a good and steady 75 % if done correctly.
GETTING INTO IT
A few months back implementing a spot only strategy with fallback instances required quite advanced configurations, but now that there is launch templates it got much simpler. Let’s walk through some of the learnings we faced implementing a possible solution (and open sourcing the entire thing).
The first issue we faced with launch templates is, that it is impossible to configure a base capacity of 0 for on-demand but use them for scaling if spot becomes unavailable. Would have been too simple I assume and also working against the AWS business model, yet as this is impossible we need another solution. Leading to the conclusion: Just take the same launch template and create two autoscaling Groups. And this is where things get tricky.
Now that we have two ASG, there needs to be some kind of logic to scale the on-demand group if the spot group is unable to provide the desired capacity. First thought: There possibly are events we can catch through an SNS queue AWS publishes whenever it scales. Sounded like a very good plan, just one miss here: This is almost NOT AT ALL documented on the AWS documentation. And testing against the instance could not be provided on spot is a huge gamble. So as long as there is no way to trigger demo events (AWS, you would do many of us a huge favor here *wink*) we need another way of getting around this.
After lot’s of support issues, twitter conversations and documentation deep dives the one thing we DID find: CloudWatch Events provide a similar notification scheme for both autoscaling but also ECS which is way better documented and also testable.
So next step: We need a lambda function to handle the “could not provide spot instance” event by scaling up the on-demand autoscaling group. Which you can find here. General idea: Check the desired count from the spot instance autoscaling group as this will always be our benchmark. Just configure the desired scaling on one autoscaling group to keep it simple and consistent but be able to make this work as a completely autonomous system.
So we check the desired and the actual count to then update our on-demand autoscaling group to fill the blanks. As simple as that. By adding some more checks before updating groups we try to keep API call counts low here. The worst thing that could happen in a system like this is running out of API calls before scaling up to break our custom scaling.
Next thing to consider: What happens when spot instances are terminated? As we know, AWS is publishing information about future terminations to the instance via instance metadata. We can easily monitor this using a little bash script that AWS kindly provided:
In this script we monitor the instance metadata — and as soon as it gets into a scheduled for termination state we switch the instance status to DRAINING in ECS. This way ECS will shift all containers away from this instance preparing it to be terminated in the future.
What this also does is trigger another CloudWatch Event notifying us of an instance change in the ECS cluster. And here comes lambda function number two: https://github.com/level25de/ecs-spot-nodes/blob/master/lambda-handler/cmd/ecs/main.go
Here we again count the instances and compare against the desired count, with a little specialty: We don’t check the spot autoscaling group actual count but compare against the number of nodes being reported as READY in the ECS cluster. This way we get a real-world count of nodes not including the ones that are going to be terminated in the close future.
There are other solutions for this, one example being implementations of autoscaling with spot instances by the company Spotinst:
or also AutoSpotting:
We specifically decided for the new way as replacing instances in our case is expensive but knowing upfront and being able to proactively shift is good enough.
The last I hope to inspire some people to just experiment, dive a little deeper into the API’s we get from cloud vendors and be as cost efficient as possible. If you use this project I am happy to answer questions on GitHub!
Next to that we do help with everything cloud and Kubernetes, both from a training and implementation perspective. Feel free to check Level 25 out!