ECS Deep Dive. Ideal container orchestration for small startups

February 10, 2023 (2y ago)

Intro

I'm working on a startup. Our first priority is to minimize operating costs, and infrastructure is not an exception. Our team is operating 20 services, including 16 microservices apps and 4 infrastructure apps (Vault, ELK stack). Around many orchestration engines, we decided to use AWS Elastic Container Service with EC2 Spot Instance. And in this article, I will share my experience along with tips to optimize this orchestration engine. Help you achieve the best performance along with minimizing operating costs as much as we can.

Table of Contents

1. Introduction

Elastic Container Service (ECS) is a fully managed container orchestration service developed by Amazon. They are solutions when your application has many services that need to be deployed and maintained on the AWS cloud environment. Some examples of applications like microservices, video rendering services or machine learning, etc.

This post will focus on running ECS cluster with EC2 instances launch type and does not include instructions for Fargate launch type

2. Basic components

ECS Design

Cluster

The highest level component represents a group of tasks or services. You can use clusters to isolate your applications

Task Definition

Contains the definition of a single or group of containers that can be deployed in JSON format. Include information about containers, ports, data volume, and the maximum of 10 containers for each task. It's recommended to deploy each microservices in each task to easier to maintain and scale.

Tasks

Tasks is a minimum deployable unit (like pod in k8s) that can be deployed in an ECS cluster. It can be deployed as a standalone task, or you can run the task as part of a service.

Services

Services are logical components that run and maintain the desired number of tasks in ECS cluster. If any of your tasks fail or stop for any reason, the Amazon ECS service scheduler launches another instance based on your task definition. It does this to replace it and thereby maintain your desired number of tasks in the service. Services also control the scaling of tasks.

Container Instances

If you run ECS with EC2 instances, the container instance is an EC2 instance that runs in your VPC and is managed by ECS. (like nodes in k8s)

Container Agent

Each container instance within an Amazon ECS cluster will run one Container Agent. The agent sends information about the currently running tasks and resource utilization of your containers to ECS. It starts and stops tasks whenever it receives a request from ECS

Capacity Provider

Amazon ECS capacity providers manage the scaling of infrastructure for tasks in your clusters. Each cluster can have one or more capacity providers and an optional capacity provider strategy. The capacity provider strategy determines how the tasks are spread across the cluster's capacity providers. Each EC2 Capacity Provider will have an Auto Scaling Group (ASG) attached to it, and all scaling actions of ASG will be controlled by Capacity Provider. Scaling actions depend on the number of tasks that ECS wants Capacity Provider to run.

3. Optimization tips

Use AWSVPC network type with ENI trunking for tasks (Increase network throughput)

Vpc trunking

We have 3 types of networks for task run in ECS, VPC Trunking (awsvpc), bridge, and host. With bridge and host network, all tasks that run in the same instance share the same ENI (elastic network interface) with container instance. This will affect the performance of the applications as all tasks share the same ENI and translate port. With AWSVPC ENI Trunking network, the task is allocated its own ENI and a primary private IPv4 address, thereby increasing network throughput to containers.

Config Task Roles (Security)

ECS task roles include Task Role and Task Execution Role. Allow the containers in your task to assume an IAM role to access other AWS Services without having to hardcode AWS Credentials inside the application code. This way ** removes sensitive credentials from your codebase** and eliminates ability of a hacker to access AWS from outside of the private VPC network.

  • Task Execution Role: grant ECS agents permissions to access specific services in AWS when the task is initialized. Send container logs to CloudWatch or pull a container image from Amazon ECR.
  • Task Role: grant task permissions to access specific resources when the task is running. (e.g., S3, WebSocket API, CloudWatch, SNS, SQS, …)

Update task placement strategy (Availability, Cost optimization)

Scale strategy

Task placement strategy is a config of services to specify which container instance is most suitable to place a task when it is scheduled. I recommend using 2 strategies in order. First ** spread** type with "attribute:ecs.availability-zone" field, and the second binpack type with CPU field. ECS will equally place the task in as many availability zones as they can to increase the availability of service. And when they specified the suitable zone, if there is more than one instance running on it. Task will be placed on container instances so as to leave the least amount of unused CPU resource. This strategy minimizes the number of container instances in use, thereby saving costs.

###Use with EC2 Spot Instance (Cost optimization) EC2 Spot Instance is an instance that uses spare EC2 capacity that is available for less than the On-Demand price. Spot Instances price is usually up to 60% lower than On-Demand price, they can lower your Amazon EC2 costs significantly. Although they have big benefits in cost, you have to add more effort to handle graceful shutdown in case of interruption. One tip to eliminate downtime for service is running at least 2 tasks at the same time in 2 different instances. Because AWS rarely reclaims 2 spot instances when Amazon EC2 needs the capacity back. So at the time of interruption, our service does not need to wait for a new spot instance to deploy a new one. If you don't want to run at least 2 tasks, you can specify Target Capacity of less than 100% in Capacity Provider, so Capacity Provider always has free space to deploy a new task to replace the draining one. In my experience, this approach is suitable for stateless applications because this type of application is dynamic and does not require fixed storage. If you want to deploy stateful applications, create another capacity provider using On-Demand instance type.

4. Drawback of ECS compared to Kubernetes

Everyone knows Kubernetes is the most robust orchestration engine, with full features that a container orchestration needs. Elastic Container Service is a simpler version, easier to learn and use. They have enough support to operate and scale microservices applications. But after time working with it, I realized that they are not "intelligent" in consolidating workloads. Sometimes a container instance only has one task running in, but other instances still have enough resources for that task. If that task is moved to other instances in the cluster, ECS can stop 1 instance and thereby ** saving costs** for us. If you are running a small number of containers, it is not a big problem. But if you are operating hundreds of containers, Kubernetes is a better solution. Kapenter is a solution to minimize the number of nodes that you need for your cluster workload for Kubernetes cluster.

Information: ECS does not have charge for master nodes. If you deploy applications on ** EKS (Kubernetes service of AWS),** you will be charged 70$ per master node.

5. Summary

In this article, I have introduced you to the basic components of ECS, tips to optimize, and its drawback. Elastic Container Service is the most suitable solution for small startups that need to operate their microservices system in AWS cloud platform because of the benefits it brings. Simple enough to learn, easy to operate, secure, and cheap.