JAWSPANKRATION2024-ECS Best Practice All on board(english)

by t-kikuchi

Slide 1

Slide 1 text

ECS Best Practice All on board Toshinori Kikuchi Classmethod, Inc.

Slide 2

Slide 2 text

Self-Introduction 2 ● 【Name】：Toshinori Kikuchi ● 【Company Affiliation】：Classmethod, Inc. ● 【AWS Title】 ○ 2024 Japan AWS Top Engineers ○ 2024 Japan AWS All Certifications Engineers ● 【Blog】 https://dev.classmethod.jp/author/tooti/ ● 【X】 https://x.com/tttkkk215 ● 【Favorite Technologies】 Containers, Terraform, Amazon EventBridge, AWS Step Functions

Slide 3

Slide 3 text

Target audience and goals of this session. 3 ● Target audience ○ I adopted ECS for containerization. ○ I use ECS but unsure if correctly. ○ Migrated to ECS, seeking next steps. ● Goal ○ Can build systems in line with ECS best practice ○ Can identify points of improvement for systems built on the current ECS

Slide 4

Slide 4 text

What are Amazon ECS best practices? 4 ● AWS has issued the following documents as best practice for ECS ○ Amazon ECS best practices \- Amazon Elastic Container Service ● AWS Service Delivery Program(Amazon ECS Delivery Partners)has a list of checks to ensure that ECS delivery is in line with best practice. ● I made my own checklist based on these documents, which you can find here. ○ https://github.com/ice1203/ecs-bestpractice-checklist

Slide 5

Slide 5 text

What are Amazon ECS best practices? 5 ● In this session, we have tried to make an agenda of the items in these best practices that may cause problems when actually applying them. ● For each item, explain the key aspects to be considered during design.

Slide 6

Slide 6 text

Agenda 6 ● CI/CD ● Task Definition（Task sizing and resource reservation/restriction） ● Security（Container image security scanning/Runtime security） ● Inter-service network configuration ● Observability

Slide 7

Slide 7 text

7 ● Key design aspects to consider? CI/CD ● Source Control ○ Which version control system to use (e.g., GitHub, GitLab) ○ Which units of the system you want to separate repositories for ○ Whether to separate repositories for infrastructure and applications ○ How to protect branches (e.g., using branch protection rules) ● Establish a branching strategy ○ What branching strategy to use (e.g. GitFlow, GitHub Flow, TrunkBase) ○ How to express environmental differences ● Overall pipeline design ○ What CI/CD tools to use (e.g. GitHub Actions) ○ What is the process flow? ○ Clarify the role and purpose of each stage ● Set triggers ○ What triggers what kind of processing ○ Whether to perform periodic runs (e.g., daily builds) ○ Provide manual trigger options ● Test Automation ○ What kind of testing do you do? ■ Unit and integration testing ● IaC selection ○ Which IaC to use ■ AWS CDK ■ Terraform ○ ECS task definition, how to manage services ● Deployment method selection ○ Which Deployment Strategy? ■ Blue/Green Deployment ■ Rolling Updates ● Rollback Strategies ○ Automatic Rollback Mechanism ○ Manual Rollback Process Establishment ● Security Scanning ○ What kind of security scans do we do? ■ Vulnerability scan of container images ■ Security Scanning of IaC Code ● Monitoring and Notification Settings ○ Deployment success/failure notifications ○ Determining which metrics to monitor during deployment etc..

Slide 8

Slide 8 text

8 ● IaC ○ You can't go wrong with CDK or Terraform as a base. ○ Infrastructure and app lifecycle are different.Task definitions and service definitions are identical to app lifecycle ○ Therefore, we want to manage task and service definitions on the app side ⇒ ecspresso CI/CD (IaC)

Slide 9

Slide 9 text

CI/CD (IaC) 9 ● ecspresso ○ Tool for code management and deployment of Resource related to ECS services and tasks ○ Can read tfstate files and reference Resource information ○ Can also read CloudFormation Outputs, Export values ○ Only three files to be managed as a minimum.

Slide 10

Slide 10 text

10 ● Notes on deploying with ecspresso ○ Only ECS task definitions and service definitions are managed by ecspresso ○ ApplicationAutoScaling and CodeDeploy are not covered. ○ The first time you build the system, you will build it as follows i. Create ALB, ECS cluster, etc. with IaC (Terraform, etc.) ii. ECS task definition and ECS service creation with ecspresso iii. CodeDeploy in IaC ○ After the initial build, (basically) ecspresso and IaC can be updated independently. CI/CD (IaC)

Slide 11

Slide 11 text

● What type of branching strategy to use? ○ There is no right answer, as the appropriate form will depend on the size of the team, release cycle, company culture, etc. ○ One consideration specific to containers is whether or not to share container images between environments. ■ Personally, I think it's better to share. ○ I've thought about what to do at each timing, so I'll break it down into infrastructure and apps CI/CD (branch strategy) 11 branch cut

Slide 12

Slide 12 text

12 ● Infrastructure CI/CD CI/CD (branch strategy)

Slide 13

Slide 13 text

13 ● Infrastructure CI/CD CI/CD (branch strategy)

Slide 14

Slide 14 text

14 ● Application CI/CD CI/CD (branch strategy)

Slide 15

Slide 15 text

CI/CD(Deploy) 15 ● Rolling Update ○ Advantages ■ Simple, making it relatively easy to design and build ■ Simple design limits the number of failure points (and consequently makes troubleshooting easier) ○ Disadvantages ■ There are moments when old and new tasks are mixed together * Image source: DevelopersIO, https://dev.classmethod.jp/articles/ecs-deploytype/

Slide 16

Slide 16 text

CI/CD(Deploy) 16 ● Blue/Green Deploy ○ Advantages ■ Switching applications at once, so there is no mixing of old and new tasks ○ Disadvantages ■ Load balancer required ■ The configuration is complex, so it is relatively difficult to design and build, and it takes time for new team members to understand it. ■ Additional configuration files must be managed (e.g., CodeDeploy settings). ■ Complex process, making troubleshooting more difficult. * Image source: DevelopersIO, https://dev.classmethod.jp/articles/ecs-deploytype/

Slide 17

Slide 17 text

CI/CD(Deploy) 17 cluster: default service: test service_definition : ecs-service-def.json task_definition: ecs-task-def.json appspec: Hooks: - BeforeInstall: "LambdaFunctionToValidateBeforeInstall" - AfterInstall: "LambdaFunctionToValidateAfterTraffic" - AfterAllowTestTraffic : "LambdaFunctionToValidateAfterTestTrafficStarts" - BeforeAllowTraffic : "LambdaFunctionToValidateBeforeAllowingProductionTraffic" - AfterAllowTraffic: "LambdaFunctionToValidateAfterAllowingProductionTraffic" ecspresso.yml ● Using ecspresso also reduces the management element of B/G deployments ○ appspec.yaml is basically unnecessary. ○ AutoScaling stop commands are available, so you can stop and restart AutoScaling before and after deployment simply by including the commands in the CD process.

Slide 18

Slide 18 text

● Key design aspects to consider? Task Deﬁnitions 18 ● Task Size ○ Proper allocation of CPU and memory based on application requirements ○ Setting resource limits ● Container Images ○ Use of up-to-date, stable images from trusted repositories ○ Use multi-stage builds ○ Minimize image size ○ Implement graceful shutdown ○ Image tagging (avoid using the LATEST tag) ● Storage Settings ○ Which Storage Should I Use? ○ EFS Consolidation of Persistent Data ○ Proper sizing of temporary storage ● Increased security ○ Proper configuration of task roles and task execution roles ○ Apply IAM policies based on the principle of least privilege ○ readOnlyRootFilesystem ● Log Configuration ○ CloudWatch Logs ○ Consider firelens ● Management of environment variables ○ Secure management of sensitive information (using AWS Secrets Manager) ● Implementation of health checks ○ Define health check commands for containers ○ Setting appropriate timeouts and intervals ● Compliance and auditing ○ Apply required tags ○ Log settings based on audit requirements ● Networking optimization ○ ENI trunking configuration (if needed) etc..

Slide 19

Slide 19 text

Task Deﬁnisions(TaskSize) 19 fields @timestamp, CpuUtilized, CpuReserved, CpuUtilized * 100 / CpuReserved as CpuUtilization, MemoryUtilized, MemoryReserved, MemoryUtilized * 100 / MemoryReserved as MemoryUtilization | filter ContainerName = "xxxxx" and Type = "Container" and TaskId = "yyyyy" | sort @timestamp asc | limit 10000 | stats avg(CpuUtilization),avg(MemoryUtilization) by bin(1m) ● In conclusion, it is best to check the utilization rate and make adjustments through performance tests, etc. ● To check the utilization of each container ○ Enable Container Insights ○ Run the following query in Cloudwatch Logs Insights

Slide 20

Slide 20 text

Task Deﬁnisions(TaskSize) 20

Slide 21

Slide 21 text

21 ● Task Level and Container Level Settings ○ Task level: represents both an upper limit and a limit in terms of the amount of CPU and memory for the task ■ Containers executed by the task can only use the capacity defined by the task size ■ In the case of Fargate, this is the basis for determining the instance to be used when the task is executed (so it must be specified). Task Deﬁnitions(TaskSize)

Slide 22

Slide 22 text

22 ● Container level: CPU and memory reservations or limits for each container. ● Note that ContainerInsights will not record per-container utilization without container-level CPU and memory settings. ○ CPU: ■ While task-level CPU settings represent the maximum CPU available, container-level CPU settings are weighted in terms of CPU share. ■ This weighting is only valid when there is CPU contention. Task Deﬁnisions(TaskSize)

Slide 23

Slide 23 text

23 ● Container level: CPU and memory reservations or limits available for each container ○ Memory ■ Specify soft limit only: Represents a reservation.The upper limit is the value specified at the task level (so if there are multiple containers in a task, they compete for the amount of memory at the task level). ■ Specify soft limit and hard limit: Soft limit represents reservation, hard limit represents upper limit. ■ Specify only hard limits: hard limits represent both reservations and upper limits Task Deﬁnitions(TaskSize)

Slide 24

Slide 24 text

24 ● Dockerfile Best Practices ○ Using multi-stage builds ○ One container, one process ○ Do not run as root user ○ Containers are stateless (do not retain state) ○ Keep container image size as small as possible ○ etc... ○ For more information. ■ https://docs.docker.jp/develop/develop-images/dockerfile _best-practices.html Task Deﬁnitions

Slide 25

Slide 25 text

Task Deﬁnisions 25 # build stage FROM node:22 AS builder WORKDIR /usr/src/app # No need to copy package.json and package-lock.json as they are bind-mounted #COPY package*.json ./ # Install dependencies. RUN --mount=type=bind,source=package.json,target=package.json \ --mount=type=bind,source=package-lock.json,target=package-lock.json \ --mount=type=cache,target=/root/.npm,sharing=locked \ npm ci --omit=dev # Copy the source code of the application COPY . . # execution stage FROM gcr.io/distroless/nodejs22-debian12 # Copy the application directory from the build stage COPY --chown=nonroot:nonroot --from=builder /usr/src/app /usr/src/app USER nonroot WORKDIR /usr/src/app CMD [ "index.js" ] ● Recently learned ○ RUN --mount=type=bind ■ Execute commands while bind-mounting ■ Might reduce unnecessary COPYs. ○ RUN --mount=type=cache ■ Mount cache directory ■ It is not a layer cache, so if there is a change in a dependent package, it will not be downloaded from scratch.

Slide 26

Slide 26 text

Task Deﬁnisions 26 # php FROM php:7.4-fpm-alpine as app_php RUN apk add --no-cache git COPY --from=composer:latest /usr/bin/composer /usr/bin/composer WORKDIR /var/www/html COPY composer.json composer.lock ./ RUN composer install --no-dev --optimize-autoloader --no-interaction --no-progress COPY . . CMD ["php-fpm"] # NGINX FROM nginx:1.17-alpine AS app_nginx COPY docker/nginx/conf.d/default.conf /etc/nginx/conf.d/ WORKDIR /var/www/html COPY --from=app_php /var/www/html/public public/ ● Recently learned ○ docker build --target=~ . ■ Intermediate stages can be generated as deliverables. ■ For example, you can prepare a stage for testing and specify the stage as --target when testing. ■ You may not have to prepare many Dockerfiles.

Slide 27

Slide 27 text

Container Security 27 ● Security Measures Along the Container Lifecycle ○ Talk about the arrows

Slide 28

Slide 28 text

● Checking Images ○ Checking for ■ For any vulnerabilities contained in it. ■ Are there any vulnerabilities in the way the Dockerfile is written? ○ When to check ■ Checking at Build Time ■ Periodic checks are also important ● Amazon Inspector Extended Scan automatically scans for CVEs as they are added Container Security(Registry) 28

Slide 29

Slide 29 text

Container Security(Registry) 29

Slide 30

Slide 30 text

30 ● Runtime Security ○ GuardDuty Runtime Monitoring ■ Runtime Monitoring for ECS added in 2023/11 update ■ AWS native services can now detect threats at runtime ■ Only detects threats, but needs to be built in if you want to automatically take action (e.g., ECS Task Stop) when a threat is detected Container Security

Slide 31

Slide 31 text

Container Security 31 ● Very easy to set up

Slide 32

Slide 32 text

Container Security 32 ● Runtime Security ○ GuardDuty Runtime Monitoring vs Sysdig Secure ■ Number of rules supported ● 36 vs. There are at least 150 more. ■ Detection speed ● Approx. 2 min.(※1) vs. several seconds ■ Customising the rules ● Impossible vs. possible ※１ Time from actual incident occurrence to registration as an event in GuardDuty (measured by the author).

Slide 33

Slide 33 text

Container Security 33

Slide 34

Slide 34 text

● Best practice to get 3 types of data to clarify what is happening in the system ○ Metrics ○ Trace ○ Log Observability 34

Slide 35

Slide 35 text

35 Observability ● What is AWS Distro for OpenTelemetry(ADOT)? ○ OpenTelemetry(OTel) distribution supported by AWS ○ OTel is a CNCF project with great future potential ○ Highly compatible with AWS services ● Main components are SDK and collector ● Logs and metrics can be collected as well as traces ● Trace information can be sent to X-Ray, CloudWatch, Prometheus, etc. ● In case of ECS Fargate, collector is implemented as a sidecar container. * Image source: One Observability Workshop, https://catalog.workshops.aws/observability/ja-JP/aws-managed-oss/adot/gowalkthrough

Slide 36

Slide 36 text

● How to implement ○ Easily implemented in any language that supports Auto instrumentation agent ○ Java Example i. Include auto instrumentation agent jar in container image for application ii. Add IAM policy to ECS task role and ECS task execution role iii. Added OTel collector as a sidecar container to the ECS task definition containing the container image above Observability 36 * For more information, see here. https://aws-otel.github.io/docs/getting-started/java-sdk/auto-instr

Slide 37

Slide 37 text

Observability 37 FROM amazoncorretto:17-alpine WORKDIR /app ADD https://github.com/aws-observability/aws-otel-java-instrumentation/releases/download/v1 .21.1/aws-opentelemetry-agent.jar /app/aws-opentelemetry-agent.jar ENV JAVA_TOOL_OPTIONS "-javaagent:/app/aws-opentelemetry-agent.jar" ARG JAR_FILE=build/libs/\*.jar COPY --from=build /app/${JAR_FILE} ./app.jar # OpenTelemetry agent configuration ENV OTEL_TRACES_SAMPLER "parentbased_traceidratio" ENV OTEL_TRACES_SAMPLER_ARG "0.3" ENV OTEL_PROPAGATORS "tracecontext,baggage,xray" ENV OTEL_RESOURCE_ATTRIBUTES "service.name=PetSearch" ENV OTEL_IMR_EXPORT_INTERVAL "10000" ENV OTEL_EXPORTER_OTLP_ENDPOINT "http://localhost:4317" ENTRYPOINT ["java","-jar","/app/app.jar"] ※For more information on environment variables, see here., https://opentelemetry.io/docs/languages/sdk-configuration/general/ ● Environment variables can be used to change the sampling rate, service names displayed in Observability Backend, etc.

Slide 38

Slide 38 text

Observability 38 { "name": "aws-otel-collector" , "image": "public.ecr.aws/aws-observability/aws-otel-collector:v0 .32.0", "cpu": 64, "memory": 256, "links": [], "portMappings" : [], "essential": true, "entryPoint": [], "command": [ "--config", "/etc/ecs/ecs-cloudwatch-xray.yaml" ], ● The collector docker image contains several configuration files by default, and the configuration files can be specified as arguments when launching the container

Slide 39

Slide 39 text

● Specific implementation examples ○ Infrastructure Repository ■ https://github.com/ice1203/ecs_sample_infra ○ Application Repository ■ https://github.com/ice1203/ecs_sample_app Speciﬁc implementation examples 39

Slide 40

Slide 40 text

● Amazon ECS best practices - Amazon Elastic Container Service ● ECSのローリングアップデートとブルー/グリーンデプロイを比較してみた | DevelopersIO ● Amazon ECS タスク定義の"タスクサイズのCPU"と”コンテナのCPUユニット”の違いを調べてみた - のぴぴのメモ ● How Amazon ECS manages CPU and memory resources | Containers ● AWS ECS コンテナ毎のメトリクスを取得する #CloudWatch - Qiita ● https://github.com/phamthanhgiang/ECS-Fargate-hands-on/ ● Dockerfile を書くベストプラクティス — Docker-docs-ja 24.0 ドキュメント ● トレース｜OpenTelemetry入門 40 bibliography

Slide 41

Slide 41 text

Thank you for your attention 41