Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JAWSPANKRATION2024-ECS Best Practice All on boa...

t-kikuchi
August 25, 2024

JAWSPANKRATION2024-ECS Best Practice All on board(english)

t-kikuchi

August 25, 2024
Tweet

More Decks by t-kikuchi

Other Decks in Technology

Transcript

  1. Self-Introduction 2 • 【Name】:Toshinori Kikuchi • 【Company Affiliation】:Classmethod, Inc. •

    【AWS Title】 ◦ 2024 Japan AWS Top Engineers ◦ 2024 Japan AWS All Certifications Engineers • 【Blog】 https://dev.classmethod.jp/author/tooti/ • 【X】 https://x.com/tttkkk215 • 【Favorite Technologies】 Containers, Terraform, Amazon EventBridge, AWS Step Functions
  2. Target audience and goals of this session. 3 • Target

    audience ◦ I adopted ECS for containerization. ◦ I use ECS but unsure if correctly. ◦ Migrated to ECS, seeking next steps. • Goal ◦ Can build systems in line with ECS best practice ◦ Can identify points of improvement for systems built on the current ECS
  3. What are Amazon ECS best practices? 4 • AWS has

    issued the following documents as best practice for ECS ◦ Amazon ECS best practices \- Amazon Elastic Container Service • AWS Service Delivery Program(Amazon ECS Delivery Partners)has a list of checks to ensure that ECS delivery is in line with best practice. • I made my own checklist based on these documents, which you can find here. ◦ https://github.com/ice1203/ecs-bestpractice-checklist
  4. What are Amazon ECS best practices? 5 • In this

    session, we have tried to make an agenda of the items in these best practices that may cause problems when actually applying them. • For each item, explain the key aspects to be considered during design.
  5. Agenda 6 • CI/CD • Task Definition(Task sizing and resource

    reservation/restriction) • Security(Container image security scanning/Runtime security) • Inter-service network configuration • Observability
  6. 7 • Key design aspects to consider? CI/CD • Source

    Control ◦ Which version control system to use (e.g., GitHub, GitLab) ◦ Which units of the system you want to separate repositories for ◦ Whether to separate repositories for infrastructure and applications ◦ How to protect branches (e.g., using branch protection rules) • Establish a branching strategy ◦ What branching strategy to use (e.g. GitFlow, GitHub Flow, TrunkBase) ◦ How to express environmental differences • Overall pipeline design ◦ What CI/CD tools to use (e.g. GitHub Actions) ◦ What is the process flow? ◦ Clarify the role and purpose of each stage • Set triggers ◦ What triggers what kind of processing ◦ Whether to perform periodic runs (e.g., daily builds) ◦ Provide manual trigger options • Test Automation ◦ What kind of testing do you do? ▪ Unit and integration testing • IaC selection ◦ Which IaC to use ▪ AWS CDK ▪ Terraform ◦ ECS task definition, how to manage services • Deployment method selection ◦ Which Deployment Strategy? ▪ Blue/Green Deployment ▪ Rolling Updates • Rollback Strategies ◦ Automatic Rollback Mechanism ◦ Manual Rollback Process Establishment • Security Scanning ◦ What kind of security scans do we do? ▪ Vulnerability scan of container images ▪ Security Scanning of IaC Code • Monitoring and Notification Settings ◦ Deployment success/failure notifications ◦ Determining which metrics to monitor during deployment etc..
  7. 8 • IaC ◦ You can't go wrong with CDK

    or Terraform as a base. ◦ Infrastructure and app lifecycle are different.Task definitions and service definitions are identical to app lifecycle ◦ Therefore, we want to manage task and service definitions on the app side ⇒ ecspresso CI/CD (IaC)
  8. CI/CD (IaC) 9 • ecspresso ◦ Tool for code management

    and deployment of Resource related to ECS services and tasks ◦ Can read tfstate files and reference Resource information ◦ Can also read CloudFormation Outputs, Export values ◦ Only three files to be managed as a minimum.
  9. 10 • Notes on deploying with ecspresso ◦ Only ECS

    task definitions and service definitions are managed by ecspresso ◦ ApplicationAutoScaling and CodeDeploy are not covered. ◦ The first time you build the system, you will build it as follows i. Create ALB, ECS cluster, etc. with IaC (Terraform, etc.) ii. ECS task definition and ECS service creation with ecspresso iii. CodeDeploy in IaC ◦ After the initial build, (basically) ecspresso and IaC can be updated independently. CI/CD (IaC)
  10. • What type of branching strategy to use? ◦ There

    is no right answer, as the appropriate form will depend on the size of the team, release cycle, company culture, etc. ◦ One consideration specific to containers is whether or not to share container images between environments. ▪ Personally, I think it's better to share. ◦ I've thought about what to do at each timing, so I'll break it down into infrastructure and apps CI/CD (branch strategy) 11 branch cut
  11. CI/CD(Deploy) 15 • Rolling Update ◦ Advantages ▪ Simple, making

    it relatively easy to design and build ▪ Simple design limits the number of failure points (and consequently makes troubleshooting easier) ◦ Disadvantages ▪ There are moments when old and new tasks are mixed together * Image source: DevelopersIO, https://dev.classmethod.jp/articles/ecs-deploytype/
  12. CI/CD(Deploy) 16 • Blue/Green Deploy ◦ Advantages ▪ Switching applications

    at once, so there is no mixing of old and new tasks ◦ Disadvantages ▪ Load balancer required ▪ The configuration is complex, so it is relatively difficult to design and build, and it takes time for new team members to understand it. ▪ Additional configuration files must be managed (e.g., CodeDeploy settings). ▪ Complex process, making troubleshooting more difficult. * Image source: DevelopersIO, https://dev.classmethod.jp/articles/ecs-deploytype/
  13. CI/CD(Deploy) 17 cluster: default service: test service_definition : ecs-service-def.json task_definition:

    ecs-task-def.json appspec: Hooks: - BeforeInstall: "LambdaFunctionToValidateBeforeInstall" - AfterInstall: "LambdaFunctionToValidateAfterTraffic" - AfterAllowTestTraffic : "LambdaFunctionToValidateAfterTestTrafficStarts" - BeforeAllowTraffic : "LambdaFunctionToValidateBeforeAllowingProductionTraffic" - AfterAllowTraffic: "LambdaFunctionToValidateAfterAllowingProductionTraffic" ecspresso.yml • Using ecspresso also reduces the management element of B/G deployments ◦ appspec.yaml is basically unnecessary. ◦ AutoScaling stop commands are available, so you can stop and restart AutoScaling before and after deployment simply by including the commands in the CD process.
  14. • Key design aspects to consider? Task Definitions 18 •

    Task Size ◦ Proper allocation of CPU and memory based on application requirements ◦ Setting resource limits • Container Images ◦ Use of up-to-date, stable images from trusted repositories ◦ Use multi-stage builds ◦ Minimize image size ◦ Implement graceful shutdown ◦ Image tagging (avoid using the LATEST tag) • Storage Settings ◦ Which Storage Should I Use? ◦ EFS Consolidation of Persistent Data ◦ Proper sizing of temporary storage • Increased security ◦ Proper configuration of task roles and task execution roles ◦ Apply IAM policies based on the principle of least privilege ◦ readOnlyRootFilesystem • Log Configuration ◦ CloudWatch Logs ◦ Consider firelens • Management of environment variables ◦ Secure management of sensitive information (using AWS Secrets Manager) • Implementation of health checks ◦ Define health check commands for containers ◦ Setting appropriate timeouts and intervals • Compliance and auditing ◦ Apply required tags ◦ Log settings based on audit requirements • Networking optimization ◦ ENI trunking configuration (if needed) etc..
  15. Task Definisions(TaskSize) 19 fields @timestamp, CpuUtilized, CpuReserved, CpuUtilized * 100

    / CpuReserved as CpuUtilization, MemoryUtilized, MemoryReserved, MemoryUtilized * 100 / MemoryReserved as MemoryUtilization | filter ContainerName = "xxxxx" and Type = "Container" and TaskId = "yyyyy" | sort @timestamp asc | limit 10000 | stats avg(CpuUtilization),avg(MemoryUtilization) by bin(1m) • In conclusion, it is best to check the utilization rate and make adjustments through performance tests, etc. • To check the utilization of each container ◦ Enable Container Insights ◦ Run the following query in Cloudwatch Logs Insights
  16. 21 • Task Level and Container Level Settings ◦ Task

    level: represents both an upper limit and a limit in terms of the amount of CPU and memory for the task ▪ Containers executed by the task can only use the capacity defined by the task size ▪ In the case of Fargate, this is the basis for determining the instance to be used when the task is executed (so it must be specified). Task Definitions(TaskSize)
  17. 22 • Container level: CPU and memory reservations or limits

    for each container. • Note that ContainerInsights will not record per-container utilization without container-level CPU and memory settings. ◦ CPU: ▪ While task-level CPU settings represent the maximum CPU available, container-level CPU settings are weighted in terms of CPU share. ▪ This weighting is only valid when there is CPU contention. Task Definisions(TaskSize)
  18. 23 • Container level: CPU and memory reservations or limits

    available for each container ◦ Memory ▪ Specify soft limit only: Represents a reservation.The upper limit is the value specified at the task level (so if there are multiple containers in a task, they compete for the amount of memory at the task level). ▪ Specify soft limit and hard limit: Soft limit represents reservation, hard limit represents upper limit. ▪ Specify only hard limits: hard limits represent both reservations and upper limits Task Definitions(TaskSize)
  19. 24 • Dockerfile Best Practices ◦ Using multi-stage builds ◦

    One container, one process ◦ Do not run as root user ◦ Containers are stateless (do not retain state) ◦ Keep container image size as small as possible ◦ etc... ◦ For more information. ▪ https://docs.docker.jp/develop/develop-images/dockerfile _best-practices.html Task Definitions
  20. Task Definisions 25 # build stage FROM node:22 AS builder

    WORKDIR /usr/src/app # No need to copy package.json and package-lock.json as they are bind-mounted #COPY package*.json ./ # Install dependencies. RUN --mount=type=bind,source=package.json,target=package.json \ --mount=type=bind,source=package-lock.json,target=package-lock.json \ --mount=type=cache,target=/root/.npm,sharing=locked \ npm ci --omit=dev # Copy the source code of the application COPY . . # execution stage FROM gcr.io/distroless/nodejs22-debian12 # Copy the application directory from the build stage COPY --chown=nonroot:nonroot --from=builder /usr/src/app /usr/src/app USER nonroot WORKDIR /usr/src/app CMD [ "index.js" ] • Recently learned ◦ RUN --mount=type=bind ▪ Execute commands while bind-mounting ▪ Might reduce unnecessary COPYs. ◦ RUN --mount=type=cache ▪ Mount cache directory ▪ It is not a layer cache, so if there is a change in a dependent package, it will not be downloaded from scratch.
  21. Task Definisions 26 # php FROM php:7.4-fpm-alpine as app_php RUN

    apk add --no-cache git COPY --from=composer:latest /usr/bin/composer /usr/bin/composer WORKDIR /var/www/html COPY composer.json composer.lock ./ RUN composer install --no-dev --optimize-autoloader --no-interaction --no-progress COPY . . CMD ["php-fpm"] # NGINX FROM nginx:1.17-alpine AS app_nginx COPY docker/nginx/conf.d/default.conf /etc/nginx/conf.d/ WORKDIR /var/www/html COPY --from=app_php /var/www/html/public public/ • Recently learned ◦ docker build --target=~ . ▪ Intermediate stages can be generated as deliverables. ▪ For example, you can prepare a stage for testing and specify the stage as --target when testing. ▪ You may not have to prepare many Dockerfiles.
  22. • Checking Images ◦ Checking for ▪ For any vulnerabilities

    contained in it. ▪ Are there any vulnerabilities in the way the Dockerfile is written? ◦ When to check ▪ Checking at Build Time ▪ Periodic checks are also important • Amazon Inspector Extended Scan automatically scans for CVEs as they are added Container Security(Registry) 28
  23. 30 • Runtime Security ◦ GuardDuty Runtime Monitoring ▪ Runtime

    Monitoring for ECS added in 2023/11 update ▪ AWS native services can now detect threats at runtime ▪ Only detects threats, but needs to be built in if you want to automatically take action (e.g., ECS Task Stop) when a threat is detected Container Security
  24. Container Security 32 • Runtime Security ◦ GuardDuty Runtime Monitoring

    vs Sysdig Secure ▪ Number of rules supported • 36 vs. There are at least 150 more. ▪ Detection speed • Approx. 2 min.(※1) vs. several seconds ▪ Customising the rules • Impossible vs. possible ※1 Time from actual incident occurrence to registration as an event in GuardDuty (measured by the author).
  25. • Best practice to get 3 types of data to

    clarify what is happening in the system ◦ Metrics ◦ Trace ◦ Log Observability 34
  26. 35 Observability • What is AWS Distro for OpenTelemetry(ADOT)? ◦

    OpenTelemetry(OTel) distribution supported by AWS ◦ OTel is a CNCF project with great future potential ◦ Highly compatible with AWS services • Main components are SDK and collector • Logs and metrics can be collected as well as traces • Trace information can be sent to X-Ray, CloudWatch, Prometheus, etc. • In case of ECS Fargate, collector is implemented as a sidecar container. * Image source: One Observability Workshop, https://catalog.workshops.aws/observability/ja-JP/aws-managed-oss/adot/gowalkthrough
  27. • How to implement ◦ Easily implemented in any language

    that supports Auto instrumentation agent ◦ Java Example i. Include auto instrumentation agent jar in container image for application ii. Add IAM policy to ECS task role and ECS task execution role iii. Added OTel collector as a sidecar container to the ECS task definition containing the container image above Observability 36 * For more information, see here. https://aws-otel.github.io/docs/getting-started/java-sdk/auto-instr
  28. Observability 37 FROM amazoncorretto:17-alpine WORKDIR /app ADD https://github.com/aws-observability/aws-otel-java-instrumentation/releases/download/v1 .21.1/aws-opentelemetry-agent.jar /app/aws-opentelemetry-agent.jar

    ENV JAVA_TOOL_OPTIONS "-javaagent:/app/aws-opentelemetry-agent.jar" ARG JAR_FILE=build/libs/\*.jar COPY --from=build /app/${JAR_FILE} ./app.jar # OpenTelemetry agent configuration ENV OTEL_TRACES_SAMPLER "parentbased_traceidratio" ENV OTEL_TRACES_SAMPLER_ARG "0.3" ENV OTEL_PROPAGATORS "tracecontext,baggage,xray" ENV OTEL_RESOURCE_ATTRIBUTES "service.name=PetSearch" ENV OTEL_IMR_EXPORT_INTERVAL "10000" ENV OTEL_EXPORTER_OTLP_ENDPOINT "http://localhost:4317" ENTRYPOINT ["java","-jar","/app/app.jar"] ※For more information on environment variables, see here., https://opentelemetry.io/docs/languages/sdk-configuration/general/ • Environment variables can be used to change the sampling rate, service names displayed in Observability Backend, etc.
  29. Observability 38 { "name": "aws-otel-collector" , "image": "public.ecr.aws/aws-observability/aws-otel-collector:v0 .32.0", "cpu":

    64, "memory": 256, "links": [], "portMappings" : [], "essential": true, "entryPoint": [], "command": [ "--config", "/etc/ecs/ecs-cloudwatch-xray.yaml" ], • The collector docker image contains several configuration files by default, and the configuration files can be specified as arguments when launching the container
  30. • Specific implementation examples ◦ Infrastructure Repository ▪ https://github.com/ice1203/ecs_sample_infra ◦

    Application Repository ▪ https://github.com/ice1203/ecs_sample_app Specific implementation examples 39
  31. • Amazon ECS best practices - Amazon Elastic Container Service

    • ECSのローリングアップデートとブルー/グリーンデプロイを比較してみた | DevelopersIO • Amazon ECS タスク定義の"タスクサイズのCPU"と”コンテナのCPUユニット”の違いを調べてみた - のぴぴのメモ • How Amazon ECS manages CPU and memory resources | Containers • AWS ECS コンテナ毎のメトリクスを取得する #CloudWatch - Qiita • https://github.com/phamthanhgiang/ECS-Fargate-hands-on/ • Dockerfile を書くベストプラクティス — Docker-docs-ja 24.0 ドキュメント • トレース|OpenTelemetry入門 40 bibliography