Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From the EC2 lift and shift to ECS Fargate

From the EC2 lift and shift to ECS Fargate

How Condé Nast Italy (publisher of Wired, Vanity Fair, GQ, Vogue, La Cucina Italiana, Traveller, AD) evolved its infrastructure from the EC2's age to containerization and serverless via the adoption of a DevOps mindset and several automation tools.

Bruno Rossi - DevOps Engineer @ Condé Nast Italia

Avatar for AWS User Group Milan

AWS User Group Milan

October 23, 2019
Tweet

More Decks by AWS User Group Milan

Other Decks in Technology

Transcript

  1. Hi guys! I’m Bruno Rossi “Cloud” adopted me! DevOps Engineer,

    AWS Solution Architect Associate, Software Engineer passionate for: cloud, containers, automation, DDD, OOP, Agile, DevOps, data, programming integrations, fishing, listening Heavy Metal, pets 2
  2. 240+ Ec2 50+ Balancers 15+ RDS Uptime: 99,000/month Some corners

    were a mess!!! Weak automation Weak monitoring Weak self healing systems High heterogeneous systems Sys/Devs separation wall Our AWS account after the “lift and shift”... 4
  3. The Lowest Common Denominator (repeated from 10 to n times)

    Cache Cluster CMS Application Cluster NGINX + PHP Persistence Layer Dynamic Pages Static Files Media Images Digital Team Deploy 5
  4. The sharded stacks (EC2 based) Cache Cluster CMS Application Cluster

    Persistence Layer Media Cache Cluster CMS Application Cluster Persistence Layer Media 1 2 Digital Team ALB ALB ALB ALB 7
  5. Reserved resources (save $, reserve your capacity) Analyze your usage

    metrics... Reserve a EC2 baseline Reserve RDS Reserve Redshift Reserve Elasticache Reserve everything is reservable 8
  6. The Proof of Concepts era, learning and applying ECS, EKS,

    EFS Cache Cluster Reverse Proxy Application Servers Digital Team EFS It works!!! limited scaling... App sources, middleware configs, etc. 9
  7. The AWS ECS Fargate sharded stacks! Cache Cluster Reverse Proxy

    PHP-FPM Grid HTTP Apps Grid Digital Team 10
  8. The challenges in DATA CONSOLIDATION! Phase 1: pump data from

    various sources into AWS RDS and AWS Redshift CSV Data Source API Data Source Internal SQL Data Source HTTP Data Source ETL PostgreSQL Python Foreign Data Wrapper AWS EC2 AWS S3 Data Lake AWS Redshift Data Warehouse AWS RDS Production 11
  9. The challenges in DATA CONSOLIDATION! Phase 2: migrate from N

    RDS Instances to few AWS Aurora Clusters Vanity Fair Wired Glamour La Cucina Italiana AWS Database Migration Tool Aurora Shard 1 Aurora Shard 2 Aurora Shard N Migrate a DB Misure Performances! Re/Calibrate Cluster Reiterate 12
  10. We are planning for the migration of all non business

    critical applications into EKS (first quarter 2020)! We are going global (K8s API is a “standard”) Can’t stop improving/evolving! The Next Step: AWS EKS/Fargate! We are working on it... 13
  11. Our CI/CD architecture Bitbucket Webhooks Stateless Jenkins Artifacts Storages Distributed

    Build/Task Smoke Test/Deploy Configurations Triggers PUSH UPLOAD Rolling Deploy 15
  12. Integrating Terraform with AWS Bitbucket Webhooks Stateless Jenkins Terraform Shared

    State Configurations Triggers PUSH NEW RECIPE UPLOAD NEW CONFIGS Envs Isolated Artifatcs/versioning Tf code, Configs, repeatable!!! TEST APPLY 16
  13. A typical PHP deploy pipeline TASK RESULT TASK TASK TASK

    TASK RESULT RESULT VALIDATION RESULT SPLIT STATIC ASSETS FROM PHP FILES ASSETS.TAR.GZ PHP.TAR.GZ COMPILE STATIC ASSETS AND LOAD IT! STATIC ASSETS (CSS, JS, SVG...) BUILD THE CONTAINER WITH PHP FILES ECR DOCKER IMAGE PERFORM SMOKE TESTS UPDATE ECS SERVICE AND TASK DEFINITIONS ROLLING DEPLOY 17
  14. How to remove the state from Jenkins Stateless Jenkins Post-Init

    Groovy Script Load all Items from XML Files Load Users Load Permissions Load Plugins Etc. Defer Heavy Jobs to AWS Codebuild Defer Artifacts Storage to AWS S3 Defer Logs and History to DataDog Reload configs in case of sudden Jenkins container’s death 18
  15. Currently under Terraform controls The AWS ECS stack DataDog *

    Monitoring AWS Lambda Via SAM (various projects) Every AWS ECS Fargate Service deployed into the ECS stack The Aurora RDS Stack (and Aurora Serverless UAT and Dev Stack) The Build Environment (Bitbucket, Codebuild) Central Logs Repository The Cloudfront Distributions with different origins and behaviours Networking Layer (VPC, NAT Gateways, Subnet) 19
  16. Terraform best practices Modules Save versioned modules into a AWS

    S3 bucket Save TF state via AWS S3 backend and DynamoDB concurrency handling Create a different state and different lock tables for every environment Perform test via Terratest Save configurations files (backend.tf, terraform.tfvars, etc.) into a AWS S3 encrypted bucket REPEATABLE SNAPSHOTS: save the code of the recipe and the configurations files into AWS S3 bucket after every apply/destroy operation 20
  17. Integrating SAM with Terraform, separation of concerns sam package \

    --template-file template.yaml \ --output-template-file packaged.yaml \ --s3-bucket samcodebucket aws s3 cp \ template.yaml \ artifactsbucket 1 How to propagate configs Terraform variables to Cloudformation Input Params Cloudformation Input Params to Lambda Environment variables Terraform iaaC tool (RDS, VPC, etc) SAM is our AWS Lambda code “On Steroids” packager 2 22
  18. Monitoring tools timeline (1° step) Cloudwatch (Metrics, Alarms, Logs &

    Dashboards) NewRelic (Synth and APM) Best efforts Analyse our envs to find out high value metrics Set a minimum bundle of alarms to speed up our incidents resolution Set alarms and endpoints to improve self applications healing Use high value metrics to perform tuning at every level (Infra, app, etc) Cloudwatch (Metrics & Alarms) 2014/2015 What about applications? We were copying with PHP, NodeJS Apps, we embraced a detailed APM free Saas tool! Best efforts Expose our metrics via human readable and easy to understand dashboards Digging deep with applications monitoring with New Relic APM Change -> Tuning -> Change -> Tuning -> Change -> Tuning! Asking: “Are we really improving our performances?” Realizing via metrics: “Yes, we did it” 2015/2017 24
  19. 25 Cloudwatch (Metrics, Alarms & Logs) X-Ray DataDog (Synth, APM,

    AWS integrations, Timeseries, Monitors) Can we do it better? Scenarios are changing: AWS Lambda, predictive monitoring, full AWS integrations, logs, fine grained APMs... 2018/2019 …... Best efforts Put native monitoring in our stacks via Iaac Standardize the monitoring policies for different domains Discover serverless log analysis systems Monitoring tools timeline (2° step) Cloudwatch (Metrics, Alarms, Logs & Dashboards) NewRelic (Synth and APM) Best efforts Expose our metrics via human readable and easy to understand dashboards Digging deep with applications monitoring with New Relic APM Change -> Tuning -> Change -> Tuning -> Change -> Tuning! Asking: “Are we really improving our performances?” Realizing via metrics: “Yes, we did it” 2016/2017
  20. Use dashboards as a code and expose them as top

    class citizens! Cloudformation + Cloudwatch Dashboards Terraform + Datadog Monitors Datadog Screenboards Datadog Synthetics Datadog * With monitoring systems too? YES Enclosing Monitoring in Iaac Mumble Mumble Mumble !!! Lack of control How to find out why we are experiencing issues? 26
  21. From JMeter to BlazeMeter/Taurus execution: - concurrency: 100 ramp-up: 1m

    hold-for: 5m scenario: quick-test scenarios: quick-test: requests: - http://blazedemo.com 28
  22. It is working on my machine with my own PHP

    libraries To deliver: Vagrant Boxes Vagrant Runnable Code Containing: Uses: Bitbucket and AWS ECR Repositories Approved Docker Images Docker Engine Docker Compose CN Italy Digital Team Local development safer (or utopical safe) environment Bitbucket Repositories + Interface CN Italy Digital Team Policies CI/CD Pipelines Envs: Dev, UAT, Production 30
  23. Ec2 Golden Images using Packer LINUX OS NGINX APACHE NodeJS

    NGINX + PHP-FPM WIRED NGINX + PHP-FPM VANITY FAIR NGINX + PHP-FPM GLAMOUR NGINX + PHP-FPM Inherited Amazon machine golden images Roles: Centos Nginx PHP-FPM Again & again & again…. YYYYMMDDHHIISS 31
  24. How we have organized our ECR Docker Images pool LINUX

    OS NGINX APACHE NodeJS NGINX + PHP-FPM WIRED NGINX + PHP-FPM VANITY FAIR NGINX + PHP-FPM GLAMOUR NGINX + PHP-FPM AWS ECR golden Docker images Again & again & again…. YYYYMMDDHHIISS 32
  25. From zero to Evolutionary Automation If you are performing the

    same commands every day Write the minimum working bash script and push it into your CVS repository of choice. It is a Unit of Work Ask the DevOps Team to find out how to integrate your script into the pipeline The Unit of Work turns on! The Devops Team puts the Unit of Work into the pipeline We are looking forward to improving and evolving again and again our pipelines as soon as new Unit of Work is discovered 34
  26. How we are currently launching our new applications Interconnecting the

    Units Of Work createOrUpdateTheAWSSecretsManagerKeys createAndUploadConfigurationsFilesIntoS3 createTheResourcesToBuildTheContainer createTheCodeBaseForTheDockerImageAndCommitIntoTheBitbucketRespository createsCodeBuildJobs createTheCloudfrontDistributionsAndTheS3Buckets createTheBitbucketRepositoryWithTheSourceCodeOfTheProject performsThePipelineToBuildTheServiceImage createTheDatabase createECSservice Bash, AWS CLI, AWS Secrets Manager Bash, AWS CLI, AWS S3 Terraform, AWS CodeBuild, Bitbucket, AWS ECR, Bash Terraform, Bitbucket, Bash, Git, Docker AWS Cloudfront, AWS S3, Terraform, Bash AWS CodeBuild, Terraform, Bash Bash. AWS CLI Bitbucket, Bash, Git, Terraform AWS ECS, Terraform, Bash AWS CLI, Bash 35
  27. Incident Outer problem has been found Fix it! Figure out

    the problem… Diagnose The website is down! Root Cause Analysis Root Cause has been found Final resolution Find the truth... Fix the Root Cause Document it! Incident Outer problem has been found Fix it! Figure out the problem… Diagnose The website is down! Root Cause Analysis Root Cause has been found Final resolution Find the truth... Fix the Root Cause Document it! You can’t cope with the long “Work in Progress” lifecycle 37
  28. Fix fast and be resilient Incident Diagnose Fever is rising,

    the website goes offline Check monitors, check metrics, check documentation (if you have it)! Respect your indexes, recover faster and faster... Outer problem has been found Varnish is dead! Fix it, be fast and keep things resilient! Deploy something to restart Varnish if it fails. Keep things resilient! Document it! Describe the issue and the fix (Confluence Knowledge Base) 38
  29. Don’t stop at the surface, dig deep to evolve Root

    Cause Analysis Dig Deep, looking for the meanings Root Cause has been found A VCL expression prevents Varnish to run smoothly under heavy load Fix the Root Cause Code refactoring, test, deploy Document it! Write full documentation (Confluence knowledge base) Iterate! Evolve monitoring, self healing, alarms... 39
  30. $ saving, uptimes, incidents, resources... 40-45 EC2 Reserved / 250+

    15-30 ECS Fargate Production Running Tasks 5 Production Load Balancer (Application/Network) / 50+ 3 Aurora RDS Reserved Cluster / 15+ 99,900 Uptime/Month / 99,000 35% Costs Saving Trusted Advisor, WAF, AWS Configs, Cloudtrail, Structured S3 Lifecycle Policies, AWS Secrets Manager... Unmeasurable saved sleeping hours and sanity of mind 41
  31. Credits Presentation template by SlidesCarnival Greetings to…. My Wife, My

    Family, Jack The Dog and Buddy The Degu The CN Italy Digital Team and who joined us tonite :) Natalie Passmore the “English Trainer” The DevOps and Cloud Community 43