From the EC2 lift and shift to ECS Fargate

From EC2 “Lift and Shift” to ECS Fargate

Hi guys! I’m Bruno Rossi “Cloud” adopted me! DevOps Engineer,
AWS Solution Architect Associate, Software Engineer passionate for: cloud, containers, automation, DDD, OOP, Agile, DevOps, data, programming integrations, fishing, listening Heavy Metal, pets 2

The lift and shift From On-Premise to AWS 1

240+ Ec2 50+ Balancers 15+ RDS Uptime: 99,000/month Some corners
were a mess!!! Weak automation Weak monitoring Weak self healing systems High heterogeneous systems Sys/Devs separation wall Our AWS account after the “lift and shift”... 4

The Lowest Common Denominator (repeated from 10 to n times)
Cache Cluster CMS Application Cluster NGINX + PHP Persistence Layer Dynamic Pages Static Files Media Images Digital Team Deploy 5

Don’t fear change If you’re scared about something, it is
the right thing! 2

The sharded stacks (EC2 based) Cache Cluster CMS Application Cluster
Persistence Layer Media Cache Cluster CMS Application Cluster Persistence Layer Media 1 2 Digital Team ALB ALB ALB ALB 7

Reserved resources (save $, reserve your capacity) Analyze your usage
metrics... Reserve a EC2 baseline Reserve RDS Reserve Redshift Reserve Elasticache Reserve everything is reservable 8

The Proof of Concepts era, learning and applying ECS, EKS,
EFS Cache Cluster Reverse Proxy Application Servers Digital Team EFS It works!!! limited scaling... App sources, middleware configs, etc. 9

The AWS ECS Fargate sharded stacks! Cache Cluster Reverse Proxy
PHP-FPM Grid HTTP Apps Grid Digital Team 10

The challenges in DATA CONSOLIDATION! Phase 1: pump data from
various sources into AWS RDS and AWS Redshift CSV Data Source API Data Source Internal SQL Data Source HTTP Data Source ETL PostgreSQL Python Foreign Data Wrapper AWS EC2 AWS S3 Data Lake AWS Redshift Data Warehouse AWS RDS Production 11

The challenges in DATA CONSOLIDATION! Phase 2: migrate from N
RDS Instances to few AWS Aurora Clusters Vanity Fair Wired Glamour La Cucina Italiana AWS Database Migration Tool Aurora Shard 1 Aurora Shard 2 Aurora Shard N Migrate a DB Misure Performances! Re/Calibrate Cluster Reiterate 12

We are planning for the migration of all non business
critical applications into EKS (first quarter 2020)! We are going global (K8s API is a “standard”) Can’t stop improving/evolving! The Next Step: AWS EKS/Fargate! We are working on it... 13

CI/CD Mask the tasks, let it run! 3

Our CI/CD architecture Bitbucket Webhooks Stateless Jenkins Artifacts Storages Distributed
Build/Task Smoke Test/Deploy Configurations Triggers PUSH UPLOAD Rolling Deploy 15

Integrating Terraform with AWS Bitbucket Webhooks Stateless Jenkins Terraform Shared
State Configurations Triggers PUSH NEW RECIPE UPLOAD NEW CONFIGS Envs Isolated Artifatcs/versioning Tf code, Configs, repeatable!!! TEST APPLY 16

A typical PHP deploy pipeline TASK RESULT TASK TASK TASK
TASK RESULT RESULT VALIDATION RESULT SPLIT STATIC ASSETS FROM PHP FILES ASSETS.TAR.GZ PHP.TAR.GZ COMPILE STATIC ASSETS AND LOAD IT! STATIC ASSETS (CSS, JS, SVG...) BUILD THE CONTAINER WITH PHP FILES ECR DOCKER IMAGE PERFORM SMOKE TESTS UPDATE ECS SERVICE AND TASK DEFINITIONS ROLLING DEPLOY 17

How to remove the state from Jenkins Stateless Jenkins Post-Init
Groovy Script Load all Items from XML Files Load Users Load Permissions Load Plugins Etc. Defer Heavy Jobs to AWS Codebuild Defer Artifacts Storage to AWS S3 Defer Logs and History to DataDog Reload configs in case of sudden Jenkins container’s death 18

Currently under Terraform controls The AWS ECS stack DataDog *
Monitoring AWS Lambda Via SAM (various projects) Every AWS ECS Fargate Service deployed into the ECS stack The Aurora RDS Stack (and Aurora Serverless UAT and Dev Stack) The Build Environment (Bitbucket, Codebuild) Central Logs Repository The Cloudfront Distributions with different origins and behaviours Networking Layer (VPC, NAT Gateways, Subnet) 19

Terraform best practices Modules Save versioned modules into a AWS
S3 bucket Save TF state via AWS S3 backend and DynamoDB concurrency handling Create a different state and different lock tables for every environment Perform test via Terratest Save configurations files (backend.tf, terraform.tfvars, etc.) into a AWS S3 encrypted bucket REPEATABLE SNAPSHOTS: save the code of the recipe and the configurations files into AWS S3 bucket after every apply/destroy operation 20

Environments branching model https://www.wearefine.com/mingle/env-branching-with-git/ 21

Integrating SAM with Terraform, separation of concerns sam package \
--template-file template.yaml \ --output-template-file packaged.yaml \ --s3-bucket samcodebucket aws s3 cp \ template.yaml \ artifactsbucket 1 How to propagate configs Terraform variables to Cloudformation Input Params Cloudformation Input Params to Lambda Environment variables Terraform iaaC tool (RDS, VPC, etc) SAM is our AWS Lambda code “On Steroids” packager 2 22

Metrics don’t lie Measure, profile, collect, analyze 4

Monitoring tools timeline (1° step) Cloudwatch (Metrics, Alarms, Logs &
Dashboards) NewRelic (Synth and APM) Best efforts Analyse our envs to find out high value metrics Set a minimum bundle of alarms to speed up our incidents resolution Set alarms and endpoints to improve self applications healing Use high value metrics to perform tuning at every level (Infra, app, etc) Cloudwatch (Metrics & Alarms) 2014/2015 What about applications? We were copying with PHP, NodeJS Apps, we embraced a detailed APM free Saas tool! Best efforts Expose our metrics via human readable and easy to understand dashboards Digging deep with applications monitoring with New Relic APM Change -> Tuning -> Change -> Tuning -> Change -> Tuning! Asking: “Are we really improving our performances?” Realizing via metrics: “Yes, we did it” 2015/2017 24

25 Cloudwatch (Metrics, Alarms & Logs) X-Ray DataDog (Synth, APM,
AWS integrations, Timeseries, Monitors) Can we do it better? Scenarios are changing: AWS Lambda, predictive monitoring, full AWS integrations, logs, fine grained APMs... 2018/2019 …... Best efforts Put native monitoring in our stacks via Iaac Standardize the monitoring policies for different domains Discover serverless log analysis systems Monitoring tools timeline (2° step) Cloudwatch (Metrics, Alarms, Logs & Dashboards) NewRelic (Synth and APM) Best efforts Expose our metrics via human readable and easy to understand dashboards Digging deep with applications monitoring with New Relic APM Change -> Tuning -> Change -> Tuning -> Change -> Tuning! Asking: “Are we really improving our performances?” Realizing via metrics: “Yes, we did it” 2016/2017

Use dashboards as a code and expose them as top
class citizens! Cloudformation + Cloudwatch Dashboards Terraform + Datadog Monitors Datadog Screenboards Datadog Synthetics Datadog * With monitoring systems too? YES Enclosing Monitoring in Iaac Mumble Mumble Mumble !!! Lack of control How to find out why we are experiencing issues? 26

Load Test! Predict your future... 5

From JMeter to BlazeMeter/Taurus execution: - concurrency: 100 ramp-up: 1m
hold-for: 5m scenario: quick-test scenarios: quick-test: requests: - http://blazedemo.com 28

Subtle differences, big pain Enforce standards 6

It is working on my machine with my own PHP
libraries To deliver: Vagrant Boxes Vagrant Runnable Code Containing: Uses: Bitbucket and AWS ECR Repositories Approved Docker Images Docker Engine Docker Compose CN Italy Digital Team Local development safer (or utopical safe) environment Bitbucket Repositories + Interface CN Italy Digital Team Policies CI/CD Pipelines Envs: Dev, UAT, Production 30

Ec2 Golden Images using Packer LINUX OS NGINX APACHE NodeJS
NGINX + PHP-FPM WIRED NGINX + PHP-FPM VANITY FAIR NGINX + PHP-FPM GLAMOUR NGINX + PHP-FPM Inherited Amazon machine golden images Roles: Centos Nginx PHP-FPM Again & again & again…. YYYYMMDDHHIISS 31

How we have organized our ECR Docker Images pool LINUX
OS NGINX APACHE NodeJS NGINX + PHP-FPM WIRED NGINX + PHP-FPM VANITY FAIR NGINX + PHP-FPM GLAMOUR NGINX + PHP-FPM AWS ECR golden Docker images Again & again & again…. YYYYMMDDHHIISS 32

Don’t get your hands dirty twice Scriptize scriptable commands! 7

From zero to Evolutionary Automation If you are performing the
same commands every day Write the minimum working bash script and push it into your CVS repository of choice. It is a Unit of Work Ask the DevOps Team to find out how to integrate your script into the pipeline The Unit of Work turns on! The Devops Team puts the Unit of Work into the pipeline We are looking forward to improving and evolving again and again our pipelines as soon as new Unit of Work is discovered 34

How we are currently launching our new applications Interconnecting the
Units Of Work createOrUpdateTheAWSSecretsManagerKeys createAndUploadConfigurationsFilesIntoS3 createTheResourcesToBuildTheContainer createTheCodeBaseForTheDockerImageAndCommitIntoTheBitbucketRespository createsCodeBuildJobs createTheCloudfrontDistributionsAndTheS3Buckets createTheBitbucketRepositoryWithTheSourceCodeOfTheProject performsThePipelineToBuildTheServiceImage createTheDatabase createECSservice Bash, AWS CLI, AWS Secrets Manager Bash, AWS CLI, AWS S3 Terraform, AWS CodeBuild, Bitbucket, AWS ECR, Bash Terraform, Bitbucket, Bash, Git, Docker AWS Cloudfront, AWS S3, Terraform, Bash AWS CodeBuild, Terraform, Bash Bash. AWS CLI Bitbucket, Bash, Git, Terraform AWS ECS, Terraform, Bash AWS CLI, Bash 35

Incidents How to weather the storm 8

Incident Outer problem has been found Fix it! Figure out
the problem… Diagnose The website is down! Root Cause Analysis Root Cause has been found Final resolution Find the truth... Fix the Root Cause Document it! Incident Outer problem has been found Fix it! Figure out the problem… Diagnose The website is down! Root Cause Analysis Root Cause has been found Final resolution Find the truth... Fix the Root Cause Document it! You can’t cope with the long “Work in Progress” lifecycle 37

Fix fast and be resilient Incident Diagnose Fever is rising,
the website goes offline Check monitors, check metrics, check documentation (if you have it)! Respect your indexes, recover faster and faster... Outer problem has been found Varnish is dead! Fix it, be fast and keep things resilient! Deploy something to restart Varnish if it fails. Keep things resilient! Document it! Describe the issue and the fix (Confluence Knowledge Base) 38

Don’t stop at the surface, dig deep to evolve Root
Cause Analysis Dig Deep, looking for the meanings Root Cause has been found A VCL expression prevents Varnish to run smoothly under heavy load Fix the Root Cause Code refactoring, test, deploy Document it! Write full documentation (Confluence knowledge base) Iterate! Evolve monitoring, self healing, alarms... 39

Measured Results Our current numbers... 9

$ saving, uptimes, incidents, resources... 40-45 EC2 Reserved / 250+
15-30 ECS Fargate Production Running Tasks 5 Production Load Balancer (Application/Network) / 50+ 3 Aurora RDS Reserved Cluster / 15+ 99,900 Uptime/Month / 99,000 35% Costs Saving Trusted Advisor, WAF, AWS Configs, Cloudtrail, Structured S3 Lifecycle Policies, AWS Secrets Manager... Unmeasurable saved sleeping hours and sanity of mind 41

Thanks! Any questions? 42

Credits Presentation template by SlidesCarnival Greetings to…. My Wife, My
Family, Jack The Dog and Buddy The Degu The CN Italy Digital Team and who joined us tonite :) Natalie Passmore the “English Trainer” The DevOps and Cloud Community 43

From the EC2 lift and shift to ECS Fargate

From the EC2 lift and shift to ECS Fargate

More Decks by AWS User Group Milan

Other Decks in Technology

Featured

Transcript