Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Solid Foundations on the AWS Cloud

Solid Foundations on the AWS Cloud

An overview of the things you should bear in mind as you approach your new AWS infrastructure.

The Scale Factory

November 07, 2019
Tweet

More Decks by The Scale Factory

Other Decks in Technology

Transcript

  1. HERE TO HELP YOU_ Highlight areas you haven’t thought about

    yet Understand tradeoffs Plan for future success Avoid common mistakes
  2. TODAY’S AGENDA_ What is the cloud? Thinking about availability Thinking

    about security Making good architectural choices Infrastructure Automation Monitoring, logging, & alerting Who goes on call? Managing cost
  3. ACCELERATE STATE OF DEVOPS REPORT_ 2019 Findings Cloud continues to

    be a differentiator for elite performers and drives high performance. The use of cloud—as defined by NIST Special Publication 800-145— is predictive of software delivery performance and availability. The highest performing teams were 24 times more likely than low performers to execute on all five capabilities of cloud computing. “
  4. DESIGN CUSTOMER NEEDS (things you care about) COMPLIANCE NEEDS (things

    the government cares about) Features Cost Performance Availability Security Security Documentation Reporting Change Control
  5. TERMINOLOGY_ SLI - service level indicator SLO - service level

    objective SLA - service level agreement
  6. Availability % Downtime per year Downtime per month Downtime per

    week Downtime per day 55.5555555% ("nine fives") 162.33 days 13.53 days 74.92 hours 10.67 hours 90% ("one nine") 36.53 days 73.05 hours 16.80 hours 2.40 hours 95% ("one nine five") 18.26 days 36.53 hours 8.40 hours 1.20 hours 97% 10.96 days 21.92 hours 5.04 hours 43.20 minutes 98% 7.31 days 14.61 hours 3.36 hours 28.80 minutes 99% ("two nines") 3.65 days 7.31 hours 1.68 hours 14.40 minutes 99.5% ("two nines five") 1.83 days 3.65 hours 50.40 minutes 7.20 minutes 99.8% 17.53 hours 87.66 minutes 20.16 minutes 2.88 minutes 99.9% ("three nines") 8.77 hours 43.83 minutes 10.08 minutes 1.44 minutes 99.95% ("three nines five") 4.38 hours 21.92 minutes 5.04 minutes 43.20 seconds 99.99% ("four nines") 52.60 minutes 4.38 minutes 1.01 minutes 8.64 seconds 99.995% ("four nines five") 26.30 minutes 2.19 minutes 30.24 seconds 4.32 seconds 99.999% ("five nines") 5.26 minutes 26.30 seconds 6.05 seconds 864.00 milliseconds 99.9999% ("six nines") 31.56 seconds 2.63 seconds 604.80 milliseconds 86.40 milliseconds 99.99999% ("seven nines") 3.16 seconds 262.98 milliseconds 60.48 milliseconds 8.64 milliseconds 99.999999% ("eight nines") 315.58 milliseconds 26.30 milliseconds 6.05 milliseconds 864.00 microseconds 99.9999999% ("nine nines") 31.56 milliseconds 2.63 milliseconds 604.80 microseconds 86.40 microseconds https:/ /en.wikipedia.org/wiki/High_availability AWS RDS (Multi-AZ) AWS EC2 instance AWS Compute LOW AVAILABILITY / LOW COST HIGHLY AVAILABLE / HIGH COST Batch processing, ETL Internal tools “Office hours” apps, single TZ E-commerce, PoS Video delivery, broadcast ATM transactions, telecoms
  7. AVAILABILITY DESIGN_ Clustering / Failover Autoscaling Multi-AZ (and Multi-Region) operation

    Caching Asynchronous processing Backpressure / Circuit breakers
  8. "Low performers take weeks to conduct security reviews and complete

    the changes identified. In contrast, elite performers build security in and can conduct security reviews and complete changes in just days." http:/ /services.google.com/fh/files/misc/state-of-devops-2018.pdf
  9. 5 AREAS OF SECURITY_ Identity and access management Detective controls

    Infrastructure protection Data protection Incident response
  10. INFRASTRUCTURE PROTECTION_ Secure network design (public/ private subnets) Security Groups

    Fine grained IAM policies DDoS Protection (AWS Shield) Web Application Firewall (AWS WAF) Vulnerability scanning (AWS Inspector, OSSEC, Snyk)
  11. DATA PROTECTION_ Data classification Encryption at rest Encryption in transit

    (TLS) Protect secrets (Secrets Manager, Parameter Store) Regular (tested) backups Access control (IAM) Scans for sensitive data (AWS Macie)
  12. FLEXIBILITY SECURITY EASE OF USE FASHION COST EASE OF OPERATION

    IMPACT ON HIRING RELIABILITY FAMILIARITY PERFORMANCE CERTIFICATION AGILITY SCALABILITY LOCK-IN
  13. FLEXIBILITY SECURITY EASE OF USE FASHION COST EASE OF OPERATION

    IMPACT ON HIRING RELIABILITY FAMILIARITY PERFORMANCE CERTIFICATION Functionality Simplicity Human Financial AGILITY SCALABILITY LOCK-IN
  14. VERSION DATE NOTABLE FEATURES 1.2 December 2009 1.4 March 2010

    Background index creation. Log rotation. 1.6 August 2010 Sharding & replica sets. 1.8 March 2011 Data journalling. Sparse & covering indices. 2.0 August 2011 Authentication 2.2 July 2012 DB level locking. Backup tool backs up indexes. 2.4 March 2013 RBAC. TLS Support. Modular authentication* 2.6 April 2014 HTTP interface disabled. Audit logging* SNMP* 3.0 March 2015 WiredTiger optional. Large replica sets (50). Query introspection. 3.2 December 2015 WiredTiger default. Encryption at rest* 3.4 November 2016 Passes Jepsen test suite. Views. Log redaction* * Enterprise
  15. Monolith Microservices Architecture Simple Easy to deploy Can’t scale individual

    parts Tends towards “big ball of mud” No fault isolation Complex More modular, better isolation Individual parts can scale Easier to understand / test individual services Distributed systems are hard Transactionality more difficult More difficult to operate Changes can need coordinating between teams
  16. FALLACIES OF DISTRIBUTED SYSTEMS_ The network is reliable Latency is

    zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous
  17. EC2 Instances Containers on k8s or ECS Containers on Fargate

    Lambda Compute Most Security effort required Least Security effort required Least Serverless Most Serverless Least Opinionated Most Opinionated Least suitable for microservices Most suitable for microservices
  18. Data RDS PostgreSQL RDS MySQL Aurora RedShift DynamoDB MongoDB DocumentDB

    AWS Neptune Neo4J Cassandra Amazon Timestream InfluxDB Relational Key-Value Document Graph Time Series Quantum Ledger Ledger Hyperledger Fabric Ethereum MySQL PostgreSQL
  19. CONTINUOUS DELIVERY PIELINE_ Linting Unit Tests Artefact Build SAST Deploy

    to Test Integration Test Performance Test DAST Deploy to UAT Deploy to Live
  20. APP LOGGING_ Create structured (JSON) log events Send logs to

    central service (ELK, X-Ray) Provide tools to view log data Use transaction IDs for correlation Set log level on feature flags
  21. “RED” APP MONITORING_ Rate (requests per second) Errors (failures per

    second) Duration (time taken to serve requests)
  22. WHEN TO ALERT_ Synthetic user flow shows errors RUM shows

    errors elevated above SLA for service RUM shows performance below SLA for service
  23. DEVELOPERS ON CALL_ Shared responsibility leads to better quality software

    Debugging a platform is easier if you’ve worked on it
  24. HUMANE ON-CALL_ Share on-call around the whole team* Only page

    on critical alerts Prioritise work to solve regular problems Give people the morning off if they’ve been paged overnight
  25. COST MANAGEMENT_ Initially? Don’t worry about it Look out for

    hidden costs Spend money instead of dev time, it will always be cheaper
  26. COST MANAGEMENT_ Tune resource utilisation Shut down test environments overnight

    Buy reserved capacity Tag resources with cost codes