EuRuKo 2015: Super-Reliable Software

EuRuKo 2015: Super-Reliable Software

4d931adc15dea47e6e5854f745719269?s=128

Simon Hørup Eskildsen

October 18, 2015
Tweet

Transcript

  1. 2.
  2. 3.

    175,000+ SHOPS $10 BILLION+ 200+ DEVS 500+ SERVERS 2.5 DATACENTRES

    RUBY ON RAILS 17K PEAK RPS 10,000 CHECKOUTS/M PEAK 20+ DAILY DEPLOYS 300M+ UNIQUE VISITS/MONTH
  3. 6.

    6

  4. 12.

    12 0110 1100 0101 1101 0110 1110 0100 1110 0110

    1100 0101 1100 0110 1110 0100 1110 BIT FLIP S IN ME MORY
  5. 14.

    14 >= 1.6 bit errors per 8 gigabytes of RAM

    per hour in Google’s fleet
  6. 21.

    21 How is it prevented for memory? • Nothing —

    silent data corruption. • Parity — detect, but not correct, (some) errors. • Error Correcting Codes (ECC) — correct and detect bit errors. Can usually detect more than it can correct. • Application — can perform checksumming, be pessimistic and have redundancy.
  7. 22.

    22

  8. 23.

    23 simon@web102.dc3:~ $ sudo dmidecode --type 16 # dmidecode 2.12

    SMBIOS 2.8 present. Handle 0x0059, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 512 GB Error Information Handle: Not Provided Number Of Devices: 8
  9. 25.

    25 Assume the system will fail • CPU – it

    will misbehave. It will overheat. • DRAM – it will corrupt your data. It will fail. It will modify your program. • Operating System – it will panic. It will be slow. It will corrupt your data. • Disk – it will corrupt data. It will be slow. It will fail.
  10. 28.

    28

  11. 31.

    31 single component failure should not be able to compromise

    the performance or availability of the entire system
  12. 32.
  13. 34.
  14. 40.

    40 Availability 70 80 90 100 Services 10 50 100

    500 1000 99.98 99.99 99.999 99.95
  15. 41.

    Even monolithic services easily have tens of dependencies: RDBMS, Redis,

    Memcached, ElasticSearch, S3, CDN, APIs, Mailing, CRM, … 41
  16. 42.

    42 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL

    Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix
  17. 43.

    43 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy
  18. 44.

    Write a Toxiproxy test for each cell 44 # test/integration/resiliency_matrix_test.rb

    def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
  19. 45.

    With fallbacks the system is still vulnerable to slowness. ECONNREFUSED

    is a luxury in distributed systems, slowness is the killer. 45
  20. 46.

    46 TIMEOUTS Gem Default Reasonable Unicorn 60s ~5s Net::HTTP 60s

    ~2s mysql2 N/A ~5s redis-rb N/A ~0.5s AWS::S3 60s ~2s memcached 0.5s ~0.5s
  21. 47.
  22. 56.

    56

  23. 57.

    Resiliency Maturity Pyramid 57 No resiliency effort Testing with mocks

    Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC
  24. 58.

    Failure Tolerance Reliability 98% 99% 99.9% 99.98% 99.99% 99.999% 99.99999%

    None Basic Fallbacks Covered resiliency matrix Resiliency Primitives Indifferent to failure Antifragile Docker Shopify Prototype A low quality Erlang app Balance reliability and failure tolerance. Your app is the sum. Rails MySQL Ruby Linux ECC memory Your shitty banks COBOL mainframe that they pray won’t break Some runtime genetic programming craziness Aerospace Probably you Premature resiliency 100% EARLY MATURE APPS FOUNDATION Hackathon Redis MATURING DNS Google HFT
  25. 59.

    Final remarks 59 Draw your resiliency matrix, write Toxiproxy tests

    and implement application-specific fallbacks Everything will fail. Embrace graceful loose coupling. Make resiliency fun! Be careful when introducing new dependencies. Where are they on the reliability/tolerance graph?
  26. 60.

    Resources @Sirupsen on Twitter github.com/Shopify/semian: Resiliency toolkit for Ruby github.com/Shopify/toxiproxy:

    Testing resiliency (any language) github.com/eapache/go-resiliency: Go library for resiliency github.com/Netflix/Hystrix: JVM library for resiliency Release It: Incredible book on resiliency Shopify Blog post on Resiliency Netflix Engineering Blog Hystrix Wiki Semian README Talk on Resilient Routing and Discovery @ DockerCon 15 Talk on Building and Testing Resilient Applications @ GoRuCo 15 ParisRB June 15 slides from @byroot 60
  27. 62.

    62 • Atomic by Ema Dimitrova from the Noun Project

    • Rocket by Simon Mettler from the Noun Project • processor by Creative Stall from the Noun Project • Another Squat Discovered by Nahemoth (Flickr) • Starting Line by rowens27 (Flickr) • Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr) • Under Pressure by Feans (Flickr) Credits