EuRuKo 2015: Super-Reliable Software

EuRuKo 2015: Super-Reliable Software

4d931adc15dea47e6e5854f745719269?s=128

Simon Hørup Eskildsen

October 18, 2015
Tweet

Transcript

  1. Super-Reliable Software BY @SI RU PSE N

  2. None
  3. 175,000+ SHOPS $10 BILLION+ 200+ DEVS 500+ SERVERS 2.5 DATACENTRES

    RUBY ON RAILS 17K PEAK RPS 10,000 CHECKOUTS/M PEAK 20+ DAILY DEPLOYS 300M+ UNIQUE VISITS/MONTH
  4. 4 SIMON ESKI LDSEN IN FRASTRUCTURE ENGI NE ER

  5. WHY DO WE CARE?

  6. 6

  7. 7 Mars Climate Orbiter

  8. 8 Radioactive Intel Chips

  9. Build reliable systems from unreliable components 9

  10. 10 Failure Layers System Communication Application Process

  11. 11 System Communication Application Process CPU RAM Disk OS

  12. 12 0110 1100 0101 1101 0110 1110 0100 1110 0110

    1100 0101 1100 0110 1110 0100 1110 BIT FLIP S IN ME MORY
  13. 13 Heat Cosmic rays Electrical problems Hardware defects Utilization Radioactivity

    CAU SES OF B IT FL IPS
  14. 14 >= 1.6 bit errors per 8 gigabytes of RAM

    per hour in Google’s fleet
  15. 15 .. and even more in space

  16. 16 BITSQUAT TI N G

  17. 17 <img src="http://scontent-ord1-1.xx.fbcdn. net/11351355_10207026068118444_52326407049 3316017_n.jpg" alt="" />

  18. 18 irb(main):024:0> bits = "fbcdn.net".unpack("B*") => [ "0110011001100010011000110110010001101110001011100 11000110110111101101101"] irb(main):025:0>

    bits[0][11] = "1" => "1" irb(main):026:0> bits.pack("B*") => "frcdn.net"
  19. 19 <img src="http://scontent-ord1-1.xx.frcdn. net/11351355_10207026068118444_52326407049 3316017_n.jpg" alt="" />

  20. 20 It happens in the wild. 30 domains:

  21. 21 How is it prevented for memory? • Nothing —

    silent data corruption. • Parity — detect, but not correct, (some) errors. • Error Correcting Codes (ECC) — correct and detect bit errors. Can usually detect more than it can correct. • Application — can perform checksumming, be pessimistic and have redundancy.
  22. 22

  23. 23 simon@web102.dc3:~ $ sudo dmidecode --type 16 # dmidecode 2.12

    SMBIOS 2.8 present. Handle 0x0059, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 512 GB Error Information Handle: Not Provided Number Of Devices: 8
  24. 24 WOULD YOU BUY DISC ARDED DRAM?

  25. 25 Assume the system will fail • CPU – it

    will misbehave. It will overheat. • DRAM – it will corrupt your data. It will fail. It will modify your program. • Operating System – it will panic. It will be slow. It will corrupt your data. • Disk – it will corrupt data. It will be slow. It will fail.
  26. 26 System Communication Application Process

  27. 27 GR AC EFUL LOOSE COUPLING

  28. 28

  29. 29 Usually not the first error that causes failure, but

    the errors that follow
  30. 30 Redundant backups with degraded functionality

  31. 31 single component failure should not be able to compromise

    the performance or availability of the entire system
  32. None
  33. search sessions carts mysql cdn

  34. None
  35. Customer signed out

  36. 36 Another example class Product def tags redis.smembers("product:#{id}:tags") end end

  37. 37 Slightly better class Product def tags redis.smembers("product:#{id}:tags") rescue Redis::BaseError

    => e ErrorReporter.log(e) [] end end
  38. 38 System Communication Application Process

  39. (Micro)service equation 39 Uptime = AN Number of services Availability

    per service Total availability
  40. 40 Availability 70 80 90 100 Services 10 50 100

    500 1000 99.98 99.99 99.999 99.95
  41. Even monolithic services easily have tens of dependencies: RDBMS, Redis,

    Memcached, ElasticSearch, S3, CDN, APIs, Mailing, CRM, … 41
  42. 42 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL

    Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix
  43. 43 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy
  44. Write a Toxiproxy test for each cell 44 # test/integration/resiliency_matrix_test.rb

    def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
  45. With fallbacks the system is still vulnerable to slowness. ECONNREFUSED

    is a luxury in distributed systems, slowness is the killer. 45
  46. 46 TIMEOUTS Gem Default Reasonable Unicorn 60s ~5s Net::HTTP 60s

    ~2s mysql2 N/A ~5s redis-rb N/A ~0.5s AWS::S3 60s ~2s memcached 0.5s ~0.5s
  47. None
  48. 48 timeouts will waste resources. response time might be up

    by 
 at least that timeout.
  49. Circuit Breakers after e timeouts raise immediately for t seconds

    49
  50. Bulkheads only allow t processes to access a resource simultaneously

    50
  51. 51 SHOP IFY/SE MI AN

  52. 52 System Communication Application Process

  53. 53 Antifragile Process Fragile Robust Antifragile

  54. 54 ROOT CAUSE A NA LYSIS

  55. 55 GAME DAY

  56. 56

  57. Resiliency Maturity Pyramid 57 No resiliency effort Testing with mocks

    Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC
  58. Failure Tolerance Reliability 98% 99% 99.9% 99.98% 99.99% 99.999% 99.99999%

    None Basic Fallbacks Covered resiliency matrix Resiliency Primitives Indifferent to failure Antifragile Docker Shopify Prototype A low quality Erlang app Balance reliability and failure tolerance. Your app is the sum. Rails MySQL Ruby Linux ECC memory Your shitty banks COBOL mainframe that they pray won’t break Some runtime genetic programming craziness Aerospace Probably you Premature resiliency 100% EARLY MATURE APPS FOUNDATION Hackathon Redis MATURING DNS Google HFT
  59. Final remarks 59 Draw your resiliency matrix, write Toxiproxy tests

    and implement application-specific fallbacks Everything will fail. Embrace graceful loose coupling. Make resiliency fun! Be careful when introducing new dependencies. Where are they on the reliability/tolerance graph?
  60. Resources @Sirupsen on Twitter github.com/Shopify/semian: Resiliency toolkit for Ruby github.com/Shopify/toxiproxy:

    Testing resiliency (any language) github.com/eapache/go-resiliency: Go library for resiliency github.com/Netflix/Hystrix: JVM library for resiliency Release It: Incredible book on resiliency Shopify Blog post on Resiliency Netflix Engineering Blog Hystrix Wiki Semian README Talk on Resilient Routing and Discovery @ DockerCon 15 Talk on Building and Testing Resilient Applications @ GoRuCo 15 ParisRB June 15 slides from @byroot 60
  61. Thank You! FOLLOW @SIRUPSE N

  62. 62 • Atomic by Ema Dimitrova from the Noun Project

    • Rocket by Simon Mettler from the Noun Project • processor by Creative Stall from the Noun Project • Another Squat Discovered by Nahemoth (Flickr) • Starting Line by rowens27 (Flickr) • Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr) • Under Pressure by Feans (Flickr) Credits