CUSEC 2016: Reliable Software in a Chaotic World

CUSEC 2016: Reliable Software in a Chaotic World

What can learn from software with aggressive reliability requirements where human lives are on the line, such as spacecrafts? What can we integrate into the everyday systems we build, and how can we write tests for these requirements? Are cosmic rays really flipping bits in your machine? At Shopify we’ve taken some of those resiliency patterns to production—we’ll show how we think about resiliency and write automated tests to ensure continued reliability. This talk will introduce humbling concepts from the most complex software in the world, with a perspective on real-world tools, resources and techniques that you can adopt today.

4d931adc15dea47e6e5854f745719269?s=128

Simon Hørup Eskildsen

January 14, 2016
Tweet

Transcript

  1. Reliable Software in a Chaotic World BY @SI RUP SEN

  2. None
  3. 200,000+ SHOPS $10 BILLION+ 250+ DEVS 500+ SERVERS 2.5 DATACENTRES

    RUBY ON RAILS 17K PEAK RPS 10,000 CHECKOUTS/M PEAK 20+ DAILY DEPLOYS 300M+ UNIQUE VISITS/MONTH
  4. 4 SIMON ESKI LDSEN IN FRASTRUCTURE ENGI NE ER

  5. WHY DO WE CARE?

  6. 6 Radioactive Intel Chips

  7. Build reliable systems from unreliable components 7

  8. 8 Failure Layers System Communication Application Process

  9. 9 System Communication Application Process CPU RAM Disk OS

  10. 10 0110 1100 0101 1101 0110 1110 0100 1110 0110

    1100 0101 1100 0110 1110 0100 1110 BIT FLI PS IN MEMORY
  11. 11 Heat Cosmic rays Electrical problems Hardware defects Utilization Radioactivity

    CAUSES OF B I T F LIPS
  12. 12 >= 1.6 bit errors per 8 gigabytes of RAM

    per hour in Google’s fleet
  13. 13 .. and even more in space

  14. 14 BITSQUATTI NG

  15. 15 <script src="http://scontent-ord1-1.xx.fbcdn. net/11351355_10207026068118444_52326407049 3316017_n.js" type=“text/javascript" />

  16. 16 irb(main):024:0> bits = "fbcdn.net".unpack("B*") => [ "0110011001100010011000110110010001101110001011100 11000110110111101101101"] irb(main):025:0>

    bits[0][11] = "1" => "1" irb(main):026:0> bits.pack("B*") => "frcdn.net"
  17. 17 <script src="http://scontent-ord1-1.xx.frcdn. net/11351355_10207026068118444_52326407049 3316017_n.js" type=“text/javascript" />

  18. 18 How is it prevented for memory? • Nothing —

    silent data corruption. Your laptop. • Parity — detect, but not correct, (some) errors. Uncommon. • Error Correcting Codes (ECC) — correct and detect bit errors. Can usually detect more than it can correct. Your server. • Application — can perform checksumming, be pessimistic and have redundancy. Good software.
  19. 19 Assume the system will fail Accept that you’re powerless

    • CPU – it will misbehave. It will overheat. • DRAM – it will corrupt your data. It will fail. It will modify your program. • Operating System – it will panic. It will be slow. It will corrupt your data. • Disk – it will corrupt data. It will be slow. It will fail.
  20. 20 System Communication Application Process

  21. 21 GRACE FU L LO OS E COU PLING

  22. 22

  23. 23 Usually not the first error that causes failure, but

    the errors that follow
  24. 24 Redundant backups with degraded functionality

  25. 25 fallbacks: secondary and possibly degraded functionality to fall back

    on
  26. 26 single component failure should not be able to compromise

    the performance or availability of the entire system
  27. None
  28. search sessions carts mysql cdn

  29. None
  30. Customer signed out

  31. 31 System Communication Application Process

  32. (Micro)service equation 32 Uptime = AN Number of services Availability

    per service Total availability
  33. 33 Availability 70 80 90 100 Services 10 50 100

    500 1000 99.98 99.99 99.999 99.95
  34. 34 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL

    Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix
  35. 35 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy
  36. Write a Toxiproxy test for each cell 36 # test/integration/resiliency_matrix_test.rb

    def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
  37. With fallbacks the system is still vulnerable to slowness. ECONNREFUSED

    is a luxury in distributed systems, slowness is the killer. 37
  38. None
  39. 39 timeouts will waste resources. response time might be up

    by 
 at least that timeout.
  40. Circuit Breakers after e timeouts raise immediately for t seconds

    40
  41. 41 System Communication Application Process

  42. 42 Antifragile Process Fragile Robust Antifragile

  43. 43 ROOT CAUSE AN ALYSIS

  44. 44

  45. Resiliency Maturity Pyramid 45 No resiliency effort Testing with mocks

    Toxiproxy tests and resiliency matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC
  46. Failure Tolerance Reliability 98% 99% 99.9% 99.98% 99.99% 99.999% 99.99999%

    None Basic Fallbacks Covered resiliency matrix Resiliency Primitives Indifferent to failure Antifragile Docker Shopify Prototype A low quality Erlang app Balance reliability and failure tolerance. Your app is the sum. Rails MySQL Ruby Linux ECC memory Your shitty banks COBOL mainframe that they pray won’t break Some runtime genetic programming craziness Aerospace Your app Premature resiliency 100% EARLY MATURE APPS FOUNDATION Hackathon Redis MATURING DNS Google HFT Datacenter
  47. Go Build This! Send me an email so we can

    work on it together. Seriously. I love this stuff. simon.eskildsen@shopify.com • Simulate infrastructures and resiliency configuration • Create a Card Game where you compete to keep your app up • Draw a resiliency matrix for your application • Then write Toxiproxy tests for it! • Let the Chaos Monkeys loose for your side-project • Write a DSL to orchestrate complex failure scenarios • … your ideas? Tell me about them! 47 Detailed at bit.ly/sirupsen-resiliency
  48. Final remarks 48 Where are you on the resiliency pyramid?

    Everyone should have a resiliency matrix and basic tests. Everything will fail. Embrace graceful loose coupling. Make resiliency fun! Be careful when introducing new dependencies. Where are they on the reliability/tolerance graph?
  49. Resources @Sirupsen on Twitter github.com/Shopify/semian: Resiliency toolkit for Ruby github.com/Shopify/toxiproxy:

    Testing resiliency (any language) github.com/eapache/go-resiliency: Go library for resiliency github.com/Netflix/Hystrix: JVM library for resiliency Release It: Incredible book on resiliency Shopify Blog post on Resiliency Netflix Engineering Blog Hystrix Wiki Semian README Talk on Resilient Routing and Discovery @ DockerCon 15 Talk on Building and Testing Resilient Applications @ GoRuCo 15 ParisRB June 15 slides from @byroot 49
  50. Thank You! FOLLOW @SIRUPSEN

  51. 51 • Atomic by Ema Dimitrova from the Noun Project

    • Rocket by Simon Mettler from the Noun Project • processor by Creative Stall from the Noun Project • Another Squat Discovered by Nahemoth (Flickr) • Starting Line by rowens27 (Flickr) • Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr) • Under Pressure by Feans (Flickr) Credits