Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CUSEC 2016: Reliable Software in a Chaotic World

CUSEC 2016: Reliable Software in a Chaotic World

What can learn from software with aggressive reliability requirements where human lives are on the line, such as spacecrafts? What can we integrate into the everyday systems we build, and how can we write tests for these requirements? Are cosmic rays really flipping bits in your machine? At Shopify we’ve taken some of those resiliency patterns to production—we’ll show how we think about resiliency and write automated tests to ensure continued reliability. This talk will introduce humbling concepts from the most complex software in the world, with a perspective on real-world tools, resources and techniques that you can adopt today.

Simon Hørup Eskildsen

January 14, 2016
Tweet

More Decks by Simon Hørup Eskildsen

Other Decks in Technology

Transcript

  1. 200,000+ SHOPS $10 BILLION+ 250+ DEVS 500+ SERVERS 2.5 DATACENTRES

    RUBY ON RAILS 17K PEAK RPS 10,000 CHECKOUTS/M PEAK 20+ DAILY DEPLOYS 300M+ UNIQUE VISITS/MONTH
  2. 10 0110 1100 0101 1101 0110 1110 0100 1110 0110

    1100 0101 1100 0110 1110 0100 1110 BIT FLI PS IN MEMORY
  3. 12 >= 1.6 bit errors per 8 gigabytes of RAM

    per hour in Google’s fleet
  4. 18 How is it prevented for memory? • Nothing —

    silent data corruption. Your laptop. • Parity — detect, but not correct, (some) errors. Uncommon. • Error Correcting Codes (ECC) — correct and detect bit errors. Can usually detect more than it can correct. Your server. • Application — can perform checksumming, be pessimistic and have redundancy. Good software.
  5. 19 Assume the system will fail Accept that you’re powerless

    • CPU – it will misbehave. It will overheat. • DRAM – it will corrupt your data. It will fail. It will modify your program. • Operating System – it will panic. It will be slow. It will corrupt your data. • Disk – it will corrupt data. It will be slow. It will fail.
  6. 22

  7. 26 single component failure should not be able to compromise

    the performance or availability of the entire system
  8. 33 Availability 70 80 90 100 Services 10 50 100

    500 1000 99.98 99.99 99.999 99.95
  9. 34 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL

    Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix
  10. 35 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy
  11. Write a Toxiproxy test for each cell 36 # test/integration/resiliency_matrix_test.rb

    def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
  12. With fallbacks the system is still vulnerable to slowness. ECONNREFUSED

    is a luxury in distributed systems, slowness is the killer. 37
  13. 44

  14. Resiliency Maturity Pyramid 45 No resiliency effort Testing with mocks

    Toxiproxy tests and resiliency matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC
  15. Failure Tolerance Reliability 98% 99% 99.9% 99.98% 99.99% 99.999% 99.99999%

    None Basic Fallbacks Covered resiliency matrix Resiliency Primitives Indifferent to failure Antifragile Docker Shopify Prototype A low quality Erlang app Balance reliability and failure tolerance. Your app is the sum. Rails MySQL Ruby Linux ECC memory Your shitty banks COBOL mainframe that they pray won’t break Some runtime genetic programming craziness Aerospace Your app Premature resiliency 100% EARLY MATURE APPS FOUNDATION Hackathon Redis MATURING DNS Google HFT Datacenter
  16. Go Build This! Send me an email so we can

    work on it together. Seriously. I love this stuff. [email protected] • Simulate infrastructures and resiliency configuration • Create a Card Game where you compete to keep your app up • Draw a resiliency matrix for your application • Then write Toxiproxy tests for it! • Let the Chaos Monkeys loose for your side-project • Write a DSL to orchestrate complex failure scenarios • … your ideas? Tell me about them! 47 Detailed at bit.ly/sirupsen-resiliency
  17. Final remarks 48 Where are you on the resiliency pyramid?

    Everyone should have a resiliency matrix and basic tests. Everything will fail. Embrace graceful loose coupling. Make resiliency fun! Be careful when introducing new dependencies. Where are they on the reliability/tolerance graph?
  18. Resources @Sirupsen on Twitter github.com/Shopify/semian: Resiliency toolkit for Ruby github.com/Shopify/toxiproxy:

    Testing resiliency (any language) github.com/eapache/go-resiliency: Go library for resiliency github.com/Netflix/Hystrix: JVM library for resiliency Release It: Incredible book on resiliency Shopify Blog post on Resiliency Netflix Engineering Blog Hystrix Wiki Semian README Talk on Resilient Routing and Discovery @ DockerCon 15 Talk on Building and Testing Resilient Applications @ GoRuCo 15 ParisRB June 15 slides from @byroot 49
  19. 51 • Atomic by Ema Dimitrova from the Noun Project

    • Rocket by Simon Mettler from the Noun Project • processor by Creative Stall from the Noun Project • Another Squat Discovered by Nahemoth (Flickr) • Starting Line by rowens27 (Flickr) • Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr) • Under Pressure by Feans (Flickr) Credits