21 How is it prevented for memory? • Nothing — silent data corruption. • Parity — detect, but not correct, (some) errors. • Error Correcting Codes (ECC) — correct and detect bit errors. Can usually detect more than it can correct. • Application — can perform checksumming, be pessimistic and have redundancy.
25 Assume the system will fail • CPU – it will misbehave. It will overheat. • DRAM – it will corrupt your data. It will fail. It will modify your program. • Operating System – it will panic. It will be slow. It will corrupt your data. • Disk – it will corrupt data. It will be slow. It will fail.
42 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix
43 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy
Write a Toxiproxy test for each cell 44 # test/integration/resiliency_matrix_test.rb def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
Resiliency Maturity Pyramid 57 No resiliency effort Testing with mocks Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC
Failure Tolerance Reliability 98% 99% 99.9% 99.98% 99.99% 99.999% 99.99999% None Basic Fallbacks Covered resiliency matrix Resiliency Primitives Indifferent to failure Antifragile Docker Shopify Prototype A low quality Erlang app Balance reliability and failure tolerance. Your app is the sum. Rails MySQL Ruby Linux ECC memory Your shitty banks COBOL mainframe that they pray won’t break Some runtime genetic programming craziness Aerospace Probably you Premature resiliency 100% EARLY MATURE APPS FOUNDATION Hackathon Redis MATURING DNS Google HFT
Final remarks 59 Draw your resiliency matrix, write Toxiproxy tests and implement application-specific fallbacks Everything will fail. Embrace graceful loose coupling. Make resiliency fun! Be careful when introducing new dependencies. Where are they on the reliability/tolerance graph?
Resources @Sirupsen on Twitter github.com/Shopify/semian: Resiliency toolkit for Ruby github.com/Shopify/toxiproxy: Testing resiliency (any language) github.com/eapache/go-resiliency: Go library for resiliency github.com/Netflix/Hystrix: JVM library for resiliency Release It: Incredible book on resiliency Shopify Blog post on Resiliency Netflix Engineering Blog Hystrix Wiki Semian README Talk on Resilient Routing and Discovery @ DockerCon 15 Talk on Building and Testing Resilient Applications @ GoRuCo 15 ParisRB June 15 slides from @byroot 60
62 • Atomic by Ema Dimitrova from the Noun Project • Rocket by Simon Mettler from the Noun Project • processor by Creative Stall from the Noun Project • Another Squat Discovered by Nahemoth (Flickr) • Starting Line by rowens27 (Flickr) • Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr) • Under Pressure by Feans (Flickr) Credits