silent data corruption. • Parity — detect, but not correct, (some) errors. • Error Correcting Codes (ECC) — correct and detect bit errors. Can usually detect more than it can correct. • Application — can perform checksumming, be pessimistic and have redundancy.
SMBIOS 2.8 present. Handle 0x0059, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 512 GB Error Information Handle: Not Provided Number Of Devices: 8
will misbehave. It will overheat. • DRAM – it will corrupt your data. It will fail. It will modify your program. • Operating System – it will panic. It will be slow. It will corrupt your data. • Disk – it will corrupt data. It will be slow. It will fail.
Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix
def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
None Basic Fallbacks Covered resiliency matrix Resiliency Primitives Indifferent to failure Antifragile Docker Shopify Prototype A low quality Erlang app Balance reliability and failure tolerance. Your app is the sum. Rails MySQL Ruby Linux ECC memory Your shitty banks COBOL mainframe that they pray won’t break Some runtime genetic programming craziness Aerospace Probably you Premature resiliency 100% EARLY MATURE APPS FOUNDATION Hackathon Redis MATURING DNS Google HFT
and implement application-specific fallbacks Everything will fail. Embrace graceful loose coupling. Make resiliency fun! Be careful when introducing new dependencies. Where are they on the reliability/tolerance graph?
Testing resiliency (any language) github.com/eapache/go-resiliency: Go library for resiliency github.com/Netflix/Hystrix: JVM library for resiliency Release It: Incredible book on resiliency Shopify Blog post on Resiliency Netflix Engineering Blog Hystrix Wiki Semian README Talk on Resilient Routing and Discovery @ DockerCon 15 Talk on Building and Testing Resilient Applications @ GoRuCo 15 ParisRB June 15 slides from @byroot 60
• Rocket by Simon Mettler from the Noun Project • processor by Creative Stall from the Noun Project • Another Squat Discovered by Nahemoth (Flickr) • Starting Line by rowens27 (Flickr) • Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr) • Under Pressure by Feans (Flickr) Credits