Slide 1

Slide 1 text

Super-Reliable Software BY @SI RU PSE N

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

175,000+ SHOPS $10 BILLION+ 200+ DEVS 500+ SERVERS 2.5 DATACENTRES RUBY ON RAILS 17K PEAK RPS 10,000 CHECKOUTS/M PEAK 20+ DAILY DEPLOYS 300M+ UNIQUE VISITS/MONTH

Slide 4

Slide 4 text

4 SIMON ESKI LDSEN IN FRASTRUCTURE ENGI NE ER

Slide 5

Slide 5 text

WHY DO WE CARE?

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

7 Mars Climate Orbiter

Slide 8

Slide 8 text

8 Radioactive Intel Chips

Slide 9

Slide 9 text

Build reliable systems from unreliable components 9

Slide 10

Slide 10 text

10 Failure Layers System Communication Application Process

Slide 11

Slide 11 text

11 System Communication Application Process CPU RAM Disk OS

Slide 12

Slide 12 text

12 0110 1100 0101 1101 0110 1110 0100 1110 0110 1100 0101 1100 0110 1110 0100 1110 BIT FLIP S IN ME MORY

Slide 13

Slide 13 text

13 Heat Cosmic rays Electrical problems Hardware defects Utilization Radioactivity CAU SES OF B IT FL IPS

Slide 14

Slide 14 text

14 >= 1.6 bit errors per 8 gigabytes of RAM per hour in Google’s fleet

Slide 15

Slide 15 text

15 .. and even more in space

Slide 16

Slide 16 text

16 BITSQUAT TI N G

Slide 17

Slide 17 text

17

Slide 18

Slide 18 text

18 irb(main):024:0> bits = "fbcdn.net".unpack("B*") => [ "0110011001100010011000110110010001101110001011100 11000110110111101101101"] irb(main):025:0> bits[0][11] = "1" => "1" irb(main):026:0> bits.pack("B*") => "frcdn.net"

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

20 It happens in the wild. 30 domains:

Slide 21

Slide 21 text

21 How is it prevented for memory? • Nothing — silent data corruption. • Parity — detect, but not correct, (some) errors. • Error Correcting Codes (ECC) — correct and detect bit errors. Can usually detect more than it can correct. • Application — can perform checksumming, be pessimistic and have redundancy.

Slide 22

Slide 22 text

22

Slide 23

Slide 23 text

23 [email protected]:~ $ sudo dmidecode --type 16 # dmidecode 2.12 SMBIOS 2.8 present. Handle 0x0059, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 512 GB Error Information Handle: Not Provided Number Of Devices: 8

Slide 24

Slide 24 text

24 WOULD YOU BUY DISC ARDED DRAM?

Slide 25

Slide 25 text

25 Assume the system will fail • CPU – it will misbehave. It will overheat. • DRAM – it will corrupt your data. It will fail. It will modify your program. • Operating System – it will panic. It will be slow. It will corrupt your data. • Disk – it will corrupt data. It will be slow. It will fail.

Slide 26

Slide 26 text

26 System Communication Application Process

Slide 27

Slide 27 text

27 GR AC EFUL LOOSE COUPLING

Slide 28

Slide 28 text

28

Slide 29

Slide 29 text

29 Usually not the first error that causes failure, but the errors that follow

Slide 30

Slide 30 text

30 Redundant backups with degraded functionality

Slide 31

Slide 31 text

31 single component failure should not be able to compromise the performance or availability of the entire system

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

search sessions carts mysql cdn

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Customer signed out

Slide 36

Slide 36 text

36 Another example class Product def tags redis.smembers("product:#{id}:tags") end end

Slide 37

Slide 37 text

37 Slightly better class Product def tags redis.smembers("product:#{id}:tags") rescue Redis::BaseError => e ErrorReporter.log(e) [] end end

Slide 38

Slide 38 text

38 System Communication Application Process

Slide 39

Slide 39 text

(Micro)service equation 39 Uptime = AN Number of services Availability per service Total availability

Slide 40

Slide 40 text

40 Availability 70 80 90 100 Services 10 50 100 500 1000 99.98 99.99 99.999 99.95

Slide 41

Slide 41 text

Even monolithic services easily have tens of dependencies: RDBMS, Redis, Memcached, ElasticSearch, S3, CDN, APIs, Mailing, CRM, … 41

Slide 42

Slide 42 text

42 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix

Slide 43

Slide 43 text

43 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate TCP conditions with Toxiproxy

Slide 44

Slide 44 text

Write a Toxiproxy test for each cell 44 # test/integration/resiliency_matrix_test.rb def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell

Slide 45

Slide 45 text

With fallbacks the system is still vulnerable to slowness. ECONNREFUSED is a luxury in distributed systems, slowness is the killer. 45

Slide 46

Slide 46 text

46 TIMEOUTS Gem Default Reasonable Unicorn 60s ~5s Net::HTTP 60s ~2s mysql2 N/A ~5s redis-rb N/A ~0.5s AWS::S3 60s ~2s memcached 0.5s ~0.5s

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

48 timeouts will waste resources. response time might be up by 
 at least that timeout.

Slide 49

Slide 49 text

Circuit Breakers after e timeouts raise immediately for t seconds 49

Slide 50

Slide 50 text

Bulkheads only allow t processes to access a resource simultaneously 50

Slide 51

Slide 51 text

51 SHOP IFY/SE MI AN

Slide 52

Slide 52 text

52 System Communication Application Process

Slide 53

Slide 53 text

53 Antifragile Process Fragile Robust Antifragile

Slide 54

Slide 54 text

54 ROOT CAUSE A NA LYSIS

Slide 55

Slide 55 text

55 GAME DAY

Slide 56

Slide 56 text

56

Slide 57

Slide 57 text

Resiliency Maturity Pyramid 57 No resiliency effort Testing with mocks Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC

Slide 58

Slide 58 text

Failure Tolerance Reliability 98% 99% 99.9% 99.98% 99.99% 99.999% 99.99999% None Basic Fallbacks Covered resiliency matrix Resiliency Primitives Indifferent to failure Antifragile Docker Shopify Prototype A low quality Erlang app Balance reliability and failure tolerance. Your app is the sum. Rails MySQL Ruby Linux ECC memory Your shitty banks COBOL mainframe that they pray won’t break Some runtime genetic programming craziness Aerospace Probably you Premature resiliency 100% EARLY MATURE APPS FOUNDATION Hackathon Redis MATURING DNS Google HFT

Slide 59

Slide 59 text

Final remarks 59 Draw your resiliency matrix, write Toxiproxy tests and implement application-specific fallbacks Everything will fail. Embrace graceful loose coupling. Make resiliency fun! Be careful when introducing new dependencies. Where are they on the reliability/tolerance graph?

Slide 60

Slide 60 text

Resources @Sirupsen on Twitter github.com/Shopify/semian: Resiliency toolkit for Ruby github.com/Shopify/toxiproxy: Testing resiliency (any language) github.com/eapache/go-resiliency: Go library for resiliency github.com/Netflix/Hystrix: JVM library for resiliency Release It: Incredible book on resiliency Shopify Blog post on Resiliency Netflix Engineering Blog Hystrix Wiki Semian README Talk on Resilient Routing and Discovery @ DockerCon 15 Talk on Building and Testing Resilient Applications @ GoRuCo 15 ParisRB June 15 slides from @byroot 60

Slide 61

Slide 61 text

Thank You! FOLLOW @SIRUPSE N

Slide 62

Slide 62 text

62 • Atomic by Ema Dimitrova from the Noun Project • Rocket by Simon Mettler from the Noun Project • processor by Creative Stall from the Noun Project • Another Squat Discovered by Nahemoth (Flickr) • Starting Line by rowens27 (Flickr) • Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr) • Under Pressure by Feans (Flickr) Credits