Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EuRuKo 2015: Super-Reliable Software

EuRuKo 2015: Super-Reliable Software

Simon Hørup Eskildsen

October 18, 2015
Tweet

More Decks by Simon Hørup Eskildsen

Other Decks in Technology

Transcript

  1. Super-Reliable Software
    BY @SI RU PSE N

    View Slide

  2. View Slide

  3. 175,000+ SHOPS
    $10 BILLION+
    200+ DEVS
    500+ SERVERS
    2.5 DATACENTRES
    RUBY ON RAILS
    17K PEAK RPS
    10,000 CHECKOUTS/M PEAK
    20+ DAILY DEPLOYS 300M+ UNIQUE VISITS/MONTH

    View Slide

  4. 4
    SIMON ESKI LDSEN
    IN FRASTRUCTURE ENGI NE ER

    View Slide

  5. WHY DO WE CARE?

    View Slide

  6. 6

    View Slide

  7. 7
    Mars Climate
    Orbiter

    View Slide

  8. 8
    Radioactive Intel Chips

    View Slide

  9. Build reliable systems from unreliable
    components
    9

    View Slide

  10. 10
    Failure Layers
    System Communication
    Application
    Process

    View Slide

  11. 11
    System Communication
    Application
    Process
    CPU RAM
    Disk
    OS

    View Slide

  12. 12
    0110 1100 0101 1101 0110 1110 0100 1110
    0110 1100 0101 1100 0110 1110 0100 1110
    BIT FLIP S IN ME MORY

    View Slide

  13. 13
    Heat
    Cosmic rays Electrical problems
    Hardware defects
    Utilization
    Radioactivity
    CAU SES OF B IT FL IPS

    View Slide

  14. 14
    >= 1.6 bit errors per 8 gigabytes of RAM per hour
    in Google’s fleet

    View Slide

  15. 15
    .. and even more in space

    View Slide

  16. 16
    BITSQUAT TI N G

    View Slide

  17. 17

    View Slide

  18. 18
    irb(main):024:0> bits = "fbcdn.net".unpack("B*")
    => [
    "0110011001100010011000110110010001101110001011100
    11000110110111101101101"]
    irb(main):025:0> bits[0][11] = "1"
    => "1"
    irb(main):026:0> bits.pack("B*")
    => "frcdn.net"

    View Slide

  19. 19

    View Slide

  20. 20
    It happens in the wild. 30 domains:

    View Slide

  21. 21
    How is it prevented for memory?
    • Nothing — silent data corruption.
    • Parity — detect, but not correct, (some) errors.
    • Error Correcting Codes (ECC) — correct and detect bit
    errors. Can usually detect more than it can correct.
    • Application — can perform checksumming, be pessimistic
    and have redundancy.

    View Slide

  22. 22

    View Slide

  23. 23
    [email protected]:~ $ sudo dmidecode --type 16
    # dmidecode 2.12
    SMBIOS 2.8 present.
    Handle 0x0059, DMI type 16, 23 bytes
    Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: Multi-bit ECC
    Maximum Capacity: 512 GB
    Error Information Handle: Not Provided
    Number Of Devices: 8

    View Slide

  24. 24
    WOULD YOU BUY
    DISC ARDED DRAM?

    View Slide

  25. 25
    Assume the system will fail
    • CPU – it will misbehave. It will overheat.
    • DRAM – it will corrupt your data. It will fail. It will modify your
    program.
    • Operating System – it will panic. It will be slow. It will corrupt
    your data.
    • Disk – it will corrupt data. It will be slow. It will fail.

    View Slide

  26. 26
    System Communication
    Application
    Process

    View Slide

  27. 27
    GR AC EFUL
    LOOSE COUPLING

    View Slide

  28. 28

    View Slide

  29. 29
    Usually not the first error that causes failure,
    but the errors that follow

    View Slide

  30. 30
    Redundant backups with degraded
    functionality

    View Slide

  31. 31
    single component failure should not be
    able to compromise the performance
    or availability of the entire system

    View Slide

  32. View Slide

  33. search
    sessions
    carts
    mysql
    cdn

    View Slide

  34. View Slide

  35. Customer signed out

    View Slide

  36. 36
    Another example
    class Product
    def tags
    redis.smembers("product:#{id}:tags")
    end
    end

    View Slide

  37. 37
    Slightly better
    class Product
    def tags
    redis.smembers("product:#{id}:tags")
    rescue Redis::BaseError => e
    ErrorReporter.log(e)
    []
    end
    end

    View Slide

  38. 38
    System Communication
    Application
    Process

    View Slide

  39. (Micro)service equation
    39
    Uptime = AN Number of services
    Availability per service
    Total availability

    View Slide

  40. 40
    Availability
    70
    80
    90
    100
    Services
    10 50 100 500 1000
    99.98 99.99 99.999 99.95

    View Slide

  41. Even monolithic services easily have tens of
    dependencies:
    RDBMS, Redis, Memcached, ElasticSearch,
    S3, CDN, APIs, Mailing, CRM, …
    41

    View Slide

  42. 42
    Checkout Admin Storefront
    MySQL Shard Unavailable Unavailable Degraded
    MySQL Master Available (if cached) Unavailable Available
    Kafka Available Degraded Available
    External HTTP API Degraded Available Unavailable
    redis-sessions Unavailable Unavailable Degraded
    logging (disk full) Unavailable Unavailable Unavailable
    Resiliency Matrix

    View Slide

  43. 43
    https://github.com/shopify/toxiproxy
    Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do
    Shop.first # this takes at least 1s
    end
    Toxiproxy[/redis/].down do
    session[:user_id] # this will throw an exception
    end
    Simulate TCP conditions with Toxiproxy

    View Slide

  44. Write a Toxiproxy test for each cell
    44
    # test/integration/resiliency_matrix_test.rb
    def test_section_a_mq_a_down
    Toxiproxy[:message_queue_a].down do
    get '/section_a'
    assert_response :success
    end
    end
    def test_section_b_datastore_b
    Toxiproxy[:datastore_b].down do
    get '/section_b'
    assert_response 500
    end
    end
    # ... and every other cell

    View Slide

  45. With fallbacks the system is still vulnerable
    to slowness. ECONNREFUSED is a luxury in
    distributed systems, slowness is the killer.
    45

    View Slide

  46. 46
    TIMEOUTS
    Gem Default Reasonable
    Unicorn 60s ~5s
    Net::HTTP 60s ~2s
    mysql2 N/A ~5s
    redis-rb N/A ~0.5s
    AWS::S3 60s ~2s
    memcached 0.5s ~0.5s

    View Slide

  47. View Slide

  48. 48
    timeouts will waste resources.
    response time might be up by 

    at least that timeout.

    View Slide

  49. Circuit Breakers
    after e timeouts raise immediately for t seconds
    49

    View Slide

  50. Bulkheads
    only allow t processes to access a resource simultaneously
    50

    View Slide

  51. 51
    SHOP IFY/SE MI AN

    View Slide

  52. 52
    System Communication
    Application
    Process

    View Slide

  53. 53
    Antifragile Process
    Fragile Robust Antifragile

    View Slide

  54. 54
    ROOT CAUSE A NA LYSIS

    View Slide

  55. 55
    GAME DAY

    View Slide

  56. 56

    View Slide

  57. Resiliency Maturity Pyramid
    57
    No resiliency effort
    Testing with mocks
    Toxiproxy tests and matrix
    Resiliency Patterns
    Production Practise Days (Games)
    Kill nodes
    Latency
    Application-Specific Fallbacks
    Kill DC

    View Slide

  58. Failure Tolerance
    Reliability
    98% 99% 99.9% 99.98% 99.99% 99.999% 99.99999%
    None
    Basic
    Fallbacks
    Covered resiliency matrix
    Resiliency Primitives
    Indifferent to failure
    Antifragile
    Docker
    Shopify
    Prototype
    A low quality Erlang app
    Balance reliability and failure tolerance. Your app is the sum.
    Rails
    MySQL
    Ruby
    Linux ECC memory
    Your shitty banks COBOL mainframe
    that they pray won’t break
    Some runtime genetic programming craziness
    Aerospace
    Probably you
    Premature resiliency
    100%
    EARLY
    MATURE APPS
    FOUNDATION
    Hackathon
    Redis
    MATURING
    DNS
    Google
    HFT

    View Slide

  59. Final remarks
    59
    Draw your resiliency matrix, write Toxiproxy tests and
    implement application-specific fallbacks
    Everything will fail. Embrace graceful loose coupling.
    Make resiliency fun!
    Be careful when introducing new dependencies. Where
    are they on the reliability/tolerance graph?

    View Slide

  60. Resources
    @Sirupsen on Twitter
    github.com/Shopify/semian: Resiliency toolkit for Ruby
    github.com/Shopify/toxiproxy: Testing resiliency (any language)
    github.com/eapache/go-resiliency: Go library for resiliency
    github.com/Netflix/Hystrix: JVM library for resiliency
    Release It: Incredible book on resiliency
    Shopify Blog post on Resiliency
    Netflix Engineering Blog
    Hystrix Wiki
    Semian README
    Talk on Resilient Routing and Discovery @ DockerCon 15
    Talk on Building and Testing Resilient Applications @ GoRuCo 15
    ParisRB June 15 slides from @byroot
    60

    View Slide

  61. Thank You!
    FOLLOW @SIRUPSE N

    View Slide

  62. 62
    • Atomic by Ema Dimitrova from the Noun Project
    • Rocket by Simon Mettler from the Noun Project
    • processor by Creative Stall from the Noun Project
    • Another Squat Discovered by Nahemoth (Flickr)
    • Starting Line by rowens27 (Flickr)
    • Domino Effect by Bro. Jeffrey Pioquinto, SJ (Flickr)
    • Under Pressure by Feans (Flickr)
    Credits

    View Slide