$30 off During Our Annual Pro Sale. View Details »

Firefighting Riak at Scale (RICON East 2013)

Firefighting Riak at Scale (RICON East 2013)

Presented by Michajlo Matijkiw at RICON East 2013

Managing a business critical Riak instance in an enterprise environment takes careful planning, coordination, and the willingness to accept that no matter how much you plan, Murphy's law will always win. At CIM we've been running Riak in production for nearly 3 years, and over those years we've seen our fair share of failures, both expected and unexpected. From disk melt downs to solar flares we've managed to recover and maintain 100% uptime with no customer impact. I'll talk about some of these failures, how we dealt with them, and how we managed to keep our clients completely unaware.

About Michajlo

Michajlo Matijkiw is a Sr Software Engineer at Comcast Interactive Media where he focuses on building the types of tools and infrastructure that make developers' lives easier. Prior to that, he was an undergraduate student at University of Pennsylvania where he split his time between soldering irons and studying task based parallelism. In his spare time he enjoys cooking, good beer, and climbing.

Basho Technologies

May 13, 2013
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. Ricon East, 2013
    Michajlo (Mishu) Matijkiw
    Sr. Software Engineer, Comcast Interactive Media
    FIREFIGHTING RIAK AT SCALE

    View Slide

  2. ADMINISTRIVIA
    These opinions are my own, and do not necessarily reflect those of my employer, and so
    forth…

    View Slide

  3. QUALIFICATIONS
    •  Riaking for nearly 3 years
    •  In charge of CIM’s Riak for roughly 2 years
    •  Writing distributed map reduce functions in Erlang to query Riak for nearly a year

    View Slide

  4. RIAK @ CIM
    “Our problem wasn’t performance and scale as much as it
    was development efficiency. The key value store is a simple
    model that allows us to store data on the fly, giving our DBA’s
    the freedom to work on other issues.”
    Riak Drives Comcast High Availability Object Store

    View Slide

  5. RIAK @ CIM
    •  “Generative technology”
    •  Make it simple
    •  Make it enjoyable
    •  People gravitate towards it and do
    cool things

    View Slide

  6. HOSS
    •  Highly Available Object Storage System
    •  s/Highly/Hacked up/
    •  Stupid simple HTTP interface
    •  “riak-http-client”
    •  Java webapp between clients & Riak
    •  Hide “complexity”
    •  Interface adjustments
    •  Monitoring
    •  Choke point

    View Slide

  7. RIAK @ CIM
    Account Information
    Account preferences
    User preferences
    Mobile auth
    User entitlements

    View Slide

  8. PREPARATION
    •  Thorough interface test (HOSS)
    •  Exercise every advertised feature to the fullest
    •  Thorough load testing
    •  Rolling restarts
    •  Kill Riak hard
    •  Leave/Join
    •  Capacity
    •  Need enough to support DC failover

    View Slide

  9. OBSERVATIONS
    •  “Bum preflists”
    •  1 node stores all replicas
    •  Membership changes & 404s
    •  Can’t leave/join in cluster taking traffic… (foreshadowing)
    •  Other things not of relevance...

    View Slide

  10. THE RISE OF THE HTTP 412
    Some time later…
    “We’re seeing a lot of HTTP 412 errors…”
    – new client
    “You’re doing it wrong”
    – me
    “We’re seeing a lot of HTTP 412 errors…”
    - old client
    “Uh oh…”
    – me
    GET /jawn
    200 OK
    Etag: “abc123”
    PUT /jawn
    If-Match: “abc123”
    PUT /jawn
    If-Match: “abc123”
    204 No Content
    412 …
    L

    View Slide

  11. HOSS IS THE POPULAR KID @ CIM
    oversubscribed
    uncomfortable capacity
    comfortable capacity
    (reenactment)

    View Slide

  12. UTILIZATION VS LATENCY
    J K L N
    RPS
    Latency
    (reenactment)

    View Slide

  13. THE FALL OF THE CLUSTER
    •  Redlining nodes didn’t scale
    •  Machines were literally overheating
    •  Hard drives weren’t happy
    •  Unhappy hard drives break
    •  That’s just what they do
    •  They also corrupt data in a last act of defiance

    View Slide

  14. ABANDON THE BAD NODE
    N

    View Slide

  15. REINTRODUCE AS NEW NODE

    View Slide

  16. BRILLANT!

    View Slide

  17. THEN ANOTHER NODE CRASHED…
    •  Hardware was ok (enough)…
    •  Bitcask wasn’t…
    http://downloads.basho.com/papers/bitcask-intro.pdf (reenactment)

    View Slide

  18. NVALS… READ REPAIR… OOPS…
    Copy 2
    (not read
    repaired
    yet)
    Copy 3…
    Copy 1

    View Slide

  19. SURGERY TIME
    1.  Identify corruption point
    2.  Recover missing keys
    3.  Use missing keys to trigger read repair
    for missing values

    View Slide

  20. Key1
    Key2
    Key3

    Images from: http://downloads.basho.com/papers/bitcask-intro.pdf

    View Slide

  21. SURGERY SUCCESS
    •  Issue gets for under replicated keys at
    our leisure
    •  No down time, no missing keys
    •  This all comes out of the box with Riak
    now
    •  This was back in the day…
    w00t

    View Slide

  22. RECOVERING…
    •  Tape backup
    •  Faster recovery in hard failure
    •  Client outreach
    •  Smarter usage patterns
    •  Close monitoring
    •  Nodes are crashing frequently,
    correlated with runaway memory
    usage
    •  MOAR CAPACITY
    •  Can’t add/join nodes without client
    impact

    View Slide

  23. CLIENT OUTREACH
    Smarter client usage

    View Slide

  24. CAPACITY ADDITIONS
    •  US based company
    •  US based traffic patterns
    •  Forgo sleep, deploy nodes
    •  Uneventful, but effective

    View Slide

  25. MEMORY USAGE STILL A PROBLEM…

    View Slide

  26. FIXING MEMORY ERRORS
    1.  Identify trouble nodes
    2.  Reboot
    3.  GOTO 1
    In the end a new allocator did the trick…

    View Slide

  27. RETROSPECTIVE
    •  Improved monitoring
    •  Track utilization
    •  Track latencies
    •  Better communication
    •  Show newcomers the ropes
    •  Appropriately utilize HOSS as choke point

    View Slide

  28. HOSS AS A CHOKE POINT
    •  Or how I stopped worrying and learned to love Little’s Law
    •  Use avg. latency and estimate capacity to limit traffic to Riak from HOSS
    “This should hopefully prevent the Riak crashes which have become all too common.”

    View Slide

  29. THROTTLE TRAFFIC TO PREVENT SADNESS
    Clients
    Riak
    HOSS
    Riak
    HOSS
    Clients
    Capacity Capacity

    View Slide

  30. NO MAN IS AN ISLAND
    Many thanks to Atif @ CIM Ops, Basho, and everyone who pitched in

    View Slide

  31. QUESTIONS?
    •  Comments?
    •  Criticisms?

    View Slide