Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Firefighting Riak at Scale (RICON East 2013)

Firefighting Riak at Scale (RICON East 2013)

Presented by Michajlo Matijkiw at RICON East 2013

Managing a business critical Riak instance in an enterprise environment takes careful planning, coordination, and the willingness to accept that no matter how much you plan, Murphy's law will always win. At CIM we've been running Riak in production for nearly 3 years, and over those years we've seen our fair share of failures, both expected and unexpected. From disk melt downs to solar flares we've managed to recover and maintain 100% uptime with no customer impact. I'll talk about some of these failures, how we dealt with them, and how we managed to keep our clients completely unaware.

About Michajlo

Michajlo Matijkiw is a Sr Software Engineer at Comcast Interactive Media where he focuses on building the types of tools and infrastructure that make developers' lives easier. Prior to that, he was an undergraduate student at University of Pennsylvania where he split his time between soldering irons and studying task based parallelism. In his spare time he enjoys cooking, good beer, and climbing.

Basho Technologies

May 13, 2013

More Decks by Basho Technologies

Other Decks in Technology


  1. ADMINISTRIVIA These opinions are my own, and do not necessarily

    reflect those of my employer, and so forth…
  2. QUALIFICATIONS •  Riaking for nearly 3 years •  In charge

    of CIM’s Riak for roughly 2 years •  Writing distributed map reduce functions in Erlang to query Riak for nearly a year
  3. RIAK @ CIM “Our problem wasn’t performance and scale as

    much as it was development efficiency. The key value store is a simple model that allows us to store data on the fly, giving our DBA’s the freedom to work on other issues.” Riak Drives Comcast High Availability Object Store
  4. RIAK @ CIM •  “Generative technology” •  Make it simple

    •  Make it enjoyable •  People gravitate towards it and do cool things
  5. HOSS •  Highly Available Object Storage System •  s/Highly/Hacked up/

    •  Stupid simple HTTP interface •  “riak-http-client” •  Java webapp between clients & Riak •  Hide “complexity” •  Interface adjustments •  Monitoring •  Choke point
  6. PREPARATION •  Thorough interface test (HOSS) •  Exercise every advertised

    feature to the fullest •  Thorough load testing •  Rolling restarts •  Kill Riak hard •  Leave/Join •  Capacity •  Need enough to support DC failover
  7. OBSERVATIONS •  “Bum preflists” •  1 node stores all replicas

    •  Membership changes & 404s •  Can’t leave/join in cluster taking traffic… (foreshadowing) •  Other things not of relevance...
  8. THE RISE OF THE HTTP 412 Some time later… “We’re

    seeing a lot of HTTP 412 errors…” – new client “You’re doing it wrong” – me “We’re seeing a lot of HTTP 412 errors…” - old client “Uh oh…” – me GET /jawn 200 OK Etag: “abc123” PUT /jawn If-Match: “abc123” PUT /jawn If-Match: “abc123” 204 No Content 412 … L
  9. THE FALL OF THE CLUSTER •  Redlining nodes didn’t scale

    •  Machines were literally overheating •  Hard drives weren’t happy •  Unhappy hard drives break •  That’s just what they do •  They also corrupt data in a last act of defiance
  10. THEN ANOTHER NODE CRASHED… •  Hardware was ok (enough)… • 

    Bitcask wasn’t… http://downloads.basho.com/papers/bitcask-intro.pdf (reenactment)
  11. SURGERY TIME 1.  Identify corruption point 2.  Recover missing keys

    3.  Use missing keys to trigger read repair for missing values
  12. SURGERY SUCCESS •  Issue gets for under replicated keys at

    our leisure •  No down time, no missing keys •  This all comes out of the box with Riak now •  This was back in the day… w00t
  13. RECOVERING… •  Tape backup •  Faster recovery in hard failure

    •  Client outreach •  Smarter usage patterns •  Close monitoring •  Nodes are crashing frequently, correlated with runaway memory usage •  MOAR CAPACITY •  Can’t add/join nodes without client impact
  14. CAPACITY ADDITIONS •  US based company •  US based traffic

    patterns •  Forgo sleep, deploy nodes •  Uneventful, but effective
  15. FIXING MEMORY ERRORS 1.  Identify trouble nodes 2.  Reboot 3. 

    GOTO 1 In the end a new allocator did the trick…
  16. RETROSPECTIVE •  Improved monitoring •  Track utilization •  Track latencies

    •  Better communication •  Show newcomers the ropes •  Appropriately utilize HOSS as choke point
  17. HOSS AS A CHOKE POINT •  Or how I stopped

    worrying and learned to love Little’s Law •  Use avg. latency and estimate capacity to limit traffic to Riak from HOSS “This should hopefully prevent the Riak crashes which have become all too common.”
  18. NO MAN IS AN ISLAND Many thanks to Atif @

    CIM Ops, Basho, and everyone who pitched in