Firefighting Riak at Scale (RICON East 2013)

Firefighting Riak at Scale (RICON East 2013)

Presented by Michajlo Matijkiw at RICON East 2013

Managing a business critical Riak instance in an enterprise environment takes careful planning, coordination, and the willingness to accept that no matter how much you plan, Murphy's law will always win. At CIM we've been running Riak in production for nearly 3 years, and over those years we've seen our fair share of failures, both expected and unexpected. From disk melt downs to solar flares we've managed to recover and maintain 100% uptime with no customer impact. I'll talk about some of these failures, how we dealt with them, and how we managed to keep our clients completely unaware.

About Michajlo

Michajlo Matijkiw is a Sr Software Engineer at Comcast Interactive Media where he focuses on building the types of tools and infrastructure that make developers' lives easier. Prior to that, he was an undergraduate student at University of Pennsylvania where he split his time between soldering irons and studying task based parallelism. In his spare time he enjoys cooking, good beer, and climbing.

E0f4dbccf64a1d37a92e224b070ee84f?s=128

Basho Technologies

May 13, 2013
Tweet

Transcript

  1. Ricon East, 2013 Michajlo (Mishu) Matijkiw Sr. Software Engineer, Comcast

    Interactive Media FIREFIGHTING RIAK AT SCALE
  2. ADMINISTRIVIA These opinions are my own, and do not necessarily

    reflect those of my employer, and so forth…
  3. QUALIFICATIONS •  Riaking for nearly 3 years •  In charge

    of CIM’s Riak for roughly 2 years •  Writing distributed map reduce functions in Erlang to query Riak for nearly a year
  4. RIAK @ CIM “Our problem wasn’t performance and scale as

    much as it was development efficiency. The key value store is a simple model that allows us to store data on the fly, giving our DBA’s the freedom to work on other issues.” Riak Drives Comcast High Availability Object Store
  5. RIAK @ CIM •  “Generative technology” •  Make it simple

    •  Make it enjoyable •  People gravitate towards it and do cool things
  6. HOSS •  Highly Available Object Storage System •  s/Highly/Hacked up/

    •  Stupid simple HTTP interface •  “riak-http-client” •  Java webapp between clients & Riak •  Hide “complexity” •  Interface adjustments •  Monitoring •  Choke point
  7. RIAK @ CIM Account Information Account preferences User preferences Mobile

    auth User entitlements
  8. PREPARATION •  Thorough interface test (HOSS) •  Exercise every advertised

    feature to the fullest •  Thorough load testing •  Rolling restarts •  Kill Riak hard •  Leave/Join •  Capacity •  Need enough to support DC failover
  9. OBSERVATIONS •  “Bum preflists” •  1 node stores all replicas

    •  Membership changes & 404s •  Can’t leave/join in cluster taking traffic… (foreshadowing) •  Other things not of relevance...
  10. THE RISE OF THE HTTP 412 Some time later… “We’re

    seeing a lot of HTTP 412 errors…” – new client “You’re doing it wrong” – me “We’re seeing a lot of HTTP 412 errors…” - old client “Uh oh…” – me GET /jawn 200 OK Etag: “abc123” PUT /jawn If-Match: “abc123” PUT /jawn If-Match: “abc123” 204 No Content 412 … L
  11. HOSS IS THE POPULAR KID @ CIM oversubscribed uncomfortable capacity

    comfortable capacity (reenactment)
  12. UTILIZATION VS LATENCY J K L N RPS Latency (reenactment)

  13. THE FALL OF THE CLUSTER •  Redlining nodes didn’t scale

    •  Machines were literally overheating •  Hard drives weren’t happy •  Unhappy hard drives break •  That’s just what they do •  They also corrupt data in a last act of defiance
  14. ABANDON THE BAD NODE N

  15. REINTRODUCE AS NEW NODE

  16. BRILLANT!

  17. THEN ANOTHER NODE CRASHED… •  Hardware was ok (enough)… • 

    Bitcask wasn’t… http://downloads.basho.com/papers/bitcask-intro.pdf (reenactment)
  18. NVALS… READ REPAIR… OOPS… Copy 2 (not read repaired yet)

    Copy 3… Copy 1
  19. SURGERY TIME 1.  Identify corruption point 2.  Recover missing keys

    3.  Use missing keys to trigger read repair for missing values
  20. Key1 Key2 Key3 … Images from: http://downloads.basho.com/papers/bitcask-intro.pdf

  21. SURGERY SUCCESS •  Issue gets for under replicated keys at

    our leisure •  No down time, no missing keys •  This all comes out of the box with Riak now •  This was back in the day… w00t
  22. RECOVERING… •  Tape backup •  Faster recovery in hard failure

    •  Client outreach •  Smarter usage patterns •  Close monitoring •  Nodes are crashing frequently, correlated with runaway memory usage •  MOAR CAPACITY •  Can’t add/join nodes without client impact
  23. CLIENT OUTREACH Smarter client usage

  24. CAPACITY ADDITIONS •  US based company •  US based traffic

    patterns •  Forgo sleep, deploy nodes •  Uneventful, but effective
  25. MEMORY USAGE STILL A PROBLEM…

  26. FIXING MEMORY ERRORS 1.  Identify trouble nodes 2.  Reboot 3. 

    GOTO 1 In the end a new allocator did the trick…
  27. RETROSPECTIVE •  Improved monitoring •  Track utilization •  Track latencies

    •  Better communication •  Show newcomers the ropes •  Appropriately utilize HOSS as choke point
  28. HOSS AS A CHOKE POINT •  Or how I stopped

    worrying and learned to love Little’s Law •  Use avg. latency and estimate capacity to limit traffic to Riak from HOSS “This should hopefully prevent the Riak crashes which have become all too common.”
  29. THROTTLE TRAFFIC TO PREVENT SADNESS Clients Riak HOSS Riak HOSS

    Clients Capacity Capacity
  30. NO MAN IS AN ISLAND Many thanks to Atif @

    CIM Ops, Basho, and everyone who pitched in
  31. QUESTIONS? •  Comments? •  Criticisms?