Firefighting Riak at Scale (RICON East 2013)

Ricon East, 2013 Michajlo (Mishu) Matijkiw Sr. Software Engineer, Comcast
Interactive Media FIREFIGHTING RIAK AT SCALE

ADMINISTRIVIA These opinions are my own, and do not necessarily
reflect those of my employer, and so forth…

QUALIFICATIONS •  Riaking for nearly 3 years •  In charge
of CIM’s Riak for roughly 2 years •  Writing distributed map reduce functions in Erlang to query Riak for nearly a year

RIAK @ CIM “Our problem wasn’t performance and scale as
much as it was development efficiency. The key value store is a simple model that allows us to store data on the fly, giving our DBA’s the freedom to work on other issues.” Riak Drives Comcast High Availability Object Store

RIAK @ CIM •  “Generative technology” •  Make it simple
•  Make it enjoyable •  People gravitate towards it and do cool things

HOSS •  Highly Available Object Storage System •  s/Highly/Hacked up/
•  Stupid simple HTTP interface •  “riak-http-client” •  Java webapp between clients & Riak •  Hide “complexity” •  Interface adjustments •  Monitoring •  Choke point

RIAK @ CIM Account Information Account preferences User preferences Mobile
auth User entitlements

PREPARATION •  Thorough interface test (HOSS) •  Exercise every advertised
feature to the fullest •  Thorough load testing •  Rolling restarts •  Kill Riak hard •  Leave/Join •  Capacity •  Need enough to support DC failover

OBSERVATIONS •  “Bum preflists” •  1 node stores all replicas
•  Membership changes & 404s •  Can’t leave/join in cluster taking traffic… (foreshadowing) •  Other things not of relevance...

THE RISE OF THE HTTP 412 Some time later… “We’re
seeing a lot of HTTP 412 errors…” – new client “You’re doing it wrong” – me “We’re seeing a lot of HTTP 412 errors…” - old client “Uh oh…” – me GET /jawn 200 OK Etag: “abc123” PUT /jawn If-Match: “abc123” PUT /jawn If-Match: “abc123” 204 No Content 412 … L

HOSS IS THE POPULAR KID @ CIM oversubscribed uncomfortable capacity
comfortable capacity (reenactment)

UTILIZATION VS LATENCY J K L N RPS Latency (reenactment)

THE FALL OF THE CLUSTER •  Redlining nodes didn’t scale
•  Machines were literally overheating •  Hard drives weren’t happy •  Unhappy hard drives break •  That’s just what they do •  They also corrupt data in a last act of defiance

ABANDON THE BAD NODE N

REINTRODUCE AS NEW NODE

BRILLANT!

THEN ANOTHER NODE CRASHED… •  Hardware was ok (enough)… • 
Bitcask wasn’t… http://downloads.basho.com/papers/bitcask-intro.pdf (reenactment)

NVALS… READ REPAIR… OOPS… Copy 2 (not read repaired yet)
Copy 3… Copy 1

SURGERY TIME 1.  Identify corruption point 2.  Recover missing keys
3.  Use missing keys to trigger read repair for missing values

Key1 Key2 Key3 … Images from: http://downloads.basho.com/papers/bitcask-intro.pdf

SURGERY SUCCESS •  Issue gets for under replicated keys at
our leisure •  No down time, no missing keys •  This all comes out of the box with Riak now •  This was back in the day… w00t

RECOVERING… •  Tape backup •  Faster recovery in hard failure
•  Client outreach •  Smarter usage patterns •  Close monitoring •  Nodes are crashing frequently, correlated with runaway memory usage •  MOAR CAPACITY •  Can’t add/join nodes without client impact

CLIENT OUTREACH Smarter client usage

CAPACITY ADDITIONS •  US based company •  US based traffic
patterns •  Forgo sleep, deploy nodes •  Uneventful, but effective

MEMORY USAGE STILL A PROBLEM…

FIXING MEMORY ERRORS 1.  Identify trouble nodes 2.  Reboot 3. 
GOTO 1 In the end a new allocator did the trick…

RETROSPECTIVE •  Improved monitoring •  Track utilization •  Track latencies
•  Better communication •  Show newcomers the ropes •  Appropriately utilize HOSS as choke point

HOSS AS A CHOKE POINT •  Or how I stopped
worrying and learned to love Little’s Law •  Use avg. latency and estimate capacity to limit traffic to Riak from HOSS “This should hopefully prevent the Riak crashes which have become all too common.”

THROTTLE TRAFFIC TO PREVENT SADNESS Clients Riak HOSS Riak HOSS
Clients Capacity Capacity

NO MAN IS AN ISLAND Many thanks to Atif @
CIM Ops, Basho, and everyone who pitched in

QUESTIONS? •  Comments? •  Criticisms?

Firefighting Riak at Scale (RICON East 2013)

Firefighting Riak at Scale (RICON East 2013)

Basho Technologies

More Decks by Basho Technologies

Other Decks in Technology

Featured

Transcript

Ricon East, 2013 Michajlo (Mishu) Matijkiw Sr. Software Engineer, Comcast

ADMINISTRIVIA These opinions are my own, and do not necessarily

QUALIFICATIONS •  Riaking for nearly 3 years •  In charge

RIAK @ CIM “Our problem wasn’t performance and scale as

RIAK @ CIM •  “Generative technology” •  Make it simple

HOSS •  Highly Available Object Storage System •  s/Highly/Hacked up/

RIAK @ CIM Account Information Account preferences User preferences Mobile

PREPARATION •  Thorough interface test (HOSS) •  Exercise every advertised

OBSERVATIONS •  “Bum preflists” •  1 node stores all replicas

THE RISE OF THE HTTP 412 Some time later… “We’re

HOSS IS THE POPULAR KID @ CIM oversubscribed uncomfortable capacity

UTILIZATION VS LATENCY J K L N RPS Latency (reenactment)

THE FALL OF THE CLUSTER •  Redlining nodes didn’t scale

ABANDON THE BAD NODE N

REINTRODUCE AS NEW NODE

BRILLANT!

THEN ANOTHER NODE CRASHED… •  Hardware was ok (enough)… •

NVALS… READ REPAIR… OOPS… Copy 2 (not read repaired yet)

SURGERY TIME 1.  Identify corruption point 2.  Recover missing keys

Key1 Key2 Key3 … Images from: http://downloads.basho.com/papers/bitcask-intro.pdf

SURGERY SUCCESS •  Issue gets for under replicated keys at

RECOVERING… •  Tape backup •  Faster recovery in hard failure

CLIENT OUTREACH Smarter client usage

CAPACITY ADDITIONS •  US based company •  US based traffic

MEMORY USAGE STILL A PROBLEM…

FIXING MEMORY ERRORS 1.  Identify trouble nodes 2.  Reboot 3.

RETROSPECTIVE •  Improved monitoring •  Track utilization •  Track latencies

HOSS AS A CHOKE POINT •  Or how I stopped

THROTTLE TRAFFIC TO PREVENT SADNESS Clients Riak HOSS Riak HOSS

NO MAN IS AN ISLAND Many thanks to Atif @

QUESTIONS? •  Comments? •  Criticisms?