Scaling Riak to 25MM Ops/Day at Kiip

Scaling Riak to 25MM Ops/Day at Kiip

This talk goes over how we scaled one part of our technology stack at Kiip over the last 18 months, and how we ended up on Riak for this specific use case.

2828f28fb012308a7786eee83b8293c5?s=128

Mitchell Hashimoto

May 23, 2012
Tweet

Transcript

  1. Scaling Riak to 25MM Ops/Day at Kiip

  2. Armon Dadgar @armondadgar Mitchell Hashimoto @mitchellh

  3. None
  4. None
  5. API Flow Session Start Moment Reward Session End 0..n times

  6. The Numbers x million unique devices per day About 4

    API calls per session = ~25 million API calls per day
  7. The Journey of Scale A Story of MongoDB * Let’s

    talk about our journey scaling, specifically with MongoDB. * We started with MySQL, but switched before we had any real traffic to MongoDB.
  8. 1. Write Limit Hit by Analytics * Analytics sent hundreds

    of atomic updates per second. * Hit limit w/ global write lock. * Solution: Aggregate over 10 seconds, send small bursts of updates, resulting in lower lock % on average. Solution: Aggregate over 10 seconds.
  9. 2. Too Many Reads (1000s/s) * We were reading too

    much, hit max throughput of MongoDB. * Solution: Cache everywhere. Solution: Heavy Caching.
  10. 3. Slow, Uncachable Queries Example: “Has device X played game

    Y today/last week/this month/in all time?” * Touches lots of data * Requires lots of index space * Not cachable * MongoDB just... slow. Solution: Bloom filters. Solution: Bloom filters!
  11. 4. Write Limit Hit, Again Basic model updates were hitting

    MongoDB’s write throughput limit. Solution: Use two distinct MongoDB clusters for disjunct datasets to avoid global write lock. One for analytics (heavy writes). One for everything else. Solution: Two clusters (lol global write lock)
  12. 5. Index Size Hit Memory Limits We didn’t vertically scale

    because we’re pretty operationally frugal and the data was growing very fast. ETL = Extract/Transform/Load, archive data to S3, remove from main DB. Solution: ETL, Reap old data.
  13. 6. ETL Overwhelmed ETL of 24 hours of data took

    longer than 24 hours to extract, limited by MongoDB read throughput. We decided to let it break and continue reaping data. Solved in the future by continuous ETL solution separate from our main DB. Solution: Punted, solved by custom solution
  14. 7. Central Bottleneck by Mongo Noticed that _all_ API response

    times were directly correlated to write load of MongoDB. Our only choice left here was to look into a new DB solution. Solution: Research new DBs!
  15. Researching a new DB

  16. RDBMS In the cloud, without horizontal scalability, I/O would hit

    a limit REAL fast. Didn’t want to deal with custom sharding layer.
  17. Cassandra Our cofounders are from Digg. Enough said.

  18. HBase Saw PyCodeConf talk about system at Mozilla based on

    HBase. We talked to speaker: * Operational nightmare * Took 1 year * No JVM experience at Kiip Not reasonable, for us.
  19. CouchDB * No auto horizontal scaling, you have to do

    it at the app level. * Features weren’t compelling (master/master syncing with phones, CouchApps, etc.). * We didn’t know anyone who used it.
  20. Riak * Attracted to solid academic foundation * Visited and

    talked with Basho developers. * Confident 100% in Basho team before even using product. * Meetups showed real world usage at scale + dev & ops happiness.
  21. Data Migration

  22. Identify Fast-Growing Data •Data we needed horizontally scalable •Session/Device data

    grew at exponential rate. •Move that data first, keep the rest in MongoDB (for now).
  23. Identify Fast-Growing Data Session Growth

  24. Session Migration

  25. Sessions First • Obviously K/V •Key: UUID, Value: JSON blob.

    •Larger and faster growing than devices.
  26. Data Access Patterns • By UUID (key) for all API

    calls • Fraud: By device ID and IP of session. • 2i compatible ✓
  27. Update ORM • Added Riak backing store driver • No

    application-level changes were necessary • Riak Python client pains Python client pains: * Protocol buffer interface buggy * No keep alive (fixed) * Poor error handling (partially fixed, needs work)
  28. Migrate • Write new data to Riak • Read from

    Riak, fallback to MongoDB if missing • After one week, remove MongoDB read-only Didn’t migrate data because ETL sent it to S3 anyways.
  29. Device Migration

  30. Devices • Huge • Growing • But... not obviously K/V.

  31. Not Obviously K/V • Canonical ID (UUID), assigned by us.

    • Vendor ID (ADID, UDID, etc.), assigned by device vendor. • Uniqueness constraint on each, so 2i not possible.
  32. Uniqueness in Riak, Part 1 Device Key: Canonical ID Value:

    JSON Blob Device_UUID Key: Vendor ID Value: Canonical ID Simulate uniqueness using If-None-Match Cross fingers and hope consistency isn’t too bad.
  33. Part 1: Results FAILURE

  34. Part 1: Results • Latency: At least 200ms, at most

    2000ms • Map/Reduce JS VMs quickly overwhelmed • Hundreds of inconsistencies per hour
  35. Uniqueness, Part 2 • Just don’t do it. • Canonical

    ID = SHA1(Vendor ID) • Backfill old data (30MM rows, days of backfill) • Success, use Riak as a K/V store!
  36. Riak In Production Our experience over 3 months.

  37. DISCLAIMER Riak has been extremely solid. However, there are minor

    pain points that could and have been addressed.
  38. Scale Early * Latencies explode under heavy I/O. Attempting to

    add a new node adds more I/O pressure for handoff. * Add new nodes early. * Hard to know when just beginning. Watch your FSM latencies carefully. Scaling at the red line is painful.
  39. 2i is slow, don’t use in real time * Normal

    EC2 get: 5ms * 2i EC2 get: 2000ms Fine for occasional background queries, not okay for queries on live requests.
  40. JS Map/Reduce is slow, easily overwhelmed. Slow, to be expected,

    so don’t use for live requests. JS VMs take a lot of RAM, limited quantity, you can run out very quickly. Riak currently doesn’t handle this well, but they’re working on it.
  41. LevelDB: More Levels, More Pain Each additional level adds a

    disk seek, which is killer in the cloud. We use it because we need 2I. In EC2 ephemeral, each additional disk seek adds about 10ms
  42. Riak Control Unusable with slow internet connection due to PJAX

    bullshit. Really bad for Ops people on the road (MiFis, international, etc.). Otherwise great. Basho is aware of the problem. Requires low-latency connection
  43. Operational Issues, Part 1

  44. Operational Issues, Part 2 • Cluster state under exceptional conditions

    doesn’t converge. • Add/Remove the same node many times (usually do to automation craziness) • EC2 partial node failures + LevelDB?
  45. Killing MongoDB So much fire.

  46. Non K/V Data • Not fast growing • Rich querying

    needed • Solution: PostgreSQL • Highly recommended.
  47. Geo • We actually still use MongoDB, for now. •

    Will move to PostGIS eventually. • Not high pressure, low priority.
  48. Closing Remarks • Scaling is hard • Nothing is a

    magic bullet • Look for easy wins that matter. • Rinse and repeat, converge to a scalable system.
  49. Closing Remarks For horizontally scalable key/value data, Riak is the

    right choice.
  50. Thanks! Q/A?