Scaling Riak to 25MM Ops/Day at Kiip

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Armon Dadgar @armondadgar Mitchell Hashimoto @mitchellh

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

API Flow Session Start Moment Reward Session End 0..n times

Slide 6

Slide 6 text

The Numbers x million unique devices per day About 4 API calls per session = ~25 million API calls per day

Slide 7

Slide 7 text

The Journey of Scale A Story of MongoDB * Let’s talk about our journey scaling, specifically with MongoDB. * We started with MySQL, but switched before we had any real traffic to MongoDB.

Slide 8

Slide 8 text

1. Write Limit Hit by Analytics * Analytics sent hundreds of atomic updates per second. * Hit limit w/ global write lock. * Solution: Aggregate over 10 seconds, send small bursts of updates, resulting in lower lock % on average. Solution: Aggregate over 10 seconds.

Slide 9

Slide 9 text

2. Too Many Reads (1000s/s) * We were reading too much, hit max throughput of MongoDB. * Solution: Cache everywhere. Solution: Heavy Caching.

Slide 10

Slide 10 text

3. Slow, Uncachable Queries Example: “Has device X played game Y today/last week/this month/in all time?” * Touches lots of data * Requires lots of index space * Not cachable * MongoDB just... slow. Solution: Bloom filters. Solution: Bloom filters!

Slide 11

Slide 11 text

4. Write Limit Hit, Again Basic model updates were hitting MongoDB’s write throughput limit. Solution: Use two distinct MongoDB clusters for disjunct datasets to avoid global write lock. One for analytics (heavy writes). One for everything else. Solution: Two clusters (lol global write lock)

Slide 12

Slide 12 text

5. Index Size Hit Memory Limits We didn’t vertically scale because we’re pretty operationally frugal and the data was growing very fast. ETL = Extract/Transform/Load, archive data to S3, remove from main DB. Solution: ETL, Reap old data.

Slide 13

Slide 13 text

6. ETL Overwhelmed ETL of 24 hours of data took longer than 24 hours to extract, limited by MongoDB read throughput. We decided to let it break and continue reaping data. Solved in the future by continuous ETL solution separate from our main DB. Solution: Punted, solved by custom solution

Slide 14

Slide 14 text

7. Central Bottleneck by Mongo Noticed that _all_ API response times were directly correlated to write load of MongoDB. Our only choice left here was to look into a new DB solution. Solution: Research new DBs!

Slide 15

Slide 15 text

Researching a new DB

Slide 16

Slide 16 text

RDBMS In the cloud, without horizontal scalability, I/O would hit a limit REAL fast. Didn’t want to deal with custom sharding layer.

Slide 17

Slide 17 text

Cassandra Our cofounders are from Digg. Enough said.

Slide 18

Slide 18 text

HBase Saw PyCodeConf talk about system at Mozilla based on HBase. We talked to speaker: * Operational nightmare * Took 1 year * No JVM experience at Kiip Not reasonable, for us.

Slide 19

Slide 19 text

CouchDB * No auto horizontal scaling, you have to do it at the app level. * Features weren’t compelling (master/master syncing with phones, CouchApps, etc.). * We didn’t know anyone who used it.

Slide 20

Slide 20 text

Riak * Attracted to solid academic foundation * Visited and talked with Basho developers. * Confident 100% in Basho team before even using product. * Meetups showed real world usage at scale + dev & ops happiness.

Slide 21

Slide 21 text

Data Migration

Slide 22

Slide 22 text

Identify Fast-Growing Data •Data we needed horizontally scalable •Session/Device data grew at exponential rate. •Move that data first, keep the rest in MongoDB (for now).

Slide 23

Slide 23 text

Identify Fast-Growing Data Session Growth

Slide 24

Slide 24 text

Session Migration

Slide 25

Slide 25 text

Sessions First • Obviously K/V •Key: UUID, Value: JSON blob. •Larger and faster growing than devices.

Slide 26

Slide 26 text

Data Access Patterns • By UUID (key) for all API calls • Fraud: By device ID and IP of session. • 2i compatible ✓

Slide 27

Slide 27 text

Update ORM • Added Riak backing store driver • No application-level changes were necessary • Riak Python client pains Python client pains: * Protocol buffer interface buggy * No keep alive (fixed) * Poor error handling (partially fixed, needs work)

Slide 28

Slide 28 text

Migrate • Write new data to Riak • Read from Riak, fallback to MongoDB if missing • After one week, remove MongoDB read-only Didn’t migrate data because ETL sent it to S3 anyways.

Slide 29

Slide 29 text

Device Migration

Slide 30

Slide 30 text

Devices • Huge • Growing • But... not obviously K/V.

Slide 31

Slide 31 text

Not Obviously K/V • Canonical ID (UUID), assigned by us. • Vendor ID (ADID, UDID, etc.), assigned by device vendor. • Uniqueness constraint on each, so 2i not possible.

Slide 32

Slide 32 text

Uniqueness in Riak, Part 1 Device Key: Canonical ID Value: JSON Blob Device_UUID Key: Vendor ID Value: Canonical ID Simulate uniqueness using If-None-Match Cross fingers and hope consistency isn’t too bad.

Slide 33

Slide 33 text

Part 1: Results FAILURE

Slide 34

Slide 34 text

Part 1: Results • Latency: At least 200ms, at most 2000ms • Map/Reduce JS VMs quickly overwhelmed • Hundreds of inconsistencies per hour

Slide 35

Slide 35 text

Uniqueness, Part 2 • Just don’t do it. • Canonical ID = SHA1(Vendor ID) • Backfill old data (30MM rows, days of backfill) • Success, use Riak as a K/V store!

Slide 36

Slide 36 text

Riak In Production Our experience over 3 months.

Slide 37

Slide 37 text

DISCLAIMER Riak has been extremely solid. However, there are minor pain points that could and have been addressed.

Slide 38

Slide 38 text

Scale Early * Latencies explode under heavy I/O. Attempting to add a new node adds more I/O pressure for handoff. * Add new nodes early. * Hard to know when just beginning. Watch your FSM latencies carefully. Scaling at the red line is painful.

Slide 39

Slide 39 text

2i is slow, don’t use in real time * Normal EC2 get: 5ms * 2i EC2 get: 2000ms Fine for occasional background queries, not okay for queries on live requests.

Slide 40

Slide 40 text

JS Map/Reduce is slow, easily overwhelmed. Slow, to be expected, so don’t use for live requests. JS VMs take a lot of RAM, limited quantity, you can run out very quickly. Riak currently doesn’t handle this well, but they’re working on it.

Slide 41

Slide 41 text

LevelDB: More Levels, More Pain Each additional level adds a disk seek, which is killer in the cloud. We use it because we need 2I. In EC2 ephemeral, each additional disk seek adds about 10ms

Slide 42

Slide 42 text

Riak Control Unusable with slow internet connection due to PJAX bullshit. Really bad for Ops people on the road (MiFis, international, etc.). Otherwise great. Basho is aware of the problem. Requires low-latency connection

Slide 43

Slide 43 text

Operational Issues, Part 1

Slide 44

Slide 44 text

Operational Issues, Part 2 • Cluster state under exceptional conditions doesn’t converge. • Add/Remove the same node many times (usually do to automation craziness) • EC2 partial node failures + LevelDB?

Slide 45

Slide 45 text

Killing MongoDB So much fire.

Slide 46

Slide 46 text

Non K/V Data • Not fast growing • Rich querying needed • Solution: PostgreSQL • Highly recommended.

Slide 47

Slide 47 text

Geo • We actually still use MongoDB, for now. • Will move to PostGIS eventually. • Not high pressure, low priority.

Slide 48

Slide 48 text

Closing Remarks • Scaling is hard • Nothing is a magic bullet • Look for easy wins that matter. • Rinse and repeat, converge to a scalable system.

Slide 49

Slide 49 text

Closing Remarks For horizontally scalable key/value data, Riak is the right choice.

Slide 50

Slide 50 text

Thanks! Q/A?