Slide 1

Slide 1 text

Firehose Storage at Paper.li Pierre-Yves Ritschard July 25, 2012

Slide 2

Slide 2 text

A Quick introduction paper.li Baseline: “Curation platform enabling millions of users to become the publishers of their own daily newspapers.” Canonical use case: daily newspaper on an interest Feeds on a wide variety of “social media” sources @pyr Lead architect at Smallrivers, I like to build big stuff Long time involvement in distributed systems, scalability Recent FP and big data convert

Slide 3

Slide 3 text

About paper.li

Slide 4

Slide 4 text

Before cassandra Standard RoR stack MySQL strongly normalized datastore Memcached

Slide 5

Slide 5 text

Storage workload Two write heavy pipelines Continuous source aggregation Edition publisher Random read workload Huge mass of somewhat popular newspapers

Slide 6

Slide 6 text

Storage painpoints You know the drill Crazy cache warm up Impossible schema changes Somewhat manageable read-workload Unmanageable write workload

Slide 7

Slide 7 text

Requirements Improving status quo I/O pressure release Constant time writes Limiting operations overhead Going further More metadata Analytics Adapting behavior Storing way more data

Slide 8

Slide 8 text

Considered solutions Sharding MySQL HBase / Voldemort / ElephantDB Riak Apache Cassandra

Slide 9

Slide 9 text

Cassandra winning points: operations Standalone stack P2P Architecture: no SPOF Multi datacenter support Extensive JMX support

Slide 10

Slide 10 text

Cassandra winning points: modeling Flexible schema Easy index and relations storage Distributed counters JVM compatibility - we’re a clojure shop One stop shop answer for our storage needs (almost)

Slide 11

Slide 11 text

Our usage of cassandra Entity storage Papers, Sources Relations and Indices Articles found in sources, ordered by time of appearance Logs and events Events happening on a source Analytics View hits, contributor appearances

Slide 12

Slide 12 text

Data layout Data types Row keys, column names and column values use a serializer UTF-8 String UUID (TimeUUID) Long Composite BYO Keyspaces The equivalent of databases Column Families The equivalent of tables no fixed amount of columns in rows (wide rows) column metadata can exist

Slide 13

Slide 13 text

What storage looks like One can think of cassandra CFs as double depth hash tables 1 {"twitter": { 2 "Users": { 3 "@steeve": { 4 "name": "Steeve Morin" 5 }, 6 }, 7 "Followers": { 8 "@steeve": { "@pyr": null, "@camping": null } 9 }, 10 "Timelines": { 11 "@steeve": { "2012-07-25-00:00:00": "meetup !", 12 "2012-07-24-00:01:34": "foo"} 13 } 14 }}

Slide 14

Slide 14 text

Cassandra schemas specifics Rows don’t need to look alike Columns are sorted by column name Column values can have arbitrary types You don’t need column values

Slide 15

Slide 15 text

Denormalization, what and why ? Copying the same data in different places Reads are more expensive than writes Hard disk storage is a commodity

Slide 16

Slide 16 text

Denormalization canonical example: before 1 SELECT * FROM UserFollowers f, Tweets t WHERE f.user_name = "@pyr" 2 AND t.user_name = f.followee_name;

Slide 17

Slide 17 text

Denormalization canonical example: after 1 SELECT * from Timelines where KEY="@pyr";

Slide 18

Slide 18 text

Consistency levels A way to express your requirements Read consistency Property ONE Response from the closest replica QUORUM (Replication Factor / 2 ) + 1 replicas must agree ALL All replicas must agree Write consistency Property ANY The write reached one node ONE The write must have been succesfully performed on at least one of the replicas QUORUM (Replication Factor / 2 ) + 1 replicas must have must have succesfully performed the write ALL All replicas must have performed the write

Slide 19

Slide 19 text

Dealing with the CAP theorem Choose the strategy that matches the data you are handling Storing entities is sensitive Ensure a high consistency level, e.g: read and write at QUORUM Regular CF snapshots Storing events can sustain consistency mishaps writing at ONE should be sufficient

Slide 20

Slide 20 text

Let’s talk numbers more than 15000 posts computed per second (peak) On average 200M per day associated social counters updated for analytics associated log event storage for scheduler input more than 3000 articles computed per second (peak) 600k paper editions per day each pulling from wide rows to filter, rank and output an edition

Slide 21

Slide 21 text

Some gotchas Don’t forget to cache When possible, use fast disks (SSD) Give your instances space to breathe Split clusters Node operations aren’t free

Slide 22

Slide 22 text

Questions ? @pyr, https://github.com/pyr slides soon on http://spootnik.org