Slide 1

Slide 1 text

Rebuilding Twitter with Cassandra Friday 28th March @ WIDS 2014, Hong Kong Matthew Rudy Jacobs

Slide 2

Slide 2 text

@matthewrudy

Slide 3

Slide 3 text

Let’s rebuild twitter!

Slide 4

Slide 4 text

matthewrudy/ twissandra-rb

Slide 5

Slide 5 text

a Ruby re- implementation of Twissandra

Slide 6

Slide 6 text

Fork it!

Slide 7

Slide 7 text

What is Twitter?

Slide 8

Slide 8 text

People

Slide 9

Slide 9 text

Tweets

Slide 10

Slide 10 text

Friends

Slide 11

Slide 11 text

Followers

Slide 12

Slide 12 text

Timeline

Slide 13

Slide 13 text

A whole load of other complex stuff

Slide 14

Slide 14 text

Complex stuff • @mentions • replies • retweets • favourites • sources • the streaming API

Slide 15

Slide 15 text

What’s so hard about that?

Slide 16

Slide 16 text

Nothing

Slide 17

Slide 17 text

Let’s do it in SQL

Slide 18

Slide 18 text

Relational Approach class User has_many :friendships has_many :friends, through: :friendships has_many :tweets has_many :timeline_tweets, through: :friends end ! me = User.find_by(username: “me”) me.timeline_tweets.order(“created_at DESC”).limit(20)

Slide 19

Slide 19 text

BOOM!

Slide 20

Slide 20 text

We’re done!

Slide 21

Slide 21 text

Thanks

Slide 22

Slide 22 text

Just kidding

Slide 23

Slide 23 text

When stuff gets big?

Slide 24

Slide 24 text

We have 2 problems

Slide 25

Slide 25 text

Throughput

Slide 26

Slide 26 text

Add Read Slaves Tweet slave #1 Tweet master Tweet slave #2 Client

Slide 27

Slide 27 text

That’s fine

Slide 28

Slide 28 text

Data size

Slide 29

Slide 29 text

10 TB of data? • Heroku Postgres has a 1TB limit • RDS Postgres has a 3TB limit • Amazon I2 8xlarge has a 6TB limit • buy a massive 12TB disk array?

Slide 30

Slide 30 text

Shard it? Tweets#1 Tweets#2 Tweets#3 Client id % 3 == 0 id % 3 == 2 id % 3 == 1 Tweet id generator

Slide 31

Slide 31 text

Joins?

Slide 32

Slide 32 text

NO!

Slide 33

Slide 33 text

Joins don’t scale!

Slide 34

Slide 34 text

You can scale SQL

Slide 35

Slide 35 text

But there’s a whole load of caveats

Slide 36

Slide 36 text

“On the shoulders of giants”

Slide 37

Slide 37 text

Cassandra

Slide 38

Slide 38 text

NoSQL?

Slide 39

Slide 39 text

No, CQL!

Slide 40

Slide 40 text

CREATE TABLE CREATE TABLE tweets ( id timeuuid, user_id uuid, body text, mentions set, ! PRIMARY KEY (id) ); ! CREATE INDEX ON tweets (user_id);

Slide 41

Slide 41 text

INSERT INTO INSERT INTO tweets (id, user_id, body) VALUES (1, 1337, ‘This is my first tweet!’);

Slide 42

Slide 42 text

SELECT FROM SELECT * FROM tweets WHERE user_id = 1337;

Slide 43

Slide 43 text

It’s awesome for timelines

Slide 44

Slide 44 text

Timelines CREATE TABLE timeline ( user_id uuid, tweet_id timeuuid, ! PRIMARY KEY (user_id, tweet_id) ) WITH CLUSTERING ORDER BY (tweet_id DESC); ! SELECT * FROM timeline WHERE user_id = 1337 AND tweet_id < minTimeuuid('2013-02-02 10:00+0000’);

Slide 45

Slide 45 text

“it’s like SQL, but without all the stuff that doesn’t scale”

Slide 46

Slide 46 text

Types

Slide 47

Slide 47 text

Types • integer - 1234567 • long - 123.456 • text - ‘abc123’ • uuid - {756716f7-2e54-4715-9f00-91dcbea6cf50} • timeuuid - NOW() • timestamp - ‘2013-06-13 11:00:00'

Slide 48

Slide 48 text

TimeUUID is awesome!

Slide 49

Slide 49 text

Collections • Sets - {‘dog’, ‘cat’, ‘elephant’} • Lists - [‘AMZN’, ‘AAPL’, ‘FB’] • Maps - { ‘GOOG': 1200, ‘AAPL’: 512}

Slide 50

Slide 50 text

Sets CREATE TABLE friends ( user_id uuid, friend_ids set, ! PRIMARY KEY (user_id) ); ! UPDATE users SET friend_ids = friend_ids + {8007} WHERE user_id = 1337;

Slide 51

Slide 51 text

Counters CREATE TABLE user_tweet_counts ( user_id INT, tweet_count COUNTER, ! PRIMARY KEY (user_id) ); ! UPDATE user_tweet_counts SET count = count + 1 WHERE user_id = 1337; ! INSERT INTO tweets (id, user_id, body) VALUES (NOW(), 1337, ‘some tweet’);

Slide 52

Slide 52 text

Distribution Model

Slide 53

Slide 53 text

Consistent Hashing Tweet A B C D E F G H Murmur3 Partitioner

Slide 54

Slide 54 text

Distributed A B C D E F G H A B C D G H A B C D E F A B C D E F G H E F G H replication factor =3

Slide 55

Slide 55 text

Throughput, Redundancy, Availability

Slide 56

Slide 56 text

Write - Quorum Node 1 Node 2 Node 3 Client

Slide 57

Slide 57 text

Read - Quorum Node 1 Node 2 Node 3 Client

Slide 58

Slide 58 text

Write - ALL Node 1 Node 2 Node 3 Client

Slide 59

Slide 59 text

Read - ONE Node 1 Node 2 Node 3 Client

Slide 60

Slide 60 text

The Accidental Win

Slide 61

Slide 61 text

Singapore Multi-region master master A B C D E F G H Virginia A B C D E F G H HK user SF user

Slide 62

Slide 62 text

BOOM!

Slide 63

Slide 63 text

Finally, let’s code!

Slide 64

Slide 64 text

NoSQL needs planning first!

Slide 65

Slide 65 text

P.E.Q.S.

Slide 66

Slide 66 text

P.E.Q.S. • Processes • Entities • Queries • Schema

Slide 67

Slide 67 text

Processes • Create a User • Send a Tweet • Read my Timeline • Follow a User • Unfollow a User

Slide 68

Slide 68 text

Create a User • store User details

Slide 69

Slide 69 text

Send a Tweet • create a Tweet for Me • add Tweet to my Followers Timelines • add Tweet to *mentioned* Users' Timeline

Slide 70

Slide 70 text

Read my Timeline • find Tweets from my Timeline • load the Tweet details • load the User details

Slide 71

Slide 71 text

Follow a User • add User to my Friends • add Me to the User's Followers • add User's Tweets to my Timeline

Slide 72

Slide 72 text

Unfollow a User • remove User from my Friends • remove Me from User's Followers • remove User's Tweets from my Timeline

Slide 73

Slide 73 text

Entities • Users • Tweets • Userline - tweets from the user • Timeline - tweets for the user to see • Friends - people the user follows • Followers - people who follow the user

Slide 74

Slide 74 text

Users • id - used for references • username - used for display - UNIQUE • metadata - location, photos and stuff

Slide 75

Slide 75 text

Tweets • id - used for references • user_id - the user who tweeted • body - the text of the tweet • mentions - a list of users mentioned

Slide 76

Slide 76 text

Userline • user_id - the user who tweeted • tweet_id - id of the tweet - UNIQUE • timestamp - when it was tweeted

Slide 77

Slide 77 text

Timeline • user_id - the user who sees this • tweet_id - id of the tweet - UNIQUE • timestamp - when it was tweeted

Slide 78

Slide 78 text

Friends • user_id - the user • friend_id - the person they follow • timestamp - when they were followed

Slide 79

Slide 79 text

Followers • user_id - the user • follower_id - the person who follows them • timestamp - when they were followed

Slide 80

Slide 80 text

Queries

Slide 81

Slide 81 text

Users • create - insert into database • find_by_username - find @matthewrudy • find_all_by_id - find all users for a set of tweets • UNIQUE by username

Slide 82

Slide 82 text

Tweets • create • find_ by_id • UNIQUE by id

Slide 83

Slide 83 text

Userline • add_tweet_for_user • find_all_by_user • ORDER BY timestamp DESC • UNIQUE by {user, tweet}

Slide 84

Slide 84 text

Timeline • add_tweet_for_user • find_all_by_user • ORDER BY timestamp DESC • UNIQUE by {user, tweet} • delete_all_by_tweet_user

Slide 85

Slide 85 text

Friends • find_all_by_user • create • UNIQUE by {user, friend}

Slide 86

Slide 86 text

Followers • find_all_by_user • create • UNIQUE by {user, follower}

Slide 87

Slide 87 text

Schema

Slide 88

Slide 88 text

Keyspace CREATE KEYSPACE twissandra WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '1' };

Slide 89

Slide 89 text

Users CREATE TABLE users ( id uuid, username text, location text, ! PRIMARY KEY (id) ); ! CREATE INDEX ON users (username);

Slide 90

Slide 90 text

Tweets CREATE TABLE tweets ( id timeuuid, // unique id with timestamp user_id uuid, body text, mentions set, ! PRIMARY KEY (id) );

Slide 91

Slide 91 text

Userline CREATE TABLE userline ( user_id uuid, tweet_id timeuuid, ! PRIMARY KEY (user_id, tweet_id) ) ! WITH CLUSTERING ORDER BY (tweet_id DESC);

Slide 92

Slide 92 text

Timeline CREATE TABLE timeline ( user_id uuid, tweet_id timeuuid, tweet_user_id uuid, ! PRIMARY KEY (user_id, tweet_id) ) ! WITH CLUSTERING ORDER BY (tweet_id DESC);

Slide 93

Slide 93 text

Friends CREATE TABLE friends ( user_id uuid, friend_id uuid, timestamp timestamp, ! PRIMARY KEY (user_id, friend_id) );

Slide 94

Slide 94 text

Followers CREATE TABLE followers ( user_id uuid, follower_id uuid, timestamp timestamp, ! PRIMARY KEY (user_id, follower_id) );

Slide 95

Slide 95 text

the code writes itself, right?

Slide 96

Slide 96 text

user.rb

Slide 97

Slide 97 text

tweet.rb

Slide 98

Slide 98 text

userline.rb

Slide 99

Slide 99 text

timeline.rb

Slide 100

Slide 100 text

friend.rb

Slide 101

Slide 101 text

follower.rb

Slide 102

Slide 102 text

create_user.rb

Slide 103

Slide 103 text

send_tweet.rb

Slide 104

Slide 104 text

read_timeline.rb

Slide 105

Slide 105 text

follow_user.rb

Slide 106

Slide 106 text

unfollow_user.rb

Slide 107

Slide 107 text

Thanks

Slide 108

Slide 108 text

@matthewrudy