Slide 1

Slide 1 text

Martin Kleppmann • [email protected] • http://martinkl.com

Slide 2

Slide 2 text

http://www.flickr.com/photos/2fs/330318471/

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

http://www.flickr.com/photos/nasamarshall/4619659804 Don’t let Architecture Astronauts scare you – Spolsky

Slide 5

Slide 5 text

1.  Caching 2.  Joins 3.  Full-text search

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

function getDocument(id) { var doc = cache.get(id); if (doc) { return doc; } else { doc = db.get(id); cache.set(id, doc, cacheTTL); return doc; } }

Slide 12

Slide 12 text

function updateDocument(id, doc) { cache.set(id, doc, cacheTTL); // or: cache.invalidate(id); db.set(id, doc); }

Slide 13

Slide 13 text

  Consistency   Staleness   Aborted transactions   Race conditions   Bootstrapping   Cold start after failure   New feature → new access pattern

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

CREATE TABLE users ( id integer PRIMARY KEY, name varchar(100) NOT NULL ); CREATE TABLE comments ( id integer PRIMARY KEY, in_reply_to integer REFERENCES comments (id), author_id integer REFERENCES users (id), content varchar(10000) );

Slide 16

Slide 16 text

SELECT * FROM comments WHERE in_reply_to = ?; SELECT * FROM comments JOIN users ON author_id = users.id WHERE in_reply_to = ?; users (id, name) comments (id, in_reply_to, author_id, content)

Slide 17

Slide 17 text

  blackdragonflower: May good fortune smile upon you foe your kindness.   Genmaken: Foe real! ▪  N0T_REALLY_RELEVANT: Almost Relevant   CanadianDave: for* ▪  Saucefire: Damn it, Dave, we don't have time foe yoe proper spelling. ▪ rosyatrandom: Ain't nobody got time foe that   beermethestrength: *dat

Slide 18

Slide 18 text

id in_reply_to author_id content 1 NULL 42 May good fortune smile upon… 2 1 43 Foe real! 3 2 44 Almost Relevant

Slide 19

Slide 19 text

-- Hacker News / Reddit clone, Postgres syntax WITH RECURSIVE tree(id, in_reply_to, author_id, content) AS ( SELECT id, in_reply_to, author_id, content FROM comments WHERE id = ? UNION ALL SELECT child.id, child.in_reply_to, child.author_id, child.content FROM tree parent, comments child WHERE parent.id = child.in_reply_to ) SELECT * FROM tree;

Slide 20

Slide 20 text

id in_reply_to author_id content 1 NULL 42 May good fortune smile upon… 2 1 43 Foe real! CREATE INDEX ON comments (in_reply_to); -- Implicit in PRIMARY KEY CREATE UNIQUE INDEX ON comments (id);

Slide 21

Slide 21 text

id in_reply_to author_id content 2 1 43 Foe real! CREATE INDEX ON comments (in_reply_to); SELECT * FROM comments WHERE in_reply_to = 1;

Slide 22

Slide 22 text

64 32 96 23 42 73 99

Slide 23

Slide 23 text

64 32 96 23 42 73 99 42 < 64 42 > 32 “To find value 42, look at rows stored in disk blocks 12345 and 23456”

Slide 24

Slide 24 text

  Consistency   Staleness   Aborted transactions   Race conditions   Bootstrapping   Cold start after failure   New feature → new access pattern

Slide 25

Slide 25 text

  Consistency   Staleness   Aborted transactions   Race conditions   Bootstrapping   Cold start after failure   New feature → new access pattern RDBMS deals with all of these!

Slide 26

Slide 26 text

Raw data Caches DB Indexes derive & maintain derive & maintain

Slide 27

Slide 27 text

Raw data Caches DB Indexes derive & maintain derive & maintain with (lots of) code with 1 line of SQL

Slide 28

Slide 28 text

-- I already have 200 million rows. -- This may take a few hours… CREATE INDEX CONCURRENTLY ON comments (in_reply_to);

Slide 29

Slide 29 text

-- I already have 200 million rows. -- This may take a few hours… CREATE INDEX CONCURRENTLY ON comments (in_reply_to); OMG! No table lock!

Slide 30

Slide 30 text

time

Slide 31

Slide 31 text

time write write

Slide 32

Slide 32 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write

Slide 33

Slide 33 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write

Slide 34

Slide 34 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write writes write

Slide 35

Slide 35 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write write write writes write

Slide 36

Slide 36 text

CREATE INDEX CONCURRENTLY ON comments (in_reply_to);

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

enter_sandman → “Say your prayers, little one / don’t forget, my son, to include everyone” term occurrences dont enter_sandman:5 everyon enter_sandman:11 forget enter_sandman:6 includ enter_sandman:10 littl enter_sandman:3 my enter_sandman:7 on enter_sandman:4 prayer enter_sandman:2 Tokenization / Stemming / Indexing / …

Slide 39

Slide 39 text

  Natural language (tokenization, stemming, synonyms)   Relevance ranking   Search facets   Spelling correction & suggestions → Warrants being separate from database

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

  Consistency   Staleness   Aborted transactions   Race conditions   Bootstrapping   Cold start after failure   New feature → new access pattern

Slide 43

Slide 43 text

Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “All Aboard the Databus!,” at ACM Symposium on Cloud Computing, 2012. http://www.socc2012.org/s18-das.pdf

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

Caches DB Indexes derive & maintain Search Indexes Raw data

Slide 46

Slide 46 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write write write writes write

Slide 47

Slide 47 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write write write writes write snapshot of table data

Slide 48

Slide 48 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write write write writes write snapshot of table data

Slide 49

Slide 49 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write write write writes write

Slide 50

Slide 50 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write write write writes write

Slide 51

Slide 51 text

snapshot of table data CREATE INDEX CONCURRENTLY… time write write write write write write writes write

Slide 52

Slide 52 text

  Derived data can always be re-generated from raw data   Derived data optimized for reading," raw data optimized for writing (normalized)   Many different datasets derived from one raw dataset   Update latency depends on application (ACID/online < ‘nearline’ < offline)

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

  Consistency   Staleness   Aborted transactions   Race conditions   Bootstrapping   Cold start after failure   New feature → new access pattern

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

  Consistency   Staleness   Aborted transactions   Race conditions   Bootstrapping   Cold start after failure   New feature → new access pattern Just build a new one!

Slide 57

Slide 57 text

  Consistency   Staleness   Aborted transactions   Race conditions   Bootstrapping   Cold start after failure   New feature → new access pattern Just build a new one! Build a new one regularly!

Slide 58

Slide 58 text

  Consistency   Staleness   Aborted transactions   Race conditions   Bootstrapping   Cold start after failure   New feature → new access pattern Just build a new one! Build a new one regularly! Fast stream process

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

Data to write All data append-only Batch process Precomputed view (cache) Message queue Stream process Real-time diff to view Reader

Slide 61

Slide 61 text

Data to write All data append-only Batch process Precomputed view (cache) Message queue Stream process Real-time diff to view Reader e.g. HDFS e.g. Hadoop MapReduce e.g. Voldemort e.g. Kafka e.g. Storm e.g. Memcached

Slide 62

Slide 62 text

  No such thing as a cache miss   After every batch rebuild, discard stream output → “human fault tolerance”   Not feasible if O(n2) [e.g. friends in common] or unbounded [e.g. search]   How does stream process know which keys to update (when using joins)?

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

No content