Executing Queries on a Sharded Database

Slide 1

Slide 1 text

Execu&ng Queries on a Sharded Database Neha Narula September 25, 2012

Slide 2

Slide 2 text

Choosing and Scaling Your Datastore Neha Narula September 25, 2012

Slide 3

Slide 3 text

Who Am I? Froogle Blobstore Na&ve Client

Slide 4

Slide 4 text

In This Talk •  What to think about when choosing a datastore for a web applica&on •  Myths and legends •  Execu&ng distributed queries •  My research: a consistent cache

Slide 5

Slide 5 text

Every so oPen… Friends ask me for advice when they are building a new applica&on

Slide 6

Slide 6 text

Friend

Slide 7

Slide 7 text

“Hi Neha! I am making a new applica&on.

Slide 8

Slide 8 text

I have heard MySQL sucks and I should use NoSQL.

Slide 9

Slide 9 text

I am going to be itera&ng on my app a lot

Slide 10

Slide 10 text

I don't have any customers yet

Slide 11

Slide 11 text

and my current data set could ﬁt on a thumb drive from 2004, but…

Slide 12

Slide 12 text

Can you tell me which NoSQL database I should use?

Slide 13

Slide 13 text

And how to shard it?"

Slide 14

Slide 14 text

Neha

Slide 15

Slide 15 text

h[p://knowyourmeme.com/memes/facepalm

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

What to Think about When You’re Choosing a Datastore Hint: Not Scaling

Slide 19

Slide 19 text

Development Cycle •  Prototyping – Ease of use •  Developing – Flexibility, mul&ple developers •  Running a real produc&on system – Reliability •  SUCCESS! Sharding and specialized datastorage – We should only be so lucky

Slide 20

Slide 20 text

Gecng Started First ﬁve minutes – what’s it like? Idea credit: Adam Marcus, The NoSQL Ecosystem, HPTS 2011 via Jus&n Sheehy from Basho

Slide 21

Slide 21 text

First Five Minutes: Redis h[p://simonwillison.net/2009/Oct/22/redis/

Slide 22

Slide 22 text

First Five Minutes: MySQL

Slide 23

Slide 23 text

Developing •  Mul&ple people working on the same code •  Tes&ng new features => new access pa[erns •  New person comes along…

Slide 24

Slide 24 text

Redis Time to go get lunch

Slide 25

Slide 25 text

MySQL

Slide 26

Slide 26 text

Ques&ons •  What is your mix of reads and writes? •  How much data do you have? •  Do you need transac&ons? •  What are your access pa[erns? •  What will grow and change? •  What do you already know?

Slide 27

Slide 27 text

Reads and Writes •  Write op&mized vs. Read op&mized •  MongoDB has a global write lock per process •  No concurrent writes!

Slide 28

Slide 28 text

Size of Data •  Does it ﬁt in memory? •  Disk-‐based solu&ons will be slower •  Worst case, Redis needs 2X the memory of your data! – It forks with copy-‐on-‐write sudo echo 1 > /proc/sys/vm/overcommit_memory!

Slide 29

Slide 29 text

Requirements •  Performance – Latency tolerance •  Durability – Data loss tolerance •  Freshness – Staleness tolerance •  Up&me – Down&me tolerance

Slide 30

Slide 30 text

Performance •  Rela&onal databases are considered slow •  But they are doing a lot of work! •  Some&mes you need that work, some&mes you don’t.

Slide 31

Slide 31 text

Simplest Scenario •  Responding to user requests in real &me •  Frequently read data ﬁts in memory •  High read rate •  No need to go to disk or scan lots of data Datastore CPU is the bo[leneck

Slide 32

Slide 32 text

Cost of Execu&ng a Primary Key Lookup Query Cost •  Receiving message •  Instan&a&ng thread •  Parsing query •  Op&mizing •  Take out read locks •  (Some&mes) lookup in an index btree •  Responding to the client Actually retrieving data

Slide 33

Slide 33 text

Op&ons to Make This Fast •  Query cache •  Prepared statements •  Handler Socket – 750K lookups/sec on a single server Lesson: Primary key lookups can be fast no ma[er what datastore you use

Slide 34

Slide 34 text

Flexibility vs. Performance •  We might want to pay the overhead for query ﬂexibility •  In a primary key datastore, we can only ask queries on primary key •  SQL gives us ﬂexibility to change our queries

Slide 35

Slide 35 text

Durability •  Persistent datastores –  Client: write –  Server: ﬂush to disk, then send “I completed your write” –  CRASH –  Recover: See the write •  By default, MongoDB does not fsync() before returning to client on write –  Need j:true •  By default, MySQL uses MyISAM instead of InnoDB

Slide 36

Slide 36 text

Specializa&on •  You know your – query access pa[erns and traﬃc – consistency requirements •  Design specialized lookups – Transac&onal datastore for consistent data – Memcached for sta&c, mostly unchanging content – Redis for a data processing pipeline •  Know what tradeoﬀs to make

Slide 37

Slide 37 text

Ways to Scale •  Reads – Cache – Replicate – Par&&on •  Writes – Par&&on data amongst mul&ple servers

Slide 38

Slide 38 text

Lots of Folklore

Slide 39

Slide 39 text

Myths of Sharded Datastores •  NoSQL scales be[er than a rela&onal database •  You can’t do a JOIN on a sharded datastore

Slide 40

Slide 40 text

MYTH: NoSQL scales be[er than a rela&onal database •  Scaling isn’t about the datastore, it’s about the applica&on – Examples: FriendFeed, Quora, Facebook •  Complex queries that go to all shards don’t scale •  Simple queries that par&&on well and use only one shard do scale

Slide 41

Slide 41 text

Problem: Applica&ons Look Up Data in Diﬀerent Ways

Slide 42

Slide 42 text

Post Page SELECT * ! FROM comments! WHERE post_id = 100! ! zrange comments:100 0 -1! HA!

Slide 43

Slide 43 text

Example Par&&oned Database Database Database Database comments table post_id! 100-199! 0-99! 200-199! Webservers

Slide 44

Slide 44 text

MySQL MySQL MySQL Query Goes to One Par&&on MySQL MySQL MySQL Comments on post 100 100-199! 0-99! 200-299!

Slide 45

Slide 45 text

MySQL MySQL MySQL Many Concurrent Queries MySQL MySQL MySQL Comments on post 100 100-199! 0-99! 200-299! Comments on post 52 Comments on post 289

Slide 46

Slide 46 text

User Comments Page Fetch all of a user's comments: SELECT * FROM comments WHERE user = 'sam'!

Slide 47

Slide 47 text

Query Goes to All Par&&ons MySQL MySQL MySQL Query goes to all servers 100-199! 0-99! 200-299! Sam’s comments

Slide 48

Slide 48 text

Costs for Query Go Up When Adding a New Server MySQL 0-99! Sam’s comments MySQL 100-199! MySQL 200-299! MySQL 300-399!

Slide 49

Slide 49 text

CPU Cost on Server of Retrieving One Row Two costs: Query overhead and row retrieval MySQL Query Overhead 97%

Slide 50

Slide 50 text

Idea: Mul&ple Par&&onings Par&&on comments table on post_id and user. Reads can choose appropriate copy so queries go to only one server. Writes go to all copies.

Slide 51

Slide 51 text

Mul&ple Par&&onings Database Database Database comments table post_id! 100-199! 0-99! 200-199! user! K-R! A-J! S-Z!

Slide 52

Slide 52 text

All Read Queries Go To Only One Par&&on MySQL MySQL MySQL 100-199! 0-99! S-Z! Sam’s comments Comments on post 100

Slide 53

Slide 53 text

Writes Go to Both Table Copies MySQL MySQL MySQL 100-199! 0-99! S-Z! Write a new comment by Vicki on post 100

Slide 54

Slide 54 text

Scaling •  Reads scale by N – the number of servers •  Writes are slowed down by T – the number of table copies •  Big win if N >> T •  Table copies use more memory – OPen don’t have to copy large tables – usually metadata •  How to execute these queries?

Slide 55

Slide 55 text

Dixie •  Is a query planner which executes SQL queries over a par&&oned database •  Op&mizes read queries to minimize query overhead and data retrieval costs •  Uses a novel cost formula which takes advantage of data par&&oned in diﬀerent ways

Slide 56

Slide 56 text

0 5000 10000 15000 20000 25000 1 2 5 10 QPS Number of Partitions One copy of page table Dixie + page.title copy Wikipedia Workload with Dixie 3.2 X Each query going to 1 server Many queries going to all servers Wikipedia •  Database dump from 2008 •  Real HTTP traces •  Sharded across 1, 2, 5, and 10 servers •  Compare by adding a copy of the page table (< 100 MB), sharded another way

Slide 57

Slide 57 text

Myths of Sharded Datastores •  NoSQL scales be[er than a rela&onal database •  You can’t do a JOIN on a sharded datastore

Slide 58

Slide 58 text

MYTH: You can’t execute a JOIN on a sharded database •  What are the JOIN keys? What are your par&&on keys? •  Bad to move lots of data •  Not bad to lookup a few pieces of data •  Index lookup JOINs expressed as primary key lookups are fast

Slide 59

Slide 59 text

Join Query Fetch all of Alice's comments on Max's posts. 3 Alice First post! comments table post_id user text 6 7 22 Alice Alice Alice Like. You think? Nice work! 1 Max http://… posts table id author link 3 22 37 Max Max Max www. Ask Hacker http://…

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Distributed Query Planning •  R* •  Shore •  The State of the Art in Distributed Query Processing, by Donald Kossman •  Gizzard

Slide 62

Slide 62 text

Conven&onal Wisdom Par&&on on JOIN keys, Send query as is to each server. 0-99! 0-99! 100- 199! 100- 199! 200- 299! 200- 299! SELECT comments.text ! FROM posts, comments! WHERE comments.post_id = posts.id! AND posts.author = 'Max'! AND comments.user = 'Alice'!

Slide 63

Slide 63 text

… Conven&onal Plan Scaling MySQL Alice’s comments on Max’s posts MySQL MySQL

Slide 64

Slide 64 text

Index Lookup Join Par&&on on ﬁlter keys, retrieve Max's posts, then retrieve Alice's comments on Max's posts. 0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! J-R! S-Z! A-I! 0-99! A-I! SELECT posts.id ! FROM posts! WHERE posts.author = 'Max’!

Slide 65

Slide 65 text

0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! J-R! S-Z! A-I! 0-99! A-I! ids of Max’s posts Index Lookup Join Par&&on on ﬁlter keys, retrieve Max's posts, then retrieve Alice's comments on Max's posts.

Slide 66

Slide 66 text

0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! A-I! J-R! S-Z! 0-99! A-I! SELECT comments.text ! FROM comments! WHERE comments.post_id IN […]! AND comments.user = 'Alice'! Index Lookup Join Par&&on on ﬁlter keys, retrieve Max's posts, then retrieve Alice's comments on Max's posts.

Slide 67

Slide 67 text

0-99! 100- 199! 200- 299! J-R! S-Z! 100- 199! 200- 299! A-I! J-R! S-Z! 0-99! A-I! Alice’s comments on Max’s posts Index Lookup Join Par&&on on ﬁlter keys, retrieve Max's posts, then retrieve Alice's comments on Max's posts.

Slide 68

Slide 68 text

… MySQL Alice’s comments on Max’s posts MySQL MySQL Index Lookup Join Plan Scaling

Slide 69

Slide 69 text

Comparison Conven&onal Plan Index Lookup Plan Intermediate data, Max’s posts Alice DIDN’T comment on

Slide 70

Slide 70 text

Comparison Conven&onal Plan Index Lookup Plan Intermediate data, Max’s posts Alice DIDN’T comment on

Slide 71

Slide 71 text

Dixie •  Cost model and query op&mizer which predicts the costs of the two plans •  Query executor which executes the original query, designed for a single database, on a par&&oned database with table copies

Slide 72

Slide 72 text

Dixie's Predic&ons for the Two Plans 0 2000 4000 6000 8000 10000 0 50 100 150 Queries/Sec Posts/Author 2-Step Join Estimate Pushdown Join Estimate •  Varying the size of the intermediate data •  Max’s posts, Alice didn’t comment on •  Fixing the number of results returned •  Par&&oned over 10 servers Conven&onal Plan, each query going to all servers Index Lookup Plan, each query going to two servers

Slide 73

Slide 73 text

Performance of the Two Plans 0 2000 4000 6000 8000 10000 0 50 100 150 Queries/Sec Posts/Author 2-Step Join Pushdown Join 2-Step Join Estimate Pushdown Join Estimate Blue beats yellow aPer this. Dixie predicts it!

Slide 74

Slide 74 text

JOINs Are Not Evil •  When only transferring small amounts data •  And with carefully par&&oned data

Slide 75

Slide 75 text

Lessons Learned •  Don’t worry at the beginning: Use what you know •  Not about SQL vs. NoSQL systems – you can scale any system •  More about complex vs. simple queries •  When you do scale, try to make every query use one (or a few) shards

Slide 76

Slide 76 text

Research: Caching •  Applica&ons use a cache to store results computed from a webserver or database rows •  Expira&on or invalida&on? •  Annoying and diﬃcult for app developers to invalidate cache items correctly Work in progress with Bryan Kate, Eddie Kohler, Yandong Mao, and Robert Morris Harvard and MIT

Slide 77

Slide 77 text

Solu&on: Dependency Tracking Cache •  Applica&on indicates what data went into a cached result •  The system will take care of invalida&ons •  Don’t need to recalculate expired data if it has not changed •  Always get the freshest data when data has changed •  Track ranges, even if empty!

Slide 78

Slide 78 text

Example Applica&on: Twi[er •  Cache a user’s view of their twi[er homepage (&meline) •  Only invalidate when a followed user tweets •  Even if many followed users tweet, only recalculate the &meline once when read •  No need to recalculate on read if no one has tweeted

Slide 79

Slide 79 text

Challenges •  Invalida&on + expira&on &mes – Reading stale data •  Distributed caching •  Cache coherence on the original database rows

Slide 80

Slide 80 text

Benefits •  Applica&on gets all the benefits of caching – Saves on computa&on – Less traffic to the backend database •  Doesn’t have to worry about freshness

Slide 81

Slide 81 text

Summary •  Choosing a datastore •  Dixie •  Dependency tracking cache

Slide 82

Slide 82 text

Thanks! [email protected] h[p://nehanaru.la @neha