Slide 1

Slide 1 text

Physical Design for Non-relational Data Systems Michael Mior • University of Waterloo

Slide 2

Slide 2 text

Proper design and configuration of data systems is critical for achieving good performance 2

Slide 3

Slide 3 text

3 Many tools exist for relational database design optimization Source: https://www.databasejournal.com/features/mssql/article.php/10894_3523616_2/Index-Tuning-Wizard.htm https://dev.mysql.com/doc/mysql-monitor/4.0/en/mem-qanal-using-ui.html Microsoft AutoAdmin (1998) DB2 Design Advisor (2004) Oracle SQL Tuning (2004)

Slide 4

Slide 4 text

We want applications to be up 24/7 We're frequently dealing with changing data or with unstructured data We require sub-second responses to queries 4 Source: Mike Loukides, VP Content Strategy, O’Reilly Media Relational databases are not always sufficient for these uses

Slide 5

Slide 5 text

“Over 30 years, we've learned how to write business intelligence applications on top of relational databases -- there are patterns. With NoSQL today, we have no cookie cutters. We don't have any blueprints.” --Ravi Krishnappa, NetApp solutions architect 5 Source: TechTarget, 2015

Slide 6

Slide 6 text

• NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications

Slide 7

Slide 7 text

• NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications

Slide 8

Slide 8 text

Model column families around query patterns But start your design with entities and relationships, if you can De-normalize and duplicate for read performance But don’t de-normalize if you don’t need to Leverage wide rows for ordering, grouping, and filtering But don’t go too wide Schema Design Best Practices Source: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/ But But But ? ? ? 8

Slide 9

Slide 9 text

NoSQL Application Development Requirements Implementation Data Model App Logic DB Access NoSE [MSAL, ICDE ‘16] [MSAL, TKDE ‘17] 9

Slide 10

Slide 10 text

Database Design Example Comment com_id com_date text User user_id nickname Post user_id post_date title 10

Slide 11

Slide 11 text

Database Design Example 11 SELECT post_id, post_title FROM users u JOIN comments c ON u.user_id = c.user_id JOIN posts p ON p.post_id = c.post_id ORDER BY p.post_date Query Find information on all posts a user has commented on in order by post date

Slide 12

Slide 12 text

Database Design Example user_id ↓ nickname comment_id post_id ↓ title post_date comment_id ↓ post_id post_date nickname nickname ↓ title post_date Execution A Execution B 12

Slide 13

Slide 13 text

NoSE Workload Query Plans Data Model 1. Candidate Enumeration 13 2. Query Planning 3. Design Optimization 4. Plan Recommendation Database Design

Slide 14

Slide 14 text

14 3 4 5 Database Design Optimization NoSE considers all possible query plans and picks the one with minimum expected cost

Slide 15

Slide 15 text

Evaluation 15 Overall workload performance improves by 5x

Slide 16

Slide 16 text

• NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications

Slide 17

Slide 17 text

Physical Logical 17 {user_id: 1, post_date: "2017-04-05", com_id: 3, …} {user_id: 2, post_date: "2017-04-05", com_id: 7, …} {post_id: 6, com_date: "2017-04-03", com_id: 3, user_id: 1, …} {post_id: 6, com_date: "2017-04-01", com_id: 7, user_id: 2, …} ? Existing NoSQL designs are a black box ?!? JSON!

Slide 18

Slide 18 text

Removes redundancy implied by both functional and inclusion dependencies Recovering Logical Schemas Extract the structure of existing data Discover dependencies Produce a logical model of the database 18

Slide 19

Slide 19 text

user_comments {░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, …} {░░░░░░░: ░, ░░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, …} comments_by_date {░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, ░░░░░░░: ░, …} {░░░░░░░: ░, ░░░░░░░░: "░░░░░░░░░░", ░░░░░░: ░, ░░░░░░░: ░, …} We want to go from raw data to a logical model Comment User Post 19 [MS, ER ‘18] (to appear)

Slide 20

Slide 20 text

20 user_comments user_id post_date com_id post_id title 1 2017-04-05 3 6 Stargate 2 2017-04-05 7 6 Stargate Data on the same logical entity appears multiple times

Slide 21

Slide 21 text

user_comments user_id com_id post_id 1 3 6 2 7 6 posts post_date post_id title 2017-04-05 6 Stargate 21 Post data can be (logically) extracted to normalize

Slide 22

Slide 22 text

22 user_comments_user user_id user_comments_post post_id post_date, title comments_by_date_post post_id comments_by_date_com com_id com_date, text comments_by_date_user user_id, nickname 22

Slide 23

Slide 23 text

23 23 posts post_id post_date, title comments com_id com_date, text users user_id, nickname This is the original logical model! Comment User Post

Slide 24

Slide 24 text

• NoSQL Database Design Optimization • Understanding Existing NoSQL Designs • Optimizing Big Data Applications

Slide 25

Slide 25 text

Apache Spark Model ▸ Series of lazy transformations which are followed by actions that force evaluation of all transformations ▸ Each step produces a resilient distributed dataset (RDD) ▸ Intermediate results can be cached on memory or disk, optionally serialized 25

Slide 26

Slide 26 text

Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… …deciding which ones to cache may be challenging. Spark Caching Best Practices Source: https://unraveldata.com/to-cache-or-not-to-cache/ 26

Slide 27

Slide 27 text

PageRank Example var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 27

Slide 28

Slide 28 text

Transformations var rankGraph = graph var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph prevRankGraph = rankGraph rankGraph = rankGraph .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() .outerJoinVertices(...).map(...) .aggregateMessages(...) .outerJoinVertices(rankUpdates) 28

Slide 29

Slide 29 text

Actions var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() prevRankGraph.unpersist() } rankGraph.edges.foreachPartition(...) rankGraph.vertices.values.sum() 29

Slide 30

Slide 30 text

30 PageRank RDDs Some RDDs are used more than once

Slide 31

Slide 31 text

Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) rankGraph.edges.foreachPartition(...) } rankGraph.vertices.values.sum() rankGraph.persist() .persist() prevRankGraph.unpersist() 31

Slide 32

Slide 32 text

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() }) rankGraph.vertices.values.sum() 32

Slide 33

Slide 33 text

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) }) rankGraph.vertices.values.sum() 33 rankGraph: 0

Slide 34

Slide 34 text

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 whileLoop (sc, iteration < numIter { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) }) rankGraph.vertices.values.sum() 34 rankGraph: 2 Persist!

Slide 35

Slide 35 text

PageRank on ReSpark 35 Without any caching, many jobs take hours!

Slide 36

Slide 36 text

36

Slide 37

Slide 37 text

Questions? xkcd.com