Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

DICE: Distributed Interactive Cube Exploration

DICE: Distributed Interactive Cube Exploration

Arnab Nandi

March 31, 2014
Tweet

More Decks by Arnab Nandi

Other Decks in Technology

Transcript

  1. DICE: Distributed and 
 Interactive Cube Exploration Niranjan Kamat Prasanth

    Jayachandran* Karthik Tunga* Arnab Nandi *Work done while at Ohio State 0
  2. Overview •  Large-scale ad-hoc analytics: Think of the User! • 

    Query Sessions •  Interactive response times •  DICE: Distributed Interactive Cube Exploration •  Speculation •  Faceted Model •  Online Sampling •  Interactive Performance •  Synthetic & Real-World Workloads •  User Studies 1
  3. Motivation •  Data Cube Materialization is Expensive •  Large datasets

    (scale: billion tuples) •  Ad-hoc Cube Exploration •  Distributed Aggregation (scale: 10-100 nodes) •  Interactive Response Times (scale: seconds) •  Fluid data exploration experience 2
  4. Outline •  Motivation •  Preliminaries •  DICE Concepts •  Speculative

    Execution •  Faceted Traversals •  Sampling •  Implementation •  Experiments 5
  5. Preliminaries: Data Cube SELECT COUNT(uid) FROM user.table as (uid, color,

    gender) CUBE ON color, gender •  Aggregate based on all possible combinations of dimensions •  Measure = COUNT •  Dimensions = color, gender 6
  6. Preliminaries: Data Cube <"*,"*">" <"*,"gender>" <"color,"*">" <color,"gender>" Men$ Women$ Total$

    Red$ 3" 1" 4" Green$ 1" 2" 3" Total$ 4" 3" 7" !  Cube"La8ce" 7
  7. Preliminaries: Data Cube Men$ Women$ Total$ Red$ 3" 1" 4"

    Green$ 1" 2" 3" Total$ 4" 3" 7" 8 Apple" Apple" Mango" Mango" Mango" Mango" (apple)$ Men$ Women$ Total$ Red$ 1" 1" 2" Green$ 1" 0" 1" Total$ 2" 1" 3" (mango)$ Men$ Women$ Total$ Red$ 2" 0" 2" Green$ 0" 2" 2" Total$ 2" 2" 4" Apple" !  New dimension
 = fruit
  8. *$ Men$ Women$ *$ Red$ 3" 1" 4" Green$ 1"

    2" 3" Total$ 4" 3" (mango)$ Men$ Women$ *$ Red$ 2" 0" 2" Green$ 0" 2" 2" Total$ 2" 2" Preliminaries: Data Cube (apple)$ Men$ Women$ *$ Red$ 1" 1" 2" Green$ 1" 0" 1" *$ 2" 1" 3" gender" color" 9 Apple" Apple"Apple" Mango" Mango" Mango" Mango"
  9. <"*,"*">" <"*,"gender>" <"color,"*">" <color,"gender>" Cube Materialization is Expensive <*, *,

    *> <month, *, *> <*, iops, *> <*, *, zone> <week, *, *> <month, iops,*> <month, *, zone> <*, iops, zone> <*, *, datacenter> <hour, *, *> <week, iops, *> <week, *, zone> <month, iops, zone> <month, *, datacenter> <*, iops, datacenter> <*, *, rack> <hour, iops, *> <hour, *, zone> <week, iops, zone> <week, *, datacenter> <month, iops, datacenter> <month, *, rack> <*, iops, rack> <hour, iops, zone> <hour, *, datacenter> <week, iops, datacenter> <week, *, rack> <month, iops, rack> <hour, iops, datacenter> <hour, *, rack> <week, iops, rack> <hour, iops, rack> Figure 2: Lattice for the hierarchical attributes ticu gro exp Thi chi cha the effe or a piv ses as f Fac g 2 me 10
  10. Cube Exploration <*, *, *> <month, *, *> <*, iops,

    *> <*, *, zone> <week, *, *> <month, iops,*> <month, *, zone> <*, iops, zone> <*, *, datacenter> <hour, *, *> <week, iops, *> <week, *, zone> <month, iops, zone> <month, *, datacenter> <*, iops, datacenter> <*, *, rack> <hour, iops, *> <hour, *, zone> <week, iops, zone> <week, *, datacenter> <month, iops, datacenter> <month, *, rack> <*, iops, rack> <hour, iops, zone> <hour, *, datacenter> <week, iops, datacenter> <week, *, rack> <month, iops, rack> <hour, iops, datacenter> <hour, *, rack> <week, iops, rack> <hour, iops, rack> Figure 2: Lattice for the hierarchical attributes ticu gro exp Thi chi cha the effe or a piv ses as f Fac g 2 me 11
  11. Outline •  Motivation •  Preliminaries •  DICE Concepts •  Speculative

    Execution •  Faceted Traversals •  Sampling •  Implementation •  Experiments 12
  12. The DICE System •  Distributed Aggregation (scale: 10-100 nodes) • 

    Large datasets (scale: billion tuples) •  Interactive Response Times (scale: seconds) 13
  13. Concept: Faceted Exploration •  Challenge: A single data cube has

    an exponential number of queries •  Which one to pick? •  Restricts Cube Exploration •  Lesser options = We can predict what the user is querying next better 16
  14. Faceted Example: Parent •  SELECT rack, AVG(iops) 
 FROM events

    
 WHERE 
 ! ! ! !datacenter = "EU" 
 " "AND "hour = 6 
 GROUP BY rack; " 17 •  SELECT rack, AVG(iops) 
 FROM events 
 WHERE 
 ! ! ! !datacenter = * 
 " "AND "hour = 6 
 GROUP BY rack; " a,b" b" a" Parent" " *"
  15. Faceted Example: Child •  SELECT rack, AVG(iops) 
 FROM events

    
 WHERE 
 " " " "datacenter = "EU" 
 " "AND "hour = 6 
 GROUP BY rack; " 18 •  SELECT rack, AVG(iops) 
 FROM events 
 WHERE 
 ! ! ! !rack = 25
 " "AND "hour = 6 
 GROUP BY rack; " Child" a,b" b" a" *"
  16. Faceted Example: Sibling •  SELECT rack, AVG(iops) 
 FROM events

    
 WHERE 
 ! ! ! !datacenter = "EU" 
 " "AND "hour = 6 
 GROUP BY rack; " 19 •  SELECT rack, AVG(iops) 
 FROM events 
 WHERE 
 " " " "datacenter = “US” 
 " "AND "hour = 6 
 GROUP BY rack; " a,b" b" a" *" Sibling"
  17. Faceted Example: Pivot •  SELECT rack, AVG(iops) 
 FROM events

    
 WHERE 
 ! ! ! !datacenter = "EU" 
 " "AND "hour = 6 
 GROUP BY rack; " 20 •  SELECT rack, AVG(iops) 
 FROM events 
 WHERE 
 " " " "rack = “12” 
 " "AND "hour = 6 
 GROUP BY datacenter; ! a,b" b" a" *" Pivot"
  18. Faceted Traversal •  Intuitive Exploration •  Can cover entire cube

    lattice •  Limits number of speculated queries 21
  19. Outline •  Motivation •  Preliminaries •  DICE Concepts •  Speculative

    Execution •  Faceted Traversals •  Sampling •  Implementation •  Experiments 22
  20. Concept: Sampling •  Tradeoff: Accuracy vs. Response Time •  Run

    queries over a sample of the data if needed •  How do we accurately provide estimates from sampled data? •  What sampling rate should we use? 23
  21. Concept : Sampling •  Provide confidence intervals for all estimates

    using variance across samples ugging in the values from above into Equation 1, we would get e variance for the combined group [rack:1,hour:6,datacenter:EU] s2 1 = 8.32 and for [rack:2,hour:6,datacenter:EU] as s2 2 = 11.52 The standard deviation can be used as an error estimate for the tire query. Once we know the variance of each of the combined oups, we can get an error estimate for the combination of all of ese groups i.e. the combined result set. We consider three mea- res SUM, AVG and COUNT. The variance of the estimator for the easure SUM can be given as: ˆ V [ˆ t] = H X h=1 N2 h (1 nh Nh ) ˆ sh 2 nh (2) milarly, variance of the estimator for the measure AVG: ˆ V [ˆ y] = H X h=1 N2 h N2 (1 nh Nh ) ˆ sh 2 nh (3) Thus, a facet fx(dxg, ! dxb : vxb) can be pivoted to the facet fy(dyg, ! dyb : vyb) if dyg 2 ! dxb ^ dxg 2 ! dyb and ! vxb and ! vyb have all but one bound dimension (and value) in common. The facet (zone)[week:w1 , iops:i1] pivots on iops i1 from the facet exam- ple, and is therefore its pivot facet. Explorability of the cube: It is critical that the user be able to fully explore the data cube, i.e. all cube groups can be explored us- ing facets, and it is possible to reach any facet from any other facet. First, for a group g = ! d : v, there can be | ! d | facets, f(dg, ! db : vb) : dg 2 ! d ^ ! db = ! d dg. Second, any two facets in a region can be reached from another by a series of sibling and pivot traversals: sibling traversals to change bound values, and pivot traversals to switch between bound and grouped dimensions. Parent and child traversals allow us to reach the corresponding parent and child re- gions in the cube lattice. Thus, the four traversals enable full ex- ploration of the cube lattice. 2.2 Distributed Execution Table Shards: We use sharded tables to achieve distributed and sampled execution of queries. A sharded table is the atomic unit of data in our system, and contains a subset of the rows of a SQL table and the concatenation of all shards across nodes is equiva- lent to the entire dataset. Each single node may contain multiple shards. Updates are atomic to each shard, and each session makes the assumption that the list of shards and the shards themselves do not change. We elaborate more the execution of queries over table shards in Section 3.3 2.3 Querying over Sampled Data Symbol Explanation s2 h variance of group h nh number of tuples in group h nhi number of tuples belonging to group h from the ith query mhi mean of the group h from the ith query mh mean of the group h from all queries vhi variance of the group h from the ith query ˆ V [ˆ t] variance of the estimator for the measure SUM ˆ V [ˆ y] variance of the estimator for the measure AVG ˆ V [ˆ p] variance of the estimator for the measure COUNT H Total number of groups in the union of all the queries Nh Total number of tuples in the combined group h nh Number of tuples in the sample in the com- sampling and post-stratification and present a methodology for do- ing so. In order to deliver results at higher sampling rates, DICE runs the same query on multiple shards on multiple nodes. This results in the same cube group of the facet query being possibly obtained from the multiple table shards. Hence, the statistics for the same group from these multiple queries need to be combined together. While combining the AVG, SUM and COUNT are straight forward, the variances can be combined as shown in Appendix A as: s2 h = 1 nh 1 ( numQ X i=1 nhi(mhi mh)2) + X i (nhi 1)vhi) (1) where numQ is the number of queries that a facet query needs to be replicated to, to achieve the user specified sampling rate. Continuing our motivating example, the faceted representation of the query is (rack)[hour:6,datacenter:EU] with the measure and measure dimension being AVG(iops) . We append the COUNT and VARIANCE measures to the queries since we need them as de- scribed in Equation 1. Let this query be run on a single shard on a couple of nodes and resulting into a sampling rate of 10%, return- ing us groups and the corresponding measures from the two queries respectively as: {[rack:1,hour:6,datacenter:EU,AVG:10,COUNT:5,VAR:4], [rack:2,hour:6,datacenter:EU,AVG:12,COUNT:6,VAR:2]} & {[rack:1,hour:6,datacenter:EU,AVG:5,COUNT:8,VAR:1], [rack:2,hour:6,datacenter:EU,AVG:6,COUNT:7,VAR:2]}. Plugging in the values from above into Equation 1, we would get the variance for the combined group [rack:1,hour:6,datacenter:EU] as s2 1 = 8.32 and for [rack:2,hour:6,datacenter:EU] as s2 2 = 11.52 The standard deviation can be used as an error estimate for the entire query. Once we know the variance of each of the combined groups, we can get an error estimate for the combination of all of these groups i.e. the combined result set. We consider three mea- sures SUM, AVG and COUNT. The variance of the estimator for the measure SUM can be given as: ˆ V [ˆ t] = H X h=1 N2 h (1 nh Nh ) ˆ sh 2 nh (2) Similarly, variance of the estimator for the measure AVG: ˆ V [ˆ y] = H X h=1 N2 h N2 (1 nh Nh ) ˆ sh 2 nh (3) all but one bound dimension (and value) in common. The facet (zone)[week:w1 , iops:i1] pivots on iops i1 from the facet exam- ple, and is therefore its pivot facet. Explorability of the cube: It is critical that the user be able to fully explore the data cube, i.e. all cube groups can be explored us- ing facets, and it is possible to reach any facet from any other facet. First, for a group g = ! d : v, there can be | ! d | facets, f(dg, ! db : vb) : dg 2 ! d ^ ! db = ! d dg. Second, any two facets in a region can be reached from another by a series of sibling and pivot traversals: sibling traversals to change bound values, and pivot traversals to switch between bound and grouped dimensions. Parent and child traversals allow us to reach the corresponding parent and child re- gions in the cube lattice. Thus, the four traversals enable full ex- ploration of the cube lattice. 2.2 Distributed Execution Table Shards: We use sharded tables to achieve distributed and sampled execution of queries. A sharded table is the atomic unit of data in our system, and contains a subset of the rows of a SQL table and the concatenation of all shards across nodes is equiva- lent to the entire dataset. Each single node may contain multiple shards. Updates are atomic to each shard, and each session makes the assumption that the list of shards and the shards themselves do not change. We elaborate more the execution of queries over table shards in Section 3.3 2.3 Querying over Sampled Data Symbol Explanation s2 h variance of group h nh number of tuples in group h nhi number of tuples belonging to group h from the ith query mhi mean of the group h from the ith query mh mean of the group h from all queries vhi variance of the group h from the ith query ˆ V [ˆ t] variance of the estimator for the measure SUM ˆ V [ˆ y] variance of the estimator for the measure AVG ˆ V [ˆ p] variance of the estimator for the measure COUNT H Total number of groups in the union of all the queries Nh Total number of tuples in the combined group h nh Number of tuples in the sample in the com- bined group h N Total number of tuples in the dataset p Proportion of tuples selected by the where clause Table 2: List of notations used in this subsection. As mentioned before, the interactive nature of our use case ne- cessitates the approximation of results by executing queries over a In order to deliver results at higher sampling rates, DICE runs the same query on multiple shards on multiple nodes. This results in the same cube group of the facet query being possibly obtained from the multiple table shards. Hence, the statistics for the same group from these multiple queries need to be combined together. While combining the AVG, SUM and COUNT are straight forward, the variances can be combined as shown in Appendix A as: s2 h = 1 nh 1 ( numQ X i=1 nhi(mhi mh)2) + X i (nhi 1)vhi) (1) where numQ is the number of queries that a facet query needs to be replicated to, to achieve the user specified sampling rate. Continuing our motivating example, the faceted representation of the query is (rack)[hour:6,datacenter:EU] with the measure and measure dimension being AVG(iops) . We append the COUNT and VARIANCE measures to the queries since we need them as de- scribed in Equation 1. Let this query be run on a single shard on a couple of nodes and resulting into a sampling rate of 10%, return- ing us groups and the corresponding measures from the two queries respectively as: {[rack:1,hour:6,datacenter:EU,AVG:10,COUNT:5,VAR:4], [rack:2,hour:6,datacenter:EU,AVG:12,COUNT:6,VAR:2]} & {[rack:1,hour:6,datacenter:EU,AVG:5,COUNT:8,VAR:1], [rack:2,hour:6,datacenter:EU,AVG:6,COUNT:7,VAR:2]}. Plugging in the values from above into Equation 1, we would get the variance for the combined group [rack:1,hour:6,datacenter:EU] as s2 1 = 8.32 and for [rack:2,hour:6,datacenter:EU] as s2 2 = 11.52 The standard deviation can be used as an error estimate for the entire query. Once we know the variance of each of the combined groups, we can get an error estimate for the combination of all of these groups i.e. the combined result set. We consider three mea- sures SUM, AVG and COUNT. The variance of the estimator for the measure SUM can be given as: ˆ V [ˆ t] = H X h=1 N2 h (1 nh Nh ) ˆ sh 2 nh (2) Similarly, variance of the estimator for the measure AVG: ˆ V [ˆ y] = H X h=1 N2 h N2 (1 nh Nh ) ˆ sh 2 nh (3) Continuing with our example, we estimate Nh N by nh n and nh Nh by the sampling rate since we cannot know Nh without sampling the entire data. Again plugging in the values we get, ˆ y = 6.92 ⇤ 13/26 + 8.77 ⇤ 13/26 = 7.85 and ˆ V [ˆ t] = (13 26 )2 ⇤ (1 0.1) ⇤ (8.32/13 + 11.52/13) = 0.35 For the measure COUNT, we can use the proportion estimator since the where clause acts as the indicator function and thus the variance of the estimator for COUNT can be given as: 24
  22. Concept: Sampling •  Challenge: •  What sampling rate should we

    use? •  Solution: •  Run user query at the requested sampling rate •  Run different speculative queries at different sampling rates •  Goal: Increase likelihood of next user query being cached at highest sampling rate 25
  23. Concept: Sampling •  Prioritize likely queries that yield highest accuracy

    gain •  Estimated Accuracy Gain at the Sampling Rate Rcurrent : •  Goodness Score for Query q at Rcurrent ue v to ave an erion is )/|R2 | ue v as e tuples For the mongst Join orming en sam- ratified lows: w(t) = bute A having the value h between the entire relations R1 and R2 will be equal to N h = m1(h) ⇤ m2(h). Therefore, to obtain a stratified random sample of the rate f, there needs to be n h = f ⇤ m1(h) ⇤ m2(h) number of samples having the value of h in the join column. This dictates that amongst the m1(h) tuples n h tuples need to be selected with replacement. Since, this is a plain extension of Motwani algorithm from simple random sample to stratified random sample, the second relation is assumed to be present as a whole and not sampled. For each tuple in the sample S1 of R1 , join it with one of the tuples in the second relation. What we want is that there needs to be f ⇤m1(h)⇤m2(h) tuples in the join having the value of h. That criterion is satisfied by the algorithm. The next criterion is that the probability of each tuple being present in the join should be the same. We are not discriminating against any tuples in either S1 or from R2 . Hence, the probabil- ity of any join tuple amongst m1(h) ⇤ m2(h) will be the same. Hence, what we obtain is indeed a stratified sample of the join by modification of the Motwani algorithm. Score(q,R current ) = Prob(q) ⇤ AccuracyGain(R current ) 4. EVALUATION 4.1 Experimental Setup Environment where predicates is generalized, en becomes as follows: the groups in this case can be n and count of all the groups in versal l, consider the resultant query h = ”January” from the original query can be aversals hat one of the where predicate h = ”February” re predicates is exchanged with mple a pivot query that respects ble. However, if there were an ld have added a where predicate the current online sampling. 1.4 Result Reuse up the Cube based on Sam- pling It is clear that the results from a more specialized region can be used to a more generalized region (removal of a where predicate or groupby clause), i.e. as we go up the cube. Hence, consider the case where the data has been sampled in a stratified sampling manner. We can compute different regions of the cube using different sets of samples. As we go up the cube, we can use the information situated in the tree with that node as a root to give much more information to the user without having to run the query asked on that region at a higher sampling rate. AccuracyGain(Rcurrent) = c⇤( 1 p Rcurrent 1 p Rcurrent+1 ) 26
  24. Outline •  Motivation •  Preliminaries •  DICE Concepts •  Speculative

    Execution •  Faceted Traversals •  Sampling •  Implementation •  Experiments 27
  25. 28 MASTER$ " Query"Federator" BI"Tool/"UI" slave1$ DB$ slave3$ DB$ slaveN$

    DB$ slave2$ DB$ Result"Cache" Workers" Network" User Query Speculative Queries Results User"Query Next"Query User"Query Speculative"" Queries Results DICE System
  26. DICE: Frontend •  One possible frontend for DICE •  Demonstrates

    interactive cubing •  Guides the user through cube exploration 29
  27. Outline •  Motivation •  Preliminaries •  DICE Concepts •  Speculative

    Execution •  Faceted Traversals •  Sampling •  Implementation •  Experiments 32
  28. Experimental Setup •  Small Cluster •  Master: 1x Quad Core

    i5, 16GB RAM •  15 slave nodes: 1x Quad Core Xeon, 4GB RAM •  Default sample size: 200M tuples •  Cloud Cluster •  Amazon EC2 •  1 master, 50 slave nodes: c1.xlarge type •  7GB Memory, 8 virtual cores 33
  29. Performance: Sample Workload Query 1 Query 2 Query 3 Query

    4 Query 5 Query 6 Query 7 Query 8 Query 9 Query 10 35 lesser variance lower response times
  30. Impact on User Experience •  Real User sessions •  10

    users, 10 queries each •  Blind “taste test” •  Unanimous preference for DICE •  NoSpec (Naive system) vs DICE •  7 seconds faster per session (54s vs 47s) 36
  31. Faceted Model Efficacy •  Are facets a good way to

    model 
 cube exploration? •  Internet Advertising Company •  Hadoop / HIVE infrastructure •  Large-scale analytics •  46 query sessions, 116 queries •  Modeled 100% of the queries 37
  32. Conclusion •  Large-scale analytics: Think of the User! •  Query

    Sessions •  Sub-second response times •  DICE: Distributed Interactive Cube Exploration •  Speculation •  Faceted Model •  Online Sampling •  Interactive Performance •  Synthetic & Real-World Workloads •  User Studies 39