Slide 1

Slide 1 text

Deep Unsupervised Cardinality Estimation Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan DuanΔ, Xi ChenΔ, Pieter AbbeelΔ, Joe Hellerstein, Sanjay Krishnanὓ, Ion Stoica VLDB 2020 github.com/naru-project bit.ly/naru-vldb20 ὓ Δ

Slide 2

Slide 2 text

Cardinality Estimation in Databases SQL Query Query Optimizer

Slide 3

Slide 3 text

Cardinality Estimation in Databases Plan Enumeration SQL Query Plans 1, 2 … Query Optimizer

Slide 4

Slide 4 text

Cardinality Estimation in Databases Plan Enumeration Cost Model SQL Query Plans 1, 2 … Plan 2 is best Join T Join E S Query Optimizer

Slide 5

Slide 5 text

Cardinality Estimation in Databases Plan Enumeration Cardinality Estimator Cost Model SQL Query Plans 1, 2 … # rows? 50 Plan 2 is best Query Optimizer Join T Join E S

Slide 6

Slide 6 text

Cardinality Estimation in Databases Plan Enumeration Cardinality Estimator Cost Model SQL Query Plans 1, 2 … # rows? 50 Plan 2 is best Query Optimizer Join T Join E S Real systems produce huge errors—up to 107x [1]! [1] Leis et al., VLDB’15

Slide 7

Slide 7 text

The Bane: Heuristic Assumptions!

Slide 8

Slide 8 text

The Bane: Heuristic Assumptions! Independence (Age) (Age, Salary) (Salary) ≈ ×

Slide 9

Slide 9 text

The Bane: Heuristic Assumptions! Independence Uniformity (Age) (Age, Salary) (Salary) ≈ × “Values in each bin are uniform”

Slide 10

Slide 10 text

The Bane: Heuristic Assumptions! Independence Uniformity (Age) (Age, Salary) (Salary) ≈ × “Values in each bin are uniform” Heuristics result in fewer correlations learned, making cardinality estimates inaccurate.

Slide 11

Slide 11 text

Key Idea: learn all correlations without heuristics Naru Overview

Slide 12

Slide 12 text

Naru Overview Unsupervised Training Learn data distribution with deep autoregressive model X1 X2 X3 Table Key Idea: learn all correlations without heuristics

Slide 13

Slide 13 text

Naru Overview Unsupervised Training Learn data distribution with deep autoregressive model X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics

Slide 14

Slide 14 text

Naru Overview Unsupervised Training Learn data distribution with deep autoregressive model X1 Encode X2 X3 X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics

Slide 15

Slide 15 text

Naru Overview Unsupervised Training Learn data distribution with deep autoregressive model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics

Slide 16

Slide 16 text

Naru Overview Unsupervised Training Learn data distribution with deep autoregressive model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Unsupervised loss (max. likelihood) Table Key Idea: learn all correlations without heuristics

Slide 17

Slide 17 text

Naru Overview Unsupervised Training Learn data distribution with deep autoregressive model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Unsupervised loss (max. likelihood) Table … … 26 2K Age Salary Model (Age) (Salary|Age=26) Example Input: Tuples Output: N conditional distributions Key Idea: learn all correlations without heuristics

Slide 18

Slide 18 text

Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Key Idea: learn all correlations without heuristics Output

Slide 19

Slide 19 text

Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Key Idea: learn all correlations without heuristics Output ~90x higher accuracy than SOTA on real-world datasets

Slide 20

Slide 20 text

Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Output ~90x higher accuracy than SOTA on real-world datasets

Slide 21

Slide 21 text

Autoregressive Modeling Naru uses autoregressive models because:

Slide 22

Slide 22 text

Factorization is exact: no independence assumptions Autoregressive Modeling (X1 , X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because:

Slide 23

Slide 23 text

Factorization is exact: no independence assumptions Autoregressive Modeling (X1 , X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because: Not materialized; Emitted on-demand by model

Slide 24

Slide 24 text

Factorization is exact: no independence assumptions Autoregressive Modeling (X1 , X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because: Many successful architectures; scale to complex data Masked MLPs, Transformers, WaveNet, …

Slide 25

Slide 25 text

Estimating selectivity

Slide 26

Slide 26 text

Estimating selectivity Point Query: sel(Age=a, Salary=s)

Slide 27

Slide 27 text

Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward pass, multiply (Age=a) (Salary=s|Age=a)

Slide 28

Slide 28 text

Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward pass, multiply (Age=a) (Salary=s|Age=a) Range Query: sel(Age in Rage, Salary in Rsal) Naively enumeration = exponential forward passes!

Slide 29

Slide 29 text

Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward pass, multiply (Age=a) (Salary=s|Age=a) Range Query: sel(Age in Rage, Salary in Rsal) Naively enumeration = exponential forward passes! Insight: not all points in the queried region are meaningful → sample tuples from model to approximate range density

Slide 30

Slide 30 text

Progressive Sampling

Slide 31

Slide 31 text

Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1) # Query: sel(X1 in R1 ,..., Xn in Rn) Progressive Sampling (Naru) # Sample dim 1

Slide 32

Slide 32 text

Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1) s2 ~ Model(X2 | X2 in R2, X1=s1) # Query: sel(X1 in R1 ,..., Xn in Rn) Progressive Sampling (Naru) # Sample dim 1 # Sample dim 2

Slide 33

Slide 33 text

Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1) s2 ~ Model(X2 | X2 in R2, X1=s1) ... # Query: sel(X1 in R1 ,..., Xn in Rn) return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1) Progressive Sampling (Naru) # Sample dim 1 # Sample dim 2 # Monte Carlo estimate

Slide 34

Slide 34 text

Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1) s2 ~ Model(X2 | X2 in R2, X1=s1) ... # Query: sel(X1 in R1 ,..., Xn in Rn) return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1) Progressive Sampling (Naru) Use learned conditionals to progressively “zoom-in” into high-mass region # Sample dim 1 # Sample dim 2 # Monte Carlo estimate

Slide 35

Slide 35 text

Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… )

Slide 36

Slide 36 text

Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… ) = MASKCity, # skipped! = MASKBirthDay, # skipped! Wildcards → equal to MASK tokens 10x reduction in max error

Slide 37

Slide 37 text

Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… ) More results details in paper Robust to any issuable queries Latency, Scalability Updating for new data = MASKCity, # skipped! = MASKBirthDay, # skipped! Wildcards → equal to MASK tokens 10x reduction in max error

Slide 38

Slide 38 text

Estimation Accuracy Dataset: DMV (11M tuples, 11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Slide 39

Slide 39 text

Estimation Accuracy CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Slide 40

Slide 40 text

Estimation Accuracy 230x 10000x CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Slide 41

Slide 41 text

Estimation Accuracy 94x CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Slide 42

Slide 42 text

Estimation Accuracy 40-70x supervised: 5.5hr to collect 10K train queries unsupervised: ~5min read+train CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Slide 43

Slide 43 text

Naru enables new capability NeuroCard Heuristics-free Join Cardinality Estimation Naru Autoregressive Models Probabilistic Inference To appear in VLDB’21! Support joins bit.ly/neurocard

Slide 44

Slide 44 text

github.com/naru-project bit.ly/naru-vldb20 Naru Takeaways Removes heuristics ✓ 90x more accurate on multi-dim range queries ✓ Efficiently handle ranges & wildcards ✓ [email protected] @zongheng_yang Naru Autoregressive Models Probabilistic Inference