Deep Unsupervised Cardinality Estimation

Deep Unsupervised Cardinality Estimation Zongheng Yang, Eric Liang, Amog Kamsetty,
Chenggang Wu, Yan DuanΔ, Xi ChenΔ, Pieter AbbeelΔ, Joe Hellerstein, Sanjay Krishnanὓ, Ion Stoica VLDB 2020 github.com/naru-project bit.ly/naru-vldb20 ὓ Δ

Cardinality Estimation in Databases SQL Query Query Optimizer

Cardinality Estimation in Databases Plan Enumeration SQL Query Plans 1,
2 … Query Optimizer

Cardinality Estimation in Databases Plan Enumeration Cost Model SQL Query
Plans 1, 2 … Plan 2 is best Join T Join E S Query Optimizer

Cardinality Estimation in Databases Plan Enumeration Cardinality Estimator Cost Model
SQL Query Plans 1, 2 … # rows? 50 Plan 2 is best Query Optimizer Join T Join E S

Cardinality Estimation in Databases Plan Enumeration Cardinality Estimator Cost Model
SQL Query Plans 1, 2 … # rows? 50 Plan 2 is best Query Optimizer Join T Join E S Real systems produce huge errors—up to 107x [1]! [1] Leis et al., VLDB’15

The Bane: Heuristic Assumptions!

The Bane: Heuristic Assumptions! Independence (Age) (Age, Salary) (Salary) ≈
×

The Bane: Heuristic Assumptions! Independence Uniformity (Age) (Age, Salary) (Salary)
≈ × “Values in each bin are uniform”

The Bane: Heuristic Assumptions! Independence Uniformity (Age) (Age, Salary) (Salary)
≈ × “Values in each bin are uniform” Heuristics result in fewer correlations learned, making cardinality estimates inaccurate.

Key Idea: learn all correlations without heuristics Naru Overview

Naru Overview Unsupervised Training Learn data distribution with deep autoregressive
model X1 X2 X3 Table Key Idea: learn all correlations without heuristics

model X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics

model X1 Encode X2 X3 X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics

model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics

model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Unsupervised loss (max. likelihood) Table Key Idea: learn all correlations without heuristics

model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Unsupervised loss (max. likelihood) Table … … 26 2K Age Salary Model (Age) (Salary|Age=26) Example Input: Tuples Output: N conditional distributions Key Idea: learn all correlations without heuristics

Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2)
X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Key Idea: learn all correlations without heuristics Output

X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Key Idea: learn all correlations without heuristics Output ~90x higher accuracy than SOTA on real-world datasets

X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Output ~90x higher accuracy than SOTA on real-world datasets

Autoregressive Modeling Naru uses autoregressive models because:

Factorization is exact: no independence assumptions Autoregressive Modeling (X1 ,
X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because:

X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because: Not materialized; Emitted on-demand by model

X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because: Many successful architectures; scale to complex data Masked MLPs, Transformers, WaveNet, …

Estimating selectivity

Estimating selectivity Point Query: sel(Age=a, Salary=s)

Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward
pass, multiply (Age=a) (Salary=s|Age=a)

pass, multiply (Age=a) (Salary=s|Age=a) Range Query: sel(Age in Rage, Salary in Rsal) Naively enumeration = exponential forward passes!

pass, multiply (Age=a) (Salary=s|Age=a) Range Query: sel(Age in Rage, Salary in Rsal) Naively enumeration = exponential forward passes! Insight: not all points in the queried region are meaningful → sample tuples from model to approximate range density

Progressive Sampling

Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)
# Query: sel(X1 in R1 ,..., Xn in Rn) Progressive Sampling (Naru) # Sample dim 1

s2 ~ Model(X2 | X2 in R2, X1=s1) # Query: sel(X1 in R1 ,..., Xn in Rn) Progressive Sampling (Naru) # Sample dim 1 # Sample dim 2

s2 ~ Model(X2 | X2 in R2, X1=s1) ... # Query: sel(X1 in R1 ,..., Xn in Rn) return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1) Progressive Sampling (Naru) # Sample dim 1 # Sample dim 2 # Monte Carlo estimate

s2 ~ Model(X2 | X2 in R2, X1=s1) ... # Query: sel(X1 in R1 ,..., Xn in Rn) return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1) Progressive Sampling (Naru) Use learned conditionals to progressively “zoom-in” into high-mass region # Sample dim 1 # Sample dim 2 # Monte Carlo estimate

Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in
*, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… )

*, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… ) = MASKCity, # skipped! = MASKBirthDay, # skipped! Wildcards → equal to MASK tokens 10x reduction in max error

*, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… ) More results details in paper Robust to any issuable queries Latency, Scalability Updating for new data = MASKCity, # skipped! = MASKBirthDay, # skipped! Wildcards → equal to MASK tokens 10x reduction in max error

Estimation Accuracy Dataset: DMV (11M tuples, 11 columns) Workload: 2000
queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Estimation Accuracy CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns)
Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Estimation Accuracy 230x 10000x CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples,
11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Estimation Accuracy 94x CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11
columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Estimation Accuracy 40-70x supervised: 5.5hr to collect 10K train queries
unsupervised: ~5min read+train CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB

Naru enables new capability NeuroCard Heuristics-free Join Cardinality Estimation Naru
Autoregressive Models Probabilistic Inference To appear in VLDB’21! Support joins bit.ly/neurocard

github.com/naru-project bit.ly/naru-vldb20 Naru Takeaways Removes heuristics ✓ 90x more accurate
on multi-dim range queries ✓ Efficiently handle ranges & wildcards ✓ zongheng@cs.berkeley.edu @zongheng_yang Naru Autoregressive Models Probabilistic Inference

Deep Unsupervised Cardinality Estimation

Deep Unsupervised Cardinality Estimation

More Decks by Zongheng Yang

Other Decks in Technology

Featured

Transcript