Deep Unsupervised Cardinality Estimation

2198891db5e2442b031b82d519a4e132?s=47 Zongheng Yang
September 07, 2020

Deep Unsupervised Cardinality Estimation

VLDB 2020
Talk video: https://youtu.be/u2glA-S1AEs
Paper: https://zongheng.me/pubs/naru-vldb20.pdf

Leveraging deep unsupervised learning, Naru is a new cardinality estimator approach that fully removes heuristic assumptions in this decades-old problem in databases.

Presenter: Zongheng Yang (https://zongheng.me), @zongheng_yang.

Joint work with Eric Liang, Amog Kamsetty, Chenggang wu, Yan Duan, Xi Chen, Pieter Abbeel, Joe Hellerstein, Sanjay Krishnan, and Ion Stoica.

2198891db5e2442b031b82d519a4e132?s=128

Zongheng Yang

September 07, 2020
Tweet

Transcript

  1. Deep Unsupervised Cardinality Estimation Zongheng Yang, Eric Liang, Amog Kamsetty,

    Chenggang Wu, Yan DuanΔ, Xi ChenΔ, Pieter AbbeelΔ, Joe Hellerstein, Sanjay Krishnanὓ, Ion Stoica VLDB 2020 github.com/naru-project bit.ly/naru-vldb20 ὓ Δ
  2. Cardinality Estimation in Databases SQL Query Query Optimizer

  3. Cardinality Estimation in Databases Plan Enumeration SQL Query Plans 1,

    2 … Query Optimizer
  4. Cardinality Estimation in Databases Plan Enumeration Cost Model SQL Query

    Plans 1, 2 … Plan 2 is best Join T Join E S Query Optimizer
  5. Cardinality Estimation in Databases Plan Enumeration Cardinality Estimator Cost Model

    SQL Query Plans 1, 2 … # rows? 50 Plan 2 is best Query Optimizer Join T Join E S
  6. Cardinality Estimation in Databases Plan Enumeration Cardinality Estimator Cost Model

    SQL Query Plans 1, 2 … # rows? 50 Plan 2 is best Query Optimizer Join T Join E S Real systems produce huge errors—up to 107x [1]! [1] Leis et al., VLDB’15
  7. The Bane: Heuristic Assumptions!

  8. The Bane: Heuristic Assumptions! Independence (Age) (Age, Salary) (Salary) ≈

    ×
  9. The Bane: Heuristic Assumptions! Independence Uniformity (Age) (Age, Salary) (Salary)

    ≈ × “Values in each bin are uniform”
  10. The Bane: Heuristic Assumptions! Independence Uniformity (Age) (Age, Salary) (Salary)

    ≈ × “Values in each bin are uniform” Heuristics result in fewer correlations learned, making cardinality estimates inaccurate.
  11. Key Idea: learn all correlations without heuristics Naru Overview

  12. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model X1 X2 X3 Table Key Idea: learn all correlations without heuristics
  13. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics
  14. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model X1 Encode X2 X3 X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics
  15. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics
  16. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Unsupervised loss (max. likelihood) Table Key Idea: learn all correlations without heuristics
  17. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Unsupervised loss (max. likelihood) Table … … 26 2K Age Salary Model (Age) (Salary|Age=26) Example Input: Tuples Output: N conditional distributions Key Idea: learn all correlations without heuristics
  18. Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2)

    X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Key Idea: learn all correlations without heuristics Output
  19. Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2)

    X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Key Idea: learn all correlations without heuristics Output ~90x higher accuracy than SOTA on real-world datasets
  20. Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2)

    X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Output ~90x higher accuracy than SOTA on real-world datasets
  21. Autoregressive Modeling Naru uses autoregressive models because:

  22. Factorization is exact: no independence assumptions Autoregressive Modeling (X1 ,

    X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because:
  23. Factorization is exact: no independence assumptions Autoregressive Modeling (X1 ,

    X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because: Not materialized; Emitted on-demand by model
  24. Factorization is exact: no independence assumptions Autoregressive Modeling (X1 ,

    X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because: Many successful architectures; scale to complex data Masked MLPs, Transformers, WaveNet, …
  25. Estimating selectivity

  26. Estimating selectivity Point Query: sel(Age=a, Salary=s)

  27. Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward

    pass, multiply (Age=a) (Salary=s|Age=a)
  28. Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward

    pass, multiply (Age=a) (Salary=s|Age=a) Range Query: sel(Age in Rage, Salary in Rsal) Naively enumeration = exponential forward passes!
  29. Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward

    pass, multiply (Age=a) (Salary=s|Age=a) Range Query: sel(Age in Rage, Salary in Rsal) Naively enumeration = exponential forward passes! Insight: not all points in the queried region are meaningful → sample tuples from model to approximate range density
  30. Progressive Sampling

  31. Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)

    # Query: sel(X1 in R1 ,..., Xn in Rn) Progressive Sampling (Naru) # Sample dim 1
  32. Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)

    s2 ~ Model(X2 | X2 in R2, X1=s1) # Query: sel(X1 in R1 ,..., Xn in Rn) Progressive Sampling (Naru) # Sample dim 1 # Sample dim 2
  33. Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)

    s2 ~ Model(X2 | X2 in R2, X1=s1) ... # Query: sel(X1 in R1 ,..., Xn in Rn) return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1) Progressive Sampling (Naru) # Sample dim 1 # Sample dim 2 # Monte Carlo estimate
  34. Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)

    s2 ~ Model(X2 | X2 in R2, X1=s1) ... # Query: sel(X1 in R1 ,..., Xn in Rn) return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1) Progressive Sampling (Naru) Use learned conditionals to progressively “zoom-in” into high-mass region # Sample dim 1 # Sample dim 2 # Monte Carlo estimate
  35. Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in

    *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… )
  36. Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in

    *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… ) = MASKCity, # skipped! = MASKBirthDay, # skipped! Wildcards → equal to MASK tokens 10x reduction in max error
  37. Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in

    *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… ) More results details in paper Robust to any issuable queries Latency, Scalability Updating for new data = MASKCity, # skipped! = MASKBirthDay, # skipped! Wildcards → equal to MASK tokens 10x reduction in max error
  38. Estimation Accuracy Dataset: DMV (11M tuples, 11 columns) Workload: 2000

    queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  39. Estimation Accuracy CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns)

    Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  40. Estimation Accuracy 230x 10000x CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples,

    11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  41. Estimation Accuracy 94x CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11

    columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  42. Estimation Accuracy 40-70x supervised: 5.5hr to collect 10K train queries

    unsupervised: ~5min read+train CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  43. Naru enables new capability NeuroCard Heuristics-free Join Cardinality Estimation Naru

    Autoregressive Models Probabilistic Inference To appear in VLDB’21! Support joins bit.ly/neurocard
  44. github.com/naru-project bit.ly/naru-vldb20 Naru Takeaways Removes heuristics ✓ 90x more accurate

    on multi-dim range queries ✓ Efficiently handle ranges & wildcards ✓ zongheng@cs.berkeley.edu @zongheng_yang Naru Autoregressive Models Probabilistic Inference