Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Unsupervised Cardinality Estimation

Zongheng Yang
September 07, 2020

Deep Unsupervised Cardinality Estimation

VLDB 2020
Talk video: https://youtu.be/u2glA-S1AEs
Paper: https://zongheng.me/pubs/naru-vldb20.pdf

Leveraging deep unsupervised learning, Naru is a new cardinality estimator approach that fully removes heuristic assumptions in this decades-old problem in databases.

Presenter: Zongheng Yang (https://zongheng.me), @zongheng_yang.

Joint work with Eric Liang, Amog Kamsetty, Chenggang wu, Yan Duan, Xi Chen, Pieter Abbeel, Joe Hellerstein, Sanjay Krishnan, and Ion Stoica.

Zongheng Yang

September 07, 2020
Tweet

More Decks by Zongheng Yang

Other Decks in Technology

Transcript

  1. Deep Unsupervised Cardinality Estimation Zongheng Yang, Eric Liang, Amog Kamsetty,

    Chenggang Wu, Yan DuanΔ, Xi ChenΔ, Pieter AbbeelΔ, Joe Hellerstein, Sanjay Krishnanὓ, Ion Stoica VLDB 2020 github.com/naru-project bit.ly/naru-vldb20 ὓ Δ
  2. Cardinality Estimation in Databases Plan Enumeration Cost Model SQL Query

    Plans 1, 2 … Plan 2 is best Join T Join E S Query Optimizer
  3. Cardinality Estimation in Databases Plan Enumeration Cardinality Estimator Cost Model

    SQL Query Plans 1, 2 … # rows? 50 Plan 2 is best Query Optimizer Join T Join E S
  4. Cardinality Estimation in Databases Plan Enumeration Cardinality Estimator Cost Model

    SQL Query Plans 1, 2 … # rows? 50 Plan 2 is best Query Optimizer Join T Join E S Real systems produce huge errors—up to 107x [1]! [1] Leis et al., VLDB’15
  5. The Bane: Heuristic Assumptions! Independence Uniformity (Age) (Age, Salary) (Salary)

    ≈ × “Values in each bin are uniform” Heuristics result in fewer correlations learned, making cardinality estimates inaccurate.
  6. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model X1 X2 X3 Table Key Idea: learn all correlations without heuristics
  7. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics
  8. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model X1 Encode X2 X3 X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics
  9. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Table Key Idea: learn all correlations without heuristics
  10. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Unsupervised loss (max. likelihood) Table Key Idea: learn all correlations without heuristics
  11. Naru Overview Unsupervised Training Learn data distribution with deep autoregressive

    model Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2) Output X1 X2 X3 tuples Unsupervised loss (max. likelihood) Table … … 26 2K Age Salary Model (Age) (Salary|Age=26) Example Input: Tuples Output: N conditional distributions Key Idea: learn all correlations without heuristics
  12. Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2)

    X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Key Idea: learn all correlations without heuristics Output
  13. Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2)

    X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Key Idea: learn all correlations without heuristics Output ~90x higher accuracy than SOTA on real-world datasets
  14. Naru Overview Model X1 Encode X2 X3 (X1) (X2|X1) (X3|X1,X2)

    X1 X2 X3 Estimate selectivity from model, given any query tuples Unsupervised loss (max. likelihood) Table Selectivity estimates Approximate Inference Unsupervised Training Learn data distribution with deep autoregressive model Output ~90x higher accuracy than SOTA on real-world datasets
  15. Factorization is exact: no independence assumptions Autoregressive Modeling (X1 ,

    X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because:
  16. Factorization is exact: no independence assumptions Autoregressive Modeling (X1 ,

    X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because: Not materialized; Emitted on-demand by model
  17. Factorization is exact: no independence assumptions Autoregressive Modeling (X1 ,

    X2 , …, Xn ) = (X1 ) (X2 |X1 ) … (Xn | X1 , …, Xn-1 ) Naru uses autoregressive models because: Many successful architectures; scale to complex data Masked MLPs, Transformers, WaveNet, …
  18. Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward

    pass, multiply (Age=a) (Salary=s|Age=a) Range Query: sel(Age in Rage, Salary in Rsal) Naively enumeration = exponential forward passes!
  19. Estimating selectivity Point Query: sel(Age=a, Salary=s) Easy: do 1 forward

    pass, multiply (Age=a) (Salary=s|Age=a) Range Query: sel(Age in Rage, Salary in Rsal) Naively enumeration = exponential forward passes! Insight: not all points in the queried region are meaningful → sample tuples from model to approximate range density
  20. Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)

    # Query: sel(X1 in R1 ,..., Xn in Rn) Progressive Sampling (Naru) # Sample dim 1
  21. Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)

    s2 ~ Model(X2 | X2 in R2, X1=s1) # Query: sel(X1 in R1 ,..., Xn in Rn) Progressive Sampling (Naru) # Sample dim 1 # Sample dim 2
  22. Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)

    s2 ~ Model(X2 | X2 in R2, X1=s1) ... # Query: sel(X1 in R1 ,..., Xn in Rn) return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1) Progressive Sampling (Naru) # Sample dim 1 # Sample dim 2 # Monte Carlo estimate
  23. Progressive Sampling ProgressiveSample(): s1 ~ Model(X1 | X1 in R1)

    s2 ~ Model(X2 | X2 in R2, X1=s1) ... # Query: sel(X1 in R1 ,..., Xn in Rn) return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1) Progressive Sampling (Naru) Use learned conditionals to progressively “zoom-in” into high-mass region # Sample dim 1 # Sample dim 2 # Monte Carlo estimate
  24. Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in

    *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… )
  25. Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in

    *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… ) = MASKCity, # skipped! = MASKBirthDay, # skipped! Wildcards → equal to MASK tokens 10x reduction in max error
  26. Naru Optimizations Wildcard Skipping avoid sampling wildcards ProgressiveSampling( City in

    *, # domain(City) Age in [20…28], BirthDay in *, # domain(BirthDay) Salary in [5k…15k], …… ) More results details in paper Robust to any issuable queries Latency, Scalability Updating for new data = MASKCity, # skipped! = MASKBirthDay, # skipped! Wildcards → equal to MASK tokens 10x reduction in max error
  27. Estimation Accuracy Dataset: DMV (11M tuples, 11 columns) Workload: 2000

    queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  28. Estimation Accuracy CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns)

    Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  29. Estimation Accuracy 230x 10000x CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples,

    11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  30. Estimation Accuracy 94x CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11

    columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  31. Estimation Accuracy 40-70x supervised: 5.5hr to collect 10K train queries

    unsupervised: ~5min read+train CIDR’19 SIGMOD’15 Dataset: DMV (11M tuples, 11 columns) Workload: 2000 queries, 5-11 range/eq filters each Model: Masked MLP, 13MB
  32. Naru enables new capability NeuroCard Heuristics-free Join Cardinality Estimation Naru

    Autoregressive Models Probabilistic Inference To appear in VLDB’21! Support joins bit.ly/neurocard
  33. github.com/naru-project bit.ly/naru-vldb20 Naru Takeaways Removes heuristics ✓ 90x more accurate

    on multi-dim range queries ✓ Efficiently handle ranges & wildcards ✓ [email protected] @zongheng_yang Naru Autoregressive Models Probabilistic Inference