$30 off During Our Annual Pro Sale. View Details »

Deep Unsupervised Cardinality Estimation

Zongheng Yang
September 07, 2020

Deep Unsupervised Cardinality Estimation

VLDB 2020
Talk video: https://youtu.be/u2glA-S1AEs
Paper: https://zongheng.me/pubs/naru-vldb20.pdf

Leveraging deep unsupervised learning, Naru is a new cardinality estimator approach that fully removes heuristic assumptions in this decades-old problem in databases.

Presenter: Zongheng Yang (https://zongheng.me), @zongheng_yang.

Joint work with Eric Liang, Amog Kamsetty, Chenggang wu, Yan Duan, Xi Chen, Pieter Abbeel, Joe Hellerstein, Sanjay Krishnan, and Ion Stoica.

Zongheng Yang

September 07, 2020
Tweet

More Decks by Zongheng Yang

Other Decks in Technology

Transcript

  1. Deep Unsupervised
    Cardinality Estimation
    Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan DuanΔ,
    Xi ChenΔ, Pieter AbbeelΔ, Joe Hellerstein, Sanjay Krishnanὓ, Ion Stoica
    VLDB 2020
    github.com/naru-project
    bit.ly/naru-vldb20

    Δ

    View Slide

  2. Cardinality Estimation in Databases
    SQL

    Query
    Query Optimizer

    View Slide

  3. Cardinality Estimation in Databases
    Plan Enumeration
    SQL

    Query
    Plans 1, 2 …
    Query Optimizer

    View Slide

  4. Cardinality Estimation in Databases
    Plan Enumeration
    Cost Model
    SQL

    Query
    Plans 1, 2 …
    Plan 2 is best
    Join
    T
    Join
    E S
    Query Optimizer

    View Slide

  5. Cardinality Estimation in Databases
    Plan Enumeration
    Cardinality Estimator
    Cost Model
    SQL

    Query
    Plans 1, 2 …
    # rows?
    50
    Plan 2 is best
    Query Optimizer
    Join
    T
    Join
    E S

    View Slide

  6. Cardinality Estimation in Databases
    Plan Enumeration
    Cardinality Estimator
    Cost Model
    SQL

    Query
    Plans 1, 2 …
    # rows?
    50
    Plan 2 is best
    Query Optimizer
    Join
    T
    Join
    E S
    Real systems produce

    huge errors—up to 107x [1]!
    [1] Leis et al., VLDB’15

    View Slide

  7. The Bane: Heuristic Assumptions!

    View Slide

  8. The Bane: Heuristic Assumptions!
    Independence
    (Age)
    (Age, Salary) (Salary)
    ≈ ×

    View Slide

  9. The Bane: Heuristic Assumptions!
    Independence Uniformity
    (Age)
    (Age, Salary) (Salary)
    ≈ ×
    “Values in each bin

    are uniform”

    View Slide

  10. The Bane: Heuristic Assumptions!
    Independence Uniformity
    (Age)
    (Age, Salary) (Salary)
    ≈ ×
    “Values in each bin

    are uniform”
    Heuristics result in fewer correlations learned,
    making cardinality estimates inaccurate.

    View Slide

  11. Key Idea: learn all correlations without heuristics
    Naru Overview

    View Slide

  12. Naru Overview
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    X1 X2 X3
    Table
    Key Idea: learn all correlations without heuristics

    View Slide

  13. Naru Overview
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    X1 X2 X3
    tuples
    Table
    Key Idea: learn all correlations without heuristics

    View Slide

  14. Naru Overview
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    X1
    Encode
    X2
    X3
    X1 X2 X3
    tuples
    Table
    Key Idea: learn all correlations without heuristics

    View Slide

  15. Naru Overview
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    Model
    X1
    Encode
    X2
    X3
    (X1)
    (X2|X1)
    (X3|X1,X2)
    Output
    X1 X2 X3
    tuples
    Table
    Key Idea: learn all correlations without heuristics

    View Slide

  16. Naru Overview
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    Model
    X1
    Encode
    X2
    X3
    (X1)
    (X2|X1)
    (X3|X1,X2)
    Output
    X1 X2 X3
    tuples
    Unsupervised loss
    (max. likelihood)
    Table
    Key Idea: learn all correlations without heuristics

    View Slide

  17. Naru Overview
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    Model
    X1
    Encode
    X2
    X3
    (X1)
    (X2|X1)
    (X3|X1,X2)
    Output
    X1 X2 X3
    tuples
    Unsupervised loss
    (max. likelihood)
    Table


    26
    2K
    Age
    Salary
    Model
    (Age)
    (Salary|Age=26)
    Example
    Input:

    Tuples
    Output:

    N conditional
    distributions
    Key Idea: learn all correlations without heuristics

    View Slide

  18. Naru Overview
    Model
    X1
    Encode
    X2
    X3
    (X1)
    (X2|X1)
    (X3|X1,X2)
    X1 X2 X3
    Estimate selectivity
    from model, given any query
    tuples
    Unsupervised loss
    (max. likelihood)
    Table
    Selectivity
    estimates
    Approximate
    Inference
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    Key Idea: learn all correlations without heuristics
    Output

    View Slide

  19. Naru Overview
    Model
    X1
    Encode
    X2
    X3
    (X1)
    (X2|X1)
    (X3|X1,X2)
    X1 X2 X3
    Estimate selectivity
    from model, given any query
    tuples
    Unsupervised loss
    (max. likelihood)
    Table
    Selectivity
    estimates
    Approximate
    Inference
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    Key Idea: learn all correlations without heuristics
    Output
    ~90x higher accuracy than SOTA

    on real-world datasets

    View Slide

  20. Naru Overview
    Model
    X1
    Encode
    X2
    X3
    (X1)
    (X2|X1)
    (X3|X1,X2)
    X1 X2 X3
    Estimate selectivity
    from model, given any query
    tuples
    Unsupervised loss
    (max. likelihood)
    Table
    Selectivity
    estimates
    Approximate
    Inference
    Unsupervised Training
    Learn data distribution with
    deep autoregressive model
    Output
    ~90x higher accuracy than SOTA

    on real-world datasets

    View Slide

  21. Autoregressive Modeling
    Naru uses autoregressive models because:

    View Slide

  22. Factorization is exact: no independence assumptions
    Autoregressive Modeling
    (X1
    , X2
    , …, Xn
    ) = (X1
    ) (X2
    |X1
    ) … (Xn
    | X1
    , …, Xn-1
    )
    Naru uses autoregressive models because:

    View Slide

  23. Factorization is exact: no independence assumptions
    Autoregressive Modeling
    (X1
    , X2
    , …, Xn
    ) = (X1
    ) (X2
    |X1
    ) … (Xn
    | X1
    , …, Xn-1
    )
    Naru uses autoregressive models because:
    Not materialized; Emitted on-demand by model

    View Slide

  24. Factorization is exact: no independence assumptions
    Autoregressive Modeling
    (X1
    , X2
    , …, Xn
    ) = (X1
    ) (X2
    |X1
    ) … (Xn
    | X1
    , …, Xn-1
    )
    Naru uses autoregressive models because:
    Many successful architectures; scale to complex data
    Masked MLPs, Transformers, WaveNet, …

    View Slide

  25. Estimating selectivity

    View Slide

  26. Estimating selectivity
    Point Query: sel(Age=a, Salary=s)

    View Slide

  27. Estimating selectivity
    Point Query: sel(Age=a, Salary=s)
    Easy: do 1 forward pass, multiply
    (Age=a) (Salary=s|Age=a)

    View Slide

  28. Estimating selectivity
    Point Query: sel(Age=a, Salary=s)
    Easy: do 1 forward pass, multiply
    (Age=a) (Salary=s|Age=a)
    Range Query: sel(Age in Rage, Salary in Rsal)
    Naively enumeration =

    exponential forward passes!

    View Slide

  29. Estimating selectivity
    Point Query: sel(Age=a, Salary=s)
    Easy: do 1 forward pass, multiply
    (Age=a) (Salary=s|Age=a)
    Range Query: sel(Age in Rage, Salary in Rsal)
    Naively enumeration =

    exponential forward passes!
    Insight: not all points in the queried region are meaningful
    → sample tuples from model to approximate range density

    View Slide

  30. Progressive Sampling

    View Slide

  31. Progressive Sampling
    ProgressiveSample():
    s1 ~ Model(X1 | X1 in R1)
    # Query: sel(X1 in R1 ,..., Xn in Rn)
    Progressive Sampling (Naru)
    # Sample dim 1

    View Slide

  32. Progressive Sampling
    ProgressiveSample():
    s1 ~ Model(X1 | X1 in R1)
    s2 ~ Model(X2 | X2 in R2, X1=s1)
    # Query: sel(X1 in R1 ,..., Xn in Rn)
    Progressive Sampling (Naru)
    # Sample dim 1
    # Sample dim 2

    View Slide

  33. Progressive Sampling
    ProgressiveSample():
    s1 ~ Model(X1 | X1 in R1)
    s2 ~ Model(X2 | X2 in R2, X1=s1)
    ...
    # Query: sel(X1 in R1 ,..., Xn in Rn)
    return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1)
    Progressive Sampling (Naru)
    # Sample dim 1
    # Sample dim 2
    # Monte Carlo estimate

    View Slide

  34. Progressive Sampling
    ProgressiveSample():
    s1 ~ Model(X1 | X1 in R1)
    s2 ~ Model(X2 | X2 in R2, X1=s1)
    ...
    # Query: sel(X1 in R1 ,..., Xn in Rn)
    return p(X1 in R1)…p(Xn in Rn|s1,…,sn-1)
    Progressive Sampling (Naru)
    Use learned conditionals to
    progressively “zoom-in” into high-mass region
    # Sample dim 1
    # Sample dim 2
    # Monte Carlo estimate

    View Slide

  35. Naru Optimizations
    Wildcard Skipping
    avoid sampling wildcards
    ProgressiveSampling(
    City in *, # domain(City)
    Age in [20…28],
    BirthDay in *, # domain(BirthDay)
    Salary in [5k…15k],
    ……
    )

    View Slide

  36. Naru Optimizations
    Wildcard Skipping
    avoid sampling wildcards
    ProgressiveSampling(
    City in *, # domain(City)
    Age in [20…28],
    BirthDay in *, # domain(BirthDay)
    Salary in [5k…15k],
    ……
    )
    = MASKCity, # skipped!
    = MASKBirthDay, # skipped!
    Wildcards → equal to MASK tokens
    10x reduction in max error

    View Slide

  37. Naru Optimizations
    Wildcard Skipping
    avoid sampling wildcards
    ProgressiveSampling(
    City in *, # domain(City)
    Age in [20…28],
    BirthDay in *, # domain(BirthDay)
    Salary in [5k…15k],
    ……
    )
    More results
    details in paper
    Robust to any issuable queries
    Latency, Scalability
    Updating for new data
    = MASKCity, # skipped!
    = MASKBirthDay, # skipped!
    Wildcards → equal to MASK tokens
    10x reduction in max error

    View Slide

  38. Estimation Accuracy
    Dataset: DMV (11M tuples, 11 columns)

    Workload: 2000 queries, 5-11 range/eq filters each
    Model: Masked MLP, 13MB

    View Slide

  39. Estimation Accuracy
    CIDR’19
    SIGMOD’15
    Dataset: DMV (11M tuples, 11 columns)

    Workload: 2000 queries, 5-11 range/eq filters each
    Model: Masked MLP, 13MB

    View Slide

  40. Estimation Accuracy
    230x
    10000x
    CIDR’19
    SIGMOD’15
    Dataset: DMV (11M tuples, 11 columns)

    Workload: 2000 queries, 5-11 range/eq filters each
    Model: Masked MLP, 13MB

    View Slide

  41. Estimation Accuracy
    94x
    CIDR’19
    SIGMOD’15
    Dataset: DMV (11M tuples, 11 columns)

    Workload: 2000 queries, 5-11 range/eq filters each
    Model: Masked MLP, 13MB

    View Slide

  42. Estimation Accuracy
    40-70x
    supervised: 5.5hr to

    collect 10K train queries
    unsupervised:

    ~5min read+train
    CIDR’19
    SIGMOD’15
    Dataset: DMV (11M tuples, 11 columns)

    Workload: 2000 queries, 5-11 range/eq filters each
    Model: Masked MLP, 13MB

    View Slide

  43. Naru enables new capability
    NeuroCard
    Heuristics-free
    Join Cardinality Estimation
    Naru
    Autoregressive
    Models
    Probabilistic
    Inference
    To appear in VLDB’21!
    Support
    joins
    bit.ly/neurocard

    View Slide

  44. github.com/naru-project
    bit.ly/naru-vldb20
    Naru Takeaways
    Removes heuristics

    90x more accurate
    on multi-dim range
    queries

    Efficiently handle
    ranges & wildcards

    [email protected]
    @zongheng_yang
    Naru
    Autoregressive
    Models
    Probabilistic
    Inference

    View Slide