$30 off During Our Annual Pro Sale. View Details »

NeuroCard: One Cardinality Estimator for All Tables

NeuroCard: One Cardinality Estimator for All Tables

VLDB 2021
Paper: https://bit.ly/neurocard-pdf
Talk video: https://youtu.be/O2VoBHSyEpw
Code: https://github.com/neurocard/neurocard/

Query optimizers rely on accurate cardinality estimates to produce good execution plans. Despite decades of research, existing cardinality estimators are inaccurate for complex queries, due to making lossy modeling assumptions and not capturing inter-table correlations. In this work, we show that it is possible to learn the correlations across all tables in a database without any independence assumptions. We present NeuroCard, a join cardinality estimator that builds a single neural density estimator over an entire database. Leveraging join sampling and modern deep autoregressive models, NeuroCard makes no inter-table or inter-column independence assumptions in its probabilistic modeling. NeuroCard achieves orders of magnitude higher accuracy than the best prior methods (a new state-of-the-art result of 8.5x maximum error on JOB-light), scales to dozens of tables, while being compact in space (several MBs) and efficient to construct or update (seconds to minutes).

Presenter: Zongheng Yang (https://zongheng.me), @zongheng_yang.

Zongheng Yang

August 11, 2021
Tweet

More Decks by Zongheng Yang

Other Decks in Technology

Transcript

  1. NeuroCard: One Cardinality
    Estimator for All Tables
    Zongheng Yang, Amog Kamsetty*, Frank Sifei Luan*,
    Eric Liang, Yan DuanΔ, Xi ChenΔ, Ion Stoica
    github.com/neurocard
    VLDB 2021
    Δ
    bit.ly/neurocard-pdf

    View Slide

  2. Why is cardinality estimation hard
    Must learn data correlations
    across columns
    across tables (joins)

    View Slide

  3. Why is cardinality estimation hard
    5k<=salary<=15k
    20<=age<=28
    Must learn data correlations
    across columns
    across tables (joins)
    Probability(salary in [5k,15k],
    age in [20,28])?

    View Slide

  4. Why is cardinality estimation hard
    Must learn data correlations
    across columns
    across tables (joins)
    In T1⋈T2⋈T3, Probability(
    T1.salary in [5k,15k], T2.age in [20,28])?
    20<=age<=28
    5k<=salary<=15k

    View Slide

  5. Heuristics lose information
    Independence Uniformity

    View Slide

  6. Heuristics lose information
    Independence Uniformity
    𝘱(T1.salary)
    𝘱(T1.salary,
    T2.age)
    𝘱(T2.age)
    ≈ ×
    Heuristics learn fewer correlations,
    making cardinality estimates inaccurate.
    “Values in each bin

    are uniform”

    View Slide

  7. NeuroCard
    Learn all correlations over all tables, without heuristics

    View Slide

  8. NeuroCard
    Learn all correlations over all tables, without heuristics
    NeuroCard
    Cardinality?
    50
    Cardinality?
    1000
    Joined
    Not joined
    P(all tables)
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    learned in a single deep
    autoregressive model
    Table
    Distribution of all
    columns from all tables
    (full outer join)

    View Slide

  9. NeuroCard
    Learn all correlations over all tables, without heuristics
    NeuroCard
    Cardinality?
    50
    Cardinality?
    1000
    Joined
    Not joined
    P(all tables)
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    learned in a single deep
    autoregressive model
    Table
    Distribution of all
    columns from all tables
    (full outer join)
    State-of-the-art: learns complex
    distributions w/o heuristics

    View Slide

  10. NeuroCard
    Learn all correlations over all tables, without heuristics
    NeuroCard
    Cardinality?
    50
    Cardinality?
    1000
    Joined
    Not joined
    P(all tables)
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==
    learned in a single deep
    autoregressive model
    Table
    Supports any join queries
    within the schema

    View Slide

  11. Training NeuroCard
    Tables &
    Join Schema

    View Slide

  12. Training NeuroCard
    Tables &
    Join Schema
    Probabilistic Model

    (deep autoreg.)
    tuples
    pθ(HH +QHmKMb 7`QK HH i#H2b)

    View Slide

  13. Training NeuroCard
    Tables &
    Join Schema
    Compute
    full join?
    Probabilistic Model

    (deep autoreg.)
    tuples
    pθ(HH +QHmKMb 7`QK HH i#H2b)

    View Slide

  14. Training NeuroCard
    Tables &
    Join Schema
    Compute
    full join?
    Probabilistic Model

    (deep autoreg.)
    tuples
    pθ(HH +QHmKMb 7`QK HH i#H2b)
    Impractical

    (e.g., 6 IMDB tables join to
    2,000,000,000,000 rows)

    View Slide

  15. Training NeuroCard
    Tables &
    Join Schema
    Unbiased

    Join Sampler
    Sample ~
    full join of all tables
    Probabilistic Model

    (deep autoreg.)
    sampled
    tuples
    pθ(HH +QHmKMb 7`QK HH i#H2b)

    View Slide

  16. Unbiased join sampling
    To learn 𝘱(J=full join) correctly, sample ~ J uniformly i.i.d.
    First method to point this out—prior use of IBJS learns biased distributions

    View Slide

  17. Unbiased join sampling
    To learn 𝘱(J=full join) correctly, sample ~ J uniformly i.i.d.
    First method to point this out—prior use of IBJS learns biased distributions
    Solution: a linear-time DP algo
    precompute correct sampling weights for all keys

    View Slide

  18. Unbiased join sampling
    To learn 𝘱(J=full join) correctly, sample ~ J uniformly i.i.d.
    First method to point this out—prior use of IBJS learns biased distributions
    Sampler
    Sampler
    Sampler
    Join schema
    Embarrassingly
    parallel
    Solution: a linear-time DP algo
    precompute correct sampling weights for all keys
    〈T1.salary, T1.dept, …, Tn.age〉
    〈T1.salary, T1.dept, …, Tn.age〉
    Batch of tuples from J,
    sampled on-the-fly
    sample()

    View Slide

  19. Autoregressive model
    x1
    x2
    xn
    T1.salary
    T1.dept
    Tn.age
    Sampled tuple
    (input)

    View Slide

  20. Autoregressive model
    x1
    x2
    xn
    T1.salary
    T1.dept
    Tn.age
    Sampled tuple
    (input)
    Model output
    Deep
    Autoregressive
    Model
    P𝜃(Xi | xfor all i

    View Slide

  21. Autoregressive model
    x1
    x2
    xn
    T1.salary
    T1.dept
    Tn.age
    Sampled tuple
    (input)
    Model output
    Deep
    Autoregressive
    Model
    No independence assumptions: 𝘱(X1,…,Xn) = ∏i 𝘱(Xi | X< i)
    Architectures: ResMADE, Transformer, WaveNet, PixelCNN, …
    P𝜃(Xi | xfor all i

    View Slide

  22. Column
    Domain: 106
    1,000,000
    1
    Binary representation
    1111010000 1001000000
    0000000000 0000000001
    Subcol 1
    976
    0
    Subcol 2
    576
    1
    Domain: ≤ 2N
    Chunk every N=10 bits
    Column factorization
    Model size blown up by high-cardinality columns

    View Slide

  23. Column
    Domain: 106
    1,000,000
    1
    Binary representation
    1111010000 1001000000
    0000000000 0000000001
    Subcol 1
    976
    0
    Subcol 2
    576
    1
    Domain: ≤ 2N
    Chunk every N=10 bits
    Column factorization
    Model size blown up by high-cardinality columns
    If using 32-dim embeddings,
    needs ~128 MB of parameters.

    View Slide

  24. Column
    Domain: 106
    1,000,000
    1
    Binary representation
    1111010000 1001000000
    0000000000 0000000001
    Subcol 1
    976
    0
    Subcol 2
    576
    1
    Domain: ≤ 2N
    Chunk every N=10 bits
    Column factorization
    Model size blown up by high-cardinality columns
    Idea: factorize into sub-columns

    View Slide

  25. Column
    Domain: 106
    1,000,000
    1
    Binary representation
    1111010000 1001000000
    0000000000 0000000001
    Subcol 1
    976
    0
    Subcol 2
    576
    1
    Domain: ≤ 2N
    Chunk every N=10 bits
    Column factorization
    Model size blown up by high-cardinality columns
    Idea: factorize into sub-columns

    View Slide

  26. Column
    Domain: 106
    1,000,000
    1
    Binary representation
    1111010000 1001000000
    0000000000 0000000001
    Subcol 1
    976
    0
    Subcol 2
    576
    1
    Domain: ≤ 2N
    Chunk every N=10 bits
    Column factorization
    Model size blown up by high-cardinality columns
    Idea: factorize into sub-columns
    Embed much smaller
    “vocabularies”

    View Slide

  27. Column
    Domain: 106
    1,000,000
    1
    Binary representation
    1111010000 1001000000
    0000000000 0000000001
    Subcol 1
    976
    0
    Subcol 2
    576
    1
    Domain: ≤ 2N
    Chunk every N=10 bits
    Column factorization
    Model size blown up by high-cardinality columns
    Idea: factorize into sub-columns
    Massive space savings with minimal loss of
    statistical efficiency (e.g., 128 MB → 250 KB, or 500x)
    Embed much smaller
    “vocabularies”

    View Slide

  28. Querying cardinalities
    Key: “schema subsetting”
    Full join on N tables, query on MDetails in paper
    Use model outputs to compute cardinalities

    View Slide

  29. Eval: Accuracy
    JOB-light
    6 tables; |Join| = 2 trillion
    simple filters, single-key joins
    Postgres Supervised
    (MSCN)
    Unsupervised
    (DeepDB)
    NeuroCard
    100
    101
    102
    103
    Q-Error

    View Slide

  30. Eval: Accuracy
    JOB-light
    6 tables; |Join| = 2 trillion
    simple filters, single-key joins
    Postgres Supervised
    (MSCN)
    Unsupervised
    (DeepDB)
    NeuroCard
    100
    101
    102
    103
    Q-Error

    View Slide

  31. Eval: Accuracy
    JOB-light
    6 tables; |Join| = 2 trillion
    simple filters, single-key joins
    JOB-M
    16 tables; |Join| = 10 trillion
    complex filters, multi-key joins
    Postgres Supervised
    (MSCN)
    Unsupervised
    (DeepDB)
    NeuroCard
    100
    101
    102
    103
    Q-Error
    Join Sampling
    (IBJS)
    Postgres NeuroCard
    100
    101
    102
    103
    104
    105
    106
    Q-Error

    View Slide

  32. Eval: Accuracy
    JOB-light
    6 tables; |Join| = 2 trillion
    simple filters, single-key joins
    JOB-M
    16 tables; |Join| = 10 trillion
    complex filters, multi-key joins
    Postgres Supervised
    (MSCN)
    Unsupervised
    (DeepDB)
    NeuroCard
    100
    101
    102
    103
    Q-Error
    Join Sampling
    (IBJS)
    Postgres NeuroCard
    100
    101
    102
    103
    104
    105
    106
    Q-Error
    Outperforms best prior methods by 4x–30x
    (see paper for more results, ablations)

    View Slide

  33. Eval: Efficiency
    1 2 3 4 5 6 7
    Per 1M tuples
    100
    101
    102
    103
    104
    Q-error, p99
    JOB-light-ranges JOB-light
    0 1000 2000
    Runtime (sec)
    MSCN
    DeepDB
    NeuroCard
    3min
    24min
    5min
    38min
    7min
    3min
    JOB-light-ranges JOB-light
    Training on a few million tuples
    is sufficient for high accuracy
    (0.001% of the join)
    Training completes
    in a few minutes (with a GPU)

    View Slide

  34. Eval: Data updates
    1 2 3 4 5
    101
    102
    103
    104
    Q-error, p95
    1 2 3 4 5
    101
    102
    Q-error, p50
    Fast update (⇠3 sec)
    Retrain (⇠3 min)
    Stale
    Postgres, re-ANALYZE
    0.0 0.2 0.4 0.6 0.8 1.0
    Number of Ingested Partitions
    0.0
    0.5
    1.0
    New data partitions arrive by time
    Fast update: take 1% of total SGD steps

    View Slide

  35. Eval: Data updates
    1 2 3 4 5
    101
    102
    103
    104
    Q-error, p95
    1 2 3 4 5
    101
    102
    Q-error, p50
    Fast update (⇠3 sec)
    Retrain (⇠3 min)
    Stale
    Postgres, re-ANALYZE
    0.0 0.2 0.4 0.6 0.8 1.0
    Number of Ingested Partitions
    0.0
    0.5
    1.0
    New data partitions arrive by time
    Fast update: take 1% of total SGD steps
    Fast update recovers accuracy in seconds;
    approaches retraining

    View Slide

  36. Eval: Real execution
    Compare plans from a simple cost model [1], with
    Postgres cardinalities
    NeuroCard
    Time (mins)
    0
    50
    100
    150
    JOB-light-ranges (~1K queries)
    1.9x speedup
    (68 mins)
    [1] How Good Are Query Optimizers, Really? Leis et al., VLDB’15.
    Postgres
    card.
    NeuroCard

    View Slide

  37. github.com/neurocard
    bit.ly/neurocard-pdf
    Neurocard Takeaways
    Learns all correlations
    in a database

    Can’t make it live due to time zone - let’s chat offline:
    [email protected]
    @zongheng_yang
    NeuroCard
    Join
    Sampler
    Column
    Factorization
    Autoreg.
    Model
    Removes independence
    assumptions on joins

    Strategies to make it more
    practical

    View Slide