Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NeuroCard: One Cardinality Estimator for All Tables

NeuroCard: One Cardinality Estimator for All Tables

VLDB 2021
Paper: https://bit.ly/neurocard-pdf
Talk video: https://youtu.be/O2VoBHSyEpw
Code: https://github.com/neurocard/neurocard/

Query optimizers rely on accurate cardinality estimates to produce good execution plans. Despite decades of research, existing cardinality estimators are inaccurate for complex queries, due to making lossy modeling assumptions and not capturing inter-table correlations. In this work, we show that it is possible to learn the correlations across all tables in a database without any independence assumptions. We present NeuroCard, a join cardinality estimator that builds a single neural density estimator over an entire database. Leveraging join sampling and modern deep autoregressive models, NeuroCard makes no inter-table or inter-column independence assumptions in its probabilistic modeling. NeuroCard achieves orders of magnitude higher accuracy than the best prior methods (a new state-of-the-art result of 8.5x maximum error on JOB-light), scales to dozens of tables, while being compact in space (several MBs) and efficient to construct or update (seconds to minutes).

Presenter: Zongheng Yang (https://zongheng.me), @zongheng_yang.

2198891db5e2442b031b82d519a4e132?s=128

Zongheng Yang

August 11, 2021
Tweet

Transcript

  1. NeuroCard: One Cardinality Estimator for All Tables Zongheng Yang, Amog

    Kamsetty*, Frank Sifei Luan*, Eric Liang, Yan DuanΔ, Xi ChenΔ, Ion Stoica github.com/neurocard VLDB 2021 Δ bit.ly/neurocard-pdf
  2. Why is cardinality estimation hard Must learn data correlations across

    columns across tables (joins)
  3. Why is cardinality estimation hard 5k<=salary<=15k 20<=age<=28 Must learn data

    correlations across columns across tables (joins) Probability(salary in [5k,15k], age in [20,28])?
  4. Why is cardinality estimation hard Must learn data correlations across

    columns across tables (joins) In T1⋈T2⋈T3, Probability( T1.salary in [5k,15k], T2.age in [20,28])? 20<=age<=28 5k<=salary<=15k
  5. Heuristics lose information Independence Uniformity

  6. Heuristics lose information Independence Uniformity 𝘱(T1.salary) 𝘱(T1.salary, T2.age) 𝘱(T2.age) ≈

    × Heuristics learn fewer correlations, making cardinality estimates inaccurate. “Values in each bin are uniform”
  7. NeuroCard Learn all correlations over all tables, without heuristics

  8. NeuroCard Learn all correlations over all tables, without heuristics NeuroCard

    Cardinality? 50 Cardinality? 1000 Joined Not joined P(all tables) <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> learned in a single deep autoregressive model Table Distribution of all columns from all tables (full outer join)
  9. NeuroCard Learn all correlations over all tables, without heuristics NeuroCard

    Cardinality? 50 Cardinality? 1000 Joined Not joined P(all tables) <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> learned in a single deep autoregressive model Table Distribution of all columns from all tables (full outer join) State-of-the-art: learns complex distributions w/o heuristics
  10. NeuroCard Learn all correlations over all tables, without heuristics NeuroCard

    Cardinality? 50 Cardinality? 1000 Joined Not joined P(all tables) <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> learned in a single deep autoregressive model Table Supports any join queries within the schema
  11. Training NeuroCard Tables & Join Schema

  12. Training NeuroCard Tables & Join Schema Probabilistic Model (deep autoreg.)

    tuples pθ(HH +QHmKMb 7`QK HH i#H2b)
  13. Training NeuroCard Tables & Join Schema Compute full join? Probabilistic

    Model (deep autoreg.) tuples pθ(HH +QHmKMb 7`QK HH i#H2b)
  14. Training NeuroCard Tables & Join Schema Compute full join? Probabilistic

    Model (deep autoreg.) tuples pθ(HH +QHmKMb 7`QK HH i#H2b) Impractical (e.g., 6 IMDB tables join to 2,000,000,000,000 rows)
  15. Training NeuroCard Tables & Join Schema Unbiased Join Sampler Sample

    ~ full join of all tables Probabilistic Model (deep autoreg.) sampled tuples pθ(HH +QHmKMb 7`QK HH i#H2b)
  16. Unbiased join sampling To learn 𝘱(J=full join) correctly, sample ~

    J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions
  17. Unbiased join sampling To learn 𝘱(J=full join) correctly, sample ~

    J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions Solution: a linear-time DP algo precompute correct sampling weights for all keys
  18. Unbiased join sampling To learn 𝘱(J=full join) correctly, sample ~

    J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions Sampler Sampler Sampler Join schema Embarrassingly parallel Solution: a linear-time DP algo precompute correct sampling weights for all keys 〈T1.salary, T1.dept, …, Tn.age〉 〈T1.salary, T1.dept, …, Tn.age〉 Batch of tuples from J, sampled on-the-fly sample()
  19. Autoregressive model x1 x2 xn T1.salary T1.dept Tn.age Sampled tuple

    (input)
  20. Autoregressive model x1 x2 xn T1.salary T1.dept Tn.age Sampled tuple

    (input) Model output Deep Autoregressive Model P𝜃(Xi | x<i) for all i
  21. Autoregressive model x1 x2 xn T1.salary T1.dept Tn.age Sampled tuple

    (input) Model output Deep Autoregressive Model No independence assumptions: 𝘱(X1,…,Xn) = ∏i 𝘱(Xi | X< i) Architectures: ResMADE, Transformer, WaveNet, PixelCNN, … P𝜃(Xi | x<i) for all i
  22. Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

    0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns
  23. Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

    0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns If using 32-dim embeddings, needs ~128 MB of parameters.
  24. Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

    0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns
  25. Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

    0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns
  26. Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

    0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns Embed much smaller “vocabularies”
  27. Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

    0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns Massive space savings with minimal loss of statistical efficiency (e.g., 128 MB → 250 KB, or 500x) Embed much smaller “vocabularies”
  28. Querying cardinalities Key: “schema subsetting” Full join on N tables,

    query on M<N tables Details in paper Use model outputs to compute cardinalities
  29. Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple

    filters, single-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error
  30. Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple

    filters, single-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error
  31. Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple

    filters, single-key joins JOB-M 16 tables; |Join| = 10 trillion complex filters, multi-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error Join Sampling (IBJS) Postgres NeuroCard 100 101 102 103 104 105 106 Q-Error
  32. Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple

    filters, single-key joins JOB-M 16 tables; |Join| = 10 trillion complex filters, multi-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error Join Sampling (IBJS) Postgres NeuroCard 100 101 102 103 104 105 106 Q-Error Outperforms best prior methods by 4x–30x (see paper for more results, ablations)
  33. Eval: Efficiency 1 2 3 4 5 6 7 Per

    1M tuples 100 101 102 103 104 Q-error, p99 JOB-light-ranges JOB-light 0 1000 2000 Runtime (sec) MSCN DeepDB NeuroCard 3min 24min 5min 38min 7min 3min JOB-light-ranges JOB-light Training on a few million tuples is sufficient for high accuracy (0.001% of the join) Training completes in a few minutes (with a GPU)
  34. Eval: Data updates 1 2 3 4 5 101 102

    103 104 Q-error, p95 1 2 3 4 5 101 102 Q-error, p50 Fast update (⇠3 sec) Retrain (⇠3 min) Stale Postgres, re-ANALYZE 0.0 0.2 0.4 0.6 0.8 1.0 Number of Ingested Partitions 0.0 0.5 1.0 New data partitions arrive by time Fast update: take 1% of total SGD steps
  35. Eval: Data updates 1 2 3 4 5 101 102

    103 104 Q-error, p95 1 2 3 4 5 101 102 Q-error, p50 Fast update (⇠3 sec) Retrain (⇠3 min) Stale Postgres, re-ANALYZE 0.0 0.2 0.4 0.6 0.8 1.0 Number of Ingested Partitions 0.0 0.5 1.0 New data partitions arrive by time Fast update: take 1% of total SGD steps Fast update recovers accuracy in seconds; approaches retraining
  36. Eval: Real execution Compare plans from a simple cost model

    [1], with Postgres cardinalities NeuroCard Time (mins) 0 50 100 150 JOB-light-ranges (~1K queries) 1.9x speedup (68 mins) [1] How Good Are Query Optimizers, Really? Leis et al., VLDB’15. Postgres card. NeuroCard
  37. github.com/neurocard bit.ly/neurocard-pdf Neurocard Takeaways Learns all correlations in a database

    ✓ Can’t make it live due to time zone - let’s chat offline: zongheng@cs.berkeley.edu @zongheng_yang NeuroCard Join Sampler Column Factorization Autoreg. Model Removes independence assumptions on joins ✓ Strategies to make it more practical ✓