NeuroCard: One Cardinality Estimator for All Tables

NeuroCard: One Cardinality Estimator for All Tables Zongheng Yang, Amog
Kamsetty*, Frank Sifei Luan*, Eric Liang, Yan DuanΔ, Xi ChenΔ, Ion Stoica github.com/neurocard VLDB 2021 Δ bit.ly/neurocard-pdf

Why is cardinality estimation hard Must learn data correlations across
columns across tables (joins)

Why is cardinality estimation hard 5k<=salary<=15k 20<=age<=28 Must learn data
correlations across columns across tables (joins) Probability(salary in [5k,15k], age in [20,28])?

Why is cardinality estimation hard Must learn data correlations across
columns across tables (joins) In T1⋈T2⋈T3, Probability( T1.salary in [5k,15k], T2.age in [20,28])? 20<=age<=28 5k<=salary<=15k

Heuristics lose information Independence Uniformity

Heuristics lose information Independence Uniformity 𝘱(T1.salary) 𝘱(T1.salary, T2.age) 𝘱(T2.age) ≈
× Heuristics learn fewer correlations, making cardinality estimates inaccurate. “Values in each bin are uniform”

NeuroCard Learn all correlations over all tables, without heuristics

NeuroCard Learn all correlations over all tables, without heuristics NeuroCard
Cardinality? 50 Cardinality? 1000 Joined Not joined P(all tables) <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> learned in a single deep autoregressive model Table Distribution of all columns from all tables (full outer join)

Cardinality? 50 Cardinality? 1000 Joined Not joined P(all tables) <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> learned in a single deep autoregressive model Table Distribution of all columns from all tables (full outer join) State-of-the-art: learns complex distributions w/o heuristics

Cardinality? 50 Cardinality? 1000 Joined Not joined P(all tables) <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> <latexit sha1_base64="quhJtApYKx4j0ILn8og/98pKl/w=">AAAChXicbVFdSxtBFJ1sbdXth7H65svQUIiQht0ibd8qWNAHHyI2akmWcHdyNw6ZnV1m7pbEJf/F1/qP/DfOxggm9sLA4Zz7MffcOFfSUhDc17xXa6/frG9s+m/fvf+wVd/+eGGzwgjsikxl5ioGi0pq7JIkhVe5QUhjhZfx+KjSL/+isTLTv2maY5TCSMtECiBHDeq7nWafcEIlKMUJXJmd7Q/qjaAdzIO/BOECNNgiOoPt2p/+MBNFipqEAmt7YZBTVIIhKRTO/H5hMQcxhhH2HNSQoo3K+fdn/LNjhjzJjHua+Jx9XlFCau00jV1mCnRtV7WK/J/WKyj5EZVS5wWhFo+DksLtmfHKCz6UBgWpqQMgjHR/5eIaDAhyjvn+0px59xzF0i7lpNBSZENcYRVNyIAjLVIKUld7lcfSWXwO2j7xrmElNH/JkSTbOnUn0a1jgzjef5bsjhGuWv8SXHxthw6fHTQO+eIsG2yPfWJNFrLv7JCdsA7rMsFu2C37x+68de+Ld+B9e0z1aouaHbYU3s8HECXGQA==</latexit> learned in a single deep autoregressive model Table Supports any join queries within the schema

Training NeuroCard Tables & Join Schema

Training NeuroCard Tables & Join Schema Probabilistic Model (deep autoreg.)
tuples pθ(HH +QHmKMb 7`QK HH i#H2b)

Training NeuroCard Tables & Join Schema Compute full join? Probabilistic
Model (deep autoreg.) tuples pθ(HH +QHmKMb 7`QK HH i#H2b)

Training NeuroCard Tables & Join Schema Compute full join? Probabilistic
Model (deep autoreg.) tuples pθ(HH +QHmKMb 7`QK HH i#H2b) Impractical (e.g., 6 IMDB tables join to 2,000,000,000,000 rows)

Training NeuroCard Tables & Join Schema Unbiased Join Sampler Sample
~ full join of all tables Probabilistic Model (deep autoreg.) sampled tuples pθ(HH +QHmKMb 7`QK HH i#H2b)

Unbiased join sampling To learn 𝘱(J=full join) correctly, sample ~
J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions

J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions Solution: a linear-time DP algo precompute correct sampling weights for all keys

J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions Sampler Sampler Sampler Join schema Embarrassingly parallel Solution: a linear-time DP algo precompute correct sampling weights for all keys 〈T1.salary, T1.dept, …, Tn.age〉〈T1.salary, T1.dept, …, Tn.age〉 Batch of tuples from J, sampled on-the-fly sample()

Autoregressive model x1 x2 xn T1.salary T1.dept Tn.age Sampled tuple
(input)

(input) Model output Deep Autoregressive Model P𝜃(Xi | x<i) for all i

(input) Model output Deep Autoregressive Model No independence assumptions: 𝘱(X1,…,Xn) = ∏i 𝘱(Xi | X< i) Architectures: ResMADE, Transformer, WaveNet, PixelCNN, … P𝜃(Xi | x<i) for all i

Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000
0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns If using 32-dim embeddings, needs ~128 MB of parameters.

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns Embed much smaller “vocabularies”

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns Massive space savings with minimal loss of statistical efficiency (e.g., 128 MB → 250 KB, or 500x) Embed much smaller “vocabularies”

Querying cardinalities Key: “schema subsetting” Full join on N tables,
query on M<N tables Details in paper Use model outputs to compute cardinalities

Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple
filters, single-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error

filters, single-key joins JOB-M 16 tables; |Join| = 10 trillion complex filters, multi-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error Join Sampling (IBJS) Postgres NeuroCard 100 101 102 103 104 105 106 Q-Error

filters, single-key joins JOB-M 16 tables; |Join| = 10 trillion complex filters, multi-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error Join Sampling (IBJS) Postgres NeuroCard 100 101 102 103 104 105 106 Q-Error Outperforms best prior methods by 4x–30x (see paper for more results, ablations)

Eval: Efficiency 1 2 3 4 5 6 7 Per
1M tuples 100 101 102 103 104 Q-error, p99 JOB-light-ranges JOB-light 0 1000 2000 Runtime (sec) MSCN DeepDB NeuroCard 3min 24min 5min 38min 7min 3min JOB-light-ranges JOB-light Training on a few million tuples is sufficient for high accuracy (0.001% of the join) Training completes in a few minutes (with a GPU)

Eval: Data updates 1 2 3 4 5 101 102
103 104 Q-error, p95 1 2 3 4 5 101 102 Q-error, p50 Fast update (⇠3 sec) Retrain (⇠3 min) Stale Postgres, re-ANALYZE 0.0 0.2 0.4 0.6 0.8 1.0 Number of Ingested Partitions 0.0 0.5 1.0 New data partitions arrive by time Fast update: take 1% of total SGD steps

Eval: Data updates 1 2 3 4 5 101 102
103 104 Q-error, p95 1 2 3 4 5 101 102 Q-error, p50 Fast update (⇠3 sec) Retrain (⇠3 min) Stale Postgres, re-ANALYZE 0.0 0.2 0.4 0.6 0.8 1.0 Number of Ingested Partitions 0.0 0.5 1.0 New data partitions arrive by time Fast update: take 1% of total SGD steps Fast update recovers accuracy in seconds; approaches retraining

Eval: Real execution Compare plans from a simple cost model
[1], with Postgres cardinalities NeuroCard Time (mins) 0 50 100 150 JOB-light-ranges (~1K queries) 1.9x speedup (68 mins) [1] How Good Are Query Optimizers, Really? Leis et al., VLDB’15. Postgres card. NeuroCard

github.com/neurocard bit.ly/neurocard-pdf Neurocard Takeaways Learns all correlations in a database
✓ Can’t make it live due to time zone - let’s chat offline: [email protected] @zongheng_yang NeuroCard Join Sampler Column Factorization Autoreg. Model Removes independence assumptions on joins ✓ Strategies to make it more practical ✓

NeuroCard: One Cardinality Estimator for All Ta...

NeuroCard: One Cardinality Estimator for All Tables

More Decks by Zongheng Yang

Other Decks in Technology

Featured

Transcript