Zongheng Yang
August 11, 2021
230

# NeuroCard: One Cardinality Estimator for All Tables

VLDB 2021
Paper: https://bit.ly/neurocard-pdf
Talk video: https://youtu.be/O2VoBHSyEpw
Code: https://github.com/neurocard/neurocard/

Query optimizers rely on accurate cardinality estimates to produce good execution plans. Despite decades of research, existing cardinality estimators are inaccurate for complex queries, due to making lossy modeling assumptions and not capturing inter-table correlations. In this work, we show that it is possible to learn the correlations across all tables in a database without any independence assumptions. We present NeuroCard, a join cardinality estimator that builds a single neural density estimator over an entire database. Leveraging join sampling and modern deep autoregressive models, NeuroCard makes no inter-table or inter-column independence assumptions in its probabilistic modeling. NeuroCard achieves orders of magnitude higher accuracy than the best prior methods (a new state-of-the-art result of 8.5x maximum error on JOB-light), scales to dozens of tables, while being compact in space (several MBs) and efficient to construct or update (seconds to minutes).

Presenter: Zongheng Yang (https://zongheng.me), @zongheng_yang.

August 11, 2021

## Transcript

1. ### NeuroCard: One Cardinality Estimator for All Tables Zongheng Yang, Amog

Kamsetty*, Frank Sifei Luan*, Eric Liang, Yan DuanΔ, Xi ChenΔ, Ion Stoica github.com/neurocard VLDB 2021 Δ bit.ly/neurocard-pdf
2. ### Why is cardinality estimation hard Must learn data correlations across

columns across tables (joins)
3. ### Why is cardinality estimation hard 5k<=salary<=15k 20<=age<=28 Must learn data

correlations across columns across tables (joins) Probability(salary in [5k,15k], age in [20,28])?
4. ### Why is cardinality estimation hard Must learn data correlations across

columns across tables (joins) In T1⋈T2⋈T3, Probability( T1.salary in [5k,15k], T2.age in [20,28])? 20<=age<=28 5k<=salary<=15k

6. ### Heuristics lose information Independence Uniformity 𝘱(T1.salary) 𝘱(T1.salary, T2.age) 𝘱(T2.age) ≈

× Heuristics learn fewer correlations, making cardinality estimates inaccurate. “Values in each bin are uniform”

12. ### Training NeuroCard Tables & Join Schema Probabilistic Model (deep autoreg.)

tuples pθ(HH +QHmKMb 7`QK HH i#H2b)
13. ### Training NeuroCard Tables & Join Schema Compute full join? Probabilistic

Model (deep autoreg.) tuples pθ(HH +QHmKMb 7`QK HH i#H2b)
14. ### Training NeuroCard Tables & Join Schema Compute full join? Probabilistic

Model (deep autoreg.) tuples pθ(HH +QHmKMb 7`QK HH i#H2b) Impractical (e.g., 6 IMDB tables join to 2,000,000,000,000 rows)
15. ### Training NeuroCard Tables & Join Schema Unbiased Join Sampler Sample

~ full join of all tables Probabilistic Model (deep autoreg.) sampled tuples pθ(HH +QHmKMb 7`QK HH i#H2b)
16. ### Unbiased join sampling To learn 𝘱(J=full join) correctly, sample ~

J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions
17. ### Unbiased join sampling To learn 𝘱(J=full join) correctly, sample ~

J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions Solution: a linear-time DP algo precompute correct sampling weights for all keys
18. ### Unbiased join sampling To learn 𝘱(J=full join) correctly, sample ~

J uniformly i.i.d. First method to point this out—prior use of IBJS learns biased distributions Sampler Sampler Sampler Join schema Embarrassingly parallel Solution: a linear-time DP algo precompute correct sampling weights for all keys 〈T1.salary, T1.dept, …, Tn.age〉 〈T1.salary, T1.dept, …, Tn.age〉 Batch of tuples from J, sampled on-the-fly sample()

(input)
20. ### Autoregressive model x1 x2 xn T1.salary T1.dept Tn.age Sampled tuple

(input) Model output Deep Autoregressive Model P𝜃(Xi | x<i) for all i
21. ### Autoregressive model x1 x2 xn T1.salary T1.dept Tn.age Sampled tuple

(input) Model output Deep Autoregressive Model No independence assumptions: 𝘱(X1,…,Xn) = ∏i 𝘱(Xi | X< i) Architectures: ResMADE, Transformer, WaveNet, PixelCNN, … P𝜃(Xi | x<i) for all i
22. ### Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns
23. ### Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns If using 32-dim embeddings, needs ~128 MB of parameters.
24. ### Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns
25. ### Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns
26. ### Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns Embed much smaller “vocabularies”
27. ### Column Domain: 106 1,000,000 1 Binary representation 1111010000 1001000000 0000000000

0000000001 Subcol 1 976 0 Subcol 2 576 1 Domain: ≤ 2N Chunk every N=10 bits Column factorization Model size blown up by high-cardinality columns Idea: factorize into sub-columns Massive space savings with minimal loss of statistical efficiency (e.g., 128 MB → 250 KB, or 500x) Embed much smaller “vocabularies”
28. ### Querying cardinalities Key: “schema subsetting” Full join on N tables,

query on M<N tables Details in paper Use model outputs to compute cardinalities
29. ### Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple

filters, single-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error
30. ### Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple

filters, single-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error
31. ### Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple

filters, single-key joins JOB-M 16 tables; |Join| = 10 trillion complex filters, multi-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error Join Sampling (IBJS) Postgres NeuroCard 100 101 102 103 104 105 106 Q-Error
32. ### Eval: Accuracy JOB-light 6 tables; |Join| = 2 trillion simple

filters, single-key joins JOB-M 16 tables; |Join| = 10 trillion complex filters, multi-key joins Postgres Supervised (MSCN) Unsupervised (DeepDB) NeuroCard 100 101 102 103 Q-Error Join Sampling (IBJS) Postgres NeuroCard 100 101 102 103 104 105 106 Q-Error Outperforms best prior methods by 4x–30x (see paper for more results, ablations)
33. ### Eval: Efficiency 1 2 3 4 5 6 7 Per

1M tuples 100 101 102 103 104 Q-error, p99 JOB-light-ranges JOB-light 0 1000 2000 Runtime (sec) MSCN DeepDB NeuroCard 3min 24min 5min 38min 7min 3min JOB-light-ranges JOB-light Training on a few million tuples is sufficient for high accuracy (0.001% of the join) Training completes in a few minutes (with a GPU)
34. ### Eval: Data updates 1 2 3 4 5 101 102

103 104 Q-error, p95 1 2 3 4 5 101 102 Q-error, p50 Fast update (⇠3 sec) Retrain (⇠3 min) Stale Postgres, re-ANALYZE 0.0 0.2 0.4 0.6 0.8 1.0 Number of Ingested Partitions 0.0 0.5 1.0 New data partitions arrive by time Fast update: take 1% of total SGD steps
35. ### Eval: Data updates 1 2 3 4 5 101 102

103 104 Q-error, p95 1 2 3 4 5 101 102 Q-error, p50 Fast update (⇠3 sec) Retrain (⇠3 min) Stale Postgres, re-ANALYZE 0.0 0.2 0.4 0.6 0.8 1.0 Number of Ingested Partitions 0.0 0.5 1.0 New data partitions arrive by time Fast update: take 1% of total SGD steps Fast update recovers accuracy in seconds; approaches retraining
36. ### Eval: Real execution Compare plans from a simple cost model

[1], with Postgres cardinalities NeuroCard Time (mins) 0 50 100 150 JOB-light-ranges (~1K queries) 1.9x speedup (68 mins) [1] How Good Are Query Optimizers, Really? Leis et al., VLDB’15. Postgres card. NeuroCard
37. ### github.com/neurocard bit.ly/neurocard-pdf Neurocard Takeaways Learns all correlations in a database

✓ Can’t make it live due to time zone - let’s chat offline: [email protected] @zongheng_yang NeuroCard Join Sampler Column Factorization Autoreg. Model Removes independence assumptions on joins ✓ Strategies to make it more practical ✓