Zongheng Yang
August 11, 2021
160

# NeuroCard: One Cardinality Estimator for All Tables

VLDB 2021
Paper: https://bit.ly/neurocard-pdf
Talk video: https://youtu.be/O2VoBHSyEpw
Code: https://github.com/neurocard/neurocard/

Query optimizers rely on accurate cardinality estimates to produce good execution plans. Despite decades of research, existing cardinality estimators are inaccurate for complex queries, due to making lossy modeling assumptions and not capturing inter-table correlations. In this work, we show that it is possible to learn the correlations across all tables in a database without any independence assumptions. We present NeuroCard, a join cardinality estimator that builds a single neural density estimator over an entire database. Leveraging join sampling and modern deep autoregressive models, NeuroCard makes no inter-table or inter-column independence assumptions in its probabilistic modeling. NeuroCard achieves orders of magnitude higher accuracy than the best prior methods (a new state-of-the-art result of 8.5x maximum error on JOB-light), scales to dozens of tables, while being compact in space (several MBs) and efficient to construct or update (seconds to minutes).

Presenter: Zongheng Yang (https://zongheng.me), @zongheng_yang.

August 11, 2021

## Transcript

1. NeuroCard: One Cardinality
Estimator for All Tables
Zongheng Yang, Amog Kamsetty*, Frank Sifei Luan*,
Eric Liang, Yan DuanΔ, Xi ChenΔ, Ion Stoica
github.com/neurocard
VLDB 2021
Δ
bit.ly/neurocard-pdf

2. Why is cardinality estimation hard
Must learn data correlations
across columns
across tables (joins)

3. Why is cardinality estimation hard
5k<=salary<=15k
20<=age<=28
Must learn data correlations
across columns
across tables (joins)
Probability(salary in [5k,15k],
age in [20,28])?

4. Why is cardinality estimation hard
Must learn data correlations
across columns
across tables (joins)
In T1⋈T2⋈T3, Probability(
T1.salary in [5k,15k], T2.age in [20,28])?
20<=age<=28
5k<=salary<=15k

5. Heuristics lose information
Independence Uniformity

6. Heuristics lose information
Independence Uniformity
𝘱(T1.salary)
𝘱(T1.salary,
T2.age)
𝘱(T2.age)
≈ ×
Heuristics learn fewer correlations,
making cardinality estimates inaccurate.
“Values in each bin

are uniform”

7. NeuroCard
Learn all correlations over all tables, without heuristics

8. NeuroCard
Learn all correlations over all tables, without heuristics
NeuroCard
Cardinality?
50
Cardinality?
1000
Joined
Not joined
P(all tables)
learned in a single deep
autoregressive model
Table
Distribution of all
columns from all tables
(full outer join)

9. NeuroCard
Learn all correlations over all tables, without heuristics
NeuroCard
Cardinality?
50
Cardinality?
1000
Joined
Not joined
P(all tables)
learned in a single deep
autoregressive model
Table
Distribution of all
columns from all tables
(full outer join)
State-of-the-art: learns complex
distributions w/o heuristics

10. NeuroCard
Learn all correlations over all tables, without heuristics
NeuroCard
Cardinality?
50
Cardinality?
1000
Joined
Not joined
P(all tables)
learned in a single deep
autoregressive model
Table
Supports any join queries
within the schema

11. Training NeuroCard
Tables &
Join Schema

12. Training NeuroCard
Tables &
Join Schema
Probabilistic Model

(deep autoreg.)
tuples
pθ(HH +QHmKMb 7`QK HH i#H2b)

13. Training NeuroCard
Tables &
Join Schema
Compute
full join?
Probabilistic Model

(deep autoreg.)
tuples
pθ(HH +QHmKMb 7`QK HH i#H2b)

14. Training NeuroCard
Tables &
Join Schema
Compute
full join?
Probabilistic Model

(deep autoreg.)
tuples
pθ(HH +QHmKMb 7`QK HH i#H2b)
Impractical

(e.g., 6 IMDB tables join to
2,000,000,000,000 rows)

15. Training NeuroCard
Tables &
Join Schema
Unbiased

Join Sampler
Sample ~
full join of all tables
Probabilistic Model

(deep autoreg.)
sampled
tuples
pθ(HH +QHmKMb 7`QK HH i#H2b)

16. Unbiased join sampling
To learn 𝘱(J=full join) correctly, sample ~ J uniformly i.i.d.
First method to point this out—prior use of IBJS learns biased distributions

17. Unbiased join sampling
To learn 𝘱(J=full join) correctly, sample ~ J uniformly i.i.d.
First method to point this out—prior use of IBJS learns biased distributions
Solution: a linear-time DP algo
precompute correct sampling weights for all keys

18. Unbiased join sampling
To learn 𝘱(J=full join) correctly, sample ~ J uniformly i.i.d.
First method to point this out—prior use of IBJS learns biased distributions
Sampler
Sampler
Sampler
Join schema
Embarrassingly
parallel
Solution: a linear-time DP algo
precompute correct sampling weights for all keys
〈T1.salary, T1.dept, …, Tn.age〉
〈T1.salary, T1.dept, …, Tn.age〉
Batch of tuples from J,
sampled on-the-fly
sample()

19. Autoregressive model
x1
x2
xn
T1.salary
T1.dept
Tn.age
Sampled tuple
(input)

20. Autoregressive model
x1
x2
xn
T1.salary
T1.dept
Tn.age
Sampled tuple
(input)
Model output
Deep
Autoregressive
Model
P𝜃(Xi | xfor all i

21. Autoregressive model
x1
x2
xn
T1.salary
T1.dept
Tn.age
Sampled tuple
(input)
Model output
Deep
Autoregressive
Model
No independence assumptions: 𝘱(X1,…,Xn) = ∏i 𝘱(Xi | X< i)
Architectures: ResMADE, Transformer, WaveNet, PixelCNN, …
P𝜃(Xi | xfor all i

22. Column
Domain: 106
1,000,000
1
Binary representation
1111010000 1001000000
0000000000 0000000001
Subcol 1
976
0
Subcol 2
576
1
Domain: ≤ 2N
Chunk every N=10 bits
Column factorization
Model size blown up by high-cardinality columns

23. Column
Domain: 106
1,000,000
1
Binary representation
1111010000 1001000000
0000000000 0000000001
Subcol 1
976
0
Subcol 2
576
1
Domain: ≤ 2N
Chunk every N=10 bits
Column factorization
Model size blown up by high-cardinality columns
If using 32-dim embeddings,
needs ~128 MB of parameters.

24. Column
Domain: 106
1,000,000
1
Binary representation
1111010000 1001000000
0000000000 0000000001
Subcol 1
976
0
Subcol 2
576
1
Domain: ≤ 2N
Chunk every N=10 bits
Column factorization
Model size blown up by high-cardinality columns
Idea: factorize into sub-columns

25. Column
Domain: 106
1,000,000
1
Binary representation
1111010000 1001000000
0000000000 0000000001
Subcol 1
976
0
Subcol 2
576
1
Domain: ≤ 2N
Chunk every N=10 bits
Column factorization
Model size blown up by high-cardinality columns
Idea: factorize into sub-columns

26. Column
Domain: 106
1,000,000
1
Binary representation
1111010000 1001000000
0000000000 0000000001
Subcol 1
976
0
Subcol 2
576
1
Domain: ≤ 2N
Chunk every N=10 bits
Column factorization
Model size blown up by high-cardinality columns
Idea: factorize into sub-columns
Embed much smaller
“vocabularies”

27. Column
Domain: 106
1,000,000
1
Binary representation
1111010000 1001000000
0000000000 0000000001
Subcol 1
976
0
Subcol 2
576
1
Domain: ≤ 2N
Chunk every N=10 bits
Column factorization
Model size blown up by high-cardinality columns
Idea: factorize into sub-columns
Massive space savings with minimal loss of
statistical efficiency (e.g., 128 MB → 250 KB, or 500x)
Embed much smaller
“vocabularies”

28. Querying cardinalities
Key: “schema subsetting”
Full join on N tables, query on MDetails in paper
Use model outputs to compute cardinalities

29. Eval: Accuracy
JOB-light
6 tables; |Join| = 2 trillion
simple filters, single-key joins
Postgres Supervised
(MSCN)
Unsupervised
(DeepDB)
NeuroCard
100
101
102
103
Q-Error

30. Eval: Accuracy
JOB-light
6 tables; |Join| = 2 trillion
simple filters, single-key joins
Postgres Supervised
(MSCN)
Unsupervised
(DeepDB)
NeuroCard
100
101
102
103
Q-Error

31. Eval: Accuracy
JOB-light
6 tables; |Join| = 2 trillion
simple filters, single-key joins
JOB-M
16 tables; |Join| = 10 trillion
complex filters, multi-key joins
Postgres Supervised
(MSCN)
Unsupervised
(DeepDB)
NeuroCard
100
101
102
103
Q-Error
Join Sampling
(IBJS)
Postgres NeuroCard
100
101
102
103
104
105
106
Q-Error

32. Eval: Accuracy
JOB-light
6 tables; |Join| = 2 trillion
simple filters, single-key joins
JOB-M
16 tables; |Join| = 10 trillion
complex filters, multi-key joins
Postgres Supervised
(MSCN)
Unsupervised
(DeepDB)
NeuroCard
100
101
102
103
Q-Error
Join Sampling
(IBJS)
Postgres NeuroCard
100
101
102
103
104
105
106
Q-Error
Outperforms best prior methods by 4x–30x
(see paper for more results, ablations)

33. Eval: Efficiency
1 2 3 4 5 6 7
Per 1M tuples
100
101
102
103
104
Q-error, p99
JOB-light-ranges JOB-light
0 1000 2000
Runtime (sec)
MSCN
DeepDB
NeuroCard
3min
24min
5min
38min
7min
3min
JOB-light-ranges JOB-light
Training on a few million tuples
is sufficient for high accuracy
(0.001% of the join)
Training completes
in a few minutes (with a GPU)

34. Eval: Data updates
1 2 3 4 5
101
102
103
104
Q-error, p95
1 2 3 4 5
101
102
Q-error, p50
Fast update (⇠3 sec)
Retrain (⇠3 min)
Stale
Postgres, re-ANALYZE
0.0 0.2 0.4 0.6 0.8 1.0
Number of Ingested Partitions
0.0
0.5
1.0
New data partitions arrive by time
Fast update: take 1% of total SGD steps

35. Eval: Data updates
1 2 3 4 5
101
102
103
104
Q-error, p95
1 2 3 4 5
101
102
Q-error, p50
Fast update (⇠3 sec)
Retrain (⇠3 min)
Stale
Postgres, re-ANALYZE
0.0 0.2 0.4 0.6 0.8 1.0
Number of Ingested Partitions
0.0
0.5
1.0
New data partitions arrive by time
Fast update: take 1% of total SGD steps
Fast update recovers accuracy in seconds;
approaches retraining

36. Eval: Real execution
Compare plans from a simple cost model [1], with
Postgres cardinalities
NeuroCard
Time (mins)
0
50
100
150
JOB-light-ranges (~1K queries)
1.9x speedup
(68 mins)
[1] How Good Are Query Optimizers, Really? Leis et al., VLDB’15.
Postgres
card.
NeuroCard

37. github.com/neurocard
bit.ly/neurocard-pdf
Neurocard Takeaways
Learns all correlations
in a database

Can’t make it live due to time zone - let’s chat offline:
[email protected]
@zongheng_yang
NeuroCard
Join
Sampler
Column
Factorization
Autoreg.
Model
Removes independence
assumptions on joins

Strategies to make it more
practical