PostgreSQL：行数推定を読み解く/row-estimation

行数推定を読み解く @Kyabatalian

はじめまして @Kyabatalian フロントエンドっぽい貧弱エンジニア PostgreSQL歴 2ヶ月【重要】間違ったことを話す可能性が高いです。

先月あれ、ここヒント句使っとるやんけ。何とかして！！ヒントク……？（お任せください！！！！！１）開発のえらいひと

解決すべき問題

こんなふうに考えていた時期がぼくにもありましたよくわからんけど、サンプル数？を上げればいいらしい！いけるで！

サンプル数を上げてみる

サンプル数を上げてみる実行時間[ms]

サンプル数を上げてみる実行時間[ms] っょぃ。

実行計画をみてみる

実行計画をみてみる Sort (cost=582.43..582.44 rows=1 width=312) (actual time=3112.054..3113.061 rows=5579 loops=1) Sort
Key: [略] Sort Method: quicksort Memory: 1674kB -> Nested Loop (cost=2.59..582.42 rows=1 width=312) (actual time=548.120..3088.081 rows=5579 loops=1) -> Nested Loop (cost=2.16..581.58 rows=1 width=297) (actual time=548.032..3008.849 rows=5579 loops=1) -> Nested Loop (cost=1.73..573.12 rows=1 width=308) (actual time=548.020..2980.460 rows=5579 loops=1) Join Filter: [略] -> Nested Loop (cost=1.29..570.17 rows=1 width=256) (actual time=544.342..2929.895 rows=6286 loops=1) -> Nested Loop (cost=0.86..456.48 rows=85 width=245) (actual time=0.054..1044.752 rows=473464 loops=1) -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) -> Index Scan using idx_[略] (cost=0.43..446.91 rows=111 width=191) (actual time=0.015..4.347 rows=4117 loops=115) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..1.33 rows=1 width=22) (actual time=0.003..0.003 rows=0 loops=473464) Index Cond: [略] Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..2.92 rows=1 width=84) (actual time=0.006..0.006 rows=1 loops=6286) Index Cond: [略] Filter: [略] Rows Removed by Filter: 0 -> Index Scan using [略] (cost=0.43..8.45 rows=1 width=11) (actual time=0.004..0.004 rows=1 loops=5579) Index Cond: [略] Filter: [略] -> Index Scan using [略] (cost=0.43..0.61 rows=1 width=42) (actual time=0.004..0.005 rows=1 loops=5579) Index Cond: [略] Filter: [略] Total runtime: 3114.997 ms

Key: [略] Sort Method: quicksort Memory: 1674kB -> Nested Loop (cost=2.59..582.42 rows=1 width=312) (actual time=548.120..3088.081 rows=5579 loops=1) -> Nested Loop (cost=2.16..581.58 rows=1 width=297) (actual time=548.032..3008.849 rows=5579 loops=1) -> Nested Loop (cost=1.73..573.12 rows=1 width=308) (actual time=548.020..2980.460 rows=5579 loops=1) Join Filter: [略] -> Nested Loop (cost=1.29..570.17 rows=1 width=256) (actual time=544.342..2929.895 rows=6286 loops=1) -> Nested Loop (cost=0.86..456.48 rows=85 width=245) (actual time=0.054..1044.752 rows=473464 loops=1) -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) -> Index Scan using idx_[略] (cost=0.43..446.91 rows=111 width=191) (actual time=0.015..4.347 rows=4117 loops=115) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..1.33 rows=1 width=22) (actual time=0.003..0.003 rows=0 loops=473464) Index Cond: [略] Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..2.92 rows=1 width=84) (actual time=0.006..0.006 rows=1 loops=6286) Index Cond: [略] Filter: [略] Rows Removed by Filter: 0 -> Index Scan using [略] (cost=0.43..8.45 rows=1 width=11) (actual time=0.004..0.004 rows=1 loops=5579) Index Cond: [略] Filter: [略] -> Index Scan using [略] (cost=0.43..0.61 rows=1 width=42) (actual time=0.004..0.005 rows=1 loops=5579) Index Cond: [略] Filter: [略] Total runtime: 3114.997 ms ？

Key: [略] Sort Method: quicksort Memory: 1674kB -> Nested Loop (cost=2.59..582.42 rows=1 width=312) (actual time=548.120..3088.081 rows=5579 loops=1) -> Nested Loop (cost=2.16..581.58 rows=1 width=297) (actual time=548.032..3008.849 rows=5579 loops=1) -> Nested Loop (cost=1.73..573.12 rows=1 width=308) (actual time=548.020..2980.460 rows=5579 loops=1) Join Filter: [略] -> Nested Loop (cost=1.29..570.17 rows=1 width=256) (actual time=544.342..2929.895 rows=6286 loops=1) -> Nested Loop (cost=0.86..456.48 rows=85 width=245) (actual time=0.054..1044.752 rows=473464 loops=1) -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) -> Index Scan using idx_[略] (cost=0.43..446.91 rows=111 width=191) (actual time=0.015..4.347 rows=4117 loops=115) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..1.33 rows=1 width=22) (actual time=0.003..0.003 rows=0 loops=473464) Index Cond: [略] Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..2.92 rows=1 width=84) (actual time=0.006..0.006 rows=1 loops=6286) Index Cond: [略] Filter: [略] Rows Removed by Filter: 0 -> Index Scan using [略] (cost=0.43..8.45 rows=1 width=11) (actual time=0.004..0.004 rows=1 loops=5579) Index Cond: [略] Filter: [略] -> Index Scan using [略] (cost=0.43..0.61 rows=1 width=42) (actual time=0.004..0.005 rows=1 loops=5579) Index Cond: [略] Filter: [略] Total runtime: 3114.997 ms -> Nested Loop (cost=0.86..456.48 rows=85 width=245) (actual time=0.054..1044.752 rows=473464 loops=1)

Key: [略] Sort Method: quicksort Memory: 1674kB -> Nested Loop (cost=2.59..582.42 rows=1 width=312) (actual time=548.120..3088.081 rows=5579 loops=1) -> Nested Loop (cost=2.16..581.58 rows=1 width=297) (actual time=548.032..3008.849 rows=5579 loops=1) -> Nested Loop (cost=1.73..573.12 rows=1 width=308) (actual time=548.020..2980.460 rows=5579 loops=1) Join Filter: [略] -> Nested Loop (cost=1.29..570.17 rows=1 width=256) (actual time=544.342..2929.895 rows=6286 loops=1) -> Nested Loop (cost=0.86..456.48 rows=85 width=245) (actual time=0.054..1044.752 rows=473464 loops=1) -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) -> Index Scan using idx_[略] (cost=0.43..446.91 rows=111 width=191) (actual time=0.015..4.347 rows=4117 loops=115) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..1.33 rows=1 width=22) (actual time=0.003..0.003 rows=0 loops=473464) Index Cond: [略] Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..2.92 rows=1 width=84) (actual time=0.006..0.006 rows=1 loops=6286) Index Cond: [略] Filter: [略] Rows Removed by Filter: 0 -> Index Scan using [略] (cost=0.43..8.45 rows=1 width=11) (actual time=0.004..0.004 rows=1 loops=5579) Index Cond: [略] Filter: [略] -> Index Scan using [略] (cost=0.43..0.61 rows=1 width=42) (actual time=0.004..0.005 rows=1 loops=5579) Index Cond: [略] Filter: [略] Total runtime: 3114.997 ms

Key: [略] Sort Method: quicksort Memory: 1674kB -> Nested Loop (cost=2.59..582.42 rows=1 width=312) (actual time=548.120..3088.081 rows=5579 loops=1) -> Nested Loop (cost=2.16..581.58 rows=1 width=297) (actual time=548.032..3008.849 rows=5579 loops=1) -> Nested Loop (cost=1.73..573.12 rows=1 width=308) (actual time=548.020..2980.460 rows=5579 loops=1) Join Filter: [略] -> Nested Loop (cost=1.29..570.17 rows=1 width=256) (actual time=544.342..2929.895 rows=6286 loops=1) -> Nested Loop (cost=0.86..456.48 rows=85 width=245) (actual time=0.054..1044.752 rows=473464 loops=1) -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) -> Index Scan using idx_[略] (cost=0.43..446.91 rows=111 width=191) (actual time=0.015..4.347 rows=4117 loops=115) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..1.33 rows=1 width=22) (actual time=0.003..0.003 rows=0 loops=473464) Index Cond: [略] Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..2.92 rows=1 width=84) (actual time=0.006..0.006 rows=1 loops=6286) Index Cond: [略] Filter: [略] Rows Removed by Filter: 0 -> Index Scan using [略] (cost=0.43..8.45 rows=1 width=11) (actual time=0.004..0.004 rows=1 loops=5579) Index Cond: [略] Filter: [略] -> Index Scan using [略] (cost=0.43..0.61 rows=1 width=42) (actual time=0.004..0.005 rows=1 loops=5579) Index Cond: [略] Filter: [略] Total runtime: 3114.997 ms -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text))

行数推定外れすぎ問題なんで？

行数推定を読み解く

実行計画とは何か SQLって書いたとおりに実行されるんじゃないんですか？

実行計画とは何か(1/4) ユーザから与えられたクエリを効率よく実行するためにが内部的に最適化する。

実行計画とは何か(1/4) ユーザから与えられたクエリを効率よく実行するためにが内部的に最適化する。？ !?

実行計画とは何か(2/4) クエリとはユーザからに対する「依頼」を記述したもの。

実行計画とは何か(3/4) は、ユーザが記述したクエリをリレーショナルモデルにおける「演算」に変換する。

実行計画とは何か(4/4) 実行計画とは、統計情報をもとに最もコストの低い演算子と順序を選択したもの。

行数推定の精度が悪いとプラン演算子の選択を誤る行数少なそう。ここはSeqScanやな。行数多かった！！！！ IndexScanのほうが速かった！

行数推定の精度が悪いと実行順序の選択を誤る行数少なそう。ネストさせとこ。行数多かった！！！！コストめっちゃ高い！！！

PostgreSQLが収集する統計情報※1 • DB単位のコミット数やロールバック数 • DBやテーブル、インデックス単位のI/O発生状況 • DBやテーブルに対して挿入/更新/削除された行数 • テーブル単位で実施された表スキャン回数 •
DBやテーブル、インデックス単位でスキャンされた行数 • テーブル単位の行数、ガベージ量 • 現在実施中のSQLやメンテナンス処理 • カラム単位の度数分布と最頻値・最頻度 … ※1 稼動統計情報を活用しよう (1) : http://lets.postgresql.jp/documents/technical/statistics/1

PostgreSQLが収集する統計情報 • DB単位のコミット数やロールバック数 • DBやテーブル、インデックス単位のI/O発生状況 • DBやテーブルに対して挿入/更新/削除された行数 • テーブル単位で実施された表スキャン回数 •
DBやテーブル、インデックス単位でスキャンされた行数 • テーブル単位の行数、ガベージ量 • 現在実施中のSQLやメンテナンス処理 • カラム単位の度数分布と最頻値・最頻度 … 行数推定で使う統計情報はこれだけ

テーブル単位の行数 • pg_class の reltuples カラムに保持される ※1 。 • VACUUM,
ANALYZE, CREATE INDEX 等で更新される。 relname reltuples user_table 1424335 SELECT relname, reltuples FROM pg_class WHERE relname = ‘user_table’; ※1 VACUUM, ANALYZE等ではテーブル全体を操作しないため、概算値が保持される。

カラム単位の最頻値・最頻度と度数分布 • pg_statistic に Array型で格納される。※1 • ANALYZE, VACUUM ANALYZE で更新される。
※1 Arrayの最大長（＝保持する最大数）をカラムごとまたはグローバルに設定できる（ default:100）。 ※2 pg_statistic はシステムカタログであり読みづらいため、参照する際は pg_stats ビューを利用する。 most_common_vals most_common_freqs histogram_bounds {yontendo, gaagle, yanoo,... {0.1349666714668274,0.130199998 61717224,0.08696666359901428,... {twitto, bacefook, linkedout, …. SELECT most_common_vals, most_common_freqs, histogram_bounds FROM pg_stats ※2 WHERE tablename = 'user_table' and attname = 'company_id';

行数推定の考え方「テーブル全体の行数の推定値」に「全体に占める対象データの割合の推定値」を　掛け合わせたものを推定行数とする。

行数推定の方法推定行数 = 濃度 * 選択度

行数推定の方法推定行数 = 濃度 * 選択度 pg_class.reltuples ？

選択度決定の方法は条件句によって異なる最頻値に含まれない最頻値に含まれる一致条件範囲条件複合条件条件なし

行数推定のアルゴリズム：条件なし SELECT * FROM user_table; Seq Scan on user_table (cost=0.00..84330.35
rows=1424335 width=355) (actual time=0.041..749.988 rows=1424450 loops=1) 1. 選択度は1となる。 2. user_tableの濃度を推定行数とする。

行数推定のアルゴリズム：範囲条件 ※1 SELECT * FROM user_table WHERE crttimestamp < ‘2015-10-10
10:10:10’; ※1 本当はもう少し複雑。 https://www.postgresql.jp/document/9.5/html/row-estimation-examples.html ※2 pg_operator の oprrest カラムに保持されている。今回は scalarltsel関数。 1. 選択度関数を決定する※2 。 2. pg_statistic から user_id の度数分布を取り出す。 3. 「<」で占められる度数分布率を選択度とする。 4. user_tableの濃度と選択度の積を推定行数とする。

column < c で占められる度数分布率 frequency = (1 + (c -
bucket[ i c ].min) / (bucket[ i c ].max - bucket[ i c ].min)) / bucket.length 0 86 16 32 48 64 70 bucket[1] bucket[2] bucket[2].min bucket[2].min 例：対象の値 c = 20 、度数分布が右図の場合、度数分布率は以下で計算される。 frequency = (1 + (20 - 16) / (32 -16)) / 6 ≒ 0.21 ※各バケット内での線形分布を仮定。

行数推定のアルゴリズム：範囲条件 SELECT * FROM user_table WHERE crttimestamp < ‘2015-10-10 10:10:10’;
Seq Scan on user_table (cost=0.00..87891.19 rows=508065 width=355) (actual time=0.015..266.237 rows=503049 loops=1) Filter: (crttimestamp < '2015-10-10 10:10:10+09'::timestamp with time zone) Rows Removed by Filter: 921401

行数推定のアルゴリズム：一致条件 SELECT * FROM user_table WHERE company_id = ‘sansan’; 1.
選択度関数を決定する※1 。 2. “sansan” という値が最頻値に含まれるか調べる。 3. 以下により選択度を決定する。 a. 最頻値に含まれる場合　：最頻度を選択度とする。 b. 最頻値に含まれない場合：がんばってしぼりだす。 4. user_tableの濃度と選択度の積を推定行数とする。 ※1 今回は eqsel関数。

最頻値に含まれない値の選択度 selectivity = (1 - sum(mvf)) / (num_distinct - num_mcv)
例：対象のカラムのとりうる値が 100通りで、MCV・MCFが　　右表の場合、”sansan” の選択度は以下で計算される。 selectivity = (1 - (0.32 + 0.18 + 0.09 + 0.04 + 0.01)) / (100 - 5) ≒ 0.0038 ※MCV以外の列の一様分布を仮定。 MCV MCF yontendo 0.32 yanoo 0.18 gaagle 0.09 hitaz 0.04 basefook 0.01

行数推定のアルゴリズム：一致条件 SELECT * FROM user_table WHERE company_id = ‘sansan’; Bitmap
Heap Scan on user_table (cost=2836.15..77415.72 rows=108867 width=355) (actual time=19.782..68.394 rows=109436 loops=1) Index Cond: ((company_id)::text = 'sansan'::text)

行数推定のアルゴリズム：複合条件 SELECT * FROM user_table WHERE company_id = ‘sansan’ AND
user_id = ‘kabata’ 1. 各条件の選択度を計算する。 2. user_tableの濃度、各条件の選択度の積を推定行数とする。 ※各条件は独立していると仮定。

行数推定のアルゴリズム：複合条件 SELECT * FROM user_table WHERE company_id = ‘sansan’ AND
user_id = ‘kabata’ Index Scan using user_table_pkey on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text))

行数推定のアルゴリズム：まとめ選択度（selectivity） • 条件なし： 1 • 範囲条件： (1
+ (c - b[ i c ].min) / (b[ i c ].max - b[ i c ].min) / b.length • 一致条件 ◦ 最頻値に含まれる　： mcf c ◦ 最頻値に含まれない： (1 - sum(mcf)) / (num_distinct - num_mcv) • 複合条件： selectivity(a) * selectivity(b) 推定行数 rows = rel_cardinarity * selectivity

問題の原因

実行計画をみてみる（再掲） Sort (cost=582.43..582.44 rows=1 width=312) (actual time=3112.054..3113.061 rows=5579 loops=1) Sort
Key: [略] Sort Method: quicksort Memory: 1674kB -> Nested Loop (cost=2.59..582.42 rows=1 width=312) (actual time=548.120..3088.081 rows=5579 loops=1) -> Nested Loop (cost=2.16..581.58 rows=1 width=297) (actual time=548.032..3008.849 rows=5579 loops=1) -> Nested Loop (cost=1.73..573.12 rows=1 width=308) (actual time=548.020..2980.460 rows=5579 loops=1) Join Filter: [略] -> Nested Loop (cost=1.29..570.17 rows=1 width=256) (actual time=544.342..2929.895 rows=6286 loops=1) -> Nested Loop (cost=0.86..456.48 rows=85 width=245) (actual time=0.054..1044.752 rows=473464 loops=1) -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) -> Index Scan using idx_[略] (cost=0.43..446.91 rows=111 width=191) (actual time=0.015..4.347 rows=4117 loops=115) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..1.33 rows=1 width=22) (actual time=0.003..0.003 rows=0 loops=473464) Index Cond: [略] Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..2.92 rows=1 width=84) (actual time=0.006..0.006 rows=1 loops=6286) Index Cond: [略] Filter: [略] Rows Removed by Filter: 0 -> Index Scan using [略] (cost=0.43..8.45 rows=1 width=11) (actual time=0.004..0.004 rows=1 loops=5579) Index Cond: [略] Filter: [略] -> Index Scan using [略] (cost=0.43..0.61 rows=1 width=42) (actual time=0.004..0.005 rows=1 loops=5579) Index Cond: [略] Filter: [略] Total runtime: 3114.997 ms -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text))

実行計画をみてみる（再掲） Sort (cost=582.43..582.44 rows=1 width=312) (actual time=3112.054..3113.061 rows=5579 loops=1) Sort
Key: [略] Sort Method: quicksort Memory: 1674kB -> Nested Loop (cost=2.59..582.42 rows=1 width=312) (actual time=548.120..3088.081 rows=5579 loops=1) -> Nested Loop (cost=2.16..581.58 rows=1 width=297) (actual time=548.032..3008.849 rows=5579 loops=1) -> Nested Loop (cost=1.73..573.12 rows=1 width=308) (actual time=548.020..2980.460 rows=5579 loops=1) Join Filter: [略] -> Nested Loop (cost=1.29..570.17 rows=1 width=256) (actual time=544.342..2929.895 rows=6286 loops=1) -> Nested Loop (cost=0.86..456.48 rows=85 width=245) (actual time=0.054..1044.752 rows=473464 loops=1) -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) -> Index Scan using idx_[略] (cost=0.43..446.91 rows=111 width=191) (actual time=0.015..4.347 rows=4117 loops=115) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..1.33 rows=1 width=22) (actual time=0.003..0.003 rows=0 loops=473464) Index Cond: [略] Filter: [略] Rows Removed by Filter: 1 -> Index Scan using [略] (cost=0.43..2.92 rows=1 width=84) (actual time=0.006..0.006 rows=1 loops=6286) Index Cond: [略] Filter: [略] Rows Removed by Filter: 0 -> Index Scan using [略] (cost=0.43..8.45 rows=1 width=11) (actual time=0.004..0.004 rows=1 loops=5579) Index Cond: [略] Filter: [略] -> Index Scan using [略] (cost=0.43..0.61 rows=1 width=42) (actual time=0.004..0.005 rows=1 loops=5579) Index Cond: [略] Filter: [略] Total runtime: 3114.997 ms -> Index Scan using idx_[略] on user_table (cost=0.43..8.46 rows=1 width=70) (actual time=0.031..0.139 rows=115 loops=1) Index Cond: (((company_id)::text = 'sansan'::text) AND ((user_id)::text = 'kabata'::text)) 一致条件 + 一致条件の複合条件

複合条件の行数推定

事象の独立性 P(A∩B) = P(A)P(B) であるとき事象Aと事象Bは独立である

条件の独立性 S(A∩B) = S(A)S(B) であるとき条件Aと条件Bは独立である ※S(X)は条件Sの選択度

独立性を検証してみる S((company_id = ‘sansan’)∩((user_id = ‘kabata’))) = S((company_id = ‘sansan’))S((user_id
= ‘kabata’)) 成り立つか？

各条件による実際の行数を取得する SELECT count(*) FROM user_table WHERE company_id = ‘sansan’ and
user_id = ‘kabata’; SELECT count(*) FROM user_table WHERE company_id = ‘sansan’; SELECT count(*) FROM user_table WHERE user_id = ‘kabata’;

各条件による実際の行数を取得する SELECT count(*) FROM user_table WHERE company_id = ‘sansan’ and
user_id = ‘kabata’; SELECT count(*) FROM user_table WHERE company_id = ‘sansan’; SELECT count(*) FROM user_table WHERE user_id = ‘kabata’; 322 109436 322

実際の選択度を計算する S(A∩B) = 322 / 1424450 ≒ 0.00023 S(A) =
109436 / 1424450 ≒ 0.071 S(B) = 322 / 1424450 ≒ 0.00023

独立性を検証する S(A∩B) ≒ 0.00023 S(A)S(B) ≒ 0.071 * 0.00023 ≒0.000017
S(A∩B) ≠ S(A)S(B)

問題の原因行数推定においてが想定する状態と実際の状態が乖離。

解決方法

3つの解決方法に従うを騙すに背く

複合条件をやめる部分関数従属に”限りなく近い”状態をやめたい。 user_id を一意にする。

ヒント句を使う※1 が作成する実行計画を無視し、手動で指定する。本質的な問題解決ではない。 ※1 pg_hint_plan : http://pgdbmsstats.osdn.jp/pg_dbms_stats-ja.html

統計情報の改竄両条件の関数従属性を仮定する。 MCF(user_id) = MCF(user_id) / MCF(company_id) 統計情報は VACUUM や
ANALYZE によって元に戻ってしまう。

を騙し続ける統計情報を改竄した状態で固定化する。 ※1 魔改造。 ※1 pg_dbms_stats : http://pgdbmsstats.osdn.jp/pg_dbms_stats-ja.html

結論に従うを騙すに背く

まとめ

まとめ(1/3) 行数推定のアルゴリズムは意外と単純。

まとめ(2/3) DB設計やデータ構造的な問題に帰結。

まとめ(3/3) 仕組みがわかると楽しい。

まとめ行数推定のアルゴリズムは意外と単純。 DB設計やデータの構造的な問題に帰結。仕組みがわかると楽しい。

補足

補足：pg_statistic からMCV/MCFを取得するSQL SELECT * FROM pg_statistic s JOIN pg_class c
ON (c.oid = s.starelid) JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum) LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace) WHERE relname = ‘user_table’ AND attname = ‘user_id’ AND stakind1 = 1;

PostgreSQL：行数推定を読み解く/row-estimation

PostgreSQL：行数推定を読み解く/row-estimation

More Decks by kabaome

Other Decks in Technology

Featured

Transcript