Upgrade to Pro — share decks privately, control downloads, hide ads and more …

テストデータが偏るということについて

 テストデータが偏るということについて

2022/05/18
サイボウズ株式会社 開運研修の1コマ

yoku0825

May 31, 2022
Tweet

More Decks by yoku0825

Other Decks in Technology

Transcript

  1. \こんにちわ/ yoku0825@とある企業のDBA オラクらない ‐ ポスグらない ‐ マイエスキューエる ‐ 生息域 Twitter:

    @yoku0825 ‐ Blog: 日々の覚書 ‐ 日本MySQLユーザ会 ‐ MySQL Casual ‐ Siriusチームのスレッド ‐ 2/35
  2. オプティマイザの弱点 データの偏りを完全には把握していない InnoDBの統計情報は1インデックスあたりデフォルト20ページのサンプリングで作る そりゃまあ毎回インデックスを完全にスキャンして偏りを把握させるわけにはいかないだろうけれども ‐ 基本的にテーブル内に一定割合 / 一定行数の更新があった時にバックグラウンドスレッドで統計情報を更新す る 「ギリギリ一定割合

    / 一定行数に届かない大量更新」があった時が一番ズレる ‐ 演算は基本的に構ってくれない INT型で桁あふれが発生しない範囲なら、 id - 1 = 0 ⇔ id = 1 … ‐ DATETIME型で桁あふれが発生しない範囲なら、 ORDER BY created_at と ORDER BY created_at + INTERVAL 1 DAY は同じ順番になるはず… ‐ 6/35
  3. データが偏る #とは あるインデックスに紐づくレコードは均一に分散するとは限らない ex. SELECT COUNT(*) FROM 住民 WHERE 都道府県

    = ? ‐ ある偏りについてオプティマイザが同じ実行計画を選ぶとは限らない ex. SELECT * FROM 住民 WHERE 都道府県 = '東京' ORDER BY birthday DESC LIMIT 10 万 vs. SELECT * FROM 住民 WHERE 都道府県 = '鳥取' ORDER BY birthday DESC LIMIT 10万 たぶん東京はORDER BYを狙った方が速くて、鳥取はWHEREを狙った方が速い(2020年の鳥取の人口は全国の0.4%らしい) 都道府県の人口一覧 - Wikipedia ‐ ex. SELECT * FROM 住民 WHERE 都道府県 = '東京' ORDER BY birthday DESC LIMIT 10 vs. SELECT * FROM 住民 WHERE 都道府県 = '東京' ORDER BY birthday DESC LIMIT 1億 LIMIT 1億だと日本総人口にかなり近いので全件フェッチの実行計画を選ぶが、LIMITの早抜けが効かないのでORDER BY狙いの キーは高速化に寄与できない ‐ だが、偏りに気が付かずにオプティマイザが実行計画を選んでしまうことはある 9/35
  4. 偏ったデータ まあシンプルなやつ mysql> SHOW CREATE TABLE post\G *************************** 1. row

    *************************** Table: post Create Table: CREATE TABLE `post` ( `post_id` int NOT NULL, `user_id` int NOT NULL, `post_text` text COLLATE utf8mb4_ja_0900_as_cs NOT NULL, `registered` datetime NOT NULL, PRIMARY KEY (`post_id`), KEY `user_id` (`user_id`), KEY `registered` (`registered`), CONSTRAINT `post_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `user` (`user_id`) ON DELETE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_ja_0900_as_cs 1 row in set (0.01 sec) mysql> SELECT user_id, COUNT(*) FROM post GROUP BY user_id; +---------+----------+ | user_id | COUNT(*) | +---------+----------+ | 1 | 100001 | | 2 | 1 | +---------+----------+ 2 rows in set (0.04 sec) 12/35
  5. 偏ったデータ +---------+----------+ | user_id | COUNT(*) | +---------+----------+ | 1

    | 100001 | | 2 | 1 | +---------+----------+ mysql> EXPLAIN DELETE FROM post WHERE user_id = 2; +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+- ---------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+- ---------+-------------+ | 1 | DELETE | post | NULL | range | user_id | user_id | 4 | const | 1 | 100.00 | Using where | +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+- ---------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> DELETE FROM post WHERE user_id = 2; 13/35
  6. 偏ったデータ +---------+----------+ | user_id | COUNT(*) | +---------+----------+ | 1

    | 100001 | | 2 | 1 | +---------+----------+ mysql> EXPLAIN DELETE FROM post WHERE user_id = 1; +----+-------------+-------+------------+------+---------------+------+---------+------+-------+----- -----+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+-------+----- -----+-------------+ | 1 | DELETE | post | NULL | ALL | user_id | NULL | NULL | NULL | 99875 | 100.00 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+-------+----- -----+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> DELETE FROM post WHERE user_id = 1; 14/35
  7. 偏ったデータ対策 「あり得る最大の行」「あり得る最小の行」「あり得ない行」の3パターンでテストするとまあまあ検出 できるはず SELECT COUNT(*) FROM post WHERE user_id =

    ? AND registered BETWEEN ? AND ? + INTERVAL 1 MONTH; SELECT * FROM post WHERE user_id = ? AND registered BETWEEN ? AND ? + INTERVAL 1 MONTH O RDER BY registered LIMIT 100 FOR UPDATE; 16/35
  8. データの偏りを調べる user_idだけで見ると mysql> SELECT user_id, COUNT(*) FROM post GROUP BY

    user_id; +---------+----------+ | user_id | COUNT(*) | +---------+----------+ | 1 | 100001 | | 2 | 1 | +---------+----------+ 2 rows in set (0.03 sec) 17/35
  9. データの偏りを調べる registeredは元のクエリが1か月単位で切りそうだったから年月に丸めてみた もとのクエリが INTERVAL 1 MONTH だから、 2021/1/30 ~ 2021/2/28

    みたいなまたいだものも本来 はあるけれども ‐ mysql> SELECT DATE_FORMAT(registered, '%Y%m') AS yyyymm, COUNT(*) FROM post GROUP BY yyyymm ORDER BY COUNT(*) ASC LI MIT 3; +--------+----------+ | yyyymm | COUNT(*) | +--------+----------+ | 200302 | 28 | | 200102 | 28 | | 200202 | 28 | +--------+----------+ 3 rows in set (0.72 sec) mysql> SELECT DATE_FORMAT(registered, '%Y%m') AS yyyymm, COUNT(*) FROM post GROUP BY yyyymm ORDER BY COUNT(*) DESC LIMIT 3; +--------+----------+ | yyyymm | COUNT(*) | +--------+----------+ | 202205 | 91847 | | 200005 | 31 | | 200003 | 31 | +--------+----------+ 3 rows in set (0.70 sec) 18/35
  10. データの偏りによる違い 最大の結果セットになりそうな user_id = 1 && regstered = 2022/05/x のパターン

    mysql> EXPLAIN SELECT COUNT(*) FROM post WHERE user_id = 1 AND registered BETWEEN '2022-05-01' AND '2022-05-01' + INTERVAL 1 M ONTH; +----+-------------+-------+------------+------+--------------------+---------+---------+-------+-------+----------+--------- ----+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+-------+----------+--------- ----+ | 1 | SIMPLE | post | NULL | ref | user_id,registered | user_id | 4 | const | 49937 | 50.00 | Using where | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+-------+----------+--------- ----+ mysql> EXPLAIN SELECT * FROM post WHERE user_id = 1 AND registered BETWEEN '2022-05-01' AND '2022-05-01' + INTERVAL 1 MONTH OR DER BY registered LIMIT 100 FOR UPDATE; +----+-------------+-------+------------+-------+--------------------+------------+---------+------+-------+----------+------ ------------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+--------------------+------------+---------+------+-------+----------+------ ------------------------------+ | 1 | SIMPLE | post | NULL | range | user_id,registered | registered | 5 | NULL | 49937 | 50.00 | Using index condition; Using where | +----+-------------+-------+------------+-------+--------------------+------------+---------+------+-------+----------+------ ------------------------------+ 20/35
  11. データの偏りによる違い 最小の結果セットになりそうな user_id = 2 && regstered = 2003/02/x のパターン

    mysql> EXPLAIN SELECT COUNT(*) FROM post WHERE user_id = 2 AND registered BETWEEN '2003-02-01' AND '2003-02-01' + INTERVAL 1 MONTH; +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------+ | 1 | SIMPLE | post | NULL | ref | user_id,registered | user_id | 4 | const | 1 | 5.00 | Using where | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> EXPLAIN SELECT * FROM post WHERE user_id = 2 AND registered BETWEEN '2003-02-01' AND '2003-02-01' + INTERVAL 1 MONTH ORDER BY re gistered LIMIT 100 FOR UPDATE; +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------------- ---------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------------- ---------+ | 1 | SIMPLE | post | NULL | ref | user_id,registered | user_id | 4 | const | 1 | 5.00 | Using where; Using filesort | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------------- ---------+ 1 row in set, 1 warning (0.00 sec) 21/35
  12. データの偏りによる違い user_id = 3 による空振り mysql> EXPLAIN SELECT COUNT(*) FROM

    post WHERE user_id = 3 AND registered BETWEEN '2022-05-01' AND '2022-05-01' + INTERVAL 1 MONTH; +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------+ | 1 | SIMPLE | post | NULL | ref | user_id,registered | user_id | 4 | const | 1 | 50.00 | Using where | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> EXPLAIN SELECT * FROM post WHERE user_id = 3 AND registered BETWEEN '2022-05-01' AND '2022-05-01' + INTERVAL 1 MONTH ORDER BY re gistered LIMIT 100 FOR UPDATE; +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------------- ---------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------------- ---------+ | 1 | SIMPLE | post | NULL | ref | user_id,registered | user_id | 4 | const | 1 | 50.00 | Using where; Using filesort | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+------+----------+-------------------- ---------+ 1 row in set, 1 warning (0.00 sec) 23/35
  13. データの偏りによる違い registered = 1999/01/x による空振り mysql> EXPLAIN SELECT COUNT(*) FROM

    post WHERE user_id = 1 AND registered BETWEEN '1999-01-01' AND '1999-01-01' + INTERVAL 1 MONTH; +----+-------------+-------+------------+-------+--------------------+------------+---------+------+------+----------+----------------- -------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+--------------------+------------+---------+------+------+----------+----------------- -------------------+ | 1 | SIMPLE | post | NULL | range | user_id,registered | registered | 5 | NULL | 1 | 50.00 | Using index condition; Using where | +----+-------------+-------+------------+-------+--------------------+------------+---------+------+------+----------+----------------- -------------------+ 1 row in set, 1 warning (0.00 sec) mysql> EXPLAIN SELECT * FROM post WHERE user_id = 1 AND registered BETWEEN '1999-01-01' AND '1999-01-01' + INTERVAL 1 MONTH ORDER BY re gistered LIMIT 100 FOR UPDATE; +----+-------------+-------+------------+-------+--------------------+------------+---------+------+------+----------+----------------- -------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+--------------------+------------+---------+------+------+----------+----------------- -------------------+ | 1 | SIMPLE | post | NULL | range | user_id,registered | registered | 5 | NULL | 1 | 50.00 | Using index condition; Using where | +----+-------------+-------+------------+-------+--------------------+------------+---------+------+------+----------+----------------- -------------------+ 1 row in set, 1 warning (0.00 sec) 24/35
  14. 偏ったように見えないデータも mysql> SHOW CREATE TABLE user\G *************************** 1. row ***************************

    Table: user Create Table: CREATE TABLE `user` ( `user_id` int NOT NULL, `unique_identifier` varchar(32) COLLATE utf8mb4_ja_0900_as_cs NOT NULL, `registered` datetime NOT NULL, PRIMARY KEY (`user_id`), UNIQUE KEY `unique_identifier` (`unique_identifier`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_ja_0900_as_cs 1 row in set (0.01 sec) mysql> SELECT * FROM user; +---------+-------------------+---------------------+ | user_id | unique_identifier | registered | +---------+-------------------+---------------------+ | 1 | yoku0825 | 2022-05-06 23:51:13 | | 2 | yoku0826 | 2022-05-06 23:51:19 | +---------+-------------------+---------------------+ 2 rows in set (0.00 sec) 28/35
  15. 偏ったように見えないデータも mysql> EXPLAIN DELETE FROM user WHERE user_id = 1;

    +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ | 1 | DELETE | user | NULL | range | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using where | +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> DELETE FROM user WHERE user_id = 1; Query OK, 1 row affected (0.80 sec) mysql> EXPLAIN DELETE FROM user WHERE user_id = 2; +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ | 1 | DELETE | user | NULL | range | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using where | +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> DELETE FROM user WHERE user_id = 2; Query OK, 1 row affected (0.00 sec) 29/35
  16. 偏ったように見えないデータ CASCADE や SET NULL な外部キー制約 FKは基本的に1対多で動くので、「見えない子側の偏り」が発生しがち ‐ UPDATE Trigger,

    DELETE Trigger 1対1なら問題ないけど、1対多(トリガーがあるのが1側)だと外部キー制約と同じようなパターンがある ‐ 多対1(トリガーがあるのが多側)だとこの問題には当たらないけれど、1側にロックが集中するのでまた別の問題は ある ‐ 30/35
  17. ちょっと別口の偏り SELECT COUNT(*) FROM post WHERE user_id = ? AND

    registered BETWEEN ? AND ? + INTERVAL 1 MONTH; SELECT * FROM post WHERE user_id = ? AND registered BETWEEN ? AND ? + INTERVAL 1 MONTH O RDER BY registered LIMIT ? FOR UPDATE; DELETE FROM post WHERE post_id IN (?, ?, .., ?); 31/35
  18. ちょっと別口の偏り(1) LIMITの値が 100 → 10000 で実行計画は変わることがある mysql> EXPLAIN SELECT *

    FROM post WHERE user_id = 1 AND registered BETWEEN '2022-05-01' AND '2022-05-01' + INTERVAL 1 MONTH ORDER BY registere d LIMIT 100 FOR UPDATE; +----+-------------+-------+------------+-------+--------------------+------------+---------+------+-------+----------+--------------------- ---------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+--------------------+------------+---------+------+-------+----------+--------------------- ---------------+ | 1 | SIMPLE | post | NULL | range | user_id,registered | registered | 5 | NULL | 49938 | 50.00 | Using index condition; Using where | +----+-------------+-------+------------+-------+--------------------+------------+---------+------+-------+----------+--------------------- ---------------+ 1 row in set, 1 warning (0.00 sec) mysql> EXPLAIN SELECT * FROM post WHERE user_id = 1 AND registered BETWEEN '2022-05-01' AND '2022-05-01' + INTERVAL 1 MONTH ORDER BY registere d LIMIT 10000 FOR UPDATE; +----+-------------+-------+------------+------+--------------------+---------+---------+-------+-------+----------+------------------------ -----+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+-------+----------+------------------------ -----+ | 1 | SIMPLE | post | NULL | ref | user_id,registered | user_id | 4 | const | 49938 | 50.00 | Using where; Using filesort | +----+-------------+-------+------------+------+--------------------+---------+---------+-------+-------+----------+------------------------ -----+ 1 row in set, 1 warning (0.00 sec) 32/35
  19. ちょっと別口の偏り(2) WHERE .. IN .. の要素の数で実行計画は変わることがある mysql> DELETE FROM post

    WHERE post_id IN (?, ?, ?, .., 15477); +----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | post | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 15747 | 100.00 | Using where | +----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.06 sec) mysql> DELETE FROM post WHERE post_id IN (?, ?, ?, .., 15478); +----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | DELETE | post | NULL | ALL | NULL | NULL | NULL | NULL | 99877 | 100.00 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 2 warnings (0.06 sec) | Warning | 3170 | Memory capacity of 8388608 bytes for 'range_optimizer_max_mem_size' exceeded. Range optimization was not done for this query. | 33/35