Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Low Latency Join Method for Distributed DBMS

Low Latency Join Method for Distributed DBMS

Technical Deep Dive Session by Mineharu "Hal" Takahara at Database Engineering Meetup #4 - Japan
https://scalar.connpass.com/event/330410/

YugabyteDB Japan

October 03, 2024
Tweet

More Decks by YugabyteDB Japan

Other Decks in Technology

Transcript

  1. 1 Yugabyte Confidential © 2024 All rights reserved. Low Latency

    Join Method for Distributed DBMS October 2, 2024 “Halˮ Takahara 1
  2. 2 Yugabyte Confidential © 2024 All rights reserved. Overview •

    Low Latency Workload • Basic Join Algorithms • Distributed System Challenges • Speed-up Techniques • Batched Nested Loops • Quick Experiment • Conclusion
  3. 3 Yugabyte Confidential © 2024 All rights reserved. Low Latency

    Workload • Use cases ◦ Web/Mobile App backend ◦ OLTP ◦ (Semi-)Streaming data processing • Characteristics ◦ Selective search conditions ◦ High concurrency • Requirements ◦ Quick response time ◦ Low system utilization (per query)
  4. 4 Yugabyte Confidential © 2024 All rights reserved. Basic Join

    Algorithms Nested Loops w/Index on inner Hash Sort-Merge w/Index on inner Lead time 0 O(S) (hash build) 0 or O(R log R) (sorting outer if not) Memory footprint 0 Hash table Sort buffer if not sorted Disk spills No Possible + partitioning overhead Possible while sorting # of table accesses 1 + R 1 + 1 1 + 1 (can stop on mismatch) Index access via join condition Yes No No R, S: the number of rows in the outer and the inner relation respectively
  5. 5 Yugabyte Confidential © 2024 All rights reserved. Basic Join

    Algorithms Nested Loops w/Index on inner Hash Sort-Merge w/Index on inner Lead time 0 😃 O(S) (hash build) 0 😃 or O(R log R) (sorting outer if not) Memory footprint 0 😃 Hash table 😩 Sort buffer if not sorted Disk spills No 😃 Possible + partitioning overhead 🥹 Possible while sorting # of table accesses 1 + R 🧐 1 + 1 😃 1 + 1 (can stop on mismatch) 😃 Index access via join condition Yes 😃 No No R, S: the number of rows in the outer and the inner relation respectively
  6. 6 Yugabyte Confidential © 2024 All rights reserved. Distributed System

    Challenges • Data move overhead ◦ Remote storage accesses ▪ Data transfer ▪ Per round trip ◦ Ensuring correct row paring ▪ Redistribute both relations on join key ▪ Broadcast either relation (Not preserved side(s) of outer-join) ▪ Send both relations to a single node ◦ Combine the results • Even more expensive with geo-distributed data • Nested Loops ◦ (Typically) Indexes lost after data move ◦ Join condition value varies in each loop → materialization may not be effective
  7. 7 Yugabyte Confidential © 2024 All rights reserved. Speed-up Techniques

    • Reduce data transfer volume ◦ Push down search conditions ▪ Join conditions can be also pushed down with Nested Loops ◦ Redistribute rows via (value (or hash) → node#) mapping table (“Subsetting Nested Loops with index”)[1] ◦ Build & send semi-join filter using join key values[2] • Reduce round trip overhead (Nested Loops) ◦ Batch inner relation accesses (“Batched Nested Loops”)[7] • Parallel processing[3],[4],[5] ◦ Scatter-gather overhead and system resource trade-off*1 *1 Unless hw assisted pipelining (e.g. SIMD, GPU, FPGA, etc.)
  8. 8 Yugabyte Confidential © 2024 All rights reserved. Subsetting Nested

    Loops[1] k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) h nodes 0 [ 0, 2 ] 1 [ 2 ] 2 [ 1 ] 3 [ 1, 2 ] hash → node# node-1 R ⨝ S on R.k = S.k node-0 node-2 * Relation fragments that may reside in each node omitted (1) Use hash table to figure out destination(s) of each row
  9. 9 Yugabyte Confidential © 2024 All rights reserved. Subsetting Nested

    Loops[1] k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) h nodes 0 [ 0, 2 ] 1 [ 2 ] 2 [ 1 ] 3 [ 1, 2 ] hash → node# node-1 R ⨝ S on R.k = S.k node-0 node-2 * Relation fragments that may reside in each node omitted (1) Use hash table to figure out destination(s) of each row
  10. 10 Yugabyte Confidential © 2024 All rights reserved. Subsetting Nested

    Loops[1] k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) h nodes 0 [ 0, 2 ] 1 [ 2 ] 2 [ 1 ] 3 [ 1, 2 ] hash → node# node-1 R ⨝ S on R.k = S.k node-0 node-2 * Relation fragments that may reside in each node omitted (1) Use hash table to figure out destination(s) of each row
  11. 11 Yugabyte Confidential © 2024 All rights reserved. Subsetting Nested

    Loops[1] k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) h nodes 0 [ 0, 2 ] 1 [ 2 ] 2 [ 1 ] 3 [ 1, 2 ] hash → node# node-1 R ⨝ S on R.k = S.k node-0 node-2 * Relation fragments that may reside in each node omitted (1) Use hash table to figure out destination(s) of each row
  12. 12 Yugabyte Confidential © 2024 All rights reserved. Subsetting Nested

    Loops[1] k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) h nodes 0 [ 0, 2 ] 1 [ 2 ] 2 [ 1 ] 3 [ 1, 2 ] hash → node# node-1 R ⨝ S on R.k = S.k node-0 node-2 * Relation fragments that may reside in each node omitted (1) Use hash table to figure out destination(s) of each row
  13. 13 Yugabyte Confidential © 2024 All rights reserved. Subsetting Nested

    Loops[1] k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) h nodes 0 [ 0, 2 ] 1 [ 2 ] 2 [ 1 ] 3 [ 1, 2 ] hash → node# node-1 R ⨝ S on R.k = S.k node-0 * Relation fragments that may reside in each node omitted k v1, … 7 6 10 10 11 7 R′ (2) Send R′ to node-1 and join to S (1) Use hash table to figure out destination(s) of each row
  14. 14 Yugabyte Confidential © 2024 All rights reserved. Semi-join Filtering[2]

    k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) node-1 R ⨝ S on R.k = S.k node-0 * Relation fragments that may reside in each node omitted * Index on S.k optional
  15. 15 Yugabyte Confidential © 2024 All rights reserved. Semi-join Filtering[2]

    k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) node-1 R ⨝ S on R.k = S.k node-0 * Relation fragments that may reside in each node omitted * Index on S.k optional R′ (1) Remove the duplicates in join key and send them to node-1 k 1 5 6 7 10 11 12
  16. 16 Yugabyte Confidential © 2024 All rights reserved. Semi-join Filtering[2]

    k v1, … 7 5 6 10 12 10 5 11 7 1 R k v1, … 2 2 6 6 7 7 11 11 14 14 S (index on k) node-1 R ⨝ S on R.k = S.k node-0 * Relation fragments that may reside in each node omitted * Index on S.k optional S′ (2) Join R′ to S, send the results (S′) to node-0 and join to R R′ (1) Remove the duplicates in join key and send them to node-1 k 1 5 6 7 10 11 12 k v1, … 6 6 7 7 11 11
  17. 17 Yugabyte Confidential © 2024 All rights reserved. Batched Nested

    Loops Algorithm 1. Read a batch of rows (e.g. 1024 rows) from the outer relation and build a hash table. 2. Access the inner relation using IN-list index condition with values from the outer. 3. For each inner row: 1) Probe the hash table 2) Apply other join conditions if necessary 3) Put the row in sort buffer if sorted results needed. Else return it immediately 4. Sort the batch and return. 5. Repeat until outer rows are exhausted (or the consuming node stops pulling the results, e.g.: LIMIT)
  18. 18 Yugabyte Confidential © 2024 All rights reserved. Batched Nested

    Loops Example SELECT * FROM keyword AS k, movie_keyword AS mk WHERE k.keyword LIKE '%sequel%' AND k.id = mk.keyword_id; Nested Loop (actual time=168.118..192.641 rows=12951 loops=1) -> Index Scan using keyword_pkey on keyword k (actual time=131.931..131.959 rows=30 loops=1) Storage Filter: (keyword ~~ '%sequel%'::text) -> Index Scan using keyword_id_movie_keyword on movie_keyword mk (actual time=1.861..1.953 rows=432 loops=30) Index Cond: (keyword_id = k.id) YB Batched Nested Loop Join (actual time=177.346..185.968 rows=12951 loops=1) Join Filter: (k.id = mk.keyword_id) -> Index Scan using keyword_pkey on keyword k (actual time=131.768..131.779 rows=30 loops=1) Storage Filter: (keyword ~~ '%sequel%'::text) -> Index Scan using keyword_id_movie_keyword on movie_keyword mk (actual time=45.480..48.586 rows=12951 loops=1) Index Cond: (keyword_id = ANY (ARRAY[k.id, $1, $2, ..., $1023]))
  19. 19 Yugabyte Confidential © 2024 All rights reserved. Quick Experiment

    • System Configuration ◦ 3-node RF=3 cluster on Yugabyte Anywhere self-managed DBaaS*1 ◦ aws c5.2xlarge (8 cores, 16GB RAM) Intel x86 ◦ AlmaLinux 8.9 ◦ YugabyteDB version 2024.1.1.0-b137 (PostgreSQL 11-based)*2 ◦ Enhanced Postgres compatibility features enabled (“early access”) • Join Order Benchmark ◦ Introduced by “How good are query optimizers, really?”[6] ◦ Synthetic workload against IMDb (imdb.com) data set • Use query q3c and modified version with ORDER BY LIMIT • Enable one of the join methods one at a time via Postgres GUC flags *1 Fully-managed DBaaS also available (YugabyteDB Aeon) *2 PostgreSQL 15-based tech-preview version coming soon
  20. 22 Yugabyte Confidential © 2024 All rights reserved. Queries q3c:

    SELECT MIN(t.title) AS movie_title FROM keyword AS k, movie_info AS mi, movie_keyword AS mk, title AS t WHERE k.keyword LIKE '%sequel%' AND mi.info IN ('Sweden', 'Norway', 'Germany', 'Denmark', 'Swedish', 'Denish', 'Norwegian', 'German', 'USA', 'American') AND t.production_year > 1990 AND t.id = mi.movie_id AND t.id = mk.movie_id AND mk.movie_id = mi.movie_id AND k.id = mk.keyword_id; q3c-ol - modified with ORDER BY t.id (the join key) and LIMIT 1: SELECT t.title AS movie_title FROM …(the same as q3c)… ORDER BY t.id LIMIT 1;
  21. 23 Yugabyte Confidential © 2024 All rights reserved. Results *

    BNL XXXX: Batched Nested Loops with the batch size = 1024, 2048, 4096 * Hash joins are fully parallelized. Only the outer relation is split and parallelized in all other join methods. Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 q3c serial 29,146 30,638 14,957 2,271 2,164 1,971 q3c px workers=2 13,788 19,139 11,420 2,075 2,097 1,938 q3c-ol 28,645 6,774 3,814 1,049 1,199 1,199 q3c-ol px workers=2 13,317 3,254 544 11,852 11,314 18,262 Total execution time [ms]
  22. 24 Yugabyte Confidential © 2024 All rights reserved. Results *

    BNL XXXX: Batched Nested Loops with the batch size = 1024, 2048, 4096 * Hash joins are fully parallelized. Only the outer relation is split and parallelized in all other join methods. Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 q3c serial 29,146 30,638 14,957 2,271 2,164 1,971 q3c px workers=2 13,788 19,139 11,420 2,075 2,097 1,938 q3c-ol 28,645 6,774 3,814 1,049 1,199 1,199 q3c-ol px workers=2 13,317 3,254 544 11,852 11,314 18,262 Total execution time [ms]
  23. 25 Yugabyte Confidential © 2024 All rights reserved. Results *

    BNL XXXX: Batched Nested Loops with the batch size = 1024, 2048, 4096 * Hash joins are fully parallelized. Only the outer relation is split and parallelized in all other join methods. Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 q3c serial 29,146 30,638 14,957 2,271 2,164 1,971 q3c px workers=2 13,788 19,139 11,420 2,075 2,097 1,938 q3c-ol 28,645 6,774 3,814 1,049 1,199 1,199 q3c-ol px workers=2 13,317 3,254 544 11,852 11,314 18,262 Total execution time [ms] ???
  24. 26 Yugabyte Confidential © 2024 All rights reserved. Row Counts

    (q3c Serial Plans) Multipliers (*) denotes the number times the relation was scanned Execution time [ms] Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 29,146 30,638 14,957 2,271 2,164 1,971 Rows scanned Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 keyword 134,170 134,170 134,170 134,170 134,170 134,170 movie_info 14,835,720 14,835,720 36 * 7874 = 283464 34723 * 8 = 277784 66523 * 4 = 266092 119812 * 2 = 239624 movie_keyword 4,523,930 4,523,930 432 * 30 = 12960 12951 * 1 = 12951 12951 * 1 = 12951 12951 * 1 = 12951 title 2,528,312 2,528,312 1 * 12951 = 12951 988 * 13 = 12844 1821 * 7 = 12747 3116 * 4 = 12,464 Total 22,022,132 22,022,132 443,545 437,749 425,960 386,757 Rows returned from storage layer Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 keyword 30 30 30 30 30 30 movie_info 706,674 706,674 1 * 7874 = 7874 892 * 8 = 7136 1731 * 4 = 6924 3204 * 2 = 6408 movie_keyword 4,523,930 4,523,930 432 * 30 = 12960 12951 * 1 = 12951 12951 * 1 = 12951 12951 * 1 = 12951 title 1,749,032 1,748,431 1 * 12951= 12951 600 * 13 = 7800 1103 * 7 = 7721 1866 * 4 = 7464 Total 6,979,666 6,979,065 33,815 27,917 27,626 26,853
  25. 27 Yugabyte Confidential © 2024 All rights reserved. Row Counts

    (q3c-ol Serial Plans) Multipliers (*) denotes the number times the relation was scanned Execution time [ms] Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 28,645 6,774 3,814 1,049 1,199 1,199 Rows scanned Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 keyword 134,170 134,170 134170 * 1 = 134170 505 * 7 = 3535 958 * 4 = 3832 1774 * 2 = 3548 movie_info 14,835,720 41,089 10 * 3 = 30 46 * 1 = 46 46 * 1 = 46 46 * 1 = 46 movie_keyword 4,523,930 4,523,930 6716 * 1 = 6716 7168 * 1 = 7168 8192 * 1 = 8192 8192 * 1 = 8192 title 2,528,312 39,990 1 * 3 = 3 4 * 1 = 4 4 * 1 = 4 4 * 1 = 4 Total 22,022,132 4,739,179 140,919 10,753 12,074 11,790 Rows returned from storage layer Hash Sort-Merge Nested Loops BNL1024 BNL2048 BNL4096 keyword 30 30 30 * 1 = 30 0.57 * 7 = 4 0.57 * 7 = 4 1 * 2 = 2 movie_info 706,674 2,265 1 * 7874 = 7874 2 * 1 = 2 2 * 1 = 2 2 * 1 = 2 movie_keyword 4,523,930 4,523,930 6716 * 1 = 6716 7168 * 1 = 7168 8192 * 1 = 8192 8192 * 1 = 8192 title 1,749,032 12,940 0.3 * 3 = 1 4 * 1 = 4 4 * 1 = 4 4 * 1 = 4 Total 6,979,666 4,539,165 14,621 7,178 8,202 8,200
  26. 28 Yugabyte Confidential © 2024 All rights reserved. Observations •

    Both Nested Loops and Batched Nested Loops were faster than Hash and Sort-Merge overall ◦ Number of rows scanned and returned from the storage layer were also smaller ◦ Batched Nested Loops even faster than Nested Loops with reduced number of inner loops ◦ Benefit of Increased batch size diminishes quickly as the inner loop count decreases • ORDER BY LIMIT query: ◦ Parallel Nested Loops was the fastest ▪ Helped by materialization of 30 row inner relation ◦ Sort-Merge significantly faster than Hash ◦ Batched Nested Loops parallel plan was much slower than Nested Loops - possibly a bug?? → yes, batching not correctly reducing the loops*1 *1 Parallel execution is a part of “early access” enhanced Postgres compatibility features
  27. 29 Yugabyte Confidential © 2024 All rights reserved. Conclusion •

    While Index Nested Loops join is very efficient in reducing the rows to be scanned, the storage layer access overhead can add up with the increased inner loop count. • Batching Nested Loops join is an effective way to reduce such overhead and without adding much system resources or processing overhead. • Parallel Nested Loops a strong choice although system resource consumption trade-off remains.
  28. 30 Yugabyte Confidential © 2024 All rights reserved. References [1]

    DeWitt, D. J., Naughton, J. F., & Burger, J. (1993, January). Nested loops revisited. In [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems (pp. 230-242). IEEE. [2] Najjar, F., & Slimani, Y. (1999). Extension of the one-shot semijoin strategy to minimize data transmission cost in distributed query processing. Information Sciences, 114(1-4), 1-21. [3] DeWitt, D. J., & Gerber, R. (1985). Multiprocessor hash-based join algorithms (pp. 151-164). University of Wisconsin-Madison, Computer Sciences Department. [4] Schneider, D. A., & DeWitt, D. J. (1989). A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. ACM SIGMOD Record, 18(2), 110-121. [5] Yamada, H., Goda, K., & Kitsuregawa, M. (2023, April). Nested Loops Revisited Again. In 2023 IEEE 39th International Conference on Data Engineering (ICDE) (pp. 3708-3717). IEEE. [6] (Join Order Benchmark) Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., & Neumann, T. (2015). How good are query optimizers, really?. Proceedings of the VLDB Endowment, 9(3), 204-215. [7] Franck Pachot (2023). Batched Nested Loop to reduce read requests to the distributed storage. DEV Community (dev.to) YugabyteDB Distributed PostgreSQL Database
  29. 32 Yugabyte Confidential © 2024 All rights reserved. Thank You!

    Join us on Slack: www.yugabyte.com/slack Star us on GitHub: github.com/yugabyte/yugabyte-db Other Resources: linktr.ee/yugabyte