Slide 1

Slide 1 text

Apache Arrow C++ Datasets Kenta Murata Speee, Inc. 2019.12.11 Apache Arrow Tokyo Meetup 2019

Slide 2

Slide 2 text

Kenta Murata • Fulltime OSS developer at Speee, Inc. • CRuby committer (as of 2010.02) • Apache Arrow committer (as of 2019.10) • The 24th place (44 commits) • SparseTensor in Arrow C++ • GLib and Ruby binding, etc.

Slide 3

Slide 3 text

Apache Arrow C++ ͷߏ੒ Base Datasets Query Engine Data Frame

Slide 4

Slide 4 text

Apache Arrow C++ Datasets • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ • ༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ • ෳ਺छྨͷετϨʔδ͔Βͷσʔλೖྗʹ΋ରԠͰ͖Δ • কདྷతʹ͸ϑΝΠϧ΁ͷॻ͖ग़͠ʹ΋ରԠ͢Δ༧ఆ

Slide 5

Slide 5 text

ෳ਺ͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ a.parquet b.parquet Query 1 Query 2 c.csv d.json Record
 Batch 1 Record
 Batch 2 Amazon S3 Amazon Redshift Local File System In-Memory Arrow Table

Slide 6

Slide 6 text

ϑΝΠϧ͔ΒͷಡΈࠐΈ Discover Scan Filter & Project Collect

Slide 7

Slide 7 text

ϑΝΠϧ͔ΒͷಡΈࠐΈ • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ • ෳ਺ϑΝΠϧΛฒྻεΩϟϯͰ͖Δ • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ • ෳ਺ͷϑΝΠϧʹ෼ׂ͞ΕͨσʔλΛ࠶ߏ੒͢Δ • σʔλΛෳ਺ϑΝΠϧʹ෼ׂ͢Δͱ͖ͷεΩʔϚ෼ׂͷنଇʹैͬͯॲཧ͢Δ • ৚݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)

Slide 8

Slide 8 text

ϑΝΠϧͷൃݟ • ϕʔεσΟϨΫτϦͷҐஔͱϑΝΠϧϑΥʔϚοτΛࢦఆ ͢ΔͱɺͦͷσΟϨΫτϦҎԼʹ͋Δର৅ϑΝΠϧΛ͢΂ ͯϦετΞοϓͯ͘͠ΕΔ • αϒσΟϨΫτϦΛ࠶ؼతʹ୳͢͜ͱ΋Մೳ • ແࢹ͢ΔϑΝΠϧ໊ͷϓϨϑΟοΫεΛࢦఆͰ͖Δ • ର৅ϑΝΠϧΛ͢΂ͯಡΈࠐΉͨΊʹඞཁͳϚʔδࡁΈͷ εΩʔϚΛ࡞ͬͯ͘ΕΔ (༧ఆ)

Slide 9

Slide 9 text

ϑΝΠϧͷൃݟͷྫ /data/.metadata /data/2018/12/JP/Tokyo/001.parquet /data/2018/12/JP/Tokyo/002.parquet /data/2018/12/JP/Osaka/001.parquet /data/2018/12/US/CA/001.parquet /data/2019/01/JP/Tokyo/001.parquet /data/2019/01/JP/Osaka/001.parquet /data/2019/01/US/CA/001.parquet /data/2019/01/US/NY/001.parquet /tmp/Tokyo.parquet ↓͜ΕΒͷϑΝΠϧ͚ͩϐοΫΞοϓ͍ͨ͠

Slide 10

Slide 10 text

ϑΝΠϧͷൃݟͷྫ using namespace arrow; using namespace arrow::dataset; fs::Selector selector; selector.base_dir = “/data”; selector.recursive = true; std::shared_ptr discovery; ARROW_OK_AND_ASSIGN( discovery, FileSystemDataSourceDiscovery::Make( fs, selector, std::make_shared(), FileSystemDiscoveryOptions())); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());

Slide 11

Slide 11 text

σʔλ෼ׂͷنଇΛࢦఆ /data/2018 /data/2018/12 /data/2018/12/JP /data/2018/12/JP/Tokyo/001.parquet auto partition_scheme =
 schema({field(“year”, int32()), field(“month”, int32()),
 field(“country”, utf8()), field(“city”, utf8())}); ASSERT_OK(discovery->SetPartitionScheme(partition_scheme));
 ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish()); year month country city => {“year": 2018} => {“year”: 2018, “month”: 12} => {“year”: 2018, “month”: 12, “country”: “JP”} => {“year”: 2018, “month”: 12,
 “country”: “JP”, “city”: “Tokyo”}

Slide 12

Slide 12 text

ϑΟϧλϦϯά • ৚݅ࣜΛ࢖ͬͯߦΛϑΟϧλϦϯάͰ͖Δ • year ͕ 2019 Ͱ sales ͕ 100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ ग़͢৔߹͸࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ “year”_ == 2019 && “sales”_ > 100.0 • εΩʔϚ෼ׂͷنଇʹैͬͯɺ৚݅ʹ߹க͠ͳ͍ϑΝΠϧ ͷಡΈࠐΈΛলུ͢Δ

Slide 13

Slide 13 text

औΓग़͢ΧϥϜͷࢦఆ • ͢΂ͯͷΧϥϜΛಡΈࠐ·ͳͯ͘ྑ͍৔߹͸ɺϓϩδΣΫ γϣϯ (ࣹӨ) ػೳΛ࢖ͬͯऔΓग़͢ΧϥϜΛ੍ݶͰ͖Δ • ͜ͷػೳͰಡΈࠐΉΧϥϜΛ੍ݶ͢ΔͱɺෆཁͳΧϥϜͷ σγϦΞϥΠζͱܕม׵͕লུ͞ΕͯɺϑΝΠϧϑΥʔ ϚοτʹΑͬͯ͸σʔλͷಡΈग़͕͠଎͘ͳΔ

Slide 14

Slide 14 text

σʔληοτΛ࡞ͬͯಡΈࠐΜͰ
 Arrow Table Λ࡞Δ·Ͱͷྫ // σʔληοτͷ࡞੒ ASSERT_OK_AND_ASSIGN(auto dataset,
 Dataset::Make({data_source}, discovery->Inspect())); // εΩϟφϏϧμ ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); // ϑΟϧλͷઃఆ auto filter = (“year”_ == 2019 && “sales”_ > 100.0); ASSERT_OK(scanner_builder->Filter(filter)); // ϓϩδΣΫγϣϯͷઃఆ std::vector columns{“item_id”, “item_name”, “sales”}; ASSERT_OK(scanner_builder->Project(columns)); // εΩϟφੜ੒ ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish(); // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ) ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());

Slide 15

Slide 15 text

ෳ਺ϑΝΠϧͷฒྻಡΈࠐΈ • ϑΝΠϧ୯ҐͰಡΈࠐΈλεΫ͕࡞ΒΕɺεϨουϓʔϧ ͰλεΫ͕ฒྻ࣮ߦ͞ΕΔ • Parquet ϑΥʔϚοτͰ͸ɺ1ͭͷϑΝΠϧ͸ߦάϧʔϓ ͝ͱʹγʔέϯγϟϧʹಡΈࠐ·ΕΔ • 1ͭͷϑΝΠϧ͔Β1ͭҎ্ͷ Arrow Record Batch ͕ੜ ੒͞Εͯɺ࠷ޙʹ·ͱΊͯ Arrow Table ͕ੜ੒͞ΕΔ

Slide 16

Slide 16 text

༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ • ݱࡏ͸ෳ਺ͷ Parquet ϑΝΠϧʹ෼ׂ͞Εͨσʔληο τ΁ͷରԠΛ੔උத • AVRO, ORC, JSON, CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ ϑΥʔϚοτ͸কདྷతʹରԠ͞ΕΔ • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request ͸ৗʹ welcome ͩͱࢥ͏

Slide 17

Slide 17 text

༷ʑͳϑΝΠϧγεςϜ΁ͷରԠ • ରԠࡁΈͷ΋ͷ • ϩʔΧϧϑΝΠϧγεςϜ • HDFS • Amazon S3 • ςετ༻ͷϞοΫϑΝΠϧγεςϜ • কདྷతʹରԠ͍ͨ͠΋ͷ • Google Cloud Storage • Microsoft Azure BLOB Storage

Slide 18

Slide 18 text

RDB ͔ΒͷಡΈࠐΈ • RDB ͷςʔϒϧ΍ΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͠࢖͑ΔΑ͏ʹ͢Δ ܭը΋͋Δ • ࣍ͷγεςϜ͸໊ࢦ͠͞Ε͍ͯΔ • SQLite3 • PostgreSQL protocol (pgsql, Vertica, Redshift) • MySQL (and MemSQL) • Microsoft SQL Server (TDS) • HiveServer2 (Hive and Impala) • ClickHouse

Slide 19

Slide 19 text

Apache Arrow C++ Datasets • Apache Arrow C++ Datasets ͕͋Ε͹ɺ͍Ζ͍Ζͳ৔ॴ ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ཰ Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ • Arrow Table Λ࡞ͬͨ͋ͱ͸ʁ • ͞Βʹ෼ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ • ूܭ΍౷ܭॲཧΛ͍ͨ͠

Slide 20

Slide 20 text

Arrow Table Λ࡞ͬͨ͋ͱ • ෼ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ => Apache Arrow C++ Query Engine • ूܭ΍౷ܭॲཧΛ͍ͨ͠ => Apache Arrow C++ Data Frame

Slide 21

Slide 21 text

Apache Arrow C++ Query Engine • ϝϞϦ্ͷ Arrow Record Batch ʹରͯ͠SQL෩ͷΫΤ Ϧ΍ɺσʔλ෼ੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞΍ pivot ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ • σʔλϕʔεΛஔ͖׵͑Δ͜ͱ͸ҙਤͤͣɺC++ ͷڞ༗ϥ ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰ࢖Θ ΕΔ͜ͱΛ૝ఆ͍ͯ͠Δ • ·ͩ։ൃ͸࢝·͍ͬͯͳ͍͕ٞ࿦͸͞Ε͍ͯΔ

Slide 22

Slide 22 text

Apache Arrow C++ Data Frame • ϝϞϦ্ͷ Arrow Record Batch ʹରͯ͠ɺ͍ΘΏΔ σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺ෼ੳɺू ܭͳͲͷػೳΛఏڙ͢Δ • ։ൃ͸·ͩ࢝·͍ͬͯͳ͍͕ٞ࿦͸͞Ε͍ͯΔ • pandas2 ͸ Arrow C++ Data Frame ΛόοΫΤϯυͱ ͯ͠࡞ΕΒΕΔͷ͔ͳʁ

Slide 23

Slide 23 text

Datasets Query Engine Data Frame ϑΝΠϧ΍DBʹอଘ͞Εͨσʔλ
 ΁ͷΞΫηε͕؆୯ʹͳΔ ϝϞϦ্ͷςʔϒϧσʔλʹର͢Δ
 ෼ੳΫΤϦ͕؆୯ʹ࣮ߦͰ͖Δ ϝϞϦ্ͷςʔϒϧσʔλΛσʔλ
 ϑϨʔϜͱͯ͠ར༻Ͱ͖Δ