Apache Arrow C++ Datasets

7cca11c5257fda526eeb4b1ada28f904?s=47 Kenta Murata
December 11, 2019

Apache Arrow C++ Datasets

Introduce Apache Arrow C++ Datasets.

Presented Apache Arrow Tokyo Meetup 2019.

7cca11c5257fda526eeb4b1ada28f904?s=128

Kenta Murata

December 11, 2019
Tweet

Transcript

  1. Apache Arrow C++ Datasets Kenta Murata Speee, Inc. 2019.12.11 Apache

    Arrow Tokyo Meetup 2019
  2. Kenta Murata • Fulltime OSS developer at Speee, Inc. •

    CRuby committer (as of 2010.02) • Apache Arrow committer (as of 2019.10) • The 24th place (44 commits) • SparseTensor in Arrow C++ • GLib and Ruby binding, etc.
  3. Apache Arrow C++ ͷߏ੒ Base Datasets Query Engine Data Frame

  4. Apache Arrow C++ Datasets • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ •

    ༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ • ෳ਺छྨͷετϨʔδ͔Βͷσʔλೖྗʹ΋ରԠͰ͖Δ • কདྷతʹ͸ϑΝΠϧ΁ͷॻ͖ग़͠ʹ΋ରԠ͢Δ༧ఆ
  5. ෳ਺ͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ a.parquet b.parquet Query 1 Query 2 c.csv d.json Record


    Batch 1 Record
 Batch 2 Amazon S3 Amazon Redshift Local File System In-Memory Arrow Table
  6. ϑΝΠϧ͔ΒͷಡΈࠐΈ Discover Scan Filter & Project Collect

  7. ϑΝΠϧ͔ΒͷಡΈࠐΈ • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ • ෳ਺ϑΝΠϧΛฒྻεΩϟϯͰ͖Δ • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ

    • ෳ਺ͷϑΝΠϧʹ෼ׂ͞ΕͨσʔλΛ࠶ߏ੒͢Δ • σʔλΛෳ਺ϑΝΠϧʹ෼ׂ͢Δͱ͖ͷεΩʔϚ෼ׂͷنଇʹैͬͯॲཧ͢Δ • ৚݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)
  8. ϑΝΠϧͷൃݟ • ϕʔεσΟϨΫτϦͷҐஔͱϑΝΠϧϑΥʔϚοτΛࢦఆ ͢ΔͱɺͦͷσΟϨΫτϦҎԼʹ͋Δର৅ϑΝΠϧΛ͢΂ ͯϦετΞοϓͯ͘͠ΕΔ • αϒσΟϨΫτϦΛ࠶ؼతʹ୳͢͜ͱ΋Մೳ • ແࢹ͢ΔϑΝΠϧ໊ͷϓϨϑΟοΫεΛࢦఆͰ͖Δ •

    ର৅ϑΝΠϧΛ͢΂ͯಡΈࠐΉͨΊʹඞཁͳϚʔδࡁΈͷ εΩʔϚΛ࡞ͬͯ͘ΕΔ (༧ఆ)
  9. ϑΝΠϧͷൃݟͷྫ /data/.metadata /data/2018/12/JP/Tokyo/001.parquet /data/2018/12/JP/Tokyo/002.parquet /data/2018/12/JP/Osaka/001.parquet /data/2018/12/US/CA/001.parquet /data/2019/01/JP/Tokyo/001.parquet /data/2019/01/JP/Osaka/001.parquet /data/2019/01/US/CA/001.parquet /data/2019/01/US/NY/001.parquet

    /tmp/Tokyo.parquet ↓͜ΕΒͷϑΝΠϧ͚ͩϐοΫΞοϓ͍ͨ͠
  10. ϑΝΠϧͷൃݟͷྫ using namespace arrow; using namespace arrow::dataset; fs::Selector selector; selector.base_dir

    = “/data”; selector.recursive = true; std::shared_ptr<FileSystemDataSourceDiscovery> discovery; ARROW_OK_AND_ASSIGN( discovery, FileSystemDataSourceDiscovery::Make( fs, selector, std::make_shared<dataset::ParquetFileFormat>(), FileSystemDiscoveryOptions())); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());
  11. σʔλ෼ׂͷنଇΛࢦఆ /data/2018 /data/2018/12 /data/2018/12/JP /data/2018/12/JP/Tokyo/001.parquet auto partition_scheme =
 schema({field(“year”, int32()),

    field(“month”, int32()),
 field(“country”, utf8()), field(“city”, utf8())}); ASSERT_OK(discovery->SetPartitionScheme(partition_scheme));
 ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish()); year month country city => {“year": 2018} => {“year”: 2018, “month”: 12} => {“year”: 2018, “month”: 12, “country”: “JP”} => {“year”: 2018, “month”: 12,
 “country”: “JP”, “city”: “Tokyo”}
  12. ϑΟϧλϦϯά • ৚݅ࣜΛ࢖ͬͯߦΛϑΟϧλϦϯάͰ͖Δ • year ͕ 2019 Ͱ sales ͕

    100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ ग़͢৔߹͸࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ “year”_ == 2019 && “sales”_ > 100.0 • εΩʔϚ෼ׂͷنଇʹैͬͯɺ৚݅ʹ߹க͠ͳ͍ϑΝΠϧ ͷಡΈࠐΈΛলུ͢Δ
  13. औΓग़͢ΧϥϜͷࢦఆ • ͢΂ͯͷΧϥϜΛಡΈࠐ·ͳͯ͘ྑ͍৔߹͸ɺϓϩδΣΫ γϣϯ (ࣹӨ) ػೳΛ࢖ͬͯऔΓग़͢ΧϥϜΛ੍ݶͰ͖Δ • ͜ͷػೳͰಡΈࠐΉΧϥϜΛ੍ݶ͢ΔͱɺෆཁͳΧϥϜͷ σγϦΞϥΠζͱܕม׵͕লུ͞ΕͯɺϑΝΠϧϑΥʔ ϚοτʹΑͬͯ͸σʔλͷಡΈग़͕͠଎͘ͳΔ

  14. σʔληοτΛ࡞ͬͯಡΈࠐΜͰ
 Arrow Table Λ࡞Δ·Ͱͷྫ // σʔληοτͷ࡞੒ ASSERT_OK_AND_ASSIGN(auto dataset,
 Dataset::Make({data_source}, discovery->Inspect()));

    // εΩϟφϏϧμ ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); // ϑΟϧλͷઃఆ auto filter = (“year”_ == 2019 && “sales”_ > 100.0); ASSERT_OK(scanner_builder->Filter(filter)); // ϓϩδΣΫγϣϯͷઃఆ std::vector<std::string> columns{“item_id”, “item_name”, “sales”}; ASSERT_OK(scanner_builder->Project(columns)); // εΩϟφੜ੒ ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish(); // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ) ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
  15. ෳ਺ϑΝΠϧͷฒྻಡΈࠐΈ • ϑΝΠϧ୯ҐͰಡΈࠐΈλεΫ͕࡞ΒΕɺεϨουϓʔϧ ͰλεΫ͕ฒྻ࣮ߦ͞ΕΔ • Parquet ϑΥʔϚοτͰ͸ɺ1ͭͷϑΝΠϧ͸ߦάϧʔϓ ͝ͱʹγʔέϯγϟϧʹಡΈࠐ·ΕΔ • 1ͭͷϑΝΠϧ͔Β1ͭҎ্ͷ

    Arrow Record Batch ͕ੜ ੒͞Εͯɺ࠷ޙʹ·ͱΊͯ Arrow Table ͕ੜ੒͞ΕΔ
  16. ༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ • ݱࡏ͸ෳ਺ͷ Parquet ϑΝΠϧʹ෼ׂ͞Εͨσʔληο τ΁ͷରԠΛ੔උத • AVRO, ORC, JSON,

    CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ ϑΥʔϚοτ͸কདྷతʹରԠ͞ΕΔ • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request ͸ৗʹ welcome ͩͱࢥ͏
  17. ༷ʑͳϑΝΠϧγεςϜ΁ͷରԠ • ରԠࡁΈͷ΋ͷ • ϩʔΧϧϑΝΠϧγεςϜ • HDFS • Amazon S3

    • ςετ༻ͷϞοΫϑΝΠϧγεςϜ • কདྷతʹରԠ͍ͨ͠΋ͷ • Google Cloud Storage • Microsoft Azure BLOB Storage
  18. RDB ͔ΒͷಡΈࠐΈ • RDB ͷςʔϒϧ΍ΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͠࢖͑ΔΑ͏ʹ͢Δ ܭը΋͋Δ • ࣍ͷγεςϜ͸໊ࢦ͠͞Ε͍ͯΔ • SQLite3

    • PostgreSQL protocol (pgsql, Vertica, Redshift) • MySQL (and MemSQL) • Microsoft SQL Server (TDS) • HiveServer2 (Hive and Impala) • ClickHouse
  19. Apache Arrow C++ Datasets • Apache Arrow C++ Datasets ͕͋Ε͹ɺ͍Ζ͍Ζͳ৔ॴ

    ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ཰ Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ • Arrow Table Λ࡞ͬͨ͋ͱ͸ʁ • ͞Βʹ෼ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ • ूܭ΍౷ܭॲཧΛ͍ͨ͠
  20. Arrow Table Λ࡞ͬͨ͋ͱ • ෼ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ => Apache Arrow C++ Query

    Engine • ूܭ΍౷ܭॲཧΛ͍ͨ͠ => Apache Arrow C++ Data Frame
  21. Apache Arrow C++ Query Engine • ϝϞϦ্ͷ Arrow Record Batch

    ʹରͯ͠SQL෩ͷΫΤ Ϧ΍ɺσʔλ෼ੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞΍ pivot ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ • σʔλϕʔεΛஔ͖׵͑Δ͜ͱ͸ҙਤͤͣɺC++ ͷڞ༗ϥ ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰ࢖Θ ΕΔ͜ͱΛ૝ఆ͍ͯ͠Δ • ·ͩ։ൃ͸࢝·͍ͬͯͳ͍͕ٞ࿦͸͞Ε͍ͯΔ
  22. Apache Arrow C++ Data Frame • ϝϞϦ্ͷ Arrow Record Batch

    ʹରͯ͠ɺ͍ΘΏΔ σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺ෼ੳɺू ܭͳͲͷػೳΛఏڙ͢Δ • ։ൃ͸·ͩ࢝·͍ͬͯͳ͍͕ٞ࿦͸͞Ε͍ͯΔ • pandas2 ͸ Arrow C++ Data Frame ΛόοΫΤϯυͱ ͯ͠࡞ΕΒΕΔͷ͔ͳʁ
  23. Datasets Query Engine Data Frame ϑΝΠϧ΍DBʹอଘ͞Εͨσʔλ
 ΁ͷΞΫηε͕؆୯ʹͳΔ ϝϞϦ্ͷςʔϒϧσʔλʹର͢Δ
 ෼ੳΫΤϦ͕؆୯ʹ࣮ߦͰ͖Δ ϝϞϦ্ͷςʔϒϧσʔλΛσʔλ


    ϑϨʔϜͱͯ͠ར༻Ͱ͖Δ