$30 off During Our Annual Pro Sale. View Details »

Apache Arrow C++ Datasets

Kenta Murata
December 11, 2019

Apache Arrow C++ Datasets

Introduce Apache Arrow C++ Datasets.

Presented Apache Arrow Tokyo Meetup 2019.

Kenta Murata

December 11, 2019
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. Apache Arrow C++
    Datasets
    Kenta Murata
    Speee, Inc.
    2019.12.11
    Apache Arrow Tokyo Meetup 2019

    View Slide

  2. Kenta Murata
    • Fulltime OSS developer at Speee, Inc.
    • CRuby committer (as of 2010.02)
    • Apache Arrow committer (as of 2019.10)
    • The 24th place (44 commits)
    • SparseTensor in Arrow C++
    • GLib and Ruby binding, etc.

    View Slide

  3. Apache Arrow C++ ͷߏ੒
    Base Datasets
    Query Engine Data Frame

    View Slide

  4. Apache Arrow C++ Datasets
    • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ
    ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ
    • ༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ
    • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ
    • ෳ਺छྨͷετϨʔδ͔Βͷσʔλೖྗʹ΋ରԠͰ͖Δ
    • কདྷతʹ͸ϑΝΠϧ΁ͷॻ͖ग़͠ʹ΋ରԠ͢Δ༧ఆ

    View Slide

  5. ෳ਺ͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ
    a.parquet
    b.parquet
    Query 1
    Query 2
    c.csv
    d.json
    Record

    Batch 1
    Record

    Batch 2
    Amazon S3
    Amazon Redshift
    Local File System In-Memory
    Arrow Table

    View Slide

  6. ϑΝΠϧ͔ΒͷಡΈࠐΈ
    Discover
    Scan
    Filter & Project
    Collect

    View Slide

  7. ϑΝΠϧ͔ΒͷಡΈࠐΈ
    • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ
    • ෳ਺ϑΝΠϧΛฒྻεΩϟϯͰ͖Δ
    • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ
    • ෳ਺ͷϑΝΠϧʹ෼ׂ͞ΕͨσʔλΛ࠶ߏ੒͢Δ
    • σʔλΛෳ਺ϑΝΠϧʹ෼ׂ͢Δͱ͖ͷεΩʔϚ෼ׂͷنଇʹैͬͯॲཧ͢Δ
    • ৚݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ
    • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ
    • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ
    • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)

    View Slide

  8. ϑΝΠϧͷൃݟ
    • ϕʔεσΟϨΫτϦͷҐஔͱϑΝΠϧϑΥʔϚοτΛࢦఆ
    ͢ΔͱɺͦͷσΟϨΫτϦҎԼʹ͋Δର৅ϑΝΠϧΛ͢΂
    ͯϦετΞοϓͯ͘͠ΕΔ
    • αϒσΟϨΫτϦΛ࠶ؼతʹ୳͢͜ͱ΋Մೳ
    • ແࢹ͢ΔϑΝΠϧ໊ͷϓϨϑΟοΫεΛࢦఆͰ͖Δ
    • ର৅ϑΝΠϧΛ͢΂ͯಡΈࠐΉͨΊʹඞཁͳϚʔδࡁΈͷ
    εΩʔϚΛ࡞ͬͯ͘ΕΔ (༧ఆ)

    View Slide

  9. ϑΝΠϧͷൃݟͷྫ
    /data/.metadata
    /data/2018/12/JP/Tokyo/001.parquet
    /data/2018/12/JP/Tokyo/002.parquet
    /data/2018/12/JP/Osaka/001.parquet
    /data/2018/12/US/CA/001.parquet
    /data/2019/01/JP/Tokyo/001.parquet
    /data/2019/01/JP/Osaka/001.parquet
    /data/2019/01/US/CA/001.parquet
    /data/2019/01/US/NY/001.parquet
    /tmp/Tokyo.parquet
    ↓͜ΕΒͷϑΝΠϧ͚ͩϐοΫΞοϓ͍ͨ͠

    View Slide

  10. ϑΝΠϧͷൃݟͷྫ
    using namespace arrow;
    using namespace arrow::dataset;
    fs::Selector selector;
    selector.base_dir = “/data”;
    selector.recursive = true;
    std::shared_ptr discovery;
    ARROW_OK_AND_ASSIGN(
    discovery,
    FileSystemDataSourceDiscovery::Make(
    fs, selector,
    std::make_shared(),
    FileSystemDiscoveryOptions()));
    ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());

    View Slide

  11. σʔλ෼ׂͷنଇΛࢦఆ
    /data/2018
    /data/2018/12
    /data/2018/12/JP
    /data/2018/12/JP/Tokyo/001.parquet
    auto partition_scheme =

    schema({field(“year”, int32()), field(“month”, int32()),

    field(“country”, utf8()), field(“city”, utf8())});
    ASSERT_OK(discovery->SetPartitionScheme(partition_scheme));

    ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());
    year month country city
    => {“year": 2018}
    => {“year”: 2018, “month”: 12}
    => {“year”: 2018, “month”: 12, “country”: “JP”}
    => {“year”: 2018, “month”: 12,

    “country”: “JP”, “city”: “Tokyo”}

    View Slide

  12. ϑΟϧλϦϯά
    • ৚݅ࣜΛ࢖ͬͯߦΛϑΟϧλϦϯάͰ͖Δ
    • year ͕ 2019 Ͱ sales ͕ 100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ
    ग़͢৔߹͸࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ
    “year”_ == 2019 && “sales”_ > 100.0
    • εΩʔϚ෼ׂͷنଇʹैͬͯɺ৚݅ʹ߹க͠ͳ͍ϑΝΠϧ
    ͷಡΈࠐΈΛলུ͢Δ

    View Slide

  13. औΓग़͢ΧϥϜͷࢦఆ
    • ͢΂ͯͷΧϥϜΛಡΈࠐ·ͳͯ͘ྑ͍৔߹͸ɺϓϩδΣΫ
    γϣϯ (ࣹӨ) ػೳΛ࢖ͬͯऔΓग़͢ΧϥϜΛ੍ݶͰ͖Δ
    • ͜ͷػೳͰಡΈࠐΉΧϥϜΛ੍ݶ͢ΔͱɺෆཁͳΧϥϜͷ
    σγϦΞϥΠζͱܕม׵͕লུ͞ΕͯɺϑΝΠϧϑΥʔ
    ϚοτʹΑͬͯ͸σʔλͷಡΈग़͕͠଎͘ͳΔ

    View Slide

  14. σʔληοτΛ࡞ͬͯಡΈࠐΜͰ

    Arrow Table Λ࡞Δ·Ͱͷྫ
    // σʔληοτͷ࡞੒
    ASSERT_OK_AND_ASSIGN(auto dataset,

    Dataset::Make({data_source}, discovery->Inspect()));
    // εΩϟφϏϧμ
    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
    // ϑΟϧλͷઃఆ
    auto filter = (“year”_ == 2019 && “sales”_ > 100.0);
    ASSERT_OK(scanner_builder->Filter(filter));
    // ϓϩδΣΫγϣϯͷઃఆ
    std::vector columns{“item_id”, “item_name”, “sales”};
    ASSERT_OK(scanner_builder->Project(columns));
    // εΩϟφੜ੒
    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish();
    // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ)
    ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());

    View Slide

  15. ෳ਺ϑΝΠϧͷฒྻಡΈࠐΈ
    • ϑΝΠϧ୯ҐͰಡΈࠐΈλεΫ͕࡞ΒΕɺεϨουϓʔϧ
    ͰλεΫ͕ฒྻ࣮ߦ͞ΕΔ
    • Parquet ϑΥʔϚοτͰ͸ɺ1ͭͷϑΝΠϧ͸ߦάϧʔϓ
    ͝ͱʹγʔέϯγϟϧʹಡΈࠐ·ΕΔ
    • 1ͭͷϑΝΠϧ͔Β1ͭҎ্ͷ Arrow Record Batch ͕ੜ
    ੒͞Εͯɺ࠷ޙʹ·ͱΊͯ Arrow Table ͕ੜ੒͞ΕΔ

    View Slide

  16. ༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ
    • ݱࡏ͸ෳ਺ͷ Parquet ϑΝΠϧʹ෼ׂ͞Εͨσʔληο
    τ΁ͷରԠΛ੔උத
    • AVRO, ORC, JSON, CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ
    ϑΥʔϚοτ͸কདྷతʹରԠ͞ΕΔ
    • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request
    ͸ৗʹ welcome ͩͱࢥ͏

    View Slide

  17. ༷ʑͳϑΝΠϧγεςϜ΁ͷରԠ
    • ରԠࡁΈͷ΋ͷ
    • ϩʔΧϧϑΝΠϧγεςϜ
    • HDFS
    • Amazon S3
    • ςετ༻ͷϞοΫϑΝΠϧγεςϜ
    • কདྷతʹରԠ͍ͨ͠΋ͷ
    • Google Cloud Storage
    • Microsoft Azure BLOB Storage

    View Slide

  18. RDB ͔ΒͷಡΈࠐΈ
    • RDB ͷςʔϒϧ΍ΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͠࢖͑ΔΑ͏ʹ͢Δ
    ܭը΋͋Δ
    • ࣍ͷγεςϜ͸໊ࢦ͠͞Ε͍ͯΔ
    • SQLite3
    • PostgreSQL protocol (pgsql, Vertica, Redshift)
    • MySQL (and MemSQL)
    • Microsoft SQL Server (TDS)
    • HiveServer2 (Hive and Impala)
    • ClickHouse

    View Slide

  19. Apache Arrow C++ Datasets
    • Apache Arrow C++ Datasets ͕͋Ε͹ɺ͍Ζ͍Ζͳ৔ॴ
    ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ཰
    Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ
    • Arrow Table Λ࡞ͬͨ͋ͱ͸ʁ
    • ͞Βʹ෼ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠
    • ूܭ΍౷ܭॲཧΛ͍ͨ͠

    View Slide

  20. Arrow Table Λ࡞ͬͨ͋ͱ
    • ෼ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠
    => Apache Arrow C++ Query Engine
    • ूܭ΍౷ܭॲཧΛ͍ͨ͠
    => Apache Arrow C++ Data Frame

    View Slide

  21. Apache Arrow C++ Query Engine
    • ϝϞϦ্ͷ Arrow Record Batch ʹରͯ͠SQL෩ͷΫΤ
    Ϧ΍ɺσʔλ෼ੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞΍ pivot
    ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ
    • σʔλϕʔεΛஔ͖׵͑Δ͜ͱ͸ҙਤͤͣɺC++ ͷڞ༗ϥ
    ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰ࢖Θ
    ΕΔ͜ͱΛ૝ఆ͍ͯ͠Δ
    • ·ͩ։ൃ͸࢝·͍ͬͯͳ͍͕ٞ࿦͸͞Ε͍ͯΔ

    View Slide

  22. Apache Arrow C++ Data Frame
    • ϝϞϦ্ͷ Arrow Record Batch ʹରͯ͠ɺ͍ΘΏΔ
    σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺ෼ੳɺू
    ܭͳͲͷػೳΛఏڙ͢Δ
    • ։ൃ͸·ͩ࢝·͍ͬͯͳ͍͕ٞ࿦͸͞Ε͍ͯΔ
    • pandas2 ͸ Arrow C++ Data Frame ΛόοΫΤϯυͱ
    ͯ͠࡞ΕΒΕΔͷ͔ͳʁ

    View Slide

  23. Datasets
    Query Engine
    Data Frame
    ϑΝΠϧ΍DBʹอଘ͞Εͨσʔλ

    ΁ͷΞΫηε͕؆୯ʹͳΔ
    ϝϞϦ্ͷςʔϒϧσʔλʹର͢Δ

    ෼ੳΫΤϦ͕؆୯ʹ࣮ߦͰ͖Δ
    ϝϞϦ্ͷςʔϒϧσʔλΛσʔλ

    ϑϨʔϜͱͯ͠ར༻Ͱ͖Δ

    View Slide