Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Apache Arrow C++ Datasets
Search
Kenta Murata
December 11, 2019
Technology
4
1.6k
Apache Arrow C++ Datasets
Introduce Apache Arrow C++ Datasets.
Presented Apache Arrow Tokyo Meetup 2019.
Kenta Murata
December 11, 2019
Tweet
Share
More Decks by Kenta Murata
See All by Kenta Murata
waitany と waitall を作った話
mrkn
0
200
HolidayJp.jl を作りました
mrkn
0
200
Calling Julia functions from Streamlit applications
mrkn
1
440
Red Data Tools で切り開く Ruby の未来
mrkn
3
1.2k
Method-based JIT compilation by transpiling to Julia
mrkn
0
7.2k
Reducing ActiveRecord memory consumption using Apache Arrow
mrkn
0
1.7k
RubyData and Rails
mrkn
0
3.1k
Tensor and Arrow
mrkn
0
940
RubyData Current and Future
mrkn
1
3.5k
Other Decks in Technology
See All in Technology
eBPF-based Process Lifecycle Monitoring
yukinakanaka
1
140
S3成長記録 in 2024 - オレたちのS3はどこに向かうのか?- @Storage-JAWS#7
p0n
1
100
社内限定だった「ChatGPTオペレーター勉強会」の極秘資料を無料で特別公開
tenho7_kodama
1
130
RubyKaigi で得た課題解決法・美意識・モチベーション
morihirok
0
130
なぜ「Event Sourcing」を選択したのか〜事実に基づくことの重要性〜/Why did we choose "Event Sourcing"?
bitkey
1
240
みんなで育てるNewsPicksのSLO
troter
3
790
Roomの監視可能なクエリのカスタマイズとレガシーコードへの適用
shiita0903
2
170
生成AIで生産性向上
tomuro
0
180
コンテナ上シェル悪用の話とPure Bashでcurlが作れた話
ryotosaito
2
350
EC-CUBEはサーバレスで動かせるのか?
yukishimada
1
140
単一の深層学習モデルによる不確実性の定量化の紹介 ~その予測結果正しいですか?~
ftakahashi
PRO
3
410
LangGraphを使ったAIエージェント実装
iwakiyusaku
1
160
Featured
See All Featured
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
2.1k
Designing on Purpose - Digital PM Summit 2013
jponch
117
7.2k
Mobile First: as difficult as doing things right
swwweet
223
9.5k
Raft: Consensus for Rubyists
vanstee
137
6.8k
jQuery: Nuts, Bolts and Bling
dougneiner
63
7.7k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
12k
The Art of Programming - Codeland 2020
erikaheidi
53
13k
Thoughts on Productivity
jonyablonski
69
4.5k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
28
9.3k
Making the Leap to Tech Lead
cromwellryan
133
9.1k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
11
580
Transcript
Apache Arrow C++ Datasets Kenta Murata Speee, Inc. 2019.12.11 Apache
Arrow Tokyo Meetup 2019
Kenta Murata • Fulltime OSS developer at Speee, Inc. •
CRuby committer (as of 2010.02) • Apache Arrow committer (as of 2019.10) • The 24th place (44 commits) • SparseTensor in Arrow C++ • GLib and Ruby binding, etc.
Apache Arrow C++ ͷߏ Base Datasets Query Engine Data Frame
Apache Arrow C++ Datasets • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ •
༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ • ෳछྨͷετϨʔδ͔ΒͷσʔλೖྗʹରԠͰ͖Δ • কདྷతʹϑΝΠϧͷॻ͖ग़͠ʹରԠ͢Δ༧ఆ
ෳͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ a.parquet b.parquet Query 1 Query 2 c.csv d.json Record
Batch 1 Record Batch 2 Amazon S3 Amazon Redshift Local File System In-Memory Arrow Table
ϑΝΠϧ͔ΒͷಡΈࠐΈ Discover Scan Filter & Project Collect
ϑΝΠϧ͔ΒͷಡΈࠐΈ • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ • ෳϑΝΠϧΛฒྻεΩϟϯͰ͖Δ • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ
• ෳͷϑΝΠϧʹׂ͞ΕͨσʔλΛ࠶ߏ͢Δ • σʔλΛෳϑΝΠϧʹׂ͢Δͱ͖ͷεΩʔϚׂͷنଇʹैͬͯॲཧ͢Δ • ݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)
ϑΝΠϧͷൃݟ • ϕʔεσΟϨΫτϦͷҐஔͱϑΝΠϧϑΥʔϚοτΛࢦఆ ͢ΔͱɺͦͷσΟϨΫτϦҎԼʹ͋ΔରϑΝΠϧΛ͢ ͯϦετΞοϓͯ͘͠ΕΔ • αϒσΟϨΫτϦΛ࠶ؼతʹ୳͢͜ͱՄೳ • ແࢹ͢ΔϑΝΠϧ໊ͷϓϨϑΟοΫεΛࢦఆͰ͖Δ •
ରϑΝΠϧΛͯ͢ಡΈࠐΉͨΊʹඞཁͳϚʔδࡁΈͷ εΩʔϚΛ࡞ͬͯ͘ΕΔ (༧ఆ)
ϑΝΠϧͷൃݟͷྫ /data/.metadata /data/2018/12/JP/Tokyo/001.parquet /data/2018/12/JP/Tokyo/002.parquet /data/2018/12/JP/Osaka/001.parquet /data/2018/12/US/CA/001.parquet /data/2019/01/JP/Tokyo/001.parquet /data/2019/01/JP/Osaka/001.parquet /data/2019/01/US/CA/001.parquet /data/2019/01/US/NY/001.parquet
/tmp/Tokyo.parquet ↓͜ΕΒͷϑΝΠϧ͚ͩϐοΫΞοϓ͍ͨ͠
ϑΝΠϧͷൃݟͷྫ using namespace arrow; using namespace arrow::dataset; fs::Selector selector; selector.base_dir
= “/data”; selector.recursive = true; std::shared_ptr<FileSystemDataSourceDiscovery> discovery; ARROW_OK_AND_ASSIGN( discovery, FileSystemDataSourceDiscovery::Make( fs, selector, std::make_shared<dataset::ParquetFileFormat>(), FileSystemDiscoveryOptions())); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());
σʔλׂͷنଇΛࢦఆ /data/2018 /data/2018/12 /data/2018/12/JP /data/2018/12/JP/Tokyo/001.parquet auto partition_scheme = schema({field(“year”, int32()),
field(“month”, int32()), field(“country”, utf8()), field(“city”, utf8())}); ASSERT_OK(discovery->SetPartitionScheme(partition_scheme)); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish()); year month country city => {“year": 2018} => {“year”: 2018, “month”: 12} => {“year”: 2018, “month”: 12, “country”: “JP”} => {“year”: 2018, “month”: 12, “country”: “JP”, “city”: “Tokyo”}
ϑΟϧλϦϯά • ݅ࣜΛͬͯߦΛϑΟϧλϦϯάͰ͖Δ • year ͕ 2019 Ͱ sales ͕
100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ ग़͢߹࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ “year”_ == 2019 && “sales”_ > 100.0 • εΩʔϚׂͷنଇʹैͬͯɺ݅ʹ߹க͠ͳ͍ϑΝΠϧ ͷಡΈࠐΈΛলུ͢Δ
औΓग़͢ΧϥϜͷࢦఆ • ͯ͢ͷΧϥϜΛಡΈࠐ·ͳͯ͘ྑ͍߹ɺϓϩδΣΫ γϣϯ (ࣹӨ) ػೳΛͬͯऔΓग़͢ΧϥϜΛ੍ݶͰ͖Δ • ͜ͷػೳͰಡΈࠐΉΧϥϜΛ੍ݶ͢ΔͱɺෆཁͳΧϥϜͷ σγϦΞϥΠζͱܕม͕লུ͞ΕͯɺϑΝΠϧϑΥʔ ϚοτʹΑͬͯσʔλͷಡΈग़͕͘͠ͳΔ
σʔληοτΛ࡞ͬͯಡΈࠐΜͰ Arrow Table Λ࡞Δ·Ͱͷྫ // σʔληοτͷ࡞ ASSERT_OK_AND_ASSIGN(auto dataset, Dataset::Make({data_source}, discovery->Inspect()));
// εΩϟφϏϧμ ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); // ϑΟϧλͷઃఆ auto filter = (“year”_ == 2019 && “sales”_ > 100.0); ASSERT_OK(scanner_builder->Filter(filter)); // ϓϩδΣΫγϣϯͷઃఆ std::vector<std::string> columns{“item_id”, “item_name”, “sales”}; ASSERT_OK(scanner_builder->Project(columns)); // εΩϟφੜ ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish(); // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ) ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
ෳϑΝΠϧͷฒྻಡΈࠐΈ • ϑΝΠϧ୯ҐͰಡΈࠐΈλεΫ͕࡞ΒΕɺεϨουϓʔϧ ͰλεΫ͕ฒྻ࣮ߦ͞ΕΔ • Parquet ϑΥʔϚοτͰɺ1ͭͷϑΝΠϧߦάϧʔϓ ͝ͱʹγʔέϯγϟϧʹಡΈࠐ·ΕΔ • 1ͭͷϑΝΠϧ͔Β1ͭҎ্ͷ
Arrow Record Batch ͕ੜ ͞Εͯɺ࠷ޙʹ·ͱΊͯ Arrow Table ͕ੜ͞ΕΔ
༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ • ݱࡏෳͷ Parquet ϑΝΠϧʹׂ͞Εͨσʔληο τͷରԠΛඋத • AVRO, ORC, JSON,
CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ ϑΥʔϚοτকདྷతʹରԠ͞ΕΔ • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request ৗʹ welcome ͩͱࢥ͏
༷ʑͳϑΝΠϧγεςϜͷରԠ • ରԠࡁΈͷͷ • ϩʔΧϧϑΝΠϧγεςϜ • HDFS • Amazon S3
• ςετ༻ͷϞοΫϑΝΠϧγεςϜ • কདྷతʹରԠ͍ͨ͠ͷ • Google Cloud Storage • Microsoft Azure BLOB Storage
RDB ͔ΒͷಡΈࠐΈ • RDB ͷςʔϒϧΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͑͠ΔΑ͏ʹ͢Δ ܭը͋Δ • ࣍ͷγεςϜ໊ࢦ͠͞Ε͍ͯΔ • SQLite3
• PostgreSQL protocol (pgsql, Vertica, Redshift) • MySQL (and MemSQL) • Microsoft SQL Server (TDS) • HiveServer2 (Hive and Impala) • ClickHouse
Apache Arrow C++ Datasets • Apache Arrow C++ Datasets ͕͋Εɺ͍Ζ͍Ζͳॴ
ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ • Arrow Table Λ࡞ͬͨ͋ͱʁ • ͞Βʹੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ • ूܭ౷ܭॲཧΛ͍ͨ͠
Arrow Table Λ࡞ͬͨ͋ͱ • ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ => Apache Arrow C++ Query
Engine • ूܭ౷ܭॲཧΛ͍ͨ͠ => Apache Arrow C++ Data Frame
Apache Arrow C++ Query Engine • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠SQL෩ͷΫΤ ϦɺσʔλੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞ pivot ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ • σʔλϕʔεΛஔ͖͑Δ͜ͱҙਤͤͣɺC++ ͷڞ༗ϥ ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰΘ ΕΔ͜ͱΛఆ͍ͯ͠Δ • ·ͩ։ൃ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ
Apache Arrow C++ Data Frame • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠ɺ͍ΘΏΔ σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺੳɺू ܭͳͲͷػೳΛఏڙ͢Δ • ։ൃ·ͩ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ • pandas2 Arrow C++ Data Frame ΛόοΫΤϯυͱ ͯ͠࡞ΕΒΕΔͷ͔ͳʁ
Datasets Query Engine Data Frame ϑΝΠϧDBʹอଘ͞Εͨσʔλ ͷΞΫηε͕؆୯ʹͳΔ ϝϞϦ্ͷςʔϒϧσʔλʹର͢Δ ੳΫΤϦ͕؆୯ʹ࣮ߦͰ͖Δ ϝϞϦ্ͷςʔϒϧσʔλΛσʔλ
ϑϨʔϜͱͯ͠ར༻Ͱ͖Δ