Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Apache Arrow C++ Datasets
Search
Kenta Murata
December 11, 2019
Technology
1.9k
4
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Apache Arrow C++ Datasets
Introduce Apache Arrow C++ Datasets.
Presented Apache Arrow Tokyo Meetup 2019.
Kenta Murata
December 11, 2019
More Decks by Kenta Murata
See All by Kenta Murata
waitany と waitall を作った話
mrkn
0
320
HolidayJp.jl を作りました
mrkn
0
370
Calling Julia functions from Streamlit applications
mrkn
1
610
Red Data Tools で切り開く Ruby の未来
mrkn
3
1.3k
Method-based JIT compilation by transpiling to Julia
mrkn
0
9.1k
Reducing ActiveRecord memory consumption using Apache Arrow
mrkn
0
1.9k
RubyData and Rails
mrkn
0
3.4k
Tensor and Arrow
mrkn
0
1.1k
RubyData Current and Future
mrkn
1
3.8k
Other Decks in Technology
See All in Technology
Flow 不死:AI 時代 DevOps 的不變本質
cheng_wei_chen
2
360
200個のGitHubリポジトリを横断調査したかった
icck
0
140
アジャイルな経理と Claude Code と経営の未来
kawaguti
PRO
3
170
[AWS Summit Japan 2026]迷っているあなたへ_小さな一歩が、やがて自分を助けてくれる
sh_fk2
1
200
インシデントレスポンス演習 I / Incident Response Exercise I
ks91
PRO
0
100
iOS アプリの「これって不具合ですか?」を AI に調べてもらう
miichan
0
110
手塩にかけりゃいいってもんじゃない
ming_ayami
0
610
白金鉱業Meetup_Vol.24_「AIエージェントは分けるほど良い」は本当か? / Is it true that “the more you divide AI agents, the better”?
brainpadpr
1
420
フィジカル版Github Onshapeの紹介
shiba_8ro
0
290
Agile and AI Redmine Japan 2026
hiranabe
3
340
ぼっちではじめた登壇が「51名」「241件」の発信に化けた
subroh0508
1
260
データサイエンスを価値につなげるプロジェクト設計 〜 DS一年目が現場で得た気づき 〜
ysd113
1
290
Featured
See All Featured
Speed Design
sergeychernyshev
33
1.9k
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
230
The Curious Case for Waylosing
cassininazir
1
400
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
600
A designer walks into a library…
pauljervisheath
211
24k
The Limits of Empathy - UXLibs8
cassininazir
1
360
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
10k
Test your architecture with Archunit
thirion
1
2.3k
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
160
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
66
55k
The AI Search Optimization Roadmap by Aleyda Solis
aleyda
1
5.9k
Color Theory Basics | Prateek | Gurzu
gurzu
0
370
Transcript
Apache Arrow C++ Datasets Kenta Murata Speee, Inc. 2019.12.11 Apache
Arrow Tokyo Meetup 2019
Kenta Murata • Fulltime OSS developer at Speee, Inc. •
CRuby committer (as of 2010.02) • Apache Arrow committer (as of 2019.10) • The 24th place (44 commits) • SparseTensor in Arrow C++ • GLib and Ruby binding, etc.
Apache Arrow C++ ͷߏ Base Datasets Query Engine Data Frame
Apache Arrow C++ Datasets • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ •
༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ • ෳछྨͷετϨʔδ͔ΒͷσʔλೖྗʹରԠͰ͖Δ • কདྷతʹϑΝΠϧͷॻ͖ग़͠ʹରԠ͢Δ༧ఆ
ෳͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ a.parquet b.parquet Query 1 Query 2 c.csv d.json Record
Batch 1 Record Batch 2 Amazon S3 Amazon Redshift Local File System In-Memory Arrow Table
ϑΝΠϧ͔ΒͷಡΈࠐΈ Discover Scan Filter & Project Collect
ϑΝΠϧ͔ΒͷಡΈࠐΈ • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ • ෳϑΝΠϧΛฒྻεΩϟϯͰ͖Δ • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ
• ෳͷϑΝΠϧʹׂ͞ΕͨσʔλΛ࠶ߏ͢Δ • σʔλΛෳϑΝΠϧʹׂ͢Δͱ͖ͷεΩʔϚׂͷنଇʹैͬͯॲཧ͢Δ • ݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)
ϑΝΠϧͷൃݟ • ϕʔεσΟϨΫτϦͷҐஔͱϑΝΠϧϑΥʔϚοτΛࢦఆ ͢ΔͱɺͦͷσΟϨΫτϦҎԼʹ͋ΔରϑΝΠϧΛ͢ ͯϦετΞοϓͯ͘͠ΕΔ • αϒσΟϨΫτϦΛ࠶ؼతʹ୳͢͜ͱՄೳ • ແࢹ͢ΔϑΝΠϧ໊ͷϓϨϑΟοΫεΛࢦఆͰ͖Δ •
ରϑΝΠϧΛͯ͢ಡΈࠐΉͨΊʹඞཁͳϚʔδࡁΈͷ εΩʔϚΛ࡞ͬͯ͘ΕΔ (༧ఆ)
ϑΝΠϧͷൃݟͷྫ /data/.metadata /data/2018/12/JP/Tokyo/001.parquet /data/2018/12/JP/Tokyo/002.parquet /data/2018/12/JP/Osaka/001.parquet /data/2018/12/US/CA/001.parquet /data/2019/01/JP/Tokyo/001.parquet /data/2019/01/JP/Osaka/001.parquet /data/2019/01/US/CA/001.parquet /data/2019/01/US/NY/001.parquet
/tmp/Tokyo.parquet ↓͜ΕΒͷϑΝΠϧ͚ͩϐοΫΞοϓ͍ͨ͠
ϑΝΠϧͷൃݟͷྫ using namespace arrow; using namespace arrow::dataset; fs::Selector selector; selector.base_dir
= “/data”; selector.recursive = true; std::shared_ptr<FileSystemDataSourceDiscovery> discovery; ARROW_OK_AND_ASSIGN( discovery, FileSystemDataSourceDiscovery::Make( fs, selector, std::make_shared<dataset::ParquetFileFormat>(), FileSystemDiscoveryOptions())); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());
σʔλׂͷنଇΛࢦఆ /data/2018 /data/2018/12 /data/2018/12/JP /data/2018/12/JP/Tokyo/001.parquet auto partition_scheme = schema({field(“year”, int32()),
field(“month”, int32()), field(“country”, utf8()), field(“city”, utf8())}); ASSERT_OK(discovery->SetPartitionScheme(partition_scheme)); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish()); year month country city => {“year": 2018} => {“year”: 2018, “month”: 12} => {“year”: 2018, “month”: 12, “country”: “JP”} => {“year”: 2018, “month”: 12, “country”: “JP”, “city”: “Tokyo”}
ϑΟϧλϦϯά • ݅ࣜΛͬͯߦΛϑΟϧλϦϯάͰ͖Δ • year ͕ 2019 Ͱ sales ͕
100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ ग़͢߹࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ “year”_ == 2019 && “sales”_ > 100.0 • εΩʔϚׂͷنଇʹैͬͯɺ݅ʹ߹க͠ͳ͍ϑΝΠϧ ͷಡΈࠐΈΛলུ͢Δ
औΓग़͢ΧϥϜͷࢦఆ • ͯ͢ͷΧϥϜΛಡΈࠐ·ͳͯ͘ྑ͍߹ɺϓϩδΣΫ γϣϯ (ࣹӨ) ػೳΛͬͯऔΓग़͢ΧϥϜΛ੍ݶͰ͖Δ • ͜ͷػೳͰಡΈࠐΉΧϥϜΛ੍ݶ͢ΔͱɺෆཁͳΧϥϜͷ σγϦΞϥΠζͱܕม͕লུ͞ΕͯɺϑΝΠϧϑΥʔ ϚοτʹΑͬͯσʔλͷಡΈग़͕͘͠ͳΔ
σʔληοτΛ࡞ͬͯಡΈࠐΜͰ Arrow Table Λ࡞Δ·Ͱͷྫ // σʔληοτͷ࡞ ASSERT_OK_AND_ASSIGN(auto dataset, Dataset::Make({data_source}, discovery->Inspect()));
// εΩϟφϏϧμ ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); // ϑΟϧλͷઃఆ auto filter = (“year”_ == 2019 && “sales”_ > 100.0); ASSERT_OK(scanner_builder->Filter(filter)); // ϓϩδΣΫγϣϯͷઃఆ std::vector<std::string> columns{“item_id”, “item_name”, “sales”}; ASSERT_OK(scanner_builder->Project(columns)); // εΩϟφੜ ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish(); // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ) ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
ෳϑΝΠϧͷฒྻಡΈࠐΈ • ϑΝΠϧ୯ҐͰಡΈࠐΈλεΫ͕࡞ΒΕɺεϨουϓʔϧ ͰλεΫ͕ฒྻ࣮ߦ͞ΕΔ • Parquet ϑΥʔϚοτͰɺ1ͭͷϑΝΠϧߦάϧʔϓ ͝ͱʹγʔέϯγϟϧʹಡΈࠐ·ΕΔ • 1ͭͷϑΝΠϧ͔Β1ͭҎ্ͷ
Arrow Record Batch ͕ੜ ͞Εͯɺ࠷ޙʹ·ͱΊͯ Arrow Table ͕ੜ͞ΕΔ
༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ • ݱࡏෳͷ Parquet ϑΝΠϧʹׂ͞Εͨσʔληο τͷରԠΛඋத • AVRO, ORC, JSON,
CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ ϑΥʔϚοτকདྷతʹରԠ͞ΕΔ • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request ৗʹ welcome ͩͱࢥ͏
༷ʑͳϑΝΠϧγεςϜͷରԠ • ରԠࡁΈͷͷ • ϩʔΧϧϑΝΠϧγεςϜ • HDFS • Amazon S3
• ςετ༻ͷϞοΫϑΝΠϧγεςϜ • কདྷతʹରԠ͍ͨ͠ͷ • Google Cloud Storage • Microsoft Azure BLOB Storage
RDB ͔ΒͷಡΈࠐΈ • RDB ͷςʔϒϧΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͑͠ΔΑ͏ʹ͢Δ ܭը͋Δ • ࣍ͷγεςϜ໊ࢦ͠͞Ε͍ͯΔ • SQLite3
• PostgreSQL protocol (pgsql, Vertica, Redshift) • MySQL (and MemSQL) • Microsoft SQL Server (TDS) • HiveServer2 (Hive and Impala) • ClickHouse
Apache Arrow C++ Datasets • Apache Arrow C++ Datasets ͕͋Εɺ͍Ζ͍Ζͳॴ
ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ • Arrow Table Λ࡞ͬͨ͋ͱʁ • ͞Βʹੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ • ूܭ౷ܭॲཧΛ͍ͨ͠
Arrow Table Λ࡞ͬͨ͋ͱ • ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ => Apache Arrow C++ Query
Engine • ूܭ౷ܭॲཧΛ͍ͨ͠ => Apache Arrow C++ Data Frame
Apache Arrow C++ Query Engine • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠SQL෩ͷΫΤ ϦɺσʔλੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞ pivot ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ • σʔλϕʔεΛஔ͖͑Δ͜ͱҙਤͤͣɺC++ ͷڞ༗ϥ ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰΘ ΕΔ͜ͱΛఆ͍ͯ͠Δ • ·ͩ։ൃ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ
Apache Arrow C++ Data Frame • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠ɺ͍ΘΏΔ σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺੳɺू ܭͳͲͷػೳΛఏڙ͢Δ • ։ൃ·ͩ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ • pandas2 Arrow C++ Data Frame ΛόοΫΤϯυͱ ͯ͠࡞ΕΒΕΔͷ͔ͳʁ
Datasets Query Engine Data Frame ϑΝΠϧDBʹอଘ͞Εͨσʔλ ͷΞΫηε͕؆୯ʹͳΔ ϝϞϦ্ͷςʔϒϧσʔλʹର͢Δ ੳΫΤϦ͕؆୯ʹ࣮ߦͰ͖Δ ϝϞϦ্ͷςʔϒϧσʔλΛσʔλ
ϑϨʔϜͱͯ͠ར༻Ͱ͖Δ