Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Apache Arrow C++ Datasets
Search
Kenta Murata
December 11, 2019
Technology
4
1.8k
Apache Arrow C++ Datasets
Introduce Apache Arrow C++ Datasets.
Presented Apache Arrow Tokyo Meetup 2019.
Kenta Murata
December 11, 2019
Tweet
Share
More Decks by Kenta Murata
See All by Kenta Murata
waitany と waitall を作った話
mrkn
0
300
HolidayJp.jl を作りました
mrkn
0
340
Calling Julia functions from Streamlit applications
mrkn
1
560
Red Data Tools で切り開く Ruby の未来
mrkn
3
1.3k
Method-based JIT compilation by transpiling to Julia
mrkn
0
8.6k
Reducing ActiveRecord memory consumption using Apache Arrow
mrkn
0
1.9k
RubyData and Rails
mrkn
0
3.4k
Tensor and Arrow
mrkn
0
1.1k
RubyData Current and Future
mrkn
1
3.7k
Other Decks in Technology
See All in Technology
Lambda Web AdapterでLambdaをWEBフレームワーク利用する
sahou909
0
120
猫でもわかるKiro CLI(AI 駆動開発への道編)
kentapapa
0
200
非情報系研究者へ送る Transformer入門
rishiyama
11
7.5k
[JAWSDAYS2026][D8]その起票、愛が足りてますか?AWSサポートを味方につける、技術的「ラブレター」の書き方
hirosys_
3
180
情シスのための生成AI実践ガイド2026 / Generative AI Practical Guide for Business Technology 2026
glidenote
0
240
楽しく学ぼう!ネットワーク入門
shotashiratori
1
330
進化するBits AI SREと私と組織
nulabinc
PRO
0
160
ナレッジワークのご紹介(第88回情報処理学会 )
kworkdev
PRO
0
210
Sansanでの認証基盤内製化と移行
sansantech
PRO
0
400
内製AIチャットボットで学んだDatadog LLM Observability活用術
mkdev10
0
100
AWSの資格って役に立つの?
tk3fftk
2
330
VPCエンドポイント意外とお金かかるなぁ。せや、共有したろ!
tommy0124
1
560
Featured
See All Featured
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
84
A Tale of Four Properties
chriscoyier
163
24k
Optimising Largest Contentful Paint
csswizardry
37
3.6k
Building an army of robots
kneath
306
46k
Why Our Code Smells
bkeepers
PRO
340
58k
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
770
We Are The Robots
honzajavorek
0
200
WENDY [Excerpt]
tessaabrams
9
36k
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
140
Testing 201, or: Great Expectations
jmmastey
46
8.1k
Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)
techseoconnect
PRO
0
120
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
190
Transcript
Apache Arrow C++ Datasets Kenta Murata Speee, Inc. 2019.12.11 Apache
Arrow Tokyo Meetup 2019
Kenta Murata • Fulltime OSS developer at Speee, Inc. •
CRuby committer (as of 2010.02) • Apache Arrow committer (as of 2019.10) • The 24th place (44 commits) • SparseTensor in Arrow C++ • GLib and Ruby binding, etc.
Apache Arrow C++ ͷߏ Base Datasets Query Engine Data Frame
Apache Arrow C++ Datasets • 1ͭҎ্ͷσʔλιʔεΛ·ͱΊͯ1ͭͷσʔληοτͱ ͯ͠ѻ͏ͨΊͷ API Λఏڙ͢Δ •
༷ʑͳछྨͷσʔλϑΥʔϚοτͷҧ͍Λٵऩ͢Δ • ҟͳΔεΩʔϚͷσʔλιʔεΛ1ͭʹ౷߹Ͱ͖Δ • ෳछྨͷετϨʔδ͔ΒͷσʔλೖྗʹରԠͰ͖Δ • কདྷతʹϑΝΠϧͷॻ͖ग़͠ʹରԠ͢Δ༧ఆ
ෳͷσʔλιʔε͔Β1ͭͷςʔϒϧΛ࡞ΕΔ a.parquet b.parquet Query 1 Query 2 c.csv d.json Record
Batch 1 Record Batch 2 Amazon S3 Amazon Redshift Local File System In-Memory Arrow Table
ϑΝΠϧ͔ΒͷಡΈࠐΈ Discover Scan Filter & Project Collect
ϑΝΠϧ͔ΒͷಡΈࠐΈ • ϑΝΠϧΛεΩϟϯͯ͠ Record Batch Λ࡞Δ • ෳϑΝΠϧΛฒྻεΩϟϯͰ͖Δ • ϑΝΠϧγεςϜ্ͷσΟϨΫτϦ͔Βࢦఆͨ͠ϧʔϧʹج͍ͮͯϑΝΠϧΛൃݟ͢Δ
• ෳͷϑΝΠϧʹׂ͞ΕͨσʔλΛ࠶ߏ͢Δ • σʔλΛෳϑΝΠϧʹׂ͢Δͱ͖ͷεΩʔϚׂͷنଇʹैͬͯॲཧ͢Δ • ݅ࣜͰߦΛϑΟϧλϦϯάͰ͖Δ • ݁ՌΛ࡞ΔͨΊʹඞཁͳΧϥϜͷΈΛಡΈࠐΉ • ϩʔΧϧετϨʔδʹΩϟογϡΛ࡞Δ • ඞཁʹͳΔ·ͰϑΝΠϧΛಡΈࠐ·ͳ͍ (lazy scan)
ϑΝΠϧͷൃݟ • ϕʔεσΟϨΫτϦͷҐஔͱϑΝΠϧϑΥʔϚοτΛࢦఆ ͢ΔͱɺͦͷσΟϨΫτϦҎԼʹ͋ΔରϑΝΠϧΛ͢ ͯϦετΞοϓͯ͘͠ΕΔ • αϒσΟϨΫτϦΛ࠶ؼతʹ୳͢͜ͱՄೳ • ແࢹ͢ΔϑΝΠϧ໊ͷϓϨϑΟοΫεΛࢦఆͰ͖Δ •
ରϑΝΠϧΛͯ͢ಡΈࠐΉͨΊʹඞཁͳϚʔδࡁΈͷ εΩʔϚΛ࡞ͬͯ͘ΕΔ (༧ఆ)
ϑΝΠϧͷൃݟͷྫ /data/.metadata /data/2018/12/JP/Tokyo/001.parquet /data/2018/12/JP/Tokyo/002.parquet /data/2018/12/JP/Osaka/001.parquet /data/2018/12/US/CA/001.parquet /data/2019/01/JP/Tokyo/001.parquet /data/2019/01/JP/Osaka/001.parquet /data/2019/01/US/CA/001.parquet /data/2019/01/US/NY/001.parquet
/tmp/Tokyo.parquet ↓͜ΕΒͷϑΝΠϧ͚ͩϐοΫΞοϓ͍ͨ͠
ϑΝΠϧͷൃݟͷྫ using namespace arrow; using namespace arrow::dataset; fs::Selector selector; selector.base_dir
= “/data”; selector.recursive = true; std::shared_ptr<FileSystemDataSourceDiscovery> discovery; ARROW_OK_AND_ASSIGN( discovery, FileSystemDataSourceDiscovery::Make( fs, selector, std::make_shared<dataset::ParquetFileFormat>(), FileSystemDiscoveryOptions())); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish());
σʔλׂͷنଇΛࢦఆ /data/2018 /data/2018/12 /data/2018/12/JP /data/2018/12/JP/Tokyo/001.parquet auto partition_scheme = schema({field(“year”, int32()),
field(“month”, int32()), field(“country”, utf8()), field(“city”, utf8())}); ASSERT_OK(discovery->SetPartitionScheme(partition_scheme)); ARROW_OK_AND_ASSIGN(auto datasource, discovery->Finish()); year month country city => {“year": 2018} => {“year”: 2018, “month”: 12} => {“year”: 2018, “month”: 12, “country”: “JP”} => {“year”: 2018, “month”: 12, “country”: “JP”, “city”: “Tokyo”}
ϑΟϧλϦϯά • ݅ࣜΛͬͯߦΛϑΟϧλϦϯάͰ͖Δ • year ͕ 2019 Ͱ sales ͕
100.0 ΑΓେ͖͍ߦ͚ͩΛऔΓ ग़͢߹࣍ͷࣜΛεΩϟφʹࢦఆ͢Δ “year”_ == 2019 && “sales”_ > 100.0 • εΩʔϚׂͷنଇʹैͬͯɺ݅ʹ߹க͠ͳ͍ϑΝΠϧ ͷಡΈࠐΈΛলུ͢Δ
औΓग़͢ΧϥϜͷࢦఆ • ͯ͢ͷΧϥϜΛಡΈࠐ·ͳͯ͘ྑ͍߹ɺϓϩδΣΫ γϣϯ (ࣹӨ) ػೳΛͬͯऔΓग़͢ΧϥϜΛ੍ݶͰ͖Δ • ͜ͷػೳͰಡΈࠐΉΧϥϜΛ੍ݶ͢ΔͱɺෆཁͳΧϥϜͷ σγϦΞϥΠζͱܕม͕লུ͞ΕͯɺϑΝΠϧϑΥʔ ϚοτʹΑͬͯσʔλͷಡΈग़͕͘͠ͳΔ
σʔληοτΛ࡞ͬͯಡΈࠐΜͰ Arrow Table Λ࡞Δ·Ͱͷྫ // σʔληοτͷ࡞ ASSERT_OK_AND_ASSIGN(auto dataset, Dataset::Make({data_source}, discovery->Inspect()));
// εΩϟφϏϧμ ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); // ϑΟϧλͷઃఆ auto filter = (“year”_ == 2019 && “sales”_ > 100.0); ASSERT_OK(scanner_builder->Filter(filter)); // ϓϩδΣΫγϣϯͷઃఆ std::vector<std::string> columns{“item_id”, “item_name”, “sales”}; ASSERT_OK(scanner_builder->Project(columns)); // εΩϟφੜ ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish(); // σʔλΛಡΈࠐΜͰ Arrow Table Λ࡞Δ (͜͜Ͱ࣮ࡍʹϑΝΠϧ͕ಡΈࠐ·ΕΔ) ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable());
ෳϑΝΠϧͷฒྻಡΈࠐΈ • ϑΝΠϧ୯ҐͰಡΈࠐΈλεΫ͕࡞ΒΕɺεϨουϓʔϧ ͰλεΫ͕ฒྻ࣮ߦ͞ΕΔ • Parquet ϑΥʔϚοτͰɺ1ͭͷϑΝΠϧߦάϧʔϓ ͝ͱʹγʔέϯγϟϧʹಡΈࠐ·ΕΔ • 1ͭͷϑΝΠϧ͔Β1ͭҎ্ͷ
Arrow Record Batch ͕ੜ ͞Εͯɺ࠷ޙʹ·ͱΊͯ Arrow Table ͕ੜ͞ΕΔ
༷ʑͳϑΝΠϧϑΥʔϚοτʹରԠ͢Δ • ݱࡏෳͷ Parquet ϑΝΠϧʹׂ͞Εͨσʔληο τͷରԠΛඋத • AVRO, ORC, JSON,
CSV ͳͲͷҰൠతͳσʔλอଘ༻ͷ ϑΥʔϚοτকདྷతʹରԠ͞ΕΔ • Parquet Ҏ֎ͷϑΥʔϚοτʹରԠ͢Δ Pull Request ৗʹ welcome ͩͱࢥ͏
༷ʑͳϑΝΠϧγεςϜͷରԠ • ରԠࡁΈͷͷ • ϩʔΧϧϑΝΠϧγεςϜ • HDFS • Amazon S3
• ςετ༻ͷϞοΫϑΝΠϧγεςϜ • কདྷతʹରԠ͍ͨ͠ͷ • Google Cloud Storage • Microsoft Azure BLOB Storage
RDB ͔ΒͷಡΈࠐΈ • RDB ͷςʔϒϧΫΤϦͷ݁ՌΛσʔλιʔεͱͯ͑͠ΔΑ͏ʹ͢Δ ܭը͋Δ • ࣍ͷγεςϜ໊ࢦ͠͞Ε͍ͯΔ • SQLite3
• PostgreSQL protocol (pgsql, Vertica, Redshift) • MySQL (and MemSQL) • Microsoft SQL Server (TDS) • HiveServer2 (Hive and Impala) • ClickHouse
Apache Arrow C++ Datasets • Apache Arrow C++ Datasets ͕͋Εɺ͍Ζ͍Ζͳॴ
ʹอଘ͞Ε͍ͯΔ͍Ζ͍ΖͳϑΥʔϚοτͷσʔλΛޮ Α͘ಡΈࠐΜͰ1ͭͷ Arrow Table ʹͰ͖Δ • Arrow Table Λ࡞ͬͨ͋ͱʁ • ͞Βʹੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ • ूܭ౷ܭॲཧΛ͍ͨ͠
Arrow Table Λ࡞ͬͨ͋ͱ • ੳ༻ͷΫΤϦΛ࣮ߦ͍ͨ͠ => Apache Arrow C++ Query
Engine • ूܭ౷ܭॲཧΛ͍ͨ͠ => Apache Arrow C++ Data Frame
Apache Arrow C++ Query Engine • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠SQL෩ͷΫΤ ϦɺσʔλੳͰΑ͘ར༻͞ΕΔ࣌ܥྻૢ࡞ pivot ૢ࡞ͳͲΛ࣮ߦ͢ΔػೳΛఏڙ͢Δ • σʔλϕʔεΛஔ͖͑Δ͜ͱҙਤͤͣɺC++ ͷڞ༗ϥ ΠϒϥϦͱͯ͠ҰൠͷΞϓϦέʔγϣϯʹຒΊࠐΜͰΘ ΕΔ͜ͱΛఆ͍ͯ͠Δ • ·ͩ։ൃ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ
Apache Arrow C++ Data Frame • ϝϞϦ্ͷ Arrow Record Batch
ʹରͯ͠ɺ͍ΘΏΔ σʔλϑϨʔϜ͕උ͍͑ͯΔΑ͏ͳσʔλૢ࡞ɺੳɺू ܭͳͲͷػೳΛఏڙ͢Δ • ։ൃ·ͩ࢝·͍ͬͯͳ͍͕ٞ͞Ε͍ͯΔ • pandas2 Arrow C++ Data Frame ΛόοΫΤϯυͱ ͯ͠࡞ΕΒΕΔͷ͔ͳʁ
Datasets Query Engine Data Frame ϑΝΠϧDBʹอଘ͞Εͨσʔλ ͷΞΫηε͕؆୯ʹͳΔ ϝϞϦ্ͷςʔϒϧσʔλʹର͢Δ ੳΫΤϦ͕؆୯ʹ࣮ߦͰ͖Δ ϝϞϦ্ͷςʔϒϧσʔλΛσʔλ
ϑϨʔϜͱͯ͠ར༻Ͱ͖Δ