Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Embulkに足りない5つのこと
Search
Civitaspo
December 15, 2015
Programming
7
5.1k
Embulkに足りない5つのこと
embulk meetup tokyoで話しました!
ユースケースが書かれているので是非参考にして下さい。
Civitaspo
December 15, 2015
Tweet
Share
More Decks by Civitaspo
See All by Civitaspo
生データを最速で取り込むチャレンジ ~LayerXデータ基盤成長物語 part1~ / Building a data infrastructure that captures raw data at the fastest
civitaspo
4
660
データ基盤における管理の考え方 〜dbtの極意〜:LayerXにdbtを導入するときに意識したこと
civitaspo
3
1.5k
Vertex Pipelines触ってみた / Try Vertex Pipelines
civitaspo
0
1.3k
Digdag と Embulk と Athena で作る Gunosy の ELT基盤
civitaspo
8
9.5k
Other Decks in Programming
See All in Programming
What We Can Learn From OSS
inouehi
0
420
Hanami and htmx
bkuhlmann
0
210
Compose-View Interop in Practice (mDevCamp 2024)
stewemetal
0
130
try!Swift Tokyo 2024 参加報告 LT
akidon0000
1
220
PHPはいつから死んでいるかの調査
chiroruxx
1
400
Apache Hive 4 on Treasure Data
ryukobayashi
0
310
1BRC--Nerd Sniping the Java Community
gunnarmorling
0
340
ADRを一年運用してみた/adr_after_a_year
hanhan1978
7
2.4k
MetricKitで予期せぬ終了を検知する話 / Detect unexpected termination with MetricKit
nekowen
1
190
Java 22 Overview
kishida
1
180
educure_カリキュラム生操作マニュアル.pdf
linew_official
0
790
VS Code をプロダクトにどう取り込むか
onomax
1
360
Featured
See All Featured
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
19
1.7k
Building a Modern Day E-commerce SEO Strategy
aleyda
17
6.4k
Optimising Largest Contentful Paint
csswizardry
8
2.4k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
25
2.3k
Designing for humans not robots
tammielis
248
25k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
322
20k
The Pragmatic Product Professional
lauravandoore
25
5.8k
StorybookのUI Testing Handbookを読んだ
zakiyama
13
4.6k
Large-scale JavaScript Application Architecture
addyosmani
504
110k
Building a Scalable Design System with Sketch
lauravandoore
456
32k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
30
6k
Documentation Writing (for coders)
carmenintech
60
3.9k
Transcript
EmbulkʹΓͳ͍ 5ͭͷ͜ͱ 2015-12-15 Embulk Meetup Tokyo #2 @Civitaspo
͓͜ͱΘΓ
࡞ͬͨπʔϧͷ͕ ग़͖ͯ·͕͢
ϦϦʔε͠·ͤΜʂ
͝ΊΜͳ͍͞(><)
ཧ༝ɿ͋Δਓ͔ΒͷҰ
ࠓ͜ͷπʔϧΛ ϦϦʔε͢Δͱ ੈͷதΛࠞཚͷӔʹ ר͖ࠐΉ͜ͱ ʹͳͬͯ͠·͏ͩΖ͏ɻ
ࠞཚͷछʹͳΔͩΖ͏ػೳ ʮEmbulk Projectʹظ͢Δ͜ͱʯ ͱ͍͏߲Ͱઆ໌͠·͢
Ͱ࢝Ί·͢ʂ
EmbulkʹΓͳ͍ 5ͭͷ͜ͱ 2015-12-15 Embulk Meetup Tokyo #2 @Civitaspo
ࣗݾհ தࢁوത (@Civitaspo) • DeNA: ೖࣾ3 • ੳܥΠϯϑϥΤϯδχΞ • σʔλճऩڥߏங
• Hadoopӡ༻…etc. • Perl / Ruby / Java • ৽ଔͰձࣾʹೖͬͯॳΊͯί ϯιʔϧ։͍ͨস
࡞ͬͨEmbulk Pluginୡ embulk-input-hdfs embulk-output-hdfs embulk-output-sftp embulk-filter-join-file embulk-filter-json-key embulk-filter-expand-json embulk-filter-flatten-json embulk-filter-distinct
࣭ཁɺΞυόΠε͋ΕtwitterͰʂ
ͪͳΈʹ શ෦Java PluginͰ͕͢
Intellijͬͯॻ͍ͯ·͢ vim!! vim!!
ΞδΣϯμ • ฐࣾͷBulkdataࣄ • ߏஙͨ͠Embulkج൫ • Γͳ͔͔ͬͨΒิͬͨػೳ̑ͭ • Embulk Projectʹظ͢Δ͜ͱ
ΞδΣϯμ • ฐࣾͷBulkdataࣄ • ߏஙͨ͠Embulkج൫ • Γͳ͔͔ͬͨΒิͬͨػೳ̑ͭ • Embulk Projectʹظ͢Δ͜ͱ
ฐࣾͷBulkdataࣄ
HDFS -> Vertica
ฐࣾͷBulkdataࣄ • ݎ࿚ͳϩάճऩج൫ • શͯͷαʔϏεͷϩάHDFSʹ֨ೲ͞ΕΔ • ϑΥʔϚοτ౷Ұ͞Ε͍ͯΔ • TSV +
JSONͷࠞ߹ϑΥʔϚοτ
ฐࣾͷBulkdataࣄ • ੳͰ༻͢ΔओཁετϨʔδVertica • HDFSͷϩάΛͦͷ··Verticaʹ֨ೲ͍ͨ͠ • ͨͩ͠streaming insertͰ͖ͳ͍… • Verticaߴසͳσʔλೖʹඇৗʹ
ऑ͍
Vertica? •ྻࢦܕߴूܭσʔλϕʔε •༻ιϑτΣΞϥΠηϯε(1TB·Ͱແྉ) •ඇৗʹ๛ͳੳؔ •twitterfacebookͰΘΕͯΔ •ฐࣾͷੳڥͷओ࣠
ߴ·Δχʔζ • GCS, S3, BigQueryͳͲcloud storageͷར༻֦େ • HDFS -> Vertica
͚ͩͰͳ͍Bulkloadχʔζ
None
Embulk
͜Ε͕΄͔ͬͨ͠Μʂ
ͬͦ͘͞ಋೖ
ΞδΣϯμ • ฐࣾͷBulkdataࣄ • ߏஙͨ͠Embulkج൫ • Γͳ͔͔ͬͨΒิͬͨػೳ̑ͭ • Embulk Projectʹظ͢Δ͜ͱ
Wrapper ߏஙͨ͠Embulkج൫
σʔλ֨ೲ·ͰͷྲྀΕ
Wrapper 1. GoogleSpreadSheet͔Β εΩʔϚใͱBulkloadλΠϓΛநग़
Wrapper 2. Ϋϥελใεέδϡʔϧใͱ ඥ͚ͯMySQLʹ֨ೲ
Wrapper 3. εέδϡʔϧʹैͬͯRedis Enqueue
Wrapper 4. ඥͮ͘BulkloadλΠϓɺΫϥελ ใ͔Βconfig.ymlΛࣗಈੜ
Wrapper 5. Embulk WrapperΛkick
Wrapper 6. ֨ೲઌͷσʔλΛআ
Wrapper 7. Embulk run
Wrapper 8. ॲཧ݁ՌΛMySQLʹอଘ
Wrapper 9. ݁Ռදࣔ
ৄࡉͳ࣮ʹؔͯ͠ ؾʹͳΔํ࠙ձͰʂ
ࠓ͢෦
Wrapper ࠓ͢෦
Wrapper ࠓ͢෦ ͳΜͰspreadsheetͬͯΔͷʁ
Wrapper ࠓ͢෦ Job QueueཧͲ͏ͬͯΔͷʁ
Wrapper ࠓ͢෦ ͳΜͰwrapper͔·ͯ͠Δͷʁ
Wrapper ࠓ͢෦ embulkʹͳ͔ͬͨͷʁ
ͳͲʹ͍ͭͯ͠·͢
ΞδΣϯμ • ฐࣾͷBulkdataࣄ • ߏஙͨ͠Embulkج൫ • Γͳ͔͔ͬͨΒิͬͨػೳ̑ͭ • Embulk Projectʹظ͢Δ͜ͱ
Γͳ͍͜ͱϥΠϯφοϓ 1. YAMLཧ 2. ୯ମॲཧϑϨʔϜϫʔΫ 3. δϣϒΩϡʔ 4. ฒྻ੍ޚ 5.
ଓ͖͔Β࣮ߦ
Γͳ͍͜ͱϥΠϯφοϓ 1. YAMLཧ 2. ୯ମॲཧϑϨʔϜϫʔΫ 3. δϣϒΩϡʔ 4. ฒྻ੍ޚ 5.
ଓ͖͔Β࣮ߦ
Γͳ͍͜ͱ̍
YAMLཧ
YAMLཧ • Embulkͷ༷͚ͩͲɻɻɻ • 1ͭͷBulkloadʹରͯ͠1ͭͷYAMLϑΝΠϧ
ฐࣾͷࣄ • εΩʔϚใΞφϦετ͕ఆٛ • εΩʔϚใҎ֎ͷઃఆੳج൫͕ཧ => ͻͱͭͷconfig.ymlੜʹෳਓ͕ؔΘΔ͜ͱʹ…
ฐࣾͷࣄ • ରͷBulkload1αʔϏε͋ͨΓ20~50 • ৗʹ20Ҏ্ͷαʔϏε͕ฒྻͰՔಇ͠ɺҠΓม ΘΓܹ͍͠ => େྔͷyaml͕ੜ͞Εͯཧ͕ͭΒ͍(ଓใͱ͔)
None
ج൫ཧ ΞφϦετཧ
None
͘Β͍ʹͳͬͨ
Α͘มΘΔ ΄ͱΜͲมΘΒͳ͍
None
݁ہΑ͘มΘΔͷ εΩʔϚใ͘Β͍
ͭͬͨ͘ • Embulk Config Generator & UI • ଓใΛDBͰҰݩཧ •
BulkloadύλʔϯΛநԽͯ͠DBͰཧ • εΩʔϚใspreadsheetͰཧ
ͭͬͨ͘ • config.yml࣮ߦલʹੜ͞ΕΔͷͰ༷ม ߋͳͲʹॊೈʹରԠͰ͖Δ • ೦ͳ͕Βguess͑ͳ͘ͳΓ·ͨ͠ɻ • ಛʹࠔͬͯͳ͍ • ϩάΛఆٛͨ͠ਓ͕εΩʔϚΛఆٛͯ͠Δ
Embulk Projectʹظ͢Δ͜ͱ • େنར༻ʹͱͬͯͷyaml • fileͰશͯΛཧ͢ΔͷͭΒ͍ • ಉ͡yamlͰཧऀ͕ҟͳΔ • ಉ͡yamlͰมߋස͕ҟͳΔ
Embulk Projectʹظ͢Δ͜ͱ • templateػೳ͚ͩͰͳ͘ɺΑΓଟ࣍ݩʹཧͰ ͖ΔΈ͕ඞཁͳͷͰʁ • ҰํͰɺ݁ہɺۀϑϩʔʹΑΔͷͰΈΜͳࣗ ࡞͢Δͷ͔͠Εͳ͍ͱࢥͬͯΔ
Γͳ͍͜ͱϥΠϯφοϓ 1. YAMLཧ 2. ୯ମॲཧϑϨʔϜϫʔΫ 3. δϣϒΩϡʔ 4. ฒྻ੍ޚ 5.
ଓ͖͔Β࣮ߦ
Γͳ͍͜ͱ2
୯ମॲཧ ϑϨʔϜϫʔΫ
EmbulkBulkload͢Δ͚ͩ • ઃఆ௨Γʹɺ͋Δσʔλιʔε͔Βɺ͋Δσʔ λιʔεσʔλΛϩʔυ͢Δ • Input / Outputͷঢ়ଶʹڵຯ͕ແ͍
͔ͩΒEmbulkͷ୲อ͢Δႈੑ ͜͏ͳΔ • 1ճ࣮ߦ͠Α͏ͱͨ͠Bulkloadʹର͢ΔႈੑΛ ୲อ͠Α͏ͱ͢ΔʢPlugin࣍ୈʣ • વ͚ͩͲɺ֨ೲઌͷσʔλͷ߹ੑΛอͭ ͷͰͳ͍
ฐࣾͷࣄ • HDFS͔ΒVerticaͷ֨ೲৗʹΧϥϜΛߜΔʢ͓ ۚΣ…ʣ • ෳࡶͳKPI͕ݟͨ͘ͳͬͨ࣌ʹΧϥϜՃΛߦ͏ • HDFS্ͷσʔλ͕ඞཁʹͳͬͨλΠϛϯάͰ σʔλͷ࠶ϩʔυ͕ൃੜ =>
ࣄલʹ֨ೲൣғͷσʔλআ͕ඞཁ
ฐࣾͷࣄ • File֨ೲ͕͍ྃͯ͠Δ͔Ͳ͏͔Λ `_SUCCESS` ͱ͍͏ϑΝΠϧΛಉ֊ʹஔ͘͜ͱͰཧ • `_SUCCESS` ͕ͳ͚Εॲཧதͷσʔλͱ ೝࣝ͞ΕΔ =>
ࣄલʹϑΝΠϧͷଘࡏ֬ೝͱࣄޙʹϑΝΠϧͷ put͕ඞཁ
ͱ͍͏͔ • ձࣾʹΑͬͯϧʔϧҧ͑Ͳɺࣄલॲཧࣄޙॲ ཧઈର͍Δͣ
ͭͬͨ͘ • Embulk Wrapper • EmbulkʹΓͳ͍ࣄલɾࣄޙͷ୯ମॲཧΛ αϙʔτ͢ΔWrapper • YAMLͷઃఆϑΝΠϧʹج͍࣮ͮͯߦ͞ΕΔ •
action × storage ͱ͍͏୯ҐͰpluginΛॻ͘
ͪΐͬͱ͚ͩ Embulk Wrapperհ આ໌༻ͷYaml͕ ؒԆͼ͢ΔͷͰ ΞϯΧʔ͍·͢ɻ
actionͱ͍ actionͱ͍͏୯ҐͰ ॲཧΛఆٛ
actionΛarrayͰఆٛ͠ γʔέϯγϟϧʹ࣮ߦ
embulkͷ࣮ߦ
࡞ͬͨActions • vertica#delete • vertica#check_null • vertica#glance (νϥݟ) • s3#poll
• s3#remove • ….
Embulk Projectʹظ͢Δ͜ͱ • Embulkʹ୯ମॲཧϑϨʔϜϫʔΫඞཁෆՄܽ • ґଘؔͷͳ͍୯ମॲཧΛߦ͏ϓϥάΠϯػߏ ͕Embulkʹଂ͞ΕΕɺEmbulkͷΈͰߦ͑Δ ͜ͱͷ෯͕͔ͳΓ͕ΔͷͰͳ͍ͩΖ͏͔ʁ
Γͳ͍͜ͱϥΠϯφοϓ 1. YAMLཧ 2. ୯ମॲཧϑϨʔϜϫʔΫ 3. δϣϒΩϡʔ 4. ฒྻ੍ޚ 5.
ଓ͖͔Β࣮ߦ
Γͳ͍͜ͱ3
δϣϒΩϡʔ
Embullk ͷ࣮ߦΛ Ͳ͏ཧ͠Α͏͔ • େنར༻͢ΔͳΒδϣϒཧͷΈ͕ඞਢ • વ͚ͩͲ Embulk ʹͦΜͳػೳͳ͍ •
ࣗͰ࡞Δ͔طଘͷผͷԿ͔Λ͏͔͠ແ͍
݁ہɺͭͬͨ͘ • Redis/Sidekiq ΛͬͨfifoͳδϣϒΩϡʔ • εέδϡʔϧ࣮ߦɺεέδϡʔϥ͕࣌ؒʹͳΔ ͱ job Λ enqueue
• Adhoc࣮ߦɺϢʔβʔ͕δϣϒ࣮ߦϘλϯΛԡ ͢ͱ job Λ enqueue => ͍͢͝ී௨ͷδϣϒΩϡʔ͆
Workerαʔόʔ(Embulk࣮ߦڥ) • 1αʔόʔ͋ͨΓɺEmbulk 1 process • EmbulkCPUΛ͑Δ΄ͲߴʹͳΔͨΊ • 24 CPU,
memory 60 GB αʔόʔ 6Ͱฒྻ࣮ߦ
ࠓͷͱ͜Ζࠔͬͯͳ͍ͷ͕ͩ • ഉଞॲཧ͕ͪΌΜͱͰ͖͍ͯͳ͍ • ༏ઌॲཧ͕ग़͖ͯͦ͏ • graceful restartͰ͖ͯͳ͍ • …
=> Ұॠߟ͑Δ͚ͩͰग़͖ͯͦ͏ͳ͕͍ͬͺ͍
͜ΕྲྀੴʹEmbulkʹ ظͯ͠ͳ͍ • ॻ͍͚ͨͲEmbulkʹΓͳ͍͜ͱͱ͍͏ΑΓ BatchڥʹΓͳ͍͜ͱ͔ͬͯΜ͡Ͱͨ͠ • ͱ͍͑ɺEmbulkΛར༻͢Δ্Ͱඞཁͳ͜ͱͳ ͷͰϕετϓϥΫςΟε୳ͯ͠·͢
Γͳ͍͜ͱϥΠϯφοϓ 1. YAMLཧ 2. ୯ମॲཧϑϨʔϜϫʔΫ 3. δϣϒΩϡʔ 4. ฒྻ੍ޚ 5.
ଓ͖͔Β࣮ߦ
Γͳ͍͜ͱ4
ฒྻ੍ޚ
Embulkͷฒྻ੍ޚ • ݱঢ়LocalExecutorInputͷϑΝΠϧͰશମͷ ฒྻΛ੍ޚ͍ͯ͠Δ • FileInputPluginܧঝͰͳ͚Εฒྻجຊతʹ 1ͭ
ฐࣾͷࣄ • HDFSʹ֨ೲ͞Ε͍ͯΔϑΝΠϧ1ϑΝΠϧ͕ ڊେ • namenodeͷϝλใΛۃྗগͳ͘͢ΔͨΊ • ϑΝΠϧ͕͔Ε͍ͯͳ͍ͨΊɺϚϧνεϨου Ͱಈ͔ͣɺCPUΛશવ͑ͳ͍
ͭͬͨ͘ • embulk-input-hdfs • ಉ͡ϑΝΠϧͷinput streamΛฒྻੜ ͠ɺඞཁൣғͷΈread͢ΔΈ • configʹॻ͔ΕͨʹϑΝΠϧΛׂ •
ҙͷฒྻͰembulk࣮ߦ͕Մೳʹʂ
͜ΕͰCPU ϑϧʹ͑Δͧʂ
ͱࢥͬͨΒ͕ • OutputͰ͋ΔVertica • Session͕૿͑͗͢ΔͱVertica͕ෆ҆ఆʹ • ࠷େ20͘Β͍ʹߜͬͯ͘ΕͱVerticaνʔϜ ͔Βґཔ • InputͰฒྻنఆ͞ΕΔͷʹͲ͏͢Ε͑͑Μ
Ͱ
ͭͬͨ͘ • embulk-output-vertica • ॲཧલʹσʔλ֨ೲ༻ͷthreadΛผ్Δ • ֤εϨουσʔλ֨ೲ༻ͷthreadʹpageΛ enqueue • σʔλ֨ೲ༻threadpageΛdequeue͠ίϐʔͯ͠
͍͘ • ͜ΕͰ1ճͷίϐʔͰ1session͔͠ΘͣʹࡁΉͧʂ
Embulk Projectʹظ͢Δ͜ͱ • ฒྻ੍ޚͬͺΓExecutorʹͬͯ΄͍͠ • input / filters / outputόϥόϥʹઃఆ͍ͨ͠
• https://github.com/embulk/embulk/issues/ 232 • ϑΝΠϧׂͷAPIͱ͔FileInputPluginͷAPIͱ͠ ͍͍ͯ͋ͬͯΜ͡Όͳ͍͔͠Β
Γͳ͍͜ͱϥΠϯφοϓ 1. YAMLཧ 2. ୯ମॲཧϑϨʔϜϫʔΫ 3. δϣϒΩϡʔ 4. ฒྻ੍ޚ 5.
ଓ͖͔Β࣮ߦ
Γͳ͍͜ͱ5
ଓ͖͔Β࣮ߦ
Embulkͷػೳͱͯ͠ͷ ʮଓ͖͔Β࣮ߦʯ • Embulk࣮ߦ࣌ʹɺ࣍ʹEmbulkΛ࣮ߦ͢Δ࣌ʹ ༻͢ΔconfigϑΝΠϧΛੜ͓ͯ͘͠ • Input Plugin୯ମͷػೳͱͯ͠ఏڙ͞Ε͍ͯΔ embulk run
/path/to/next-config.yml \ -o /path/to/next-config.yml
ฐࣾͷࣄ • HDFS্ͷϑΝΠϧʹσʔλ͕ه͞Εͯߦ͘͜ ͱ͕ଟ͍ • ϑΝΠϧύεมΘΒͳ͍ɺͰɺඞཁͳσʔ λ͕૿͑ͯΔ • σʔλͷ༰͔ΒFilter͢Δ͔Ͳ͏͔அ͢ Δඞཁ͕͋Δ
Γ͔ͨͬͨ͜ͱ • Outputઌ(Vertica) ֨ೲ͞Ε͍ͯΔσʔλΛ֬ ೝ͠ɺInputσʔλΛFiltering͢Δ • SELECT max(id) FROM table;
ͷ݁Ռ͔Β • embulk-filter-row ͷconditionੜ
ͭͬͯ͘ͳ͍ • ୯ମॲཧϑϨʔϜϫʔΫͱҧ͍ɺॲཧؒͷґ ଘ͕ؔଘࡏ͢ΔͨΊɺͬ͘͞ͱͰ͖·ͤΜ Ͱͨ͠ٽ • Embulk WrapperͰରൣғΛࣄલʹফͯ͠શͯ ϩʔυ͢Δͱ͍͏ྗٕͰରԠͨ͠
Embulk Projectʹظ͢Δ͜ͱ • ΄Μͱ͏ͷҙຯͰ߹ੑΛอͭʹOutputଆͷ σʔλΛݟʹߦ͘ඞཁ͕͋Δ • ࣮ࡍʹॻ͜͏ͱ͢ΔͱEmbulkͷྖҬΛ͑Δ • ࡢฉ͍͚ͨͲʮdigdagʯͱ͍͏πʔϧ͕ग़ΔΒ ͍͠ΐ
• https://github.com/treasure-data/digdag-docs
·ͱΊ
·ͱΊ • Embulkಋೖͯ͠ϝονϟḿͬͯΔ • ৽͍͠σʔλιʔε͕ग़͖ͯͯίετ͘ ಋೖͰ͖Δʂ
·ͱΊ • ͨͩɺ࣮ࡍproductionӡ༻͠Α͏ͱ͢Δͱࡉ͔͍ ͜ͱ͕ؾʹͳͬͯ͘Δ • YAMLͷͱ͔ • δϣϒΩϡʔͷͱ͔ • લॲཧɺޙॲཧͲ͏͢Δͷͱ͔
·ͱΊ • ࣮ࡍɺEmbulk͚ͩͰղܾ͢ΔΑ͏ͳͰͳ͍ • ΈΜͳͰݟͨΊͯϕετϓϥΫςΟε୳͍ͯ͠ ͖·͠ΐ͏ • AdventCalendarۭ͍ͯΔͷͰॻ͖·͠ΐ͏
͝ਗ਼ௌ͋Γ͕ͱ͏ ͍͟͝·ͨ͠ʂ