Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
spark shuffle 勉強会
Search
huydx
March 18, 2016
Programming
0
980
spark shuffle 勉強会
spark shuffle 勉強会
huydx
March 18, 2016
Tweet
Share
More Decks by huydx
See All by huydx
Finagle もろもろ
huydx
0
2.1k
web audio api (htmlday osaka)
huydx
1
1.1k
Other Decks in Programming
See All in Programming
ESLintプラグインを使用してCDKのセオリーを適用する
yamanashi_ren01
2
240
ecspresso, ecschedule, lambroll を PipeCDプラグインとして動かしてみた (プロトタイプ) / Running ecspresso, ecschedule, and lambroll as PipeCD Plugins (prototype)
tkikuc
2
1.9k
[JAWS-UG横浜 #80] うわっ…今年のServerless アップデート、少なすぎ…?
maroon1st
0
110
Итераторы в Go 1.23: зачем они нужны, как использовать, и насколько они быстрые?
lamodatech
0
1.4k
カンファレンス動画鑑賞会のススメ / Osaka.swift #1
hironytic
0
180
PicoRubyと暮らす、シェアハウスハック
ryosk7
0
230
PHPで学ぶプログラミングの教訓 / Lessons in Programming Learned through PHP
nrslib
4
1.1k
watsonx.ai Dojo #6 継続的なAIアプリ開発と展開
oniak3ibm
PRO
0
170
PHPで作るWebSocketサーバー ~リアクティブなアプリケーションを知るために~ / WebSocket Server in PHP - To know reactive applications
seike460
PRO
2
770
AWS re:Invent 2024個人的まとめ
satoshi256kbyte
0
110
盆栽転じて家具となる / Bonsai and Furnitures
aereal
0
1.9k
asdf-ecspresso作って 友達が増えた話 / Fujiwara Tech Conference 2025
koluku
0
1.4k
Featured
See All Featured
Building an army of robots
kneath
302
45k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Designing for Performance
lara
604
68k
Done Done
chrislema
182
16k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
132
33k
Typedesign – Prime Four
hannesfritz
40
2.5k
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
A Modern Web Designer's Workflow
chriscoyier
693
190k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
It's Worth the Effort
3n
183
28k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
6
500
Transcript
SparkͷShuffleपΓ @huydx
Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊ͷதؒϨΠϠʔ ʮShuffleʯͱݺͿ •
ShuffleͰ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • SparkReduceδϣϒʹඞཁͳσʔλϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍
͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ
Shuffleͷ • ShuffleϑΝΠϧ • Mapͷ͕MɺReduceͷ͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ͕ M * R (M
= 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δͷ • ௨৴͕ॏ͍
Shuffleͷղܾ • ShuffleϑΝΠϧɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ͑ΒΕΔ •
Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045
Shuffleͷղܾ • SortͷΞϧΰϦζϜબ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷΉ߹ΘͤͰฏ
ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢
Shuffleͷղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮ (FileChannel.transferToͰzero
copyʣ