Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
spark shuffle 勉強会
Search
huydx
March 18, 2016
Programming
0
1k
spark shuffle 勉強会
spark shuffle 勉強会
huydx
March 18, 2016
Tweet
Share
More Decks by huydx
See All by huydx
Finagle もろもろ
huydx
0
2.1k
web audio api (htmlday osaka)
huydx
1
1.1k
Other Decks in Programming
See All in Programming
Hypervel - A Coroutine Framework for Laravel Artisans
albertcht
1
100
童醫院敏捷轉型的實踐經驗
cclai999
0
190
明示と暗黙 ー PHPとGoの インターフェイスの違いを知る
shimabox
2
320
What Spring Developers Should Know About Jakarta EE
ivargrimstad
0
250
コードの90%をAIが書く世界で何が待っているのか / What awaits us in a world where 90% of the code is written by AI
rkaga
46
31k
関数型まつり2025登壇資料「関数プログラミングと再帰」
taisontsukada
2
850
アンドパッドの Go 勉強会「 gopher 会」とその内容の紹介
andpad
0
260
Cline指示通りに動かない? AI小説エージェントで学ぶ指示書の書き方と自動アップデートの仕組み
kamomeashizawa
1
580
Composerが「依存解決」のためにどんな工夫をしているか #phpcon
o0h
PRO
1
230
AWS CDKの推しポイント 〜CloudFormationと比較してみた〜
akihisaikeda
3
310
Haskell でアルゴリズムを抽象化する / 関数型言語で競技プログラミング
naoya
17
4.9k
Cursor AI Agentと伴走する アプリケーションの高速リプレイス
daisuketakeda
1
130
Featured
See All Featured
How to Think Like a Performance Engineer
csswizardry
24
1.7k
Balancing Empowerment & Direction
lara
1
370
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.6k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
48
2.8k
Six Lessons from altMBA
skipperchong
28
3.8k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
The World Runs on Bad Software
bkeepers
PRO
69
11k
Statistics for Hackers
jakevdp
799
220k
The Pragmatic Product Professional
lauravandoore
35
6.7k
Embracing the Ebb and Flow
colly
86
4.7k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.5k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.3k
Transcript
SparkͷShuffleपΓ @huydx
Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊ͷதؒϨΠϠʔ ʮShuffleʯͱݺͿ •
ShuffleͰ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • SparkReduceδϣϒʹඞཁͳσʔλϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍
͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ
Shuffleͷ • ShuffleϑΝΠϧ • Mapͷ͕MɺReduceͷ͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ͕ M * R (M
= 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δͷ • ௨৴͕ॏ͍
Shuffleͷղܾ • ShuffleϑΝΠϧɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ͑ΒΕΔ •
Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045
Shuffleͷղܾ • SortͷΞϧΰϦζϜબ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷΉ߹ΘͤͰฏ
ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢
Shuffleͷղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮ (FileChannel.transferToͰzero
copyʣ