Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
spark shuffle 勉強会
Search
huydx
March 18, 2016
Programming
0
1k
spark shuffle 勉強会
spark shuffle 勉強会
huydx
March 18, 2016
Tweet
Share
More Decks by huydx
See All by huydx
Finagle もろもろ
huydx
0
2.1k
web audio api (htmlday osaka)
huydx
1
1.2k
Other Decks in Programming
See All in Programming
『実践MLOps』から学ぶ DevOps for ML
nsakki55
2
480
Feature Flags Suck! - KubeCon Atlanta 2025
phodgson
0
170
Stay Hacker 〜九州で生まれ、Perlに出会い、コミュニティで育つ〜
pyama86
2
2.6k
モダンJSフレームワークのビルドプロセス 〜なぜReactは503行、Svelteは12行なのか〜
fuuki12
0
120
Claude Code on the Web を超える!? Codex Cloud の実践テク5選
sunagaku
0
610
[SF Ruby Conf 2025] Rails X
palkan
0
370
開発生産性が組織文化になるまでの軌跡
tonegawa07
0
190
All(?) About Point Sets
hole
0
220
全員アーキテクトで挑む、 巨大で高密度なドメインの紐解き方
agatan
8
10k
分散DBって何者なんだ... Spannerから学ぶRDBとの違い
iwashi623
0
110
生成AIを活用したリファクタリング実践 ~コードスメルをなくすためのアプローチ
raedion
0
140
DartASTとその活用
sotaatos
2
150
Featured
See All Featured
The Art of Programming - Codeland 2020
erikaheidi
56
14k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.1k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.2k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Designing Experiences People Love
moore
142
24k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.5k
Being A Developer After 40
akosma
91
590k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
118
20k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.1k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.8k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Transcript
SparkͷShuffleपΓ @huydx
Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊ͷதؒϨΠϠʔ ʮShuffleʯͱݺͿ •
ShuffleͰ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • SparkReduceδϣϒʹඞཁͳσʔλϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍
͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ
Shuffleͷ • ShuffleϑΝΠϧ • Mapͷ͕MɺReduceͷ͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ͕ M * R (M
= 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δͷ • ௨৴͕ॏ͍
Shuffleͷղܾ • ShuffleϑΝΠϧɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ͑ΒΕΔ •
Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045
Shuffleͷղܾ • SortͷΞϧΰϦζϜબ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷΉ߹ΘͤͰฏ
ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢
Shuffleͷղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮ (FileChannel.transferToͰzero
copyʣ