Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
spark shuffle 勉強会
Search
huydx
March 18, 2016
Programming
0
1k
spark shuffle 勉強会
spark shuffle 勉強会
huydx
March 18, 2016
Tweet
Share
More Decks by huydx
See All by huydx
Finagle もろもろ
huydx
0
2.2k
web audio api (htmlday osaka)
huydx
1
1.2k
Other Decks in Programming
See All in Programming
AgentCoreとHuman in the Loop
har1101
5
240
インターン生でもAuth0で認証基盤刷新が出来るのか
taku271
0
190
Basic Architectures
denyspoltorak
0
680
Rust 製のコードエディタ “Zed” を使ってみた
nearme_tech
PRO
0
200
24時間止められないシステムを守る-医療ITにおけるランサムウェア対策の実際
koukimiura
1
120
CSC307 Lecture 01
javiergs
PRO
0
690
AIフル活用時代だからこそ学んでおきたい働き方の心得
shinoyu
0
140
Data-Centric Kaggle
isax1015
2
780
フロントエンド開発の勘所 -複数事業を経験して見えた判断軸の違い-
heimusu
7
2.8k
CSC307 Lecture 05
javiergs
PRO
0
500
Lambda のコードストレージ容量に気をつけましょう
tattwan718
0
140
QAフローを最適化し、品質水準を満たしながらリリースまでの期間を最短化する #RSGT2026
shibayu36
2
4.4k
Featured
See All Featured
Ruling the World: When Life Gets Gamed
codingconduct
0
150
Navigating Team Friction
lara
192
16k
GraphQLとの向き合い方2022年版
quramy
50
14k
GitHub's CSS Performance
jonrohan
1032
470k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
359
30k
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
920
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
270
Embracing the Ebb and Flow
colly
88
5k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.4k
Six Lessons from altMBA
skipperchong
29
4.2k
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
280
Transcript
SparkͷShuffleपΓ @huydx
Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊ͷதؒϨΠϠʔ ʮShuffleʯͱݺͿ •
ShuffleͰ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • SparkReduceδϣϒʹඞཁͳσʔλϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍
͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ
Shuffleͷ • ShuffleϑΝΠϧ • Mapͷ͕MɺReduceͷ͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ͕ M * R (M
= 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δͷ • ௨৴͕ॏ͍
Shuffleͷղܾ • ShuffleϑΝΠϧɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ͑ΒΕΔ •
Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045
Shuffleͷղܾ • SortͷΞϧΰϦζϜબ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷΉ߹ΘͤͰฏ
ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢
Shuffleͷղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮ (FileChannel.transferToͰzero
copyʣ