Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
spark shuffle 勉強会
Search
huydx
March 18, 2016
Programming
0
1k
spark shuffle 勉強会
spark shuffle 勉強会
huydx
March 18, 2016
Tweet
Share
More Decks by huydx
See All by huydx
Finagle もろもろ
huydx
0
2.1k
web audio api (htmlday osaka)
huydx
1
1.1k
Other Decks in Programming
See All in Programming
OSS開発者という働き方
andpad
5
1.7k
2025 年のコーディングエージェントの現在地とエンジニアの仕事の変化について
azukiazusa1
23
12k
知っているようで知らない"rails new"の世界 / The World of "rails new" You Think You Know but Don't
luccafort
PRO
1
110
「手軽で便利」に潜む罠。 Popover API を WCAG 2.2の視点で安全に使うには
taitotnk
0
840
Azure SRE Agentで運用は楽になるのか?
kkamegawa
0
2.1k
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
340
GitHubとGitLabとAWS CodePipelineでCI/CDを組み比べてみた
satoshi256kbyte
4
210
Amazon RDS 向けに提供されている MCP Server と仕組みを調べてみた/jawsug-okayama-2025-aurora-mcp
takahashiikki
1
110
Ruby×iOSアプリ開発 ~共に歩んだエコシステムの物語~
temoki
0
270
ProxyによるWindow間RPC機構の構築
syumai
3
1.2k
請來的 AI Agent 同事們在寫程式時,怎麼用 pytest 去除各種幻想與盲點
keitheis
0
120
Go言語での実装を通して学ぶLLMファインチューニングの仕組み / fukuokago22-llm-peft
monochromegane
0
120
Featured
See All Featured
The Power of CSS Pseudo Elements
geoffreycrofte
77
6k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.9k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
15k
A better future with KSS
kneath
239
17k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.7k
Embracing the Ebb and Flow
colly
87
4.8k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
30
9.7k
[RailsConf 2023] Rails as a piece of cake
palkan
57
5.8k
Fireside Chat
paigeccino
39
3.6k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Scaling GitHub
holman
463
140k
Transcript
SparkͷShuffleपΓ @huydx
Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊ͷதؒϨΠϠʔ ʮShuffleʯͱݺͿ •
ShuffleͰ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • SparkReduceδϣϒʹඞཁͳσʔλϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍
͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ
Shuffleͷ • ShuffleϑΝΠϧ • Mapͷ͕MɺReduceͷ͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ͕ M * R (M
= 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δͷ • ௨৴͕ॏ͍
Shuffleͷղܾ • ShuffleϑΝΠϧɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ͑ΒΕΔ •
Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045
Shuffleͷղܾ • SortͷΞϧΰϦζϜબ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷΉ߹ΘͤͰฏ
ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢
Shuffleͷղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮ (FileChannel.transferToͰzero
copyʣ