Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
spark shuffle 勉強会
Search
huydx
March 18, 2016
Programming
0
960
spark shuffle 勉強会
spark shuffle 勉強会
huydx
March 18, 2016
Tweet
Share
More Decks by huydx
See All by huydx
Finagle もろもろ
huydx
0
2k
web audio api (htmlday osaka)
huydx
1
1k
Other Decks in Programming
See All in Programming
SRE チーム立ち上げ前に考えたこと・取り組んだこと / Considerations and Preparations Before Establishing an SRE Team
mackey0225
3
320
20240706_CDKConf
takuyay0ne
0
1.2k
AHC035解説
terryu16
0
710
Temporalを取り巻く仕様を整理する
sajikix
0
110
The rollercoaster of releasing an Android, iOS, and macOS app with Kotlin Multiplatform | droidcon Berlin
prof18
0
110
大規模マルチテナントを解決するYugabyteDBという選択肢
nnaka2992
1
250
入社1ヶ月でここまでやった!Findy Toolsインフラ支援の最適化
rvirus0817
6
1.4k
Mastering Developer Experience: A Roadmap for Success 【開発生産性Conference 2024】
findyinc
1
380
CSC307 Lecture 13
javiergs
PRO
0
150
OpenAI/Gemini APIを使って EPUBを翻訳するCLIツールをつくってみた
tomiyan
0
790
Exploring the Gradually Lost Technical Skills in the Cloud Native Era
hwchiu
2
3.9k
DMMプラットフォームにおけるTiDBの導入から運用まで
pospome
7
3k
Featured
See All Featured
The Invisible Side of Design
smashingmag
294
50k
Bootstrapping a Software Product
garrettdimon
PRO
304
110k
Building Applications with DynamoDB
mza
89
5.8k
Art, The Web, and Tiny UX
lynnandtonic
291
20k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
353
29k
RailsConf 2023
tenderlove
16
720
Making Projects Easy
brettharned
111
5.7k
Code Review Best Practice
trishagee
58
16k
YesSQL, Process and Tooling at Scale
rocio
166
14k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
16
1.6k
Testing 201, or: Great Expectations
jmmastey
33
6.9k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
36
9.1k
Transcript
SparkͷShuffleपΓ @huydx
Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊ͷதؒϨΠϠʔ ʮShuffleʯͱݺͿ •
ShuffleͰ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • SparkReduceδϣϒʹඞཁͳσʔλϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍
͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ
Shuffleͷ • ShuffleϑΝΠϧ • Mapͷ͕MɺReduceͷ͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ͕ M * R (M
= 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δͷ • ௨৴͕ॏ͍
Shuffleͷղܾ • ShuffleϑΝΠϧɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ͑ΒΕΔ •
Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045
Shuffleͷղܾ • SortͷΞϧΰϦζϜબ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷΉ߹ΘͤͰฏ
ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢
Shuffleͷղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮ (FileChannel.transferToͰzero
copyʣ