Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
spark shuffle 勉強会
Search
huydx
March 18, 2016
Programming
0
980
spark shuffle 勉強会
spark shuffle 勉強会
huydx
March 18, 2016
Tweet
Share
More Decks by huydx
See All by huydx
Finagle もろもろ
huydx
0
2k
web audio api (htmlday osaka)
huydx
1
1.1k
Other Decks in Programming
See All in Programming
Less waste, more joy, and a lot more green: How Quarkus makes Java better
hollycummins
0
100
as(型アサーション)を書く前にできること
marokanatani
10
2.7k
イベント駆動で成長して委員会
happymana
1
330
Jakarta EE meets AI
ivargrimstad
0
220
Jakarta EE meets AI
ivargrimstad
0
130
Jakarta EE meets AI
ivargrimstad
0
600
シェーダーで魅せるMapLibreの動的ラスタータイル
satoshi7190
1
480
Realtime API 入門
riofujimon
0
150
3 Effective Rules for Using Signals in Angular
manfredsteyer
PRO
0
120
AI時代におけるSRE、 あるいはエンジニアの生存戦略
pyama86
6
1.2k
距離関数を極める! / SESSIONS 2024
gam0022
0
290
WebフロントエンドにおけるGraphQL(あるいはバックエンドのAPI)との向き合い方 / #241106_plk_frontend
izumin5210
4
1.4k
Featured
See All Featured
The Art of Programming - Codeland 2020
erikaheidi
52
13k
KATA
mclloyd
29
14k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
47
5k
We Have a Design System, Now What?
morganepeng
50
7.2k
Code Review Best Practice
trishagee
64
17k
Building Your Own Lightsaber
phodgson
103
6.1k
Building Adaptive Systems
keathley
38
2.3k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
28
9.1k
It's Worth the Effort
3n
183
27k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
38
1.8k
GraphQLの誤解/rethinking-graphql
sonatard
67
10k
Making the Leap to Tech Lead
cromwellryan
133
8.9k
Transcript
SparkͷShuffleपΓ @huydx
Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊ͷதؒϨΠϠʔ ʮShuffleʯͱݺͿ •
ShuffleͰ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • SparkReduceδϣϒʹඞཁͳσʔλϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍
͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ
Shuffleͷ • ShuffleϑΝΠϧ • Mapͷ͕MɺReduceͷ͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ͕ M * R (M
= 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δͷ • ௨৴͕ॏ͍
Shuffleͷղܾ • ShuffleϑΝΠϧɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ͑ΒΕΔ •
Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045
Shuffleͷղܾ • SortͷΞϧΰϦζϜબ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷΉ߹ΘͤͰฏ
ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢
Shuffleͷղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮ (FileChannel.transferToͰzero
copyʣ