Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
spark shuffle 勉強会
Search
huydx
March 18, 2016
Programming
0
1k
spark shuffle 勉強会
spark shuffle 勉強会
huydx
March 18, 2016
Tweet
Share
More Decks by huydx
See All by huydx
Finagle もろもろ
huydx
0
2.2k
web audio api (htmlday osaka)
huydx
1
1.2k
Other Decks in Programming
See All in Programming
公共交通オープンデータ × モバイルUX 複雑な運行情報を 『直感』に変換する技術
tinykitten
PRO
0
160
著者と進める!『AIと個人開発したくなったらまずCursorで要件定義だ!』
yasunacoffee
0
150
AIコーディングエージェント(skywork)
kondai24
0
200
Cell-Based Architecture
larchanjo
0
140
FluorTracer / RayTracingCamp11
kugimasa
0
250
Rediscover the Console - SymfonyCon Amsterdam 2025
chalasr
2
190
大規模Cloud Native環境におけるFalcoの運用
owlinux1000
0
190
Deno Tunnel を使ってみた話
kamekyame
0
230
Pythonではじめるオープンデータ分析〜書籍の紹介と書籍で紹介しきれなかった事例の紹介〜
welliving
2
530
AI Agent Tool のためのバックエンドアーキテクチャを考える #encraft
izumin5210
3
910
まだ間に合う!Claude Code元年をふりかえる
nogu66
5
890
Denoのセキュリティに関する仕組みの紹介 (toranoana.deno #23)
uki00a
0
150
Featured
See All Featured
Reality Check: Gamification 10 Years Later
codingconduct
0
1.9k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
510
Designing for Timeless Needs
cassininazir
0
92
How GitHub (no longer) Works
holman
316
140k
Prompt Engineering for Job Search
mfonobong
0
120
Faster Mobile Websites
deanohume
310
31k
We Have a Design System, Now What?
morganepeng
54
7.9k
Building AI with AI
inesmontani
PRO
1
570
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
33
Between Models and Reality
mayunak
0
150
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.4k
Transcript
SparkͷShuffleपΓ @huydx
Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊ͷதؒϨΠϠʔ ʮShuffleʯͱݺͿ •
ShuffleͰ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • SparkReduceδϣϒʹඞཁͳσʔλϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍
͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ
Shuffleͷ • ShuffleϑΝΠϧ • Mapͷ͕MɺReduceͷ͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ͕ M * R (M
= 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δͷ • ௨৴͕ॏ͍
Shuffleͷղܾ • ShuffleϑΝΠϧɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ͑ΒΕΔ •
Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045
Shuffleͷղܾ • SortͷΞϧΰϦζϜબ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷΉ߹ΘͤͰฏ
ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢
Shuffleͷղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮ (FileChannel.transferToͰzero
copyʣ