Slide 1

Slide 1 text

SparkͷShuffleपΓ @huydx

Slide 2

Slide 2 text

Shuffleʹ͍ͭͯ • Map - Reduce Ϟσϧ • Mapஈ֊͔ΒReduceஈ֊΁ͷதؒϨΠϠʔ͸ ʮShuffleʯͱݺͿ • ShuffleͰ͸ • Spark͕PullϞσϧʢ·ͣσΟεΫʹ݁Ռॻ͍ ͯɺReduceδϣϒ͕औΓʹߦ͘ʣ • Spark͸Reduceδϣϒʹඞཁͳσʔλ͸ϝϞϦ ϑΟοτ͠ͳ͍ͱ͍͚ͳ͍

Slide 3

Slide 3 text

͍ͭShuffle͕ൃੜ͢Δ • Join • Cogroup • *ByKeyΦϖϨʔγϣϯ

Slide 4

Slide 4 text

Shuffleͷ໰୊ • ShuffleϑΝΠϧ਺ • Mapͷ਺͕MɺReduceͷ਺͕Rͱͨ͠ΒσΟεΫʹॻ͘ ϑΝΠϧ਺͕ M * R (M = 5000, R = 1024 ͩͱ 500ສϑΝ Πϧʂʣ • Reduce͢Δͱ͖ʹιʔτΞϧΰϦζϜ͕ඞཁ • ฒྻʹιʔτ͢Δඞཁ͕Ͱ͖Δ΋ͷ • ௨৴͕ॏ͍

Slide 5

Slide 5 text

Shuffleͷ໰୊ղܾ • ShuffleϑΝΠϧ਺ɿ • O(M * R) ͡Όͳͯ͘ O(R)·Ͱ཈͑ΒΕΔ • Hashed base shuffle(ҰͭͷRͻͱͭͷϑΝΠϧʣ͡Όͳͯ͘ Sort base shuffle • ࢀߟɿhttps://issues.apache.org/jira/secure/attachment/ 12637642/Consolidating%20Shuffle%20Files%20in %20Spark.pdf • https://issues.apache.org/jira/browse/SPARK-2045

Slide 6

Slide 6 text

Shuffleͷ໰୊ղܾ • SortͷΞϧΰϦζϜબ୒ • https://databricks.com/blog/2014/10/10/spark- petabyte-sort.html • TimsortΛ࣮૷͢Δ • ৭ʑͳιʔτΞϧΰϦζϜͷ૊Ή߹ΘͤͰฏ ۉWorst CaseύʔϑΥϚϯεΛݮΒ͢

Slide 7

Slide 7 text

Shuffleͷ໰୊ղܾ • ωοτϫʔΫϞδϡʔϧΛվળ • https://issues.apache.org/jira/browse/ SPARK-2468 • Netty ϕʔεσʔλసૹͷ࣮૷ (FileChannel.transferToͰzero copyʣ