Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Introduction to Parallel Computing 2.2
Kazuhiro Serizawa
October 19, 2018
0
14
Introduction to Parallel Computing 2.2
Kazuhiro Serizawa
October 19, 2018
Tweet
Share
More Decks by Kazuhiro Serizawa
See All by Kazuhiro Serizawa
Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations
serihiro
1
140
hpc170_slide.pdf
serihiro
0
18
画像解像度別ImageNetの100 iterationの合計処理時間比較
serihiro
0
66
深層ニューラルネットワークにおける訓練高速化のための自動最適化
serihiro
0
19
My summer internship result at Treasure Data 2018 #td_intern
serihiro
0
1.8k
startupでもrails使うなら これだけはやっとけ的 tips集
serihiro
19
9.8k
つらくないコードレビューの運用
serihiro
43
20k
5分で分かるかもしれないjava8 Stream API
serihiro
1
2.1k
Featured
See All Featured
Learning to Love Humans: Emotional Interface Design
aarron
263
38k
Writing Fast Ruby
sferik
613
58k
It's Worth the Effort
3n
177
26k
How New CSS Is Changing Everything About Graphic Design on the Web
jensimmons
214
12k
JazzCon 2018 Closing Keynote - Leadership for the Reluctant Leader
reverentgeek
175
9.1k
Unsuck your backbone
ammeep
659
56k
Designing the Hi-DPI Web
ddemaree
273
32k
Art Directing for the Web. Five minutes with CSS Template Areas
malarkey
196
9.8k
The Invisible Side of Design
smashingmag
292
48k
The Invisible Customer
myddelton
113
12k
Building an army of robots
kneath
301
40k
Building Applications with DynamoDB
mza
85
5k
Transcript
Introduction to Parallel Computing 2.2 Limitations of Memory System Performance(p.16-p.24)
ຊͷൃද༰ • 2.2 Limitations of Memory System Performance • 2.2.1
Improving Effective Memory Latency Using Caches • 2.2.2 Impact of Memory Bandwidth • 2.2.3 Alternate Approaches for Hiding Memory Latency • 2.2.4 Tradeoffs of Multithreading and Prefetching
ຊͷൃද༰ • 2.2 Limitations of Memory System Performance • 2.2.1
Improving Effective Memory Latency Using Caches(p.17-p.18) • 2.2.2 Impact of Memory Bandwidth • 2.2.3 Alternate Approaches for Hiding Memory Latency • 2.2.4 Tradeoffs of Multithreading and Prefetching
2.2 Limitations of Memory System Performance • ϓϩάϥϜͷޮՌతͳύϑΥʔϚϯε,ϓϩηοαͷε ϐʔυ͚ͩͰͳ͘,ϓϩηοαʹσʔλΛఏڙ͢ΔϝϞϦ γεςϜͷੑೳʹґଘ͢Δ
• ϝϞϦγεςϜͷੑೳʹؔ͢Δ̎ͭͷࢦඪʮϨΠςϯγʯ ͱʮόϯυ෯ʯ • ϝϞϦγεςϜΛ͏·͘ѻ͏ͨΊͷςΫχοΫޓ͍ʹҟ ͳΓڝ߹͍ͯ͠ΔͨΊ,ʮϨΠςϯγʯͱʮόϯυ෯ʯͷ ҧ͍Λཧղ͢Δ͜ͱ͕ॏཁ
2.2 Limitations of Memory System Performance • ϨΠςϯγΛόϯυ෯ͷҧ͍ΛফՐખͱফԽϗʔεͷྫΛ༻͍ͯ ղઆ͢Δ •
ফՐખΛ։͍͔ͯΒফԽϗʔεͷઌ͔Βਫ͕ग़Δ·Ͱ̎ඵ͔͔ Δ ➡ϨΠςϯγʮ2 secondʯ • ϗʔε͕ਫΛຖඵ1Ψϩϯ(1Ψϩϯ== 3.78541Ϧοτϧ)ͰྲྀΕ Δ ➡όϯυ෯ʮ1gallon/secondʯ
2.2 Limitations of Memory System Performance • ϨΠςϯγΛόϯυ෯ͷҧ͍ΛফՐખͱফԽϗʔεͷྫΛ༻͍ͯղઆ͢ Δ •
ଈ࠲ʹՐ͢Δඞཁ͕͋Δ࣌ ➡ফՐખ͔ΒͷΑΓߴ͍ਫѹ͕ඞཁ(ϨΠςϯγ͕ඞཁ) • ΑΓେ͖͍ՐΛ૬खʹ͢Δ࣌ ➡ΑΓ͍ϗʔεͱফՐખ͕ඞཁ(ߴόϯυ෯͕ඞཁ) • ͜ͷྫ,ϝϞϦγεςϜʹ͍ͭͯʮਫʯΛʮσʔλϒϩοΫʯʹஔ͖ ͑Δͱಉ༷ʹػೳ͢Δ
Example2.2 Effect of memory latency on performance • ϝϞϦγεςϜʹ͓͚ΔϨΠςϯγͷӨڹʹֶ͍ͭͯͿͨΊʹҎԼͷલఏΛ࣋ͭγεςϜͷྫʹ͍ͭͯ ߟ͢Δ
• CPU • ಈ࡞प1GHzͰ2ͭͷੵԋࢉϢχοτΛ࣋ͭ • ϐʔΫੑೳ4GFLOPS • 2flop/cycleͰܭࢉͰ͖ΔੵԋࢉϢχοτ͕2ݸ x 1GHz = 4GFLOPS • ϝϞϦ • 1ݸͷϝϞϦϒϩοΫ͕1wordͰߏ͞ΕΔ(ͭ·ΓຖճϝϞϦΞΫηε͕ൃੜ͢Δ) • CPU͔ΒϝϞϦΞΫηεͷϨΠςϯγ100ns • 100nsCPUʹ͓͚Δ100cycleͷ࣌ؒͱಉ͡ͳͷͰ, ϝϞϦΞΫηε͢Δʹ100cycleͷ ͕ͪ࣌ؒੜ͡Δ (1GHz = 10^6Hz, 1ns = 10^-6 s )
Example2.2 Effect of memory latency on performance • ͜ͷΑ͏ͳϓϥοτϑΥʔϜ্Ͱ2ͭͷϕΫτϧͷੵΛܭࢉ͢Δέʔε ʹ͍ͭͯߟ͢Δ
• ྫ͑A=(a1, a2), B=(b1 , b2) ͷ࣌ A x B=a1 x b1 + a2 x b2 • ੵܭࢉͦΕͧΕͷϕΫτϧ্ͷཁૉ1ϖΞ͝ͱʹੵԋࢉΛߦ͍, ԋࢉͷʹϝϞϦΞΫηε͢ΔͨΊ,ຖճͷԋࢉ(1flop)ʹ͔͔Δ࣌ؒ ϝϞϦΞΫηεʹ͔͔Δ100ns ➡͜ͷܭࢉϨʔτ10MFLOPS • ͜ΕϐʔΫੑೳͷ4GFLOPSͷ͘͝Θ͔ͣͰ͋Γ,ߴ͍ܭࢉϨʔτΛୡ ͢Δ্ͰϝϞϦγγεςϜͷύϑΥʔϚϯεͷॏཁੑΛ͍ࣔͯ͠Δ
2.2.1 Improving Effective Memory Latency Using Caches • ϓϩηοαͱDRAMͷεϐʔυͷෆҰகΛѻ͏͜ͱͰઃܭ্ͷΠϊϕʔγϣϯ͕ੜͨ͡ •
ͦͷΑ͏ͳΠϊϕʔγϣϯͷ͕̍ͭ,ϓϩηοαͱDRAMͷؒʹʮΩϟογϡʯͱݺ ΕΔ,ߴϨΠςϯγߴόϯυ෯ͳϝϞϦΛஔ͢Δ͜ͱͰ͋Δ • ϝϞϦ͔ΒಡΈग़ͨ͠σʔλΩϟογϡʹϑΣον͞Ε,ݪཧతʹಉ͡σʔλ͕ ϓϩηοα͔Β܁Γฦ͠ࢀর͞ΕΔͷͰ͋Ε,ϝϞϦγεςϜͷϨΠςϯγΛݮ͢ Δ͜ͱ͕Ͱ͖Δ • ଟ͘ͷΞϓϦέʔγϣϯͷܭࢉϨʔτ,ϓϩηοαͷॲཧεϐʔυͰͳ͘ϓϩηο αͷσʔλͷసૹϨʔτʹଋറ͞Ε͓ͯΓ,͜ͷΑ͏ͳܭࢉΛʮmemory boundʯͱ ݺͿ • memory boundͳΞϓϦέʔγϣϯͷύϑΥʔϚϯε,ΩϟογϡώοτʹଟେͳӨ ڹΛड͚Δ.ͦͷ͜ͱΛ࣍ͷྫͰࣔ͢.
Example2.3 Impact of caches on memory system performance • Example2.2͔Βͷมߋ
• Example2.2ͷલఏʹՃ͑ͯϨΠςϯγ͕1ns(͘͠ 1cycle)ͷ23KBͷΩϟογϡΛಋೖ͢Δ • ͜ͷΩϟογϡΛ༻͍ͯ32x32࣍ݩͷ2ͭͷߦྻA,BΛ ϑΣον͠,ߦྻੵ A x BΛܭࢉ͢Δέʔεʹ͍ͭͯߟ͢ Δ
Example2.3 Impact of caches on memory system performance • 2ͭͷߦྻΛΩϟογϡʹϑΣον͢Δ͜ͱ2Kϫʔυ(32
x 32 x 2 = 2,048ϫʔυ)ͷ ϑΣονʹରԠ͢Δ • ϝϞϦͷΞΫηεϨΠςϯγ͕100nsͳͷͰ,͓Αͦ200μs(100ns x 2,000) • 2ͭͷߦྻੵͷܭࢉྔO(2n3)ͳͷͰ,ߦྻA,Bͷੵ64KΦϖϨʔγϣϯ(2 x 323 = 65,536) ʹରԠ͢Δ • 4 instruction/cycleͰ࣮ߦͰ͖ΔͷͰ,ॲཧ࣌ؒ16K cycle(·ͨ16μs) • ߹ܭܭࢉ࣌ؒload/storeΦϖϨʔγϣϯͱܭࢉͦͷͷͷ࣌ؒͷ͓͓Αͦͷ߹ܭͱͳ Γ, 200 + 16 = 216μs • ͜Ε64K/216(ΦϖϨʔγϣϯ/μs),͘͠303MFLOPS(64 * 1024 / (216 * 10^(-6)) = 303,407,407)ʹ֘͢Δ
Example2.3 Impact of caches on memory system performance • Example2.3ͷ·ͱΊ
• Example2.2ͱͷύϑΥʔϚϯεൺֱ • Example2.2 -> 10MFLOPS(Ωϟογϡͳ͠) • Example2.3 -> 303MFLOPS(23KBΩϟογϡ͋Γ) • ΩϟογϡʹΑΓ30ഒͷվળΛݟ͕ͤͨ,ϐʔΫϓϩηοαύϑΥʔϚ ϯε(4GFLOPS)ͷ10%ҎԼ • ͜ͷྫΛ௨ͯ͡খ͞ͳΩϟογϡϝϞϦΛஔ͢Δ͜ͱͰϓϩηοα ར༻ΛվળͰ͖Δ͜ͱΛࣔͨ͠
Example2.3 Impact of caches on memory system performance • Example2.3ͷ·ͱΊ
• ΩϟογϡͷଘࡏʹΑΔύϑΥʔϚϯεվળͷ݁Ռ,ಉ͡σʔλΞΠςϜʹର ͯ͠෮తͳࢀর(࠶ར༻)͕ଘࡏ͢Δ͜ͱΛԾఆ͍ͯ͠Δ • ߦྻੵͷ߹ߦྻதͷಉ͡ཁૉΛԿࢀর͢ΔͷͰ,Ωϟογϡதͷ σʔλ͕࠶ར༻͞Ε͍ͯΔ͜ͱ͕ఆ͞ΕΔ • ִ࣌ؒؒʹ͓͚Δσʔλͷ෮ࢀরͷ֓೦ ʮ࣌ؒతہॴੑ(temoprary locality)ʯͱݺΕΔ • ͠,ͦΕͧΕͷσʔλ͕Ұ͔͠ΘΕͳ͍߹,DRAMϨΠςϯγ͕֤Φϖ Ϩʔγϣϯʹൃੜ͢ΔͨΊ,σʔλͷ࠶ར༻ΩϟογϡύϑΥʔϚϯεʹͱͬ ͯॏཁͰ͋Δ
ຊͷൃද༰ • 2.2 Limitations of Memory System Performance • 2.2.1
Improving Effective Memory Latency Using Caches • 2.2.2 Impact of Memory Bandwidth(p.18-p.21) • 2.2.3 Alternate Approaches for Hiding Memory Latency • 2.2.4 Tradeoffs of Multithreading and Prefetching
2.2.2 Impact of Memory Bandwidth • ϝϞϦόϯυ෯σʔλ͕ϓϩηοαͱϝϞϦؒΛҠಈͰ͖ΔϨʔτʹ֘͢Δ • ϝϞϦόϯυ෯ϝϞϦόεͷόϯυ෯ʹΑܾͬͯఆ͞ΕΔ •
ϝϞϦόϯυ෯Λվળ͢Δ̍ͭͷςΫχοΫϝϞϦϒϩοΫͷαΠζΛ૿Ճ ͤ͞Δ͜ͱͰ͋Δ • ୯ҰͷϝϞϦϦΫΤετ࿈ଓͨ͠ϒϩοΫΛฦ͢ • ʮ࿈ଓͨ͠ϒϩοΫʯͷେ͖͞యܕతʹ2͔Β8ϫʔυͰ͋Γ,͜ͷ୯Ґ ʮΩϟογϡϥΠϯʯͱݺΕΔ • σʔλ࠶ར༻੍͕ݶ͞ΕͨΞϓϦέʔγϣϯʹ͓͍ͯ͜ΕΒ͕ͲͷΑ͏ʹཱ ͔ͭΛExample2.4ͰௐΔ
Example2.4 Effect of block size: dot-product of two vectors •
Example2.3͔Βͷมߋ • 1ݸͷϝϞϦϒϩοΫ͕4wordͰߏ͞Ε, ΩϟογϡϥΠϯ͕4wordʹͳΔ • ϒϩοΫαΠζΛมߋͨ͜͠ͱʹΑΔύϑΥʔϚϯεͷӨڹ • ϓϩηοα100cycle͝ͱʹ4wordΩϟογϡϥΠϯΛϑΣονͰ͖Δ • ߦྻ͕ϝϞϦ্ʹஔ͞Ε͍ͯΔͱ͢Δͱ(cache্ʹଘࡏ͠ͳ͍), 200cycle ͝ͱʹ8FLOP͕࣮ߦ͞Ε,͜ͷ࣌ͷϐʔΫεϐʔυ40MFLOPS(25ns/FLOP) ͱͳΔ • ͜ͷ݁Ռ͔Β,ϒϩοΫαΠζ͕૿Ճ͢Δ(όϯυ෯͕૿͑Δ)͜ͱͰ,Ωϟο γϡͳ͠ͰੵΞϧΰϦζϜ͕ߴԽͰ͖Δ͜ͱΛࣔͨ͠
Example2.4 Effect of block size: dot-product of two vectors •
ΩϟογϡώοτΛݟੵΔ͜ͱͰύϑΥʔϚϯεݶքΛ͘͢ݟੵΔ͜ͱ͕Ͱ͖Δ • ྫ͑͜ͷαϯϓϧͰ,8σʔλΞΫηε͝ͱʹ2ΞΫηεͷDRAMΞΫηε(cacheϛε) ͕ൃੜ͠ɺ75%ͷΩϟογϡώοτʹରԠ͢Δ • ΩϟογϡϛεʹΑΔΦʔόʔϔου͕ࢧతͩͱ͢Δͱ,͜ͷ߹ͷฏۉΞΫηε࣌ؒ 25ns/word(200ns / 8word = 25ns/word)ʹ૬͢Δ • ੵͷܭࢉ1operation / wordͳͷͰܭࢉϨʔτ40MFLOPS • ΑΓਖ਼֬ͳܭࢉϨʔτͷݟੵΓ,ฏۉϝϞϦΞΫηελΠϜΛ0.75 x 1 + 0.25 * 100 = 25.75ns/word ͱͯ͠ܭࢉ͠,38.8MFLOPSͰ͋Δ • Ωϟογϡώοτ͕100%ͷ߹303MFLOPSͳͷͰ,ΩϟογϡϛεʹΑΓ͔ͳΓੑೳ ྼԽ͍ͯ͠Δ͜ͱ͕Θ͔Δ
Example2.4 Effect of block size: dot-product of two vectors •
Example2.4ͷ·ͱΊ • Example2.4Ͱղઆ͞Ε͍ͯΔγφϦΦෳϝϞϦόϯΫʹଓ͞Ε͍ͨdata bus(4word,·ͨ 128bit)ʹରԠ͢Δ͕,͜ͷΑ͏ͳγεςϜߴՁͰ͋Δ • ݱ࣮తͳγεςϜͰ࿈ଓͨ͠wordͷ࠷ॳͷword͕औΓग़͞Εͨޙʹ,ޙଓͷbus cycleͰΓͷ word͕memory busʹૹ৴͞ΕΔ • ྫ͑,32bitσʔλόεͰ࠷ॳͷword100ns͔͔ͬͯbusʹஔ͞ΕΔ͕,ޙଓͷ֤bus cycleͰ1word͕ஔ͞ΕΔ • ͜ΕʹΑΓ,ͯ͢ͷΩϟογϡϥΠϯ(4word)100 + 3 x (memory bus cycle) nsͰར༻Մೳʹ ͳͳΔ • σʔλόεΦϖϨʔγϣϯ͕200MHzͩͱ͢Δͱॲཧ5ns/wordͳͷͰ, ϓϥε3 x 5 = 15nsͰ3word͕Ωϟογϡ͞ΕΔͷͰ4wordΛϑΣον͢Δͷʹ100 + 15 = 115ns͔͔Δܭࢉ ʹͳΔ
Example2.4 Effect of block size: dot-product of two vectors •
Example2.4ͷ·ͱΊ • Example2.4૿Ճͨ͠όϯυ෯͕ܭࢉϨʔτΛߴΊΔ͜ͱΛ໌֬ʹ͍ࣔͯ͠Δ • ·ͨ,σʔλϨΠΞτ࿈ଓ໋ͨ͠ྩʹΑͬͯ༻͞ΕΔ࿈ଓͨ͠σʔλʔϫʔ υͷΑ͏ʹͳ͍ͬͯΔͱఆ͞Ε͍ͯΔ • ܭࢉத৺తͳࢹͰݟΔͱϝϞϦΞΫηεͷʮۭؒతہॴੑ(spatial locality)ʯ ͕ଘࡏ͍ͯ͠Δ • ͠ܭࢉ(͘͠σʔλͷΞΫηεύλʔϯ)͕ۭؒతہॴੑΛ࣋ͨͳ͚Ε, ࣮ࡍͷόϯυ෯ϐʔΫόϯυ෯ΑΓ͔ͳΓখ͘͞ͳΔ • ͜ͷΑ͏ͳ(ޮͷѱ͍)ΞΫηεύλʔϯͷྫ,ߦྻ͕ߦ༏ઌͰϝϞϦʹ֨ೲ ͞Ε͍ͯΔʹؔΘΒͣ,ྻํʹσʔλΛಡΉΑ͏ͳέʔεͰ͋Δ
Example2.5 Impact of strided access • ߦྻbͷྻ͝ͱͷ߹ܭΛϕΫτϧcolumn_sumʹ֨ೲ͢ Δ • column_sumΩϟογϡʹΔ΄Ͳখ͍͞
• ߦྻbྻํʹΞΫηε͞Ε͍ͯΔ for (i = 0; I < 1000; i++) column_sum[i] = 0.0; for(j = 0; j < 1000; j++) column_sum[i] += b[j][j];
Example2.5 Impact of strided access for (i = 0; I
< 1000; i++) column_sum[i] = 0.0; for(j = 0; j < 1000; j++) column_sum[i] += b[j][j]; • ߦྻbߦํͰ֨ೲ͞Ε͍ͯΔͷͰ,͜ͷίʔυͰຖճ 1,000ݸͷཁૉ(ϕΫτϧ)ΞΫηε͓ͯ͠ΓΩϟογϡ͕ ׆༻͞Εͳ͍ • ͦͷͨΊύϑΥʔϚϯε͕ग़ͳ͍
Example2.5 Impact of strided access
Example2.5 Impact of strided access for (i = 0; I
< 1000; i++) column_sum[i] = 0.0; for(j = 0; j < 1000; j++) column_sum[i] += b[j][j]; • ͜ͷαϯϓϧετϥΠυΞΫηεΛද͍ͯ͠Δ • ʮετϥΠυʯͱʮ·͙ͨʯͱ͍͏ҙຯͰ,͜ͷ߹ෳͷΩϟο γϡϥΠϯʹΔσʔλʹ·͕ͨͬͯΞΫηεͯ͠͠·͍ͬͯΔ ➡͜ͷঢ়ଶʮۭؒతہॴੑ(spatial locality)͕ࣦΘΕ͍ͯΔʯͱදݱ ͞ΕΔ
Example2.6 Eliminating strided access for (i = 0; I <
1000; i++) column_sum[i] = 0.0; for(j = 0; j < 1000; j++) for(i = 0; i < 1000; i++) column_sum[i] += b[j][i] • Example2.5Λ͜ͷΑ͏ʹॻ͖͢͜ͱͰߦྻΛߦํʹΞΫηεͰ͖ΔΑ͏ʹͳ Γ,ۭؒతہॴੑΛ׆͔͢͜ͱ͕Ͱ͖ΔΑ͏ʹͳΔ • ͔͠͠,ߦྻͷαΠζ͕͋·Γʹେ͖͘,column_sumϕΫτϧ͕ΩϟογϡʹΓ Βͳ͍߹column_sumͷΩϟογϡϛε͕ϘτϧωοΫʹͳΔ • ͦͷ߹ϧʔϓΛෳͷϒϩοΫʹׂͯ͠ϒϩοΫ͝ͱʹ߹ܭ͢Δඞཁ͕ ͋Δ
2.2.2ͷ·ͱΊ • ͜ͷষͰࣔ͢Ұ࿈ͷαϯϓϧҎԼͷ֓೦Λઆ໌ͨ͠ • ΞϓϦέʔγϣϯͷۭؒత͓Αͼ࣌ؒతہॴੑΛར༻͢Δ ͜ͱ,ϝϞϦͪ࣌ؒͷঈ٫ͱϝϞϦଳҬ෯ͷ૿Ճʹෆ Մܽ • ಛఆͷΞϓϦέʔγϣϯ,ຊ࣭తʹ࣌ؒతہॴੑ͕ߴ͘, ͍ϝϞϦଳҬ෯ʹରͯ͠ڐ༰ੑ͕͋Δ
• ϝϞϦϨΠΞτͱܭࢉΛదʹฤ͢Δ͜ͱͰ,ۭؒత ͓Αͼ࣌ؒతہॴੑʹେ͖ͳӨڹΛ༩͑Δ͜ͱ͕Ͱ͖Δ
ຊͷൃද༰ • 2.2 Limitations of Memory System Performance • 2.2.1
Improving Effective Memory Latency Using Caches • 2.2.2 Impact of Memory Bandwidth • 2.2.3 Alternate Approaches for Hiding Memory Latency(p.22-p.23) • 2.2.4 Tradeoffs of Multithreading and Prefetching
2.2.3 Alternate Approaches for Hiding Memory Latency • webαΠτͷϨεϙϯε͕͍࣌ҎԼͷͲΕ͔ͷΞϓϩʔνʹΑͬͯঢ়گ ΛվળͰ͖Δ
a) ͜Ε͔ΒӾཡ͢ΔwebαΠτΛ༧ଌͯ͠ࣄલʹϦΫΤετΛߦ͏ ➡ prefetching b) ෳͷϒϥβΛىಈͯͦ͠ΕͧΕผͷϖʔδʹΞΫηε͢Δ ➡multithreading c) Ұʹ·ͱΊͯෳͷϖʔδΛऔಘ͢Δ ➡memoryΞΫηεʹ͓͚Δۭؒతہॴੑ
Multithreading for Latency Hiding • εϨουͱϓϩάϥϜϑϩʔͷ୯Ұͷ੍ޚετϦʔϜͰ ͋Δ • εϨουͷγϯϓϧͳྫΛ࣍ͷExample2.7Ͱղઆ͢Δ
Example2.7 Thread execution of matrix multiplication • n*nͷߦྻaʹϕΫτϧbΛͨ݁͡ՌΛϕΫτϧcʹ֨ೲ ͢Δ •
֤ཁૉͷߦͷߦྻܭࢉޓ͍ʹಠཱ͍ͯ͠ΔͷͰ,εϨο υΛ༻ͯ࣍͠ͷΑ͏ʹॻ͖͢͜ͱ͕Ͱ͖Δ for (i = 0; i < 1000; i++) c[i] = dot_product(get_row(a, i), b);
Example2.7 Thread execution of matrix multiplication • ͱ͋ΔεϨου͕̍ͭͷϕΫτϧΛϝϞϦ͔Βϩʔυ͠,ܭࢉ͍ͯ͠Δ ؒʹ,ผͷεϨου͕ผͷϕΫτϧΛॲཧ͢Δ͜ͱ͕Ͱ͖Δ •
͜ΕʹΑΓ,̍ͭͷεϨου͕ϝϞϦΛϩʔυ͍ͯ͠Δؒʹผͷϩʔυ Λऴ͑ͨεϨου͕ܭࢉΛߦ͏͜ͱ͕Ͱ͖Δ • ݁ՌతʹຖαΠΫϧܭࢉΛ࣮ߦͰ͖Δ for (i = 0; i < 1000; i++) c[i] = create_thread(dot_product(get_row(a, i), b));
Example2.7 Thread execution of matrix multiplication • MultithreadԽͨ͠ϓϩηοα,େྔͷϝϞϦΞΫηε,௨৴ͳ ͲͷϦΫΤετΛ͏ܭࢉεϨουͷίϯςΩετΛҡ࣋͠ ͨঢ়ଶͰ࣮ߦ͢Δ͜ͱ͕Ͱ͖Δ
• HEPTeraͷΑ͏ͳϚγϯຖαΠΫϧ࣮ߦίϯςΩετΛε ΠονͰ͖ΔMultithreadԽ͞Εͨϓϩηοαʹґଘ͍ͯ͠Δ • ݁Ռతʹ,εϨουޮՌతʹϨΠςϯγΛӅṭԽͰ͖,ϓϩ ηοαΛΞΠυϦϯάͤ͞ͳ͍ͨΊͷฒྻੑΛఏڙ͢Δ͜ͱ ͕Ͱ͖Δ
Prefetching for Latency hiding • యܕతͳϓϩάϥϜͰ͍ظؒͰϓϩηοαʹσʔλ͕ ϩʔυ͞Ε༻͞ΕΔͨΊ,σʔλϩʔυ͕Ωϟογϡϛ εΛى͜͢ͱͦͷσʔλΛ༻͢Δॲཧ͕ετʔϧ͢Δ • ͜ͷͷ୯७ͳղܾࡦ,͠Ωϟογϡϛε͕͋ͬͯ
σʔλ͕༻͞ΕΔ·Ͱʹϩʔυ͞Ε͍ͯΔΑ͏ʹσʔ λϩʔυΛલ͢͠Δ͜ͱͰ͋Δ
Prefetching for Latency hiding • ͔͠͠,σʔλϩʔυΛલͯ͠͠,σʔλ͕༻͞ΕΔ ·Ͱͷؒʹผͷ໋ྩͰ্ॻ͖͞ΕΔͱ࠶ϩʔυ͕ඞཁ ͱͳΔ • ͜ͷΑ͏ͳରࡦΛ͢ΔʹϦιʔεґଘͷͳ͍(ྫ͑ಉ
͡ϨδελΛ༻͠ͳ͍)εϨουΛಛఆ࣮ͯ͢͠Δඞ ཁ͕͋Δ • ଟ͘ͷίϯύΠϥ͕ੵۃతʹ͜ͷΑ͏ͳσʔλϩʔυͷલ ͠Λߦ͓͏ͱ͢Δ
Example2.8 Hiding latency by prefetching • for loopΛͬͯϕΫτϧa,bͷͦΕͧΕͷཁૉΛՃࢉ͢Δྫʹ͍ͭͯ ߟ͢Δ •
࠷ॳͷϧʔϓͰ(a[0], b[0])ͷϖΞΛϦΫΤετ͢Δ͕Ωϟογϡ ʹଘࡏ͠ͳ͍ͷͰϝϞϦϨΠςϯγ͕ൃੜ͢Δ • ͜ΕΒͷϦΫΤετ͕࣮ߦ͞Ε͍ͯΔؒʹ,ผͷϧʔϓͰ(a[1],b[1]), (a[2],b[2])ͷϖΞͷϦΫΤετΛ࣮ߦ͢Δ • ֤ϦΫΤετ͕1nsͰ࣮ߦ͞ΕϝϞϦϦΫΤετ͕100nsͰྃ͢ Δͱ͢Δͱ,100ճͷϝϞϦϦΫΤετͷޙʹa[0],b[0]ͷϖΞ͕ฦͬ ͯ͘Δ
Example2.8 Hiding latency by prefetching Memory request a+b a+b Memory
request a+b Memory request a+b Memory request ॲཧ։͔࢝Βͷܦա࣌ؒ 0 1 2 3 a+b Memory request 4
2.2.3ͷ·ͱΊ • MultithreadͱPrefetchingΛ༻͍ͯσʔλΞΫηεΛલ ͢͠Δ͜ͱͰϩʔυ࣌ؒΛӅṭͰ͖Δ • σʔλϩʔυͷલ͠Λ࣮͢ΔʹϦιʔεґଘͷͳ͍ εϨουΛಛఆ͢Δඞཁ͕͋Δ
ຊͷൃද༰ • 2.2 Limitations of Memory System Performance • 2.2.1
Improving Effective Memory Latency Using Caches • 2.2.2 Impact of Memory Bandwidth • 2.2.3 Alternate Approaches for Hiding Memory Latency • 2.2.4 Tradeoffs of Multithreading and Prefetching (p.23-p.24)
2.2.4 Tradeoffs of Multithreading and Prefetching • 2.2.3ͰݟͨΑ͏ʹ,multithreadingͱprefetchingϝϞϦγ εςϜʹؔ͢ΔΛશͯղܾ͢Δ͔ͷΑ͏ʹΈ͑Δ •
͔ͦ͠͠ΕΒϝϞϦόϯυ෯ʹΑͬͯਂࠁͳӨڹΛड͚ Δ
Example2.9 Impact of bandwidth on multithreaded programs • ҎԼͷલఏͷϚγϯͰͷܭࢉΛߟ͢Δ •
1GHzΫϩοΫCPU • 4-word(4Byte * 4)ΩϟογϡϥΠϯʹ1cycleͰΞΫηεՄ • 1KBΛΩϟογϡ͢Δ࣌Ωϟογϡώοτ25%, 32KBΛΩϟογϡ͢Δ࣌Ωϟογϡώοτ90% • 100nsϨΠςϯγͷDRAM
Example2.9 Impact of bandwidth on multithreaded programs • ̎ͭͷ࣮ߦܗࣜͷܭࢉΛߟ͢Δ a.
શͯͷΩϟογϡ͕γʔέϯγϟϧͳίϯςΩετͰ ༻ՄೳͳγϯάϧεϨουʹ࣮ߦ b. ֤εϨου͕1KBͷΩϟογϡΛ࣋ͭ32εϨουͷ ϚϧνεϨου࣮ߦ
Example2.9 Impact of bandwidth on multithreaded programs a. શͯͷΩϟογϡ͕γʔέϯγϟϧͳίϯςΩετͰ༻Մ ೳͳγϯάϧεϨουʹ࣮ߦ
• 1αΠΫϧ͝ͱʹσʔλϦΫΤετΛ࣮ߦ͢Δ߹,࠷ॳͷ DRAMͷΞΫηε100nsΛཁ͢Δ • ͔͠͠Ωϟογϡώοτ͕90%ͳͷͰ,Γ10%ͷΞΫη εΛΩϟογϡώοτ࣌ͱಉ࣌ؒ͡Ͱྃͤ͞Δʹ ,DRAMʹ10ns͝ͱʹ1wordͷόϯυ෯͕ඞཁ ➡0.1words/ns = 100M words/s = 400MB/s
Example2.9 Impact of bandwidth on multithreaded programs a. શͯͷΩϟογϡ͕γʔέϯγϟϧͳίϯςΩετͰ ༻ՄೳͳγϯάϧεϨουʹ࣮ߦ
Hit ॲཧ։͔࢝Βͷܦա࣌ؒ Hit Hit Hit Hit Hit Hit Hit Miss 1ns 1ns 1ns 1ns 1ns 1ns 1ns 1ns 1ns Memory Request 100ns Memory Request Miss Hit Hit
Example2.9 Impact of bandwidth on multithreaded programs b. ֤εϨου͕1KBͷΩϟογϡΛ࣋ͭ32εϨουͷϚϧ νεϨου࣮ߦ
Hit ॲཧ։͔࢝Βͷܦա࣌ؒ Miss 1ns 1ns 1ns 1ns Memory Request Miss Miss Memory Request Memory Request • DRAMʹཁٻ͞ΕΔόϯυ෯ 3words/4cycle ≒3GB/sʹ૬͢Δ
Example2.9 Impact of bandwidth on multithreaded programs • Example2.9ͱͯॏཁͳΛղઆ͍ͯ͠Δ •
multithreadedγεςϜͷόϯυ෯ͷཁٻ,֤εϨουͷখنͳ ΩϟογϡৗறʹΑΓ,ٸܹʹ૿Ճ͢Δ • ྫ͑,࣋ଓ͢Δ400MB/sͷDRAMόϯυ෯ଥͰ͋Δ͕,3.0GB/s ݱࡏఏڙ͞Ε͍ͯΔେͷγεςϜΑΓߴ͍ • ͜ͷͰ,multithreadedγεςϜϨΠςϯγόϯυͰͳ͘όϯυ ෯όϯυͱͳΔ • multithreadingͱprefetchingϨΠςϯγͷΈΛऔѻ͏ͷͰ,όϯυ ෯ΛѱԽͤ͞Δͱ͍͏͜ͱΛೝࣝ͢Δ͜ͱ͕ॏཁͰ͋Δ
Example2.9 Impact of bandwidth on multithreaded programs • ผͷྫͱͯ͠,10ݸͷϨδελͷσʔλϩʔυΛલ͢͠Δঢ় گΛߟ͢Δ
• ϩʔυ͞Εͨσʔλ͕ผͷinstructionʹΑ্ͬͯॻ͖͞ΕΔ ͱ͏ҰϩʔυΛΓ͢͜ͱʹͳΔ • ͜ͷ߹ϨΠςϯγ૿Ճ͠ͳ͍͕,2ճϑΣον͕ൃੜ͢ ΔͨΊϝϞϦʹཁٻ͞ΕΔόϯυ෯͕2ഒʹͳΔ • ͜ͷঢ়گΑΓେ͖ͳϨδελϑΝΠϧͱΩϟογϡʹΑΔ prefetchingͱmultithreadingΛαϙʔτ͢Δ͜ͱͰܰݮͰ͖Δ