Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Parallel Computing 2.2

Kazuhiro Serizawa
October 19, 2018
52

Introduction to Parallel Computing 2.2

Kazuhiro Serizawa

October 19, 2018
Tweet

Transcript

  1. ຊ೔ͷൃද಺༰ • 2.2 Limitations of Memory System Performance • 2.2.1

    Improving Effective Memory Latency Using Caches • 2.2.2 Impact of Memory Bandwidth • 2.2.3 Alternate Approaches for Hiding Memory Latency • 2.2.4 Tradeoffs of Multithreading and Prefetching
  2. ຊ೔ͷൃද಺༰ • 2.2 Limitations of Memory System Performance • 2.2.1

    Improving Effective Memory Latency Using Caches(p.17-p.18) • 2.2.2 Impact of Memory Bandwidth • 2.2.3 Alternate Approaches for Hiding Memory Latency • 2.2.4 Tradeoffs of Multithreading and Prefetching
  3. 2.2 Limitations of Memory System Performance • ϓϩάϥϜͷޮՌతͳύϑΥʔϚϯε͸,ϓϩηοαͷε ϐʔυ͚ͩͰͳ͘,ϓϩηοαʹσʔλΛఏڙ͢ΔϝϞϦ γεςϜͷੑೳʹ΋ґଘ͢Δ

    • ϝϞϦγεςϜͷੑೳʹؔ͢Δ̎ͭͷࢦඪ͸ʮϨΠςϯγʯ ͱʮόϯυ෯ʯ • ϝϞϦγεςϜΛ͏·͘ѻ͏ͨΊͷςΫχοΫ͸ޓ͍ʹҟ ͳΓڝ߹͍ͯ͠ΔͨΊ,ʮϨΠςϯγʯͱʮόϯυ෯ʯͷ ҧ͍Λཧղ͢Δ͜ͱ͕ॏཁ
  4. 2.2 Limitations of Memory System Performance • ϨΠςϯγΛόϯυ෯ͷҧ͍ΛফՐખͱফԽϗʔεͷྫΛ༻͍ͯ ղઆ͢Δ •

    ফՐખΛ։͍͔ͯΒফԽϗʔεͷઌ͔Βਫ͕ग़Δ·Ͱ̎ඵ͔͔ Δ ➡ϨΠςϯγ͸ʮ2 secondʯ • ϗʔε͕ਫΛຖඵ1Ψϩϯ(1Ψϩϯ== 3.78541Ϧοτϧ)ͰྲྀΕ Δ ➡όϯυ෯͸ʮ1gallon/secondʯ
  5. 2.2 Limitations of Memory System Performance • ϨΠςϯγΛόϯυ෯ͷҧ͍ΛফՐખͱফԽϗʔεͷྫΛ༻͍ͯղઆ͢ Δ •

    ଈ࠲ʹ௟Ր͢Δඞཁ͕͋Δ࣌ ➡ফՐખ͔ΒͷΑΓߴ͍ਫѹ͕ඞཁ(௿ϨΠςϯγ͕ඞཁ) • ΑΓେ͖͍ՐΛ૬खʹ͢Δ࣌ ➡ΑΓ޿͍ϗʔεͱফՐખ͕ඞཁ(ߴόϯυ෯͕ඞཁ) • ͜ͷྫ͸,ϝϞϦγεςϜʹ͍ͭͯ΋ʮਫʯΛʮσʔλϒϩοΫʯʹஔ͖ ׵͑Δͱಉ༷ʹػೳ͢Δ
  6. Example2.2 Effect of memory latency on performance • ϝϞϦγεςϜʹ͓͚ΔϨΠςϯγͷӨڹʹֶ͍ͭͯͿͨΊʹҎԼͷલఏΛ࣋ͭγεςϜͷྫʹ͍ͭͯ ߟ࡯͢Δ

    • CPU • ಈ࡞प೾਺͸1GHzͰ2ͭͷੵ࿨ԋࢉϢχοτΛ࣋ͭ • ϐʔΫੑೳ͸4GFLOPS • 2flop/cycleͰܭࢉͰ͖Δੵ࿨ԋࢉϢχοτ͕2ݸ x 1GHz = 4GFLOPS • ϝϞϦ • 1ݸͷϝϞϦϒϩοΫ͕1wordͰߏ੒͞ΕΔ(ͭ·ΓຖճϝϞϦΞΫηε͕ൃੜ͢Δ) • CPU͔ΒϝϞϦΞΫηεͷϨΠςϯγ͸100ns • 100ns͸CPUʹ͓͚Δ100cycleͷ࣌ؒͱಉ͡ͳͷͰ, ϝϞϦ΁ΞΫηε͢Δ౓ʹ100cycleͷ ଴͕ͪ࣌ؒੜ͡Δ (1GHz = 10^6Hz, 1ns = 10^-6 s )
  7. Example2.2 Effect of memory latency on performance • ͜ͷΑ͏ͳϓϥοτϑΥʔϜ্Ͱ2ͭͷϕΫτϧͷ಺ੵΛܭࢉ͢Δέʔε ʹ͍ͭͯߟ࡯͢Δ

    • ྫ͑͹A=(a1, a2), B=(b1 , b2) ͷ࣌ A x B=a1 x b1 + a2 x b2 • ಺ੵܭࢉ͸ͦΕͧΕͷϕΫτϧ্ͷཁૉ1ϖΞ͝ͱʹੵ࿨ԋࢉΛߦ͍, ԋࢉͷ౓ʹϝϞϦ΁ΞΫηε͢ΔͨΊ,ຖճͷԋࢉ(1flop)ʹ͔͔Δ࣌ؒ ͸ϝϞϦΞΫηεʹ͔͔Δ100ns ➡͜ͷܭࢉϨʔτ͸10MFLOPS • ͜Ε͸ϐʔΫੑೳͷ4GFLOPSͷ͘͝Θ͔ͣͰ͋Γ,ߴ͍ܭࢉϨʔτΛୡ ੒͢Δ্ͰϝϞϦγγεςϜͷύϑΥʔϚϯεͷॏཁੑΛ͍ࣔͯ͠Δ
  8. 2.2.1 Improving Effective Memory Latency Using Caches • ϓϩηοαͱDRAMͷεϐʔυͷෆҰகΛѻ͏͜ͱͰઃܭ্ͷΠϊϕʔγϣϯ͕ੜͨ͡ •

    ͦͷΑ͏ͳΠϊϕʔγϣϯͷ͕̍ͭ,ϓϩηοαͱDRAMͷؒʹʮΩϟογϡʯͱݺ͹ ΕΔ,ߴϨΠςϯγߴόϯυ෯ͳϝϞϦΛ഑ஔ͢Δ͜ͱͰ͋Δ • ϝϞϦ͔ΒಡΈग़ͨ͠σʔλ͸ΩϟογϡʹϑΣον͞Ε,ݪཧతʹ͸ಉ͡σʔλ͕ ϓϩηοα͔Β܁Γฦ͠ࢀর͞ΕΔͷͰ͋Ε͹,ϝϞϦγεςϜͷϨΠςϯγΛ௿ݮ͢ Δ͜ͱ͕Ͱ͖Δ • ଟ͘ͷΞϓϦέʔγϣϯͷܭࢉϨʔτ͸,ϓϩηοαͷॲཧεϐʔυͰ͸ͳ͘ϓϩηο α΁ͷσʔλͷసૹϨʔτʹଋറ͞Ε͓ͯΓ,͜ͷΑ͏ͳܭࢉΛʮmemory boundʯͱ ݺͿ • memory boundͳΞϓϦέʔγϣϯͷύϑΥʔϚϯε͸,Ωϟογϡώοτ཰ʹଟେͳӨ ڹΛड͚Δ.ͦͷ͜ͱΛ࣍ͷྫͰࣔ͢.
  9. Example2.3 Impact of caches on memory system performance • Example2.2͔Βͷมߋ఺

    • Example2.2ͷલఏʹՃ͑ͯϨΠςϯγ͕1ns(΋͘͠͸ 1cycle)ͷ23KBͷΩϟογϡΛಋೖ͢Δ • ͜ͷΩϟογϡΛ༻͍ͯ32x32࣍ݩͷ2ͭͷߦྻA,BΛ ϑΣον͠,ߦྻੵ A x BΛܭࢉ͢Δέʔεʹ͍ͭͯߟ࡯͢ Δ
  10. Example2.3 Impact of caches on memory system performance • 2ͭͷߦྻΛΩϟογϡʹϑΣον͢Δ͜ͱ͸2Kϫʔυ(32

    x 32 x 2 = 2,048ϫʔυ)ͷ ϑΣονʹରԠ͢Δ • ϝϞϦ΁ͷΞΫηεϨΠςϯγ͕100nsͳͷͰ,͓Αͦ200μs(100ns x 2,000) • 2ͭͷߦྻੵͷܭࢉྔ͸O(2n3)ͳͷͰ,ߦྻA,Bͷੵ͸64KΦϖϨʔγϣϯ(2 x 323 = 65,536) ʹରԠ͢Δ • 4 instruction/cycleͰ࣮ߦͰ͖ΔͷͰ,ॲཧ࣌ؒ͸16K cycle(·ͨ͸16μs) • ߹ܭܭࢉ࣌ؒ͸load/storeΦϖϨʔγϣϯͱܭࢉͦͷ΋ͷͷ࣌ؒͷ͓͓Αͦͷ߹ܭ஋ͱͳ Γ, 200 + 16 = 216μs • ͜Ε͸64K/216(ΦϖϨʔγϣϯ/μs),΋͘͠͸303MFLOPS(64 * 1024 / (216 * 10^(-6)) = 303,407,407)ʹ֘౰͢Δ
  11. Example2.3 Impact of caches on memory system performance • Example2.3ͷ·ͱΊ

    • Example2.2ͱͷύϑΥʔϚϯεൺֱ • Example2.2 -> 10MFLOPS(Ωϟογϡͳ͠) • Example2.3 -> 303MFLOPS(23KBΩϟογϡ͋Γ) • ΩϟογϡʹΑΓ30ഒͷվળΛݟ͕ͤͨ,ϐʔΫϓϩηοαύϑΥʔϚ ϯε(4GFLOPS)ͷ10%ҎԼ • ͜ͷྫΛ௨ͯ͡খ͞ͳΩϟογϡϝϞϦΛ഑ஔ͢Δ͜ͱͰϓϩηοα ར༻཰ΛվળͰ͖Δ͜ͱΛࣔͨ͠
  12. Example2.3 Impact of caches on memory system performance • Example2.3ͷ·ͱΊ

    • ΩϟογϡͷଘࡏʹΑΔύϑΥʔϚϯεվળͷ݁Ռ͸,ಉ͡σʔλΞΠςϜʹର ͯ͠൓෮తͳࢀর(࠶ར༻)͕ଘࡏ͢Δ͜ͱΛԾఆ͍ͯ͠Δ • ߦྻੵͷ৔߹͸ߦྻதͷಉ͡ཁૉΛԿ౓΋ࢀর͢ΔͷͰ,Ωϟογϡதͷ σʔλ͕࠶ར༻͞Ε͍ͯΔ͜ͱ͕૝ఆ͞ΕΔ • ୹ִ࣌ؒؒʹ͓͚Δσʔλ΁ͷ൓෮ࢀরͷ֓೦͸
 ʮ࣌ؒతہॴੑ(temoprary locality)ʯͱݺ͹ΕΔ • ΋͠,ͦΕͧΕͷσʔλ͕Ұ౓͔͠࢖ΘΕͳ͍৔߹,DRAMϨΠςϯγ͕֤Φϖ Ϩʔγϣϯʹൃੜ͢ΔͨΊ,σʔλͷ࠶ར༻͸ΩϟογϡύϑΥʔϚϯεʹͱͬ ͯॏཁͰ͋Δ
  13. ຊ೔ͷൃද಺༰ • 2.2 Limitations of Memory System Performance • 2.2.1

    Improving Effective Memory Latency Using Caches • 2.2.2 Impact of Memory Bandwidth(p.18-p.21) • 2.2.3 Alternate Approaches for Hiding Memory Latency • 2.2.4 Tradeoffs of Multithreading and Prefetching
  14. 2.2.2 Impact of Memory Bandwidth • ϝϞϦόϯυ෯͸σʔλ͕ϓϩηοαͱϝϞϦؒΛҠಈͰ͖ΔϨʔτʹ֘౰͢Δ • ϝϞϦόϯυ෯͸ϝϞϦόεͷόϯυ෯ʹΑܾͬͯఆ͞ΕΔ •

    ϝϞϦόϯυ෯Λվળ͢Δ̍ͭͷςΫχοΫ͸ϝϞϦϒϩοΫͷαΠζΛ૿Ճ ͤ͞Δ͜ͱͰ͋Δ • ୯ҰͷϝϞϦϦΫΤετ͸࿈ଓͨ͠ϒϩοΫΛฦ͢ • ʮ࿈ଓͨ͠ϒϩοΫʯͷେ͖͞͸యܕతʹ2͔Β8ϫʔυͰ͋Γ,͜ͷ୯Ґ͸ ʮΩϟογϡϥΠϯʯͱݺ͹ΕΔ • σʔλ࠶ར༻੍͕ݶ͞ΕͨΞϓϦέʔγϣϯʹ͓͍ͯ͜ΕΒ͕ͲͷΑ͏ʹ໾ཱ ͔ͭΛExample2.4Ͱௐ΂Δ
  15. Example2.4 Effect of block size: dot-product of two vectors •

    Example2.3͔Βͷมߋ఺ • 1ݸͷϝϞϦϒϩοΫ͕4wordͰߏ੒͞Ε, ΩϟογϡϥΠϯ͕4wordʹͳΔ • ϒϩοΫαΠζΛมߋͨ͜͠ͱʹΑΔύϑΥʔϚϯε΁ͷӨڹ • ϓϩηοα͸100cycle͝ͱʹ4wordΩϟογϡϥΠϯΛϑΣονͰ͖Δ • ߦྻ͕ϝϞϦ্ʹ഑ஔ͞Ε͍ͯΔͱ͢Δͱ(cache্ʹଘࡏ͠ͳ͍), 200cycle ͝ͱʹ8FLOP͕࣮ߦ͞Ε,͜ͷ࣌ͷϐʔΫεϐʔυ͸40MFLOPS(25ns/FLOP) ͱͳΔ • ͜ͷ݁Ռ͔Β,ϒϩοΫαΠζ͕૿Ճ͢Δ(όϯυ෯͕૿͑Δ)͜ͱͰ,Ωϟο γϡͳ͠Ͱ΋಺ੵΞϧΰϦζϜ͕ߴ଎ԽͰ͖Δ͜ͱΛࣔͨ͠
  16. Example2.4 Effect of block size: dot-product of two vectors •

    Ωϟογϡώοτ཰Λݟੵ΋Δ͜ͱͰύϑΥʔϚϯεݶքΛ͢͹΍͘ݟੵ΋Δ͜ͱ͕Ͱ͖Δ • ྫ͑͹͜ͷαϯϓϧͰ͸,8σʔλΞΫηε͝ͱʹ2ΞΫηεͷDRAMΞΫηε(cacheϛε) ͕ൃੜ͠ɺ75%ͷΩϟογϡώοτ཰ʹରԠ͢Δ • ΩϟογϡϛεʹΑΔΦʔόʔϔου͕ࢧ഑తͩͱ͢Δͱ,͜ͷ৔߹ͷฏۉΞΫηε࣌ؒ ͸25ns/word(200ns / 8word = 25ns/word)ʹ૬౰͢Δ • ಺ੵͷܭࢉ͸1operation / wordͳͷͰܭࢉϨʔτ͸40MFLOPS • ΑΓਖ਼֬ͳܭࢉϨʔτͷݟੵ΋Γ͸,ฏۉϝϞϦΞΫηελΠϜΛ0.75 x 1 + 0.25 * 100 = 25.75ns/word ͱͯ͠ܭࢉ͠,38.8MFLOPSͰ͋Δ • Ωϟογϡώοτ཰͕100%ͷ৔߹͸303MFLOPSͳͷͰ,ΩϟογϡϛεʹΑΓ͔ͳΓੑೳ ྼԽ͍ͯ͠Δ͜ͱ͕Θ͔Δ
  17. Example2.4 Effect of block size: dot-product of two vectors •

    Example2.4ͷ·ͱΊ • Example2.4Ͱղઆ͞Ε͍ͯΔγφϦΦ͸ෳ਺ϝϞϦόϯΫʹ઀ଓ͞Εͨ޿͍data bus(4word,·ͨ ͸128bit)ʹରԠ͢Δ͕,͜ͷΑ͏ͳγεςϜ͸ߴՁͰ͋Δ • ݱ࣮తͳγεςϜͰ͸࿈ଓͨ͠wordͷ࠷ॳͷword͕औΓग़͞Εͨޙʹ,ޙଓͷbus cycleͰ࢒Γͷ word͕memory busʹૹ৴͞ΕΔ • ྫ͑͹,32bitσʔλόεͰ͸࠷ॳͷword͸100ns͔͔ͬͯbusʹ഑ஔ͞ΕΔ͕,ޙଓͷ֤bus cycleͰ1word͕഑ஔ͞ΕΔ • ͜ΕʹΑΓ,͢΂ͯͷΩϟογϡϥΠϯ(4word)100 + 3 x (memory bus cycle) nsͰར༻Մೳʹ ͳͳΔ • σʔλόεΦϖϨʔγϣϯ͕200MHzͩͱ͢Δͱॲཧ଎౓͸5ns/wordͳͷͰ, ϓϥε3 x 5 = 15nsͰ3word͕Ωϟογϡ͞ΕΔͷͰ4wordΛϑΣον͢Δͷʹ100 + 15 = 115ns͔͔Δܭࢉ ʹͳΔ
  18. Example2.4 Effect of block size: dot-product of two vectors •

    Example2.4ͷ·ͱΊ • Example2.4͸૿Ճͨ͠όϯυ෯͕ܭࢉϨʔτΛߴΊΔ͜ͱΛ໌֬ʹ͍ࣔͯ͠Δ • ·ͨ,σʔλϨΠΞ΢τ͸࿈ଓ໋ͨ͠ྩʹΑͬͯ࢖༻͞ΕΔ࿈ଓͨ͠σʔλʔϫʔ υͷΑ͏ʹͳ͍ͬͯΔͱ૝ఆ͞Ε͍ͯΔ • ܭࢉத৺తͳࢹ఺ͰݟΔͱϝϞϦΞΫηεͷʮۭؒతہॴੑ(spatial locality)ʯ ͕ଘࡏ͍ͯ͠Δ • ΋͠ܭࢉ(΋͘͠͸σʔλͷΞΫηεύλʔϯ)͕ۭؒతہॴੑΛ࣋ͨͳ͚Ε͹, ࣮ࡍͷόϯυ෯͸ϐʔΫόϯυ෯ΑΓ΋͔ͳΓখ͘͞ͳΔ • ͜ͷΑ͏ͳ(ޮ཰ͷѱ͍)ΞΫηεύλʔϯͷྫ͸,ߦྻ͕ߦ༏ઌͰϝϞϦʹ֨ೲ ͞Ε͍ͯΔʹ΋ؔΘΒͣ,ྻํ޲ʹσʔλΛಡΉΑ͏ͳέʔεͰ͋Δ
  19. Example2.5 Impact of strided access • ߦྻbͷྻ͝ͱͷ߹ܭ஋ΛϕΫτϧcolumn_sumʹ֨ೲ͢ Δ • column_sum͸Ωϟογϡʹ৐Δ΄Ͳখ͍͞

    • ߦྻb͸ྻํ޲ʹΞΫηε͞Ε͍ͯΔ for (i = 0; I < 1000; i++)
 column_sum[i] = 0.0;
 for(j = 0; j < 1000; j++)
 column_sum[i] += b[j][j];
  20. Example2.5 Impact of strided access for (i = 0; I

    < 1000; i++)
 column_sum[i] = 0.0;
 for(j = 0; j < 1000; j++)
 column_sum[i] += b[j][j]; • ߦྻb͸ߦํ޲Ͱ֨ೲ͞Ε͍ͯΔͷͰ,͜ͷίʔυͰ͸ຖճ 1,000ݸͷཁૉ(ϕΫτϧ)΁ΞΫηε͓ͯ͠ΓΩϟογϡ͕ ׆༻͞Εͳ͍ • ͦͷͨΊύϑΥʔϚϯε͕ग़ͳ͍
  21. Example2.5 Impact of strided access for (i = 0; I

    < 1000; i++)
 column_sum[i] = 0.0;
 for(j = 0; j < 1000; j++)
 column_sum[i] += b[j][j]; • ͜ͷαϯϓϧ͸ετϥΠυΞΫηε໰୊Λද͍ͯ͠Δ • ʮετϥΠυʯͱ͸ʮ·͙ͨʯͱ͍͏ҙຯͰ,͜ͷ৔߹͸ෳ਺ͷΩϟο γϡϥΠϯʹ৐Δσʔλʹ·͕ͨͬͯΞΫηεͯ͠͠·͍ͬͯΔ ➡͜ͷঢ়ଶ͸ʮۭؒతہॴੑ(spatial locality)͕ࣦΘΕ͍ͯΔʯͱදݱ ͞ΕΔ
  22. Example2.6 Eliminating strided access for (i = 0; I <

    1000; i++)
 column_sum[i] = 0.0;
 for(j = 0; j < 1000; j++)
 for(i = 0; i < 1000; i++)
 column_sum[i] += b[j][i] • Example2.5Λ͜ͷΑ͏ʹॻ͖௚͢͜ͱͰߦྻΛߦํ޲ʹΞΫηεͰ͖ΔΑ͏ʹͳ Γ,ۭؒతہॴੑΛ׆͔͢͜ͱ͕Ͱ͖ΔΑ͏ʹͳΔ • ͔͠͠,ߦྻͷαΠζ͕͋·Γʹେ͖͘,column_sumϕΫτϧ͕Ωϟογϡʹ৐Γ ੾Βͳ͍৔߹͸column_sumͷΩϟογϡϛε͕ϘτϧωοΫʹͳΔ • ͦͷ৔߹͸ϧʔϓΛෳ਺ͷϒϩοΫʹ෼ׂͯ͠ϒϩοΫ͝ͱʹ߹ܭ͢Δඞཁ͕ ͋Δ
  23. ຊ೔ͷൃද಺༰ • 2.2 Limitations of Memory System Performance • 2.2.1

    Improving Effective Memory Latency Using Caches • 2.2.2 Impact of Memory Bandwidth • 2.2.3 Alternate Approaches for Hiding Memory Latency(p.22-p.23) • 2.2.4 Tradeoffs of Multithreading and Prefetching
  24. 2.2.3 Alternate Approaches for Hiding Memory Latency • webαΠτͷϨεϙϯε͕஗͍࣌ҎԼͷͲΕ͔ͷΞϓϩʔνʹΑͬͯঢ়گ ΛվળͰ͖Δ

    a) ͜Ε͔ΒӾཡ͢ΔwebαΠτΛ༧ଌͯ͠ࣄલʹϦΫΤετΛߦ͏ ➡ prefetching b) ෳ਺ͷϒϥ΢βΛىಈͯͦ͠ΕͧΕผͷϖʔδʹΞΫηε͢Δ ➡multithreading c) Ұ౓ʹ·ͱΊͯෳ਺ͷϖʔδΛऔಘ͢Δ ➡memoryΞΫηεʹ͓͚Δۭؒతہॴੑ
  25. Example2.7 Thread execution of matrix multiplication • n*nͷߦྻaʹϕΫτϧbΛ৐ͨ݁͡ՌΛϕΫτϧcʹ֨ೲ ͢Δ •

    ֤ཁૉͷߦͷߦྻܭࢉ͸ޓ͍ʹಠཱ͍ͯ͠ΔͷͰ,εϨο υΛ࢖༻ͯ࣍͠ͷΑ͏ʹॻ͖௚͢͜ͱ͕Ͱ͖Δ for (i = 0; i < 1000; i++)
 c[i] = dot_product(get_row(a, i), b);
  26. Example2.7 Thread execution of matrix multiplication • ͱ͋ΔεϨου͕̍ͭͷϕΫτϧΛϝϞϦ͔Βϩʔυ͠,ܭࢉ͍ͯ͠Δ ؒʹ,ผͷεϨου͕ผͷϕΫτϧΛॲཧ͢Δ͜ͱ͕Ͱ͖Δ •

    ͜ΕʹΑΓ,̍ͭͷεϨου͕ϝϞϦΛϩʔυ͍ͯ͠Δؒʹผͷϩʔυ Λऴ͑ͨεϨου͕ܭࢉΛߦ͏͜ͱ͕Ͱ͖Δ • ݁ՌతʹຖαΠΫϧܭࢉΛ࣮ߦͰ͖Δ for (i = 0; i < 1000; i++)
 c[i] = create_thread(dot_product(get_row(a, i), b));
  27. Example2.7 Thread execution of matrix multiplication • MultithreadԽͨ͠ϓϩηοα͸,େྔͷϝϞϦΞΫηε,௨৴ͳ ͲͷϦΫΤετΛ൐͏ܭࢉεϨουͷίϯςΩετΛҡ࣋͠ ͨঢ়ଶͰ࣮ߦ͢Δ͜ͱ͕Ͱ͖Δ

    • HEP΍TeraͷΑ͏ͳϚγϯ͸ຖαΠΫϧ࣮ߦίϯςΩετΛε ΠονͰ͖ΔMultithreadԽ͞Εͨϓϩηοαʹґଘ͍ͯ͠Δ • ݁Ռతʹ,εϨου͸ޮՌతʹϨΠςϯγΛӅṭԽͰ͖,ϓϩ ηοαΛΞΠυϦϯάͤ͞ͳ͍ͨΊͷฒྻੑΛఏڙ͢Δ͜ͱ ͕Ͱ͖Δ
  28. Prefetching for Latency hiding • ͔͠͠,σʔλϩʔυΛલ౗ͯ͠͠΋,σʔλ͕࢖༻͞ΕΔ ·Ͱͷؒʹผͷ໋ྩͰ্ॻ͖͞ΕΔͱ࠶౓ϩʔυ͕ඞཁ ͱͳΔ • ͜ͷΑ͏ͳରࡦΛ͢Δʹ͸Ϧιʔεґଘͷͳ͍(ྫ͑͹ಉ

    ͡ϨδελΛ࢖༻͠ͳ͍)εϨουΛಛఆ࣮ͯ͠૷͢Δඞ ཁ͕͋Δ • ଟ͘ͷίϯύΠϥ͕ੵۃతʹ͜ͷΑ͏ͳσʔλϩʔυͷલ ౗͠Λߦ͓͏ͱ͢Δ
  29. Example2.8 Hiding latency by prefetching • for loopΛ࢖ͬͯϕΫτϧa,bͷͦΕͧΕͷཁૉΛՃࢉ͢Δྫʹ͍ͭͯ ߟ࡯͢Δ •

    ࠷ॳͷϧʔϓͰ͸(a[0], b[0])ͷϖΞΛϦΫΤετ͢Δ͕Ωϟογϡ ʹଘࡏ͠ͳ͍ͷͰϝϞϦϨΠςϯγ͕ൃੜ͢Δ • ͜ΕΒͷϦΫΤετ͕࣮ߦ͞Ε͍ͯΔؒʹ,ผͷϧʔϓͰ(a[1],b[1]), (a[2],b[2])ͷϖΞ΁ͷϦΫΤετΛ࣮ߦ͢Δ • ֤ϦΫΤετ͕1nsͰ࣮ߦ͞ΕϝϞϦϦΫΤετ͕100nsͰ׬ྃ͢ Δͱ͢Δͱ,100ճͷϝϞϦϦΫΤετͷޙʹa[0],b[0]ͷϖΞ͕ฦͬ ͯ͘Δ
  30. Example2.8 Hiding latency by prefetching Memory request a+b a+b Memory

    request a+b Memory request a+b Memory request ॲཧ։͔࢝Βͷܦա࣌ؒ 0 1 2 3 a+b Memory request 4
  31. ຊ೔ͷൃද಺༰ • 2.2 Limitations of Memory System Performance • 2.2.1

    Improving Effective Memory Latency Using Caches • 2.2.2 Impact of Memory Bandwidth • 2.2.3 Alternate Approaches for Hiding Memory Latency • 2.2.4 Tradeoffs of Multithreading and Prefetching
 (p.23-p.24)
  32. Example2.9 Impact of bandwidth on multithreaded programs • ҎԼͷલఏͷϚγϯͰͷܭࢉΛߟ࡯͢Δ •

    1GHzΫϩοΫCPU • 4-word(4Byte * 4)ΩϟογϡϥΠϯʹ1cycleͰΞΫηεՄ • 1KBΛΩϟογϡ͢Δ࣌͸Ωϟογϡώοτ཰25%, 32KBΛΩϟογϡ͢Δ࣌͸Ωϟογϡώοτ཰90% • 100nsϨΠςϯγͷDRAM
  33. Example2.9 Impact of bandwidth on multithreaded programs • ̎ͭͷ࣮ߦܗࣜͷܭࢉΛߟ࡯͢Δ a.

    શͯͷΩϟογϡ͕γʔέϯγϟϧͳίϯςΩετͰ ࢖༻ՄೳͳγϯάϧεϨουʹ࣮ߦ b. ֤εϨου͕1KBͷΩϟογϡΛ࣋ͭ32εϨουͷ ϚϧνεϨου࣮ߦ
  34. Example2.9 Impact of bandwidth on multithreaded programs a. શͯͷΩϟογϡ͕γʔέϯγϟϧͳίϯςΩετͰ࢖༻Մ ೳͳγϯάϧεϨουʹ࣮ߦ

    • 1αΠΫϧ͝ͱʹσʔλϦΫΤετΛ࣮ߦ͢Δ৔߹,࠷ॳͷ DRAM΁ͷΞΫηε͸100nsΛཁ͢Δ • ͔͠͠Ωϟογϡώοτ཰͕90%ͳͷͰ,࢒Γ10%ͷΞΫη εΛΩϟογϡώοτ࣌ͱಉ࣌ؒ͡Ͱ׬ྃͤ͞Δʹ ͸,DRAMʹ͸10ns͝ͱʹ1wordͷόϯυ෯͕ඞཁ ➡0.1words/ns = 100M words/s = 400MB/s
  35. Example2.9 Impact of bandwidth on multithreaded programs a. શͯͷΩϟογϡ͕γʔέϯγϟϧͳίϯςΩετͰ࢖ ༻ՄೳͳγϯάϧεϨουʹ࣮ߦ

    Hit ॲཧ։͔࢝Βͷܦա࣌ؒ Hit Hit Hit Hit Hit Hit Hit Miss 1ns 1ns 1ns 1ns 1ns 1ns 1ns 1ns 1ns Memory Request 100ns Memory Request Miss Hit Hit
  36. Example2.9 Impact of bandwidth on multithreaded programs b. ֤εϨου͕1KBͷΩϟογϡΛ࣋ͭ32εϨουͷϚϧ νεϨου࣮ߦ

    Hit ॲཧ։͔࢝Βͷܦա࣌ؒ Miss 1ns 1ns 1ns 1ns Memory Request Miss Miss Memory Request Memory Request • DRAMʹཁٻ͞ΕΔόϯυ෯͸
 3words/4cycle ≒3GB/sʹ૬౰͢Δ
  37. Example2.9 Impact of bandwidth on multithreaded programs • Example2.9͸ͱͯ΋ॏཁͳ໰୊Λղઆ͍ͯ͠Δ •

    multithreadedγεςϜͷόϯυ෯ͷཁٻ͸,֤εϨουͷখن໛ͳ ΩϟογϡৗறʹΑΓ,ٸܹʹ૿Ճ͢Δ • ྫ͑͹,࣋ଓ͢Δ400MB/sͷDRAMόϯυ෯͸ଥ౰Ͱ͋Δ͕,3.0GB/s͸ ݱࡏఏڙ͞Ε͍ͯΔେ఍ͷγεςϜΑΓ΋ߴ͍ • ͜ͷ఺Ͱ,multithreadedγεςϜ͸ϨΠςϯγό΢ϯυͰ͸ͳ͘όϯυ ෯ό΢ϯυͱͳΔ • multithreadingͱprefetching͸ϨΠςϯγ໰୊ͷΈΛऔѻ͏΋ͷͰ,όϯυ ෯໰୊ΛѱԽͤ͞Δͱ͍͏͜ͱΛೝࣝ͢Δ͜ͱ͕ॏཁͰ͋Δ
  38. Example2.9 Impact of bandwidth on multithreaded programs • ผͷྫͱͯ͠,10ݸͷϨδελ΁ͷσʔλϩʔυΛલ౗͢͠Δঢ় گΛߟ࡯͢Δ

    • ϩʔυ͞Εͨσʔλ͕ผͷinstructionʹΑ্ͬͯॻ͖͞ΕΔ ͱ΋͏Ұ౓ϩʔυΛ΍Γ௚͢͜ͱʹͳΔ • ͜ͷ৔߹͸ϨΠςϯγ͸૿Ճ͠ͳ͍͕,2ճϑΣον͕ൃੜ͢ ΔͨΊϝϞϦʹཁٻ͞ΕΔόϯυ෯͕2ഒʹͳΔ • ͜ͷঢ়گ͸ΑΓେ͖ͳϨδελϑΝΠϧͱΩϟογϡʹΑΔ prefetchingͱmultithreadingΛαϙʔτ͢Δ͜ͱͰܰݮͰ͖Δ