Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Polca Dancing from High Performance Computing Meetup

Munich Lambda
April 16, 2014
140

Polca Dancing from High Performance Computing Meetup

Munich Lambda

April 16, 2014
Tweet

Transcript

  1. A 30-minutes Dance Lesson… Lutz Schubert, University of Ulm Jan

    Kuper, University of Twente Daniel Rubio Bonilla, HLRS Salvador Tamarit, IMDEA http:// www.polc a-
  2. Processor Develop- ment
 from faster to more 10^2 Frequency (MHz)

    1985 10^3 10^4 10^5 1990 1995 2000 2005 2010 2015 2020 Actual development 10^1 Projected development typical application performance
  3. More = Faster? • In theory: • 1 processor at

    100 Hz executes 100 operations per second • 1 processor at 200 Hz executes 200 operations per second • HENCE: 2 processors at 100 Hz execute 200 operations per second ➢Everything is fine?
  4. Not a New Problem 1958 • S. Gill (Ferranti) discussed

    parallel programming • J. Cocke, D. Slotnick (IBM) discussed parallelism for numerical calculations 1962 • 4 processor machine that accessed 16 memory modules 1967 • Amdahl and Slotnick debate the feasibility of parallel processing 1969 • Honeywell introduced Multics system (symmetric 8 processor machine)
  5. Modern HPC Systems ! Current largest machine: • Tianhe-2 (meaning

    MilkyWay-2) • 3.120.000 cores • Theoretically 54.902 TFlops = 54*1015 Flops • Sustained 33.863 TFlops • Power consumption 18 MWatt ! Will there be Exascale computers?
  6. Usage Problems #1 Reliability • Exascale ~200.000 CPUs • MTBF

    for one CPU = ~10 years ➢MTBF for 200.000 CPUs: 10 years = 3.652 days = 87.648 hours = 5.258.880 minutes 5.258.880 / 200.000 = 26 minutes(!)
  7. Usage Problems #2 Energy • ~20 MWatt for ~50*1015 Flops

    ➢Exaflop = 400 Mwatt (a medium scale power reactor creates around 300 MW)
  8. More = Faster? Simple applications are not faster on modern

    / more machines e.g. how to use multiple processors for: int n; int a[]; … do { n=a[n]; } while (n<(sizeof(a)/sizeof(int)))
  9. More ≠ Faster! 1 5 2 7 34 8 777

    56 3 654 66 27 13 314 211 212 … n= 100 Hz 100 Hz 100 Hz 100 Hz More = Faster? Total = 100Hz
  10. Simple Example ! double n = 5; for (int i=0;

    i<3; i++) { a[i] = pow(a[i],n); } a[0] = pow(a[0],5); a[1] = pow(a[1],5); a[2] = pow(a[2],5); ?
  11. Amdahl’s Law T serial T parallel T serial + T

    parallel T serial + ½ T parallel T serial + ⅓ T parallel T serial + ¼ T parallel T serial + ⅕ T parallel T serial + ⅙ T parallel … … …
  12. Visualised Each “heat point” is influenced by its direct neighbours,

    and implicitly, vice versa, influences all direct neighbours
  13. Algorithm // per iteration ! for (int y=0; y<size_y; y++)

    for (int x=0; x<size_x; x++) htmp[x,y] = h[x-1,y] + h[x,y-1] + ... – 4*h[x,y]; ! for (int y=0; y<size_y; y++) for (int x=0; x<size_x; x++) h[x,y] = htmp[x,y]
  14. Further Problems • Memory hierarchy • Distance & number of

    hops • Bandwidth • Architecture layout • Operating System • Jitter • …
  15. What to do? Example: Halo Halos can be used to

    further decrease the communication overhead during calculation, but this leads to • Higher cache load • Higher setup communication
  16. The main problem The compiler cannot see how data is

    actually accessed – this includes • read / write order • indirection • etc. ! ! ! int n; int a[]; … do { n=a[n]; } while (n<(sizeof(a)/sizeof(int)))
  17. Part 3 • as opposed to C: mathematical approach •

    All dependencies known • Break down easiest : each core only one task • Communication & memory allocation not controllable • Need load to compute ratio
  18. Explicit depen- dencies z0 z1 z2 z3 z4 z5 z6

    z7 z8 z9 z'''0 z'''1 z'''2 z'''3 z'''4 z'''5 z'''6 z'''7 z'''8 z'''9 t=1 t=2 t=3 g g g g g g g g t=0 g g g g g g g g g g g g g g g g
  19. Ideal for parallel computing? Problems: • The compiler does not

    know how to interpret the communication • Workload vs. communication load is unknown • Same segmentation and intelligence missing as in C • A “legal” segmentation would have every function on an individual core ! ➢can we combine the performance information with the communication knowledge?
  20. Anno- tation Model
 POLCA extends the C programming language !!

    :!!,!= !!−1,!−1+!(!,!−1)∗!!−1,!−1−!!,!−1 double heatmap[10]; double heatmap_tmp[10]; #pragma func main() = heatspread( initmap( heatmap ) ) void main() { … for (int iter=0; iter<100; iter++) // 100 iterations { heatspread(*heatmap); memcpy(heatmap, heatmap_tmp, 10); } } 
 #pragma func heatspread(heatmap) = zipWith (+) (heatcell( neighbours(heatmap) )) (heatmap) void heatspread(double** heatmap) { for (int x=0; x<10; y++) { double dphi = heatcell( neighbours(heatmap, x) ); heatmap_tmp[x] = heatmap[x] + dphi; } } ... workflow data in data out
  21. Three operations Dataflow 
 for function ①
 serialised y0 y1

    t=0: + - * t=1: y2 + - * y3 + - * y4 + - * y'1 + - * y'2 + - * y'3 + - * y'4 + - * ①
  22. Reuse a result Two operations Dataflow 
 for function ②


    serialised y4 * - t=0: t=1: y3 y2 * - y1 * - yo * * - y'4 * - y'3 y'2 * - y'1 * - y'o * * - ②
  23. y0 y1 + - * y2 + - * y3

    + - * No read / write dependency ① Dataflow
 Analysis 
 dependencies define
 parallelism y2 * - y1 * - yo * y3 * - write dependency read dependency ②
  24. y2 * - y1 * - yo * y3 *

    - sequential ② Dataflow
 Analysis 
 dependencies define
 parallelism y0 y1 + - * y2 + - * y3 + - * parallel ① No read / write dependency
  25. Trans- formations
 affecting the 
 program behaviour • Changes the

    load • Maintains correctness • But affects the algorithm
  26. iteration Nature of Simulation
 simulated time does(?)
 imply iteration order

    z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z'''0 z'''1 z'''2 z'''3 z'''4 z'''5 z'''6 z'''7 z'''8 z'''9 t=1 t=2 t=3 g g g g g g g g t=0 g g g g g g g g g g g g g g g g
  27. Iteration Function
 is NOT bound to t t=0 z 4

    z 5 z 6 g(2,2) g(3,2) t=1 t=2 t=3 g(3,1) g(4,1) effective calculation iteration g(4,2) g(3,3) g(5,1)
  28. Impact on Code 
 unchanged algorithm 
 for function ②

    for (int t=0; t<100; t++) for (int x=0; x<5; x++) { ysqr[x,t] = 
 y[x,t-1]*
 y[x,t-1]; y[x,t] = 
 ysqr[x-1,t]-
 ysqr[x,t]; }
  29. Impact on Code 
 changing the iterator
 for algorithm ②

    for (int t=0; t<100+5; t++) parallelfor (int x=0;
 x<5; x++) { ysqr[x,t-x] = 
 y[x,t-x-1]*
 y[x,t-x-1]; y[0+x,t-x] = 
 ysqr[x-1,t-x] – 
 ysqr[x,t-x]; }
  30. “Real” Amdahl T serial T parallel T comm T comm

    T serial + ½ T parallel + 2T com !