Polca Dancing from High Performance Computing Meetup

A 30-minutes Dance Lesson… Lutz Schubert, University of Ulm Jan
Kuper, University of Twente Daniel Rubio Bonilla, HLRS Salvador Tamarit, IMDEA http:// www.polc a-

Part 1 Large Scale Computing

Processor Develop- ment  from faster to more 10^2 Frequency (MHz)
1985 10^3 10^4 10^5 1990 1995 2000 2005 2010 2015 2020 Actual development 10^1 Projected development typical application performance

More = Faster? • In theory: • 1 processor at
100 Hz executes 100 operations per second • 1 processor at 200 Hz executes 200 operations per second • HENCE: 2 processors at 100 Hz execute 200 operations per second ➢Everything is fine?

Not a New Problem 1958 • S. Gill (Ferranti) discussed
parallel programming • J. Cocke, D. Slotnick (IBM) discussed parallelism for numerical calculations 1962 • 4 processor machine that accessed 16 memory modules 1967 • Amdahl and Slotnick debate the feasibility of parallel processing 1969 • Honeywell introduced Multics system (symmetric 8 processor machine)

Modern HPC Systems ! Current largest machine: • Tianhe-2 (meaning
MilkyWay-2) • 3.120.000 cores • Theoretically 54.902 TFlops = 54*1015 Flops • Sustained 33.863 TFlops • Power consumption 18 MWatt ! Will there be Exascale computers?

Usage Problems #1 Reliability • Exascale ~200.000 CPUs • MTBF
for one CPU = ~10 years ➢MTBF for 200.000 CPUs: 10 years = 3.652 days = 87.648 hours = 5.258.880 minutes 5.258.880 / 200.000 = 26 minutes(!)

Usage Problems #2 Energy • ~20 MWatt for ~50*1015 Flops
➢Exaflop = 400 Mwatt (a medium scale power reactor creates around 300 MW)

Usage Problems #3 Performance … but still: 300 processors =
300 times the speed?

More = Faster? Simple applications are not faster on modern
/ more machines e.g. how to use multiple processors for: int n; int a[]; … do { n=a[n]; } while (n<(sizeof(a)/sizeof(int)))

More ≠ Faster! 1 5 2 7 34 8 777
56 3 654 66 27 13 314 211 212 … n= 100 Hz 100 Hz 100 Hz 100 Hz More = Faster? Total = 100Hz

More ≠ Faster ➢in order to use multiple processors, the
program must be parallelisable

Part 2 Gene Amdahl…

Amdahl’s Law

Simple Example double n = 5; for (int i=0; i<3;
i++) { a[i] = pow(a[i],n); }

Simple Example ! double n = 5; for (int i=0;
i<3; i++) { a[i] = pow(a[i],n); } a[0] = pow(a[0],5); a[1] = pow(a[1],5); a[2] = pow(a[2],5); ?

Amdahl’s Law T serial T parallel T serial + T
parallel T serial + ½ T parallel T serial + ⅓ T parallel T serial + ¼ T parallel T serial + ⅕ T parallel T serial + ⅙ T parallel … … …

Amdahl’s Law B=0 B=0.02 B=0.1 B=0.5 # processing units speedup

Heat Dissipation Example

Heat Dissipation example 1-d heat dissipation function x t

Discreti- zation … ② ① ③

Visualised

Visualised Each “heat point” is influenced by its direct neighbours,
and implicitly, vice versa, influences all direct neighbours

Algorithm // per iteration ! for (int y=0; y<size_y; y++)
for (int x=0; x<size_x; x++) htmp[x,y] = h[x-1,y] + h[x,y-1] + ... – 4*h[x,y]; ! for (int y=0; y<size_y; y++) for (int x=0; x<size_x; x++) h[x,y] = htmp[x,y]

Straight- forward?

Let’s think about this h[x,y] htmp[x,y ] That‘s fine, isn‘t
it?

Problem 1 There is only one bus

Amdahl’s Law B=0 B=0.02 B=0.1 B=0.5 # processing units speedup

Real Speedup 1 B=0.02 # processing units speedup

Problem 2: Non- shared memory Needs communication => Needs time

Real Speedup 2 B=0.02 # processing units speedup

Further Problems • Memory hierarchy • Distance & number of
hops • Bandwidth • Architecture layout • Operating System • Jitter • …

What to do? Example: Halo Halos can be used to
further decrease the communication overhead during calculation, but this leads to • Higher cache load • Higher setup communication

The main problem The compiler cannot see how data is
actually accessed – this includes • read / write order • indirection • etc. ! ! ! int n; int a[]; … do { n=a[n]; } while (n<(sizeof(a)/sizeof(int)))

Part 3 Declarative Programming?

Part 3 • as opposed to C: mathematical approach •
All dependencies known • Break down easiest : each core only one task • Communication & memory allocation not controllable • Need load to compute ratio

Recap: Heatmap … ② ① ③

Explicit dependencies z0 z1 z2 z3 z4 z5 z6
z7 z8 z9 z'''0 z'''1 z'''2 z'''3 z'''4 z'''5 z'''6 z'''7 z'''8 z'''9 t=1 t=2 t=3 g g g g g g g g t=0 g g g g g g g g g g g g g g g g

Ideal for parallel computing? Problems: • The compiler does not
know how to interpret the communication • Workload vs. communication load is unknown • Same segmentation and intelligence missing as in C • A “legal” segmentation would have every function on an individual core ! ➢can we combine the performance information with the communication knowledge?

Part 4 POLCA Dancing

Anno- tation Model  POLCA extends the C programming language !!
:!!,!= !!−1,!−1+!(!,!−1)∗!!−1,!−1−!!,!−1 double heatmap[10]; double heatmap_tmp[10]; #pragma func main() = heatspread( initmap( heatmap ) ) void main() { … for (int iter=0; iter<100; iter++) // 100 iterations { heatspread(*heatmap); memcpy(heatmap, heatmap_tmp, 10); } }   #pragma func heatspread(heatmap) = zipWith (+) (heatcell( neighbours(heatmap) )) (heatmap) void heatspread(double** heatmap) { for (int x=0; x<10; y++) { double dphi = heatcell( neighbours(heatmap, x) ); heatmap_tmp[x] = heatmap[x] + dphi; } } ... workflow data in data out

Equivalency of Functions Mathematical Transformations

Transfor- mation  mathematical functions  can be altered !! :!!,!= !!−1,!−1+!(!,!−1)∗!!−1,!−1−!!,!−1
⟺ a2-b2 = a2-ab+ba-b2 = (a+b)*(a-b)

Three operations Dataflow   for function ①  serialised y0 y1
t=0: + - * t=1: y2 + - * y3 + - * y4 + - * y'1 + - * y'2 + - * y'3 + - * y'4 + - * ①

Reuse a result Two operations Dataflow   for function ② 
serialised y4 * - t=0: t=1: y3 y2 * - y1 * - yo * * - y'4 * - y'3 y'2 * - y'1 * - y'o * * - ②

y0 y1 + - * y2 + - * y3
+ - * No read / write dependency ① Dataflow  Analysis   dependencies define  parallelism y2 * - y1 * - yo * y3 * - write dependency read dependency ②

y2 * - y1 * - yo * y3 *
- sequential ② Dataflow  Analysis   dependencies define  parallelism y0 y1 + - * y2 + - * y3 + - * parallel ① No read / write dependency

Trans- formations  affecting the   program behaviour • Changes the
load • Maintains correctness • But affects the algorithm

Operational Load and Degree of Concurrency Can Be Manipulated Tempo-Spatial
Reasoning

iteration Nature of Simulation  simulated time does(?)  imply iteration order
z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z'''0 z'''1 z'''2 z'''3 z'''4 z'''5 z'''6 z'''7 z'''8 z'''9 t=1 t=2 t=3 g g g g g g g g t=0 g g g g g g g g g g g g g g g g

Iteration Function  is NOT bound to t t=1 t=2 t=3
t=0

Iteration Function  is NOT bound to t t=0 z 4
z 5 z 6 g(2,2) g(3,2) t=1 t=2 t=3 g(3,1) g(4,1) effective calculation iteration g(4,2) g(3,3) g(5,1)

Iteration Function  is NOT bound to t effective calculation iteration
t=1 t=2 t=3 t=0

Affecting Parallelism  applying the principle   to function ② calculation
iteration t=1 t=2 t=0 t=3

effective calculation iteration t=1 t=2 t=0 t=3 Affecting Parallelism  applying
the principle   to function ②

Affecting Parallelism  applying the principle   to function ② effective
calculation iteration t=1 t=2 t=0 τ=0 τ=1 t=3

Impact on Code   unchanged algorithm   for function ②
for (int t=0; t<100; t++) for (int x=0; x<5; x++) { ysqr[x,t] =   y[x,t-1]*  y[x,t-1]; y[x,t] =   ysqr[x-1,t]-  ysqr[x,t]; }

Impact on Code   changing the iterator  for algorithm ②
for (int t=0; t<100+5; t++) parallelfor (int x=0;  x<5; x++) { ysqr[x,t-x] =   y[x,t-x-1]*  y[x,t-x-1]; y[0+x,t-x] =   ysqr[x-1,t-x] –   ysqr[x,t-x]; }

!! :!!,!= !!−1,!−1+!(!,!−1)∗!!−1,!−1−!!,!−1 Lutz Schubert  University of Ulm  [email protected] Please
find more information on http://

“Real” Amdahl T serial T parallel T serial + T
parallel

“Real” Amdahl T serial T parallel T serial + ½
T parallel + T com ! T comm

“Real” Amdahl T serial T parallel T serial + ⅓
T parallel + T comm T comm

“Real” Amdahl T serial T parallel T comm

“Real” Amdahl T serial T parallel T comm T comm
T serial + ½ T parallel + 2T com !

“Real” Amdahl B=0 B=0.02 B=0.1 B=0.5 # processing units speedup

Polca Dancing from High Performance Computing M...

Polca Dancing from High Performance Computing Meetup

More Decks by Munich Lambda

Featured

Transcript