Big Problems?...Time to Get Parallel

© Trustees of Indiana University Released under Creative Commons 3.0
unported license; license terms on last slide. Big Problems?...Time to Get Parallel A brief foray into the world of parallel computing Scott Michael [email protected] Presented at XSEDE 12 Conference, 16 Jul 2012 Available from: https://speakerdeck.com/u/scamicha/

Outline •  What is parallel programming? – And why is it
important? •  How do I design a parallel program? – Challenges •  What are the tools that enable parallel programming? – Shared vs. Distributed memory •  Demonstrations

What is parallel computing and why do we need it?
•  Wikipedia says: “Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel").” •  Supercomputers = massive numbers of cores •  Everyday computers = many/multi-core Sequoia: LLNL’s BlueGene Q is the world’s fastest supercomputer with 20 PFLOPs and 1.6 million cores

I’m convinced! How do I design a parallel program? • 
Two main strategies used problem decomposition –  Task Parallel (MISD, MIMD, MPMD) Processing elements do different tasks –  Data Parallel (SIMD, SPMD) Processing elements do the same task on different data McCormick et al. 2007

Wow, sounds easy! What’s the big deal? •  There are
a variety of challenges in designing and implementing parallel algorithms – Amdhal’s law – Load balance – Interprocess communication/Data locality – Fault tolerance

The law according to Amdahl •  A key determiner of
the scalability of a code is how much of it is parallelized •  So… S(N) = 1 (1− P)+ P N lim N→∞ S(N) = 1 1− P

Load balancing •  Even if 100% of your algorithm is
able to be parallelized one processor with too much work can be very problematic

Okay, challenges addressed. Now, how do I write parallel code?
•  Determine the best type of parallel implementation for your work –  Embarrassingly parallel •  Many unrelated tasks that don’t require communication –  Shared memory parallel •  Processing elements require access to shared global memory –  Distributed memory parallel •  Processing elements only require access to local memory

So parallel, it’s embarrassing! •  Embarrassingly parallel problems involve many
tasks that can be executed completely independently, but need some organizing infrastructure •  There are many high throughput frameworks that address embarrassingly parallel problems –  Condor, BigJob, etc.

Tools for sharing memory •  Shared memory tools allow all
tasks (threads) to access the same global memory •  Useful within a single node or on large shared-memory machines •  Can be implemented with –  POSIX threads (Pthreads): low level system access –  OpenMP: compiler pragmas –  PGAS languages: semantics in the language itself

Scaling up to many nodes •  Large distributed-memory systems require
explicit message passing to coordinate tasks •  MPI (Message Passing Interface) is a series of tools that allows you to write code that coordinates the efforts of many nodes with distributed memory

Master/worker •  The two most often used paradigms for MPI
are: –  Simultaneous execution: all ranks perform the exact same tasks –  Master/worker: one rank acts as the controlling node and orchestrates and distributes work to the other nodes

Hybrid: The best of both? •  You can also combine
the two types of parallelism •  A combined OpenMP/MPI program is commonly referred to as a hybrid program •  There are several reasons you might want to do this –  Performance –  Memory

Demonstration Time! •  Let’s try out what we’ve learned! • 
Some simple parallel “Hello World” examples •  Shared memory with OpenMP •  Distributed memory with OpenMPI •  A hybrid approach

Demonstration Time! •  Demo code is in PARALLELTRN directory – 
Log in to ida.hps.iu.edu –  cd PARALLELTRN •  Or available at Github! •  git clone git://github.com/scamicha/ XSEDE12-Parallel-Tutorial.git PARALLELTRN

A simple “Hello World” program #include <stdio.h> #include <stdlib.h> int
main (int argc, char *argv[]) { printf("Hello World! \n"); }

Let’s make it parallel! #include <omp.h> #include <stdio.h> #include <stdlib.h>
int main (int argc, char *argv[]) { omp_set_num_threads(6); // Fork a team of threads #pragma omp parallel { printf("Hello World! \n"); } }

int main (int argc, char *argv[]) { omp_set_num_threads(6); // Fork a team of threads #pragma omp parallel { printf("Hello World! \n"); } } This goes in every OpenMP program

int main (int argc, char *argv[]) { omp_set_num_threads(6); // Fork a team of threads #pragma omp parallel { printf("Hello World! \n"); } } This goes in every OpenMP program This funcIon or the environment variable OMP_NUM_THREADS sets the number of threads

int main (int argc, char *argv[]) { omp_set_num_threads(6); // Fork a team of threads #pragma omp parallel { printf("Hello World! \n"); } } This goes in every OpenMP program The magic sauce…

It’s all about the pragmas •  OpenMP uses compiler directives
or pragmas to enable parallelism in compiled code •  Some of the most important pragmas are: #pragma omp parallel! #pragma omp for! #pragma omp sections! #pragma omp single! #pragma omp task! #pragma omp critical! #pragma omp barrier! #pragma omp master! •  Check OpenMP.org for a full list of pragmas and functions

Time to submit our parallel job #!/bin/bash #PBS -l nodes=1:ppn=6,walltime=1:00
#PBS -N hello_omp #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_omp_v1 ]; then gcc -O3 -o hello_omp_v1 -fopenmp hello_omp_v1.c fi ./hello_omp_v1 > hello_v1.out

Time to submit our parallel job #!/bin/bash #PBS -l nodes=1:ppn=6,walltime=1:00
#PBS -N hello_omp #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_omp_v1 ]; then gcc -O3 -o hello_omp_v1 -fopenmp hello_omp_v1.c fi ./hello_omp_v1 > hello_v1.out Specify resources… The compiler ﬂag that makes it all happen Execute and redirect the output

Insert Terminal Wizardry Here

Talk to me threads! #include <omp.h> #include <stdio.h> #include <stdlib.h>
int main (int argc, char *argv[]) { int nthreads, tid, maxthreads; omp_set_num_threads(6); // Fork a team of threads and give a private copy of their thread number #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } Create a ‘thread private’ variable

The master thread #include <omp.h> #include <stdio.h> #include <stdlib.h> int
main (int argc, char *argv[]) { int nthreads, tid, maxthreads; omp_set_num_threads(6); // Fork a team of threads and give a private copy of their thread number #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } Only the ‘master’ thread executes this block

Thread functions #include <omp.h> #include <stdio.h> #include <stdlib.h> int main
(int argc, char *argv[]) { int nthreads, tid, maxthreads; omp_set_num_threads(6); // Fork a team of threads and give a private copy of their thread number #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } More OpenMP funcIons

Private variables #include <omp.h> #include <stdio.h> #include <stdlib.h> int main
(int argc, char *argv[]) { int nthreads, tid, maxthreads; omp_set_num_threads(6); // Fork a team of threads and give a private copy of their thread number #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } Remember, every thread has its own ‘private’ copy of this variable

Back to the Black and Green

Syncronization #include <omp.h> #include <stdio.h> #include <stdlib.h>! int main (int
argc, char *argv[]) { ... #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } #pragma omp barrier // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } Threads wait here unIl they’ve all reached this point

One more time with OpenMP

That was easy! Let’s use more nodes #include <stdio.h> #include
<mpi.h> int main (argc, argv) int argc; char *argv[]; { MPI_Init (&argc, &argv); !/* starts MPI */ printf( "Hello world!"); MPI_Finalize(); /* ends MPI */ return 0; } !

<mpi.h> int main (argc, argv) int argc; char *argv[]; { MPI_Init (&argc, &argv); !/* starts MPI */ printf( "Hello world!"); MPI_Finalize(); /* ends MPI */ return 0; } ! This goes in every MPI program

<mpi.h> int main (argc, argv) int argc; char *argv[]; { MPI_Init (&argc, &argv); !/* starts MPI */ printf( "Hello world!"); MPI_Finalize(); /* ends MPI */ return 0; } ! Every MPI program needs these two funcIons

There are many MPI functions •  MPI uses a library
of functions to enable inter-process communication and parallelism in compiled code •  Some of the most important functions are: int MPI_Init(int *argc, char **argv) int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) int MPI_Finalize()! int MPI_Send (void *buf,int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv (void *buf,int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) •  Check mpi-forum.org for the official MPI specification

If you build it… #!/bin/bash #PBS -l nodes=2:ppn=6,walltime=1:00 #PBS -N
hello_mpi #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_mpi_v1 ]; then mpicc -O3 -o hello_mpi_v1 hello_mpi_v1.c fi mpirun -np 12 --hostfile $PBS_NODEFILE hello_mpi_v1 > hello_v1.out

More fun with the Terminal

Tell me your names! #include <stdio.h> #include <mpi.h> int main
(argc, argv) int argc; char *argv[]; { int rank, size; MPI_Init (&argc, &argv); !/* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); !/* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); !/* get number of processes */ printf( "Hello world from process %d of %d\n", rank, size ); MPI_Finalize(); return 0; }

Show the power of MPI

Let’s use Master/Worker to serialize the output #include <stdio.h> #include
<mpi.h>! int main (argc, argv) ... if (rank == 0) { /*Rank 0 is the master process */ ... for (x = 1; x < size; x++) { /* Receive messages in for loop order */ MPI_Recv(msg, 50, MPI_CHARACTER, x, tag, MPI_COMM_WORLD, &status); printf("From worker %d: %s”,status.MPI_SOURCE, msg); }

Let’s use Master/Worker to serialize the output #include <stdio.h> #include
<mpi.h>! int main (argc, argv) ... if (rank == 0) { /*Rank 0 is the master process */ ... for (x = 1; x < size; x++) { /* Receive messages in for loop order */ MPI_Recv(msg, 50, MPI_CHARACTER, x, tag, ! ! ! !MPI_COMM_WORLD, &status); printf("From worker %d: %s”,status.MPI_SOURCE, msg); }

Let’s use Master/Worker to serialize the output ...! } else
{ char msg[50]; snprintf(msg, 50, "Hello world from process %d of %d\n", rank, size); MPI_Send(msg, 50, MPI_CHARACTER, 0, tag, MPI_COMM_WORLD); } MPI_Finalize(); return 0; }

Let’s use Master/Worker to serialize the output ...! } else
{ char msg[50]; snprintf(msg, 50, "Hello world from process %d of %d\n", rank, size); MPI_Send(msg, 50, MPI_CHARACTER, 0, tag, ! ! ! !MPI_COMM_WORLD); } MPI_Finalize(); return 0; }

Counting to 12 with MPI

MPI and OpenMP together like Peanut Butter and Jelly #include
<stdio.h>! #include <mpi.h>! #include <omp.h>! ! int main (argc, argv)! int argc;! char *argv[];! {! MPI_Init (&argc, &argv); !/* starts MPI */! omp_set_num_threads(6);! // Fork a team of threads! #pragma omp parallel! {! printf( "Hello world! \n");! }! MPI_Finalize(); /* ends MPI */! return 0;! }!

MPI and OpenMP together like Peanut Butter and Jelly #!/bin/bash
#PBS -l nodes=2:ppn=1,walltime=1:00 #PBS -N hello_hybrid #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_hybrid_v1 ]; then mpicc -O3 -fopenmp -o hello_hybrid_v1 hello_hybrid_v1.c fi mpirun -np 2 --hostfile $PBS_NODEFILE hello_hybrid_v1 > hello_v1.out

The last demo…I promise

If you didn’t want homework, you shouldn’t come to the
student session •  See if you can extend hybrid_v1 to versions 2 and 3 •  For version 2 have each process report it’s thread and rank number •  For version 3 set up MPI master/worker and serialize by rank •  Experiment with different mixes of ranks, threads, nodes, etc. •  Feel free to fork on github and submit pull requests

Research Technologies is a division of University Information Technology Services
and is affiliated with the Pervasive Technology Institute Acknowledgements & disclaimer •  Many thanks to Robert Henschel and Jennet Tillotson •  Patrick McCormick, Jeff Inman, James Ahrens, Jamaludin Mohd-Yusof, Greg Roth, Sharen Cummins, Scout: a data-parallel programming language for graphics processors, Parallel Computing, Volume 33, Issues 10–11, November 2007, Pages 648-662 •  Any opinions presented here are those of the presenter(s) and do not necessarily represent the opinions of the National Science Foundation or any other funding agencies

License terms •  Please cite as: Michael, S. Big Problems?...Time
to Get Parallel. (Student Tutorial) XSEDE12 (Chicago, IL, 16 July 2012). Available from: https://speakerdeck.com/u/scamicha/ •  Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse. •  Except where otherwise noted, contents of this presentation are copyright 2012 by the Trustees of Indiana University. •  This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.

Big Problems?...Time to Get Parallel

Big Problems?...Time to Get Parallel

More Decks by Scott Michael

Other Decks in Technology

Featured

Transcript