Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Problems?...Time to Get Parallel

Big Problems?...Time to Get Parallel

This talk will introduce students to parallel programming. Topics include a survey of parallel programming paradigms and some sample parallel programs in OpenMP and MPI.

Scott Michael

July 16, 2012
Tweet

More Decks by Scott Michael

Other Decks in Technology

Transcript

  1. © Trustees of Indiana University Released under Creative Commons 3.0

    unported license; license terms on last slide. Big Problems?...Time to Get Parallel A brief foray into the world of parallel computing Scott Michael [email protected] Presented at XSEDE 12 Conference, 16 Jul 2012 Available from: https://speakerdeck.com/u/scamicha/
  2. Outline •  What is parallel programming? – And why is it

    important? •  How do I design a parallel program? – Challenges •  What are the tools that enable parallel programming? – Shared vs. Distributed memory •  Demonstrations
  3. What is parallel computing and why do we need it?

    •  Wikipedia says: “Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel").” •  Supercomputers = massive numbers of cores •  Everyday computers = many/multi-core Sequoia:   LLNL’s  BlueGene  Q  is   the  world’s  fastest   supercomputer  with   20  PFLOPs  and  1.6   million  cores  
  4. What is parallel computing and why do we need it?

    •  Wikipedia says: “Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel").” •  Supercomputers = massive numbers of cores •  Everyday computers = many/multi-core Sequoia:   LLNL’s  BlueGene  Q  is   the  world’s  fastest   supercomputer  with   20  PFLOPs  and  1.6   million  cores  
  5. I’m convinced! How do I design a parallel program? • 

    Two main strategies used problem decomposition –  Task Parallel (MISD, MIMD, MPMD) Processing elements do different tasks –  Data Parallel (SIMD, SPMD) Processing elements do the same task on different data McCormick  et  al.  2007  
  6. Wow, sounds easy! What’s the big deal? •  There are

    a variety of challenges in designing and implementing parallel algorithms – Amdhal’s law – Load balance – Interprocess communication/Data locality – Fault tolerance
  7. The law according to Amdahl •  A key determiner of

    the scalability of a code is how much of it is parallelized •  So… S(N) = 1 (1− P)+ P N lim N→∞ S(N) = 1 1− P
  8. Load balancing •  Even if 100% of your algorithm is

    able to be parallelized one processor with too much work can be very problematic
  9. Okay, challenges addressed. Now, how do I write parallel code?

    •  Determine the best type of parallel implementation for your work –  Embarrassingly parallel •  Many unrelated tasks that don’t require communication –  Shared memory parallel •  Processing elements require access to shared global memory –  Distributed memory parallel •  Processing elements only require access to local memory
  10. So parallel, it’s embarrassing! •  Embarrassingly parallel problems involve many

    tasks that can be executed completely independently, but need some organizing infrastructure •  There are many high throughput frameworks that address embarrassingly parallel problems –  Condor, BigJob, etc.
  11. Tools for sharing memory •  Shared memory tools allow all

    tasks (threads) to access the same global memory •  Useful within a single node or on large shared-memory machines •  Can be implemented with –  POSIX threads (Pthreads): low level system access –  OpenMP: compiler pragmas –  PGAS languages: semantics in the language itself
  12. Scaling up to many nodes •  Large distributed-memory systems require

    explicit message passing to coordinate tasks •  MPI (Message Passing Interface) is a series of tools that allows you to write code that coordinates the efforts of many nodes with distributed memory
  13. Master/worker •  The two most often used paradigms for MPI

    are: –  Simultaneous execution: all ranks perform the exact same tasks –  Master/worker: one rank acts as the controlling node and orchestrates and distributes work to the other nodes
  14. Hybrid: The best of both? •  You can also combine

    the two types of parallelism •  A combined OpenMP/MPI program is commonly referred to as a hybrid program •  There are several reasons you might want to do this –  Performance –  Memory
  15. Demonstration Time! •  Let’s try out what we’ve learned! • 

    Some simple parallel “Hello World” examples •  Shared memory with OpenMP •  Distributed memory with OpenMPI •  A hybrid approach
  16. Demonstration Time! •  Demo code is in PARALLELTRN directory – 

    Log in to ida.hps.iu.edu –  cd PARALLELTRN •  Or available at Github! •  git clone git://github.com/scamicha/ XSEDE12-Parallel-Tutorial.git PARALLELTRN
  17. A simple “Hello World” program #include <stdio.h> #include <stdlib.h> int

    main (int argc, char *argv[]) { printf("Hello World! \n"); }
  18. Let’s make it parallel! #include <omp.h> #include <stdio.h> #include <stdlib.h>

    int main (int argc, char *argv[]) { omp_set_num_threads(6); // Fork a team of threads #pragma omp parallel { printf("Hello World! \n"); } }
  19. Let’s make it parallel! #include <omp.h> #include <stdio.h> #include <stdlib.h>

    int main (int argc, char *argv[]) { omp_set_num_threads(6); // Fork a team of threads #pragma omp parallel { printf("Hello World! \n"); } } This  goes  in  every  OpenMP  program  
  20. Let’s make it parallel! #include <omp.h> #include <stdio.h> #include <stdlib.h>

    int main (int argc, char *argv[]) { omp_set_num_threads(6); // Fork a team of threads #pragma omp parallel { printf("Hello World! \n"); } } This  goes  in  every  OpenMP  program   This  funcIon  or  the   environment  variable   OMP_NUM_THREADS   sets  the  number  of   threads  
  21. Let’s make it parallel! #include <omp.h> #include <stdio.h> #include <stdlib.h>

    int main (int argc, char *argv[]) { omp_set_num_threads(6); // Fork a team of threads #pragma omp parallel { printf("Hello World! \n"); } } This  goes  in  every  OpenMP  program   The  magic  sauce…  
  22. It’s all about the pragmas •  OpenMP uses compiler directives

    or pragmas to enable parallelism in compiled code •  Some of the most important pragmas are: #pragma omp parallel! #pragma omp for! #pragma omp sections! #pragma omp single! #pragma omp task! #pragma omp critical! #pragma omp barrier! #pragma omp master! •  Check OpenMP.org for a full list of pragmas and functions
  23. Time to submit our parallel job #!/bin/bash #PBS -l nodes=1:ppn=6,walltime=1:00

    #PBS -N hello_omp #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_omp_v1 ]; then gcc -O3 -o hello_omp_v1 -fopenmp hello_omp_v1.c fi ./hello_omp_v1 > hello_v1.out
  24. Time to submit our parallel job #!/bin/bash #PBS -l nodes=1:ppn=6,walltime=1:00

    #PBS -N hello_omp #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_omp_v1 ]; then gcc -O3 -o hello_omp_v1 -fopenmp hello_omp_v1.c fi ./hello_omp_v1 > hello_v1.out Specify  resources…   The  compiler  flag  that   makes  it  all  happen   Execute  and  redirect   the  output  
  25. Talk to me threads! #include <omp.h> #include <stdio.h> #include <stdlib.h>

    int main (int argc, char *argv[]) { int nthreads, tid, maxthreads; omp_set_num_threads(6); // Fork a team of threads and give a private copy of their thread number #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } Create  a  ‘thread  private’  variable  
  26. The master thread #include <omp.h> #include <stdio.h> #include <stdlib.h> int

    main (int argc, char *argv[]) { int nthreads, tid, maxthreads; omp_set_num_threads(6); // Fork a team of threads and give a private copy of their thread number #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } Only  the  ‘master’  thread  executes  this  block  
  27. Thread functions #include <omp.h> #include <stdio.h> #include <stdlib.h> int main

    (int argc, char *argv[]) { int nthreads, tid, maxthreads; omp_set_num_threads(6); // Fork a team of threads and give a private copy of their thread number #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } More  OpenMP  funcIons  
  28. Private variables #include <omp.h> #include <stdio.h> #include <stdlib.h> int main

    (int argc, char *argv[]) { int nthreads, tid, maxthreads; omp_set_num_threads(6); // Fork a team of threads and give a private copy of their thread number #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } Remember,  every  thread   has  its  own  ‘private’  copy  of   this  variable  
  29. Syncronization #include <omp.h> #include <stdio.h> #include <stdlib.h>! int main (int

    argc, char *argv[]) { ... #pragma omp parallel private(tid) { #pragma omp master { maxthreads = omp_get_max_threads(); nthreads = omp_get_num_threads(); printf("There are %d threads available. %d of these threads are currently active. \n", ! maxthreads, nthreads); } #pragma omp barrier // Obtain thread number tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } } Threads  wait  here  unIl  they’ve  all  reached  this  point  
  30. That was easy! Let’s use more nodes #include <stdio.h> #include

    <mpi.h> int main (argc, argv) int argc; char *argv[]; { MPI_Init (&argc, &argv); !/* starts MPI */ printf( "Hello world!"); MPI_Finalize(); /* ends MPI */ return 0; } !
  31. That was easy! Let’s use more nodes #include <stdio.h> #include

    <mpi.h> int main (argc, argv) int argc; char *argv[]; { MPI_Init (&argc, &argv); !/* starts MPI */ printf( "Hello world!"); MPI_Finalize(); /* ends MPI */ return 0; } ! This  goes  in  every  MPI  program  
  32. That was easy! Let’s use more nodes #include <stdio.h> #include

    <mpi.h> int main (argc, argv) int argc; char *argv[]; { MPI_Init (&argc, &argv); !/* starts MPI */ printf( "Hello world!"); MPI_Finalize(); /* ends MPI */ return 0; } ! Every  MPI  program  needs  these  two  funcIons  
  33. There are many MPI functions •  MPI uses a library

    of functions to enable inter-process communication and parallelism in compiled code •  Some of the most important functions are: int MPI_Init(int *argc, char **argv) int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) int MPI_Finalize()! int MPI_Send (void *buf,int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv (void *buf,int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) •  Check mpi-forum.org for the official MPI specification
  34. If you build it… #!/bin/bash #PBS -l nodes=2:ppn=6,walltime=1:00 #PBS -N

    hello_mpi #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_mpi_v1 ]; then mpicc -O3 -o hello_mpi_v1 hello_mpi_v1.c fi mpirun -np 12 --hostfile $PBS_NODEFILE hello_mpi_v1 > hello_v1.out
  35. If you build it… #!/bin/bash #PBS -l nodes=2:ppn=6,walltime=1:00 #PBS -N

    hello_mpi #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_mpi_v1 ]; then mpicc -O3 -o hello_mpi_v1 hello_mpi_v1.c fi mpirun -np 12 --hostfile $PBS_NODEFILE hello_mpi_v1 > hello_v1.out
  36. If you build it… #!/bin/bash #PBS -l nodes=2:ppn=6,walltime=1:00 #PBS -N

    hello_mpi #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_mpi_v1 ]; then mpicc -O3 -o hello_mpi_v1 hello_mpi_v1.c fi mpirun -np 12 --hostfile $PBS_NODEFILE hello_mpi_v1 > hello_v1.out
  37. Tell me your names! #include <stdio.h> #include <mpi.h> int main

    (argc, argv) int argc; char *argv[]; { int rank, size; MPI_Init (&argc, &argv); !/* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); !/* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); !/* get number of processes */ printf( "Hello world from process %d of %d\n", rank, size ); MPI_Finalize(); return 0; }
  38. Tell me your names! #include <stdio.h> #include <mpi.h> int main

    (argc, argv) int argc; char *argv[]; { int rank, size; MPI_Init (&argc, &argv); !/* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); !/* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); !/* get number of processes */ printf( "Hello world from process %d of %d\n", rank, size ); MPI_Finalize(); return 0; }
  39. Let’s use Master/Worker to serialize the output #include <stdio.h> #include

    <mpi.h>! int main (argc, argv) ... if (rank == 0) { /*Rank 0 is the master process */ ... for (x = 1; x < size; x++) { /* Receive messages in for loop order */ MPI_Recv(msg, 50, MPI_CHARACTER, x, tag, MPI_COMM_WORLD, &status); printf("From worker %d: %s”,status.MPI_SOURCE, msg); }
  40. Let’s use Master/Worker to serialize the output #include <stdio.h> #include

    <mpi.h>! int main (argc, argv) ... if (rank == 0) { /*Rank 0 is the master process */ ... for (x = 1; x < size; x++) { /* Receive messages in for loop order */ MPI_Recv(msg, 50, MPI_CHARACTER, x, tag, ! ! ! !MPI_COMM_WORLD, &status); printf("From worker %d: %s”,status.MPI_SOURCE, msg); }
  41. Let’s use Master/Worker to serialize the output ...! } else

    { char msg[50]; snprintf(msg, 50, "Hello world from process %d of %d\n", rank, size); MPI_Send(msg, 50, MPI_CHARACTER, 0, tag, MPI_COMM_WORLD); } MPI_Finalize(); return 0; }
  42. Let’s use Master/Worker to serialize the output ...! } else

    { char msg[50]; snprintf(msg, 50, "Hello world from process %d of %d\n", rank, size); MPI_Send(msg, 50, MPI_CHARACTER, 0, tag, ! ! ! !MPI_COMM_WORLD); } MPI_Finalize(); return 0; }
  43. MPI and OpenMP together like Peanut Butter and Jelly #include

    <stdio.h>! #include <mpi.h>! #include <omp.h>! ! int main (argc, argv)! int argc;! char *argv[];! {! MPI_Init (&argc, &argv); !/* starts MPI */! omp_set_num_threads(6);! // Fork a team of threads! #pragma omp parallel! {! printf( "Hello world! \n");! }! MPI_Finalize(); /* ends MPI */! return 0;! }!
  44. MPI and OpenMP together like Peanut Butter and Jelly #!/bin/bash

    #PBS -l nodes=2:ppn=1,walltime=1:00 #PBS -N hello_hybrid #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_hybrid_v1 ]; then mpicc -O3 -fopenmp -o hello_hybrid_v1 hello_hybrid_v1.c fi mpirun -np 2 --hostfile $PBS_NODEFILE hello_hybrid_v1 > hello_v1.out
  45. MPI and OpenMP together like Peanut Butter and Jelly #!/bin/bash

    #PBS -l nodes=2:ppn=1,walltime=1:00 #PBS -N hello_hybrid #PBS -q batch #PBS -j oe cd $PBS_O_WORKDIR if [ ! -x hello_hybrid_v1 ]; then mpicc -O3 -fopenmp -o hello_hybrid_v1 hello_hybrid_v1.c fi mpirun -np 2 --hostfile $PBS_NODEFILE hello_hybrid_v1 > hello_v1.out
  46. If you didn’t want homework, you shouldn’t come to the

    student session •  See if you can extend hybrid_v1 to versions 2 and 3 •  For version 2 have each process report it’s thread and rank number •  For version 3 set up MPI master/worker and serialize by rank •  Experiment with different mixes of ranks, threads, nodes, etc. •  Feel free to fork on github and submit pull requests
  47. Research Technologies is a division of University Information Technology Services

    and is affiliated with the Pervasive Technology Institute Acknowledgements & disclaimer •  Many thanks to Robert Henschel and Jennet Tillotson •  Patrick McCormick, Jeff Inman, James Ahrens, Jamaludin Mohd-Yusof, Greg Roth, Sharen Cummins, Scout: a data-parallel programming language for graphics processors, Parallel Computing, Volume 33, Issues 10–11, November 2007, Pages 648-662 •  Any opinions presented here are those of the presenter(s) and do not necessarily represent the opinions of the National Science Foundation or any other funding agencies
  48. License terms •  Please cite as: Michael, S. Big Problems?...Time

    to Get Parallel. (Student Tutorial) XSEDE12 (Chicago, IL, 16 July 2012). Available from: https://speakerdeck.com/u/scamicha/ •  Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse. •  Except where otherwise noted, contents of this presentation are copyright 2012 by the Trustees of Indiana University. •  This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.