Slide 1

Slide 1 text

Rodrigo Nemmen Universidade de São Paulo Accelerating Scientific Computing with GPUs w/ Roberta D. Pereira, João Paulo Navarro (NVIDIA), Matheus Bernardino, Alfredo Goldmann, Ivan Almeida M. Weiss, CfA blackholegroup.org @nemmen

Slide 2

Slide 2 text

Index 1.Challenge of simulating the universe 2.What is a GPU? 3.HOWTO: accelerate your science 4.My work: Black holes, GPUs and deep learning

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Challenge:

Slide 5

Slide 5 text

Challenge: Do physical laws reproduce the observed universe?

Slide 6

Slide 6 text

Challenge: Do physical laws reproduce the observed universe? Computational approach: Simulate a universe following those laws and compare with the real one

Slide 7

Slide 7 text

The challenge of simulating the universe

Slide 8

Slide 8 text

The challenge of simulating the universe

Slide 9

Slide 9 text

Range of length scales for simulating (part of) the universe

Slide 10

Slide 10 text

gas cooling, electrodynamics, turbulence, … galactic dynamics Range of length scales for simulating (part of) the universe

Slide 11

Slide 11 text

Range of length scales for simulating (part of) the universe gas cooling, electrodynamics, turbulence, … galactic dynamics > 1013 dynamical range = largest scale smallest scale

Slide 12

Slide 12 text

Range of length scales for fluid dynamics Smallest scales = Kolmogorov scale turbulent eddies size system size dynamical range > 105

Slide 13

Slide 13 text

Incredible computational costs in astrophysics Many problems have remained numerically intractable for a long time

Slide 14

Slide 14 text

Computational feasibility of problem ∝ computer calculations/sec # of calculations to solve problem

Slide 15

Slide 15 text

Computational feasibility of problem ∝ computer calculations/sec # of calculations to solve problem ∝ et

Slide 16

Slide 16 text

Exponential evolution of technology

Slide 17

Slide 17 text

Kurzweill 2005 Calculations per second per $1000 Year

Slide 18

Slide 18 text

Kurzweill 2005 CPUs Calculations per second per $1000 Year

Slide 19

Slide 19 text

Kurzweill 2005 CPUs GPUs Calculations per second per $1000 Year

Slide 20

Slide 20 text

Rise of GPU computing Huang GTC 2018 single-threaded performance Relative performance

Slide 21

Slide 21 text

Rise of GPU computing gamers

Slide 22

Slide 22 text

bitcoin mining Rise of GPU computing gamers

Slide 23

Slide 23 text

bitcoin mining Rise of GPU computing gamers deep learning

Slide 24

Slide 24 text

The difference between CPU and GPU thread ID kIdx.x*blockDim.x+threadIdx.x; we do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU massively parallel

Slide 25

Slide 25 text

The difference between CPU and GPU thread ID kIdx.x*blockDim.x+threadIdx.x; we do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU massively parallel Intel i7 7700 4.2 GHz 8 cores 0.04 TFLOPS $355

Slide 26

Slide 26 text

The difference between CPU and GPU thread ID kIdx.x*blockDim.x+threadIdx.x; we do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU massively parallel Intel i7 7700 4.2 GHz 8 cores 0.04 TFLOPS NVIDIA GTX 1080Ti 3584 cuda cores 11.3 TFLOPS (FP32) $355 $1279 3× +expensive than CPU

Slide 27

Slide 27 text

The difference between CPU and GPU thread ID kIdx.x*blockDim.x+threadIdx.x; we do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU massively parallel Intel i7 7700 4.2 GHz 8 cores 0.04 TFLOPS NVIDIA GTX 1080Ti 3584 cuda cores 11.3 TFLOPS (FP32) 280× speedup potentially $355 $1279 3× +expensive than CPU

Slide 28

Slide 28 text

Lesson: we need to learn and train our students on how to exploit GPUs for science

Slide 29

Slide 29 text

GPUs in Astrophysics

Slide 30

Slide 30 text

Exponential evolution of astrophysical fluid simulations Year Data from: Ruffert+1994; Hawley+1995; Hawley+1996; Stone+1999; Igumenshchev 2000; Stone+2001; Hawley+2002; De Villiers+2003; McKinney+2004; Kuwabara+2005; Hawley+2006; Beckwith+2009; Tchekhovskoy+2011; Narayan+2012; McKinney+2012; Sorathia+2012; McKinney+2013; Hawley+2013; Kestener+2014; Gheller+2015; Benitez- Llambay+2016; Schneider+2017 Liska+2018 Schneider+2018 log10 # of fluid elements t0 = 2.8 yrs y / et/to AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY= AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY= AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY= AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=

Slide 31

Slide 31 text

Exponential evolution of astrophysical fluid simulations Kestener+2014 RAMSES-GPU N = 800x1600x800 Rise of GPU codes Year # of fluid elements t0 = 2.8 yrs y / et/to AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY= AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY= AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY= AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY= Data from: Ruffert+1994; Hawley+1995; Hawley+1996; Stone+1999; Igumenshchev 2000; Stone+2001; Hawley+2002; De Villiers+2003; McKinney+2004; Kuwabara+2005; Hawley+2006; Beckwith+2009; Tchekhovskoy+2011; Narayan+2012; McKinney+2012; Sorathia+2012; McKinney+2013; Hawley+2013; Kestener+2014; Gheller+2015; Benitez- Llambay+2016; Schneider+2017 Liska+2018 Schneider+2018

Slide 32

Slide 32 text

Astrophysical computations supercharged with GPUs

Slide 33

Slide 33 text

https://vimeo.com/228857009 Cholla: hydrodynamics of galaxy evolution and galactic winds Schneider & Robertson 2017 Resolution = 2048 x 512 x 512 MPI + CUDA 16000 GPUs (Titan)

Slide 34

Slide 34 text

https://vimeo.com/228857009 Cholla: hydrodynamics of galaxy evolution and galactic winds Schneider & Robertson 2017 Resolution = 2048 x 512 x 512 MPI + CUDA 16000 GPUs (Titan)

Slide 35

Slide 35 text

Alexander (Sasha) Tchekhovskoy SPSAS-HighAstro ’17 H-AMR = BH Revolution Matthew Liska (U of Amsterdam) • Multi-GPU 3D H-AMR (“hammer”, Liska, AT, et al. 2017): • Based on Godunov-type code HARM2D (Gammie et al. ‘03) • 85% parallel scaling to 4096 GPUs (MPI, OpenMP, OpenCL) • 100-450x speedup compared to a CPU core • Features on par with state-of-the-art CPU codes (e.g., Athena++): Adaptive Mesh Refinement (AMR) with local adaptive time-stepping • features developed in the past year on BW • ideal for studying typical, tilted systems • give an additional speedup of ≳10x (or even 100x) for hardest problems! • Ideal for getting computational time • 9M GPU-hours ≳ 9B CPU core-hour allocation on Blue Waters supercomputer • Science is no longer limited by computational resources! • Graphical Processing Units (GPUs), the cutting edge of modern supercomputing Alexander Tchekhovskoy (Northwestern) H-AMR: multi-GPU general relativistic MHD code to simulate black hole disks Based on Godunov code HARM2D MPI + CUDA + OpenMP 100-450x speedup compared to a CPU core Adaptive mesh refinement (AMR) + local adaptive time-stepping Liska+2018

Slide 36

Slide 36 text

https://www.youtube.com/watch?v=mbnG5_UTTdk Liska+2019 Resolution = 2880×864×1200 MPI + CUDA 450 GPUs (Blue Waters) H-AMR: MHD simulation of black hole and tilted accretion disk

Slide 37

Slide 37 text

https://www.youtube.com/watch?v=mbnG5_UTTdk Liska+2019 Resolution = 2880×864×1200 MPI + CUDA 450 GPUs (Blue Waters) H-AMR: MHD simulation of black hole and tilted accretion disk

Slide 38

Slide 38 text

Chan+15a,b ApJ radio 10 GHz 1.3mm IR 2.1μm X-rays GRay: ray tracing in curved spacetimes

Slide 39

Slide 39 text

Chan+15a,b ApJ radio 10 GHz 1.3mm IR 2.1μm X-rays GRay: ray tracing in curved spacetimes

Slide 40

Slide 40 text

Chan+2013; 2015a,b

Slide 41

Slide 41 text

Chan+2013; 2015a,b

Slide 42

Slide 42 text

Previously intractable problems are now possible w/ GPUs Change of paradigm in computational astrophysics: science no longer limited by hardware resources GPUs ideal for getting computational time: 1 GPU-hour = (10-20) CPU-hours

Slide 43

Slide 43 text

Is your problem appropriate for a GPU?

Slide 44

Slide 44 text

Problem is too small Is your problem appropriate for a GPU? Do not bother

Slide 45

Slide 45 text

Problem is too small Can divide in parallel chunks Is your problem appropriate for a GPU? ✔ e.g. loops can be split-up Do not bother

Slide 46

Slide 46 text

Problem is too small Can divide in parallel chunks Lots of communication Is your problem appropriate for a GPU? ✔ e.g. loops can be split-up Do not bother ✘ need to constantly read from memory

Slide 47

Slide 47 text

Problem is too small Can divide in parallel chunks Lots of communication Too unpredictable Is your problem appropriate for a GPU? ✔ e.g. loops can be split-up Do not bother ✘ need to constantly read from memory ✘ branch divergence: breaks SIMD paradigm

Slide 48

Slide 48 text

Not just “compile it for the GPU” Need to change how your code is engineered

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Your problem

Slide 51

Slide 51 text

Your problem

Slide 52

Slide 52 text

Your problem

Slide 53

Slide 53 text

Your problem

Slide 54

Slide 54 text

Your problem

Slide 55

Slide 55 text

Your problem

Slide 56

Slide 56 text

Solution ✔

Slide 57

Slide 57 text

Solution ✔

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

OpenACC Directives Overvi Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compil Compiler Parallelise Works on many-core GPU multicore CPUs OpenACC Compiler Hint CPU

Slide 60

Slide 60 text

OpenACC Directives Overvi Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compil Compiler Parallelise Works on many-core GPU multicore CPUs OpenACC Compiler Hint CPU kernel

Slide 61

Slide 61 text

OpenACC Directives Overvi Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compil Compiler Parallelise Works on many-core GPU multicore CPUs OpenACC Compiler Hint CPU kernel sure we do not go out of bounds < n) d] = a[id] + b[id]; GPU GPU CPU-GPU communication is a bottleneck

Slide 62

Slide 62 text

How to adapt your problem to GPUs Less development time Bigger speedup and flexibility DIFFICULTY

Slide 63

Slide 63 text

integer :: n, i !$acc kernels do i=1,n y(i) = a*x(i) enddo !$acc end kernels end subroutine sa ... $ From main progr $ call SAXPY on 1 call saxpy(2**20, ... float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs Example of serial code to be parallelised Array y

Slide 64

Slide 64 text

integer :: n, i !$acc kernels do i=1,n y(i) = a*x(i) enddo !$acc end kernels end subroutine sa ... $ From main progr $ call SAXPY on 1 call saxpy(2**20, ... float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs subrout real integ !$acc k do i= y(i enddo !$acc e end sub void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } A SIMPLE EXAMPLE: SAXPY in C SA ; Example of serial code to be parallelised Array y GPU version w/ Vector Addition Device Code (CUDA) Launch a CUDA kernel for the Vector addition __global__ void vecaddgpu(float *r, float *a, float *b, int n) { // Get global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } GPU

Slide 65

Slide 65 text

float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs Serial code: GPU version w/

Slide 66

Slide 66 text

float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs Serial code: GPU version w/ // Kernel function (runs on GPU) __global__ void multiply(int n, float a, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = a*x[i]; }

Slide 67

Slide 67 text

float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs Serial code: GPU version w/ // Kernel function (runs on GPU) __global__ void multiply(int n, float a, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = a*x[i]; } // Allocate Unified Memory – accessible from CPU or GPU cudaMallocManaged(&x, N*sizeof(float)); cudaMallocManaged(&y, N*sizeof(float)); // Run kernel on the GPU multiply<<<1, 1>>>(N, x, y); Call from main

Slide 68

Slide 68 text

blackholegroup.org Gustavo Soares PhD Artur Vemado MsC Fabio Cafardo PhD Raniere Menezes PhD Ivan Almeida Msc Rodrigo Nemmen Roberta Pereira MsC Edson Ponciano undergrad Matheus Bernardino undergrad (CS) Apply to join my group

Slide 69

Slide 69 text

General relativistic MHD simulations: black hole weather forecast https://www.youtube.com/watch?v=bWg6vaf5WXw

Slide 70

Slide 70 text

General relativistic MHD simulations: black hole weather forecast https://www.youtube.com/watch?v=bWg6vaf5WXw

Slide 71

Slide 71 text

Let there be light from the hot plasma Spectra log(frequency / Hz) log(luminosity [erg/s]) Courtesy: @mmosc_m

Slide 72

Slide 72 text

Let there be light from the hot plasma Time series Spectra log(frequency / Hz) log(luminosity [erg/s]) Courtesy: @mmosc_m

Slide 73

Slide 73 text

Let there be light from the hot plasma Time series Spectra log(frequency / Hz) log(luminosity [erg/s]) Courtesy: @mmosc_m

Slide 74

Slide 74 text

How radiative transfer works in astrophysics Essentially ray tracing in general relativity

Slide 75

Slide 75 text

1. Spacetime and plasma properties

Slide 76

Slide 76 text

1. Spacetime and plasma properties gμν

Slide 77

Slide 77 text

1. Spacetime and plasma properties gμν ρ0 rest-mass density u internal energy uμ 4-velocity bμ magnetic field

Slide 78

Slide 78 text

2. Photon generation Monte Carlo method kα = dxα dλ

Slide 79

Slide 79 text

2. Photon generation Monte Carlo method kα = dxα dλ kα(λ0 ) Initial wave vectors

Slide 80

Slide 80 text

3. Photon propagation Solve geodesic equation in curved spacetime Geodesic equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, With ε = 0.04, grmonty integrates ∼16,700 geodesics s−1 on a single core of an Intel Xeon model E5430. If we use fourth- order Runge–Kutta exclusively so that the error in E, l, and Q is ∼1000 times smaller, then the speed is ∼ 6200 geodesics s−1. If we use the Runge–Kutta Prince–Dorman method in GSL with ε = 0.04 the fraction error is ∼ 10−10 and the speed is ∼1700 geodesics s−1. These results can be compared to the publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) (see Mihalas & Mihalas 1984). Here Iν is specific intensity and for example, is proportional to Since Iν/ν3 is proportional t along each ray, Iν/ν3 ∝ w, and emission) dw dτa = where dτa = (ν is the differential optical depth parentheses is the “invariant opa with second-order accuracy τa = 1 2 ((ναν,a ) n + and then set wn+1 = Since the components of kµ are rest-mass energy, ν = −kµu opacity at the end of each step be reused as the beginning of t 5. SCAT Our treatment of scattering determines where a superpho

Slide 81

Slide 81 text

3. Photon propagation Solve geodesic equation in curved spacetime Geodesic equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, With ε = 0.04, grmonty integrates ∼16,700 geodesics s−1 on a single core of an Intel Xeon model E5430. If we use fourth- order Runge–Kutta exclusively so that the error in E, l, and Q is ∼1000 times smaller, then the speed is ∼ 6200 geodesics s−1. If we use the Runge–Kutta Prince–Dorman method in GSL with ε = 0.04 the fraction error is ∼ 10−10 and the speed is ∼1700 geodesics s−1. These results can be compared to the publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) (see Mihalas & Mihalas 1984). Here Iν is specific intensity and for example, is proportional to Since Iν/ν3 is proportional t along each ray, Iν/ν3 ∝ w, and emission) dw dτa = where dτa = (ν is the differential optical depth parentheses is the “invariant opa with second-order accuracy τa = 1 2 ((ναν,a ) n + and then set wn+1 = Since the components of kµ are rest-mass energy, ν = −kµu opacity at the end of each step be reused as the beginning of t 5. SCAT Our treatment of scattering determines where a superpho

Slide 82

Slide 82 text

Geodesic equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) is pa wi an Si res op be 4. Photon absorption and scattering Solve radiative transfer equation along each path

Slide 83

Slide 83 text

Geodesic equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) is pa wi an Si res op be 4. Photon absorption and scattering Solve radiative transfer equation along each path absorption

Slide 84

Slide 84 text

Geodesic equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) is pa wi an Si res op be 4. Photon absorption and scattering Solve radiative transfer equation along each path absorption scattering

Slide 85

Slide 85 text

5. Count photons that leave “box” Produce spectra or images E1 E2 E3 E4 {e ̂ α } O bserver orthonormal basis

Slide 86

Slide 86 text

5. Count photons that leave “box” Produce spectra or images E1 E2 E3 E4 {e ̂ α } O bserver orthonormal basis Energy Number of photons Observed spectrum

Slide 87

Slide 87 text

5. Count photons that leave “box” Produce spectra or images E1 E2 E3 E4 {e ̂ α } O bserver orthonormal basis Energy Number of photons Observed spectrum Observed image

Slide 88

Slide 88 text

Radiative transfer scheme: Monte Carlo ray tracing in full general relativity Current features: Only thermal synchrotron emission Inverse Compton scattering 2D flows OpenMP parallelization Arbitrary spacetimes Dolence+09 In progress: support for 3D flows GPU acceleration non thermal distributions brehmstrahlung ✅ ✅ grmonty gpu-monty N, Bernardino, Goldmann, in prep.

Slide 89

Slide 89 text

thread 1 thread 2 thread 3 thread 4 thread 5 Work mostly by undergrad Matheus Bernardino GPU-acceleration: collaboration with computer scientists // Get global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } GPU

Slide 90

Slide 90 text

Phase I: Tests ● Robust end-to-end test. ● Testing code correctness: Race condition was found. Matheus Bernardino undergrad (CS) Alfredo Goldmann Professor (CS) Spectrum appropriate for Sagittarius A* GIT codebase contributor

Slide 91

Slide 91 text

Phase II: Optimization Matheus Bernardino undergrad (CS)

Slide 92

Slide 92 text

Results: code runs 4x faster on a gamer GPU than CPU (32x faster than 1 CPU-core) thread ID kIdx.x*blockDim.x+threadIdx.x; we do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU Intel i7 2.8 GHz 8 cores NVIDIA GTX 1080

Slide 93

Slide 93 text

Results: code runs 4x faster on a gamer GPU than CPU (32x faster than 1 CPU-core) thread ID kIdx.x*blockDim.x+threadIdx.x; we do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU Intel i7 2.8 GHz 8 cores NVIDIA GTX 1080 Plenty of room for optimizations (should get to 100× speedup)

Slide 94

Slide 94 text

Lesson: collaborate w/ your CS friends, they can help you!

Slide 95

Slide 95 text

Using deep learning and GPUs to make black hole weather forecasting

Slide 96

Slide 96 text

https://youtu.be/TmPfTpjtdgg Learning to play Breakout (Atari)

Slide 97

Slide 97 text

https://youtu.be/TmPfTpjtdgg Learning to play Breakout (Atari)

Slide 98

Slide 98 text

https://youtu.be/TmPfTpjtdgg After 600 matches…

Slide 99

Slide 99 text

https://youtu.be/TmPfTpjtdgg After 600 matches…

Slide 100

Slide 100 text

AI developed by Google beats best GO player Zero AlphaZero Silver+2016, 2017 Nature; Silver+2018 Science

Slide 101

Slide 101 text

AI developed by Google beats best GO player Zero AlphaZero Silver+2016, 2017 Nature; Silver+2018 Science

Slide 102

Slide 102 text

AI developed by Google beats best GO player AlphaZero: Algorithm learned by playing against itself After 24 horas: beats best human player After 3 days: becomes invincible against last version of the code After 21 days: invincible against all existing GO codes Zero AlphaZero Silver+2016, 2017 Nature; Silver+2018 Science

Slide 103

Slide 103 text

Garry Kasparov is the former world chess champion and the author of Deep Thinking: Where Machine Intelligence Ends and Human Downloaded from The recent world chess championship saw Mag- nus Carlsen defend his title against Fabiano Caruana. But it was not a contest between the two strongest chess players on the planet, only the strongest humans. Soon after I lost my re- match against IBM’s Deep Blue in 1997, the short window of human-machine chess competition slammed shut forever. Unlike humans, machines keep getting faster, and today a smartphone chess app can be stronger than Deep Blue. But as we see with the Al phaZero system (see pages 1118 and 1140), machine dominance has not ended the historical role of chess as a laboratory of cognition. Much as the Drosophila melanogaster fruit fly be- came a model organism for geneticists, chess became a Drosophila of reasoning. In the late 19th century, Alfred Binet hoped that understand- ing why certain people ex- celled at chess would unlock secrets of human thought. braries of opening and endgame moves, AlphaZero starts out knowing only the rules of chess, with no embedded human strategies. In just a few hours, it plays more games against itself than have been recorded in human chess history. It teaches itself the best way to play, reeval- uating such fundamental concepts as the relative values of the pieces. It quickly becomes strong enough to defeat the best chess-playing entities in the world, winning 28, drawing 72, and losing none in a victory over Stockfish. I admit that I was pleased to see that AlphaZero had a dynamic, open style like my own. The conventional wisdom was that machines would approach perfection with endless dry maneuver- ing, usually leading to drawn games. But in my observa- tion, AlphaZero prioritizes piece activity over material, preferring positions that to my eye looked risky and ag- gressive. Programs usually re- flect priorities and prejudices of programmers, but because AlphaZero programs itself, Chess, a Drosophila of reasoning Garry Kasparov is the former world chess champion and the author of Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins. He is chairman of the Human Rights Foundation, New York, NY, USA. [email protected] http://science.sciencem Downloaded from Science 2018

Slide 104

Slide 104 text

Machine learning predicts future of simple chaotic system for long time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time)

Slide 105

Slide 105 text

Machine learning predicts future of simple chaotic system for long time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time)

Slide 106

Slide 106 text

Machine learning predicts future of simple chaotic system for long time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time)

Slide 107

Slide 107 text

Machine learning predicts future of simple chaotic system for long time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time)

Slide 108

Slide 108 text

Machine learning predicts future of simple chaotic system for long time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time) Does this work for astrophysical chaotic systems?

Slide 109

Slide 109 text

Black hole blowback simulations Almeida & N, MNRAS submitted, arXiv:1905.13708 Wind powers similar to values required in AGN feedback: Pwind ∼ (10−3 − 10−2) · Mc2 Terminal velocities = 300-3000 km/s Results consistent with Akira LLAGN (Cheung+2016) Ivan Almeida

Slide 110

Slide 110 text

Black hole blowback simulations Almeida & N, MNRAS submitted, arXiv:1905.13708 Wind powers similar to values required in AGN feedback: Pwind ∼ (10−3 − 10−2) · Mc2 Terminal velocities = 300-3000 km/s Results consistent with Akira LLAGN (Cheung+2016) Ivan Almeida

Slide 111

Slide 111 text

No content

Slide 112

Slide 112 text

No content

Slide 113

Slide 113 text

Using deep learning for black hole weather forecasting Pereira, N, Navarro, in prep. Deep learning: convolutional neural network UNet + ReLU Roberta D. Pereira João Paulo Navarro Time (GM/c3) R2

Slide 114

Slide 114 text

Using deep learning for black hole weather forecasting Pereira, N, Navarro, in prep. Deep learning: convolutional neural network UNet + ReLU Roberta D. Pereira João Paulo Navarro Time (GM/c3) R2

Slide 115

Slide 115 text

Using deep learning for black hole weather forecasting Pereira, N, Navarro, in prep. Deep learning: convolutional neural network UNet + ReLU Neural network imagines the future well for Limits of DL for multidimensional chaotic systems? “Artificial Alzheimer’s” Δt ≈ 3000GM/c3 ≈ 2tdyn (100rS ) Roberta D. Pereira João Paulo Navarro Time (GM/c3) R2

Slide 116

Slide 116 text

Summary: GPUs in astrophysics GPUs: not only for bitcoin mining, games and deep learning We need to train our researchers and students in GPU computing techniques rodrigonemmen.com blackholegroup.org Collaborate with computer scientists: they will love to teach how to optimize things Use GPUs if possible: more bang for your buck Ideal for getting computational time: 
 1 GPU-hours > 10 CPU-hours

Slide 117

Slide 117 text

If also interested in: High-energy astrophysics Simulations: accretion / jets AGN feedback (winds, jet) Come talk to me! 🙂 [email protected] @nemmen

Slide 118

Slide 118 text

Github Twitter Web E-mail Bitbucket Facebook Group figshare [email protected] rodrigonemmen.com @nemmen rsnemmen facebook.com/rodrigonemmen nemmen blackholegroup.org bit.ly/2fax2cT