Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accelerating scientific computing with GPUs

Rodrigo Nemmen
September 22, 2019

Accelerating scientific computing with GPUs

Graphics processing units (GPUs) aren’t just of interest to gamers and cryptocurrency miners. Increasingly, they’re being used to turbocharge academic research, too. GPUs provide massive parallelism enabling one to perform computations on a basic computer that until very recently required a supercomputer. I will describe how GPUs can be leveraged to dramatically accelerate scientific calculations. I will also give an overview of recent applications of GPUs to astrophysical simulations, with a particular focus on radiative transfer in curved spacetimes and (magneto)hydrodynamics simulations of black hole accretion.

Slides from a colloquium presented in 2019.

Rodrigo Nemmen

September 22, 2019
Tweet

More Decks by Rodrigo Nemmen

Other Decks in Research

Transcript

  1. Rodrigo Nemmen Universidade de São Paulo Accelerating Scientific Computing with

    GPUs w/ Roberta D. Pereira, João Paulo Navarro (NVIDIA), Matheus Bernardino, Alfredo Goldmann, Ivan Almeida M. Weiss, CfA blackholegroup.org @nemmen
  2. Index 1.Challenge of simulating the universe 2.What is a GPU?

    3.HOWTO: accelerate your science 4.My work: Black holes, GPUs and deep learning
  3. Challenge: Do physical laws reproduce the observed universe? Computational approach:

    Simulate a universe following those laws and compare with the real one
  4. Range of length scales for simulating (part of) the universe

    gas cooling, electrodynamics, turbulence, … galactic dynamics > 1013 dynamical range = largest scale smallest scale
  5. Range of length scales for fluid dynamics Smallest scales =

    Kolmogorov scale turbulent eddies size system size dynamical range > 105
  6. The difference between CPU and GPU thread ID kIdx.x*blockDim.x+threadIdx.x; we

    do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU massively parallel
  7. The difference between CPU and GPU thread ID kIdx.x*blockDim.x+threadIdx.x; we

    do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU massively parallel Intel i7 7700 4.2 GHz 8 cores 0.04 TFLOPS $355
  8. The difference between CPU and GPU thread ID kIdx.x*blockDim.x+threadIdx.x; we

    do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU massively parallel Intel i7 7700 4.2 GHz 8 cores 0.04 TFLOPS NVIDIA GTX 1080Ti 3584 cuda cores 11.3 TFLOPS (FP32) $355 $1279 3× +expensive than CPU
  9. The difference between CPU and GPU thread ID kIdx.x*blockDim.x+threadIdx.x; we

    do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU massively parallel Intel i7 7700 4.2 GHz 8 cores 0.04 TFLOPS NVIDIA GTX 1080Ti 3584 cuda cores 11.3 TFLOPS (FP32) 280× speedup potentially $355 $1279 3× +expensive than CPU
  10. Lesson: we need to learn and train our students on

    how to exploit GPUs for science
  11. Exponential evolution of astrophysical fluid simulations Year Data from: Ruffert+1994;

    Hawley+1995; Hawley+1996; Stone+1999; Igumenshchev 2000; Stone+2001; Hawley+2002; De Villiers+2003; McKinney+2004; Kuwabara+2005; Hawley+2006; Beckwith+2009; Tchekhovskoy+2011; Narayan+2012; McKinney+2012; Sorathia+2012; McKinney+2013; Hawley+2013; Kestener+2014; Gheller+2015; Benitez- Llambay+2016; Schneider+2017 Liska+2018 Schneider+2018 log10 # of fluid elements t0 = 2.8 yrs y / et/to <latexit sha1_base64="yNPvDP883kf+EENmTxes12C2poo=">AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=</latexit> <latexit sha1_base64="yNPvDP883kf+EENmTxes12C2poo=">AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=</latexit> <latexit sha1_base64="yNPvDP883kf+EENmTxes12C2poo=">AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=</latexit> <latexit sha1_base64="yNPvDP883kf+EENmTxes12C2poo=">AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=</latexit>
  12. Exponential evolution of astrophysical fluid simulations Kestener+2014 RAMSES-GPU N =

    800x1600x800 Rise of GPU codes Year # of fluid elements t0 = 2.8 yrs y / et/to <latexit sha1_base64="yNPvDP883kf+EENmTxes12C2poo=">AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=</latexit> <latexit sha1_base64="yNPvDP883kf+EENmTxes12C2poo=">AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=</latexit> <latexit sha1_base64="yNPvDP883kf+EENmTxes12C2poo=">AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=</latexit> <latexit sha1_base64="yNPvDP883kf+EENmTxes12C2poo=">AAACB3icdVDLSgMxFM34rPVVdekmWAQXUjNVa7srunGpYG2hrSWTphrMTEJypzAM/QC/wq2u3IlbP8OF/2I6VlDRA4HDOfdyc06gpbBAyJs3NT0zOzefW8gvLi2vrBbW1i+tig3jDaakMq2AWi5FxBsgQPKWNpyGgeTN4PZk7DeH3FihogtINO+G9DoSA8EoOKlX2EhwRxulQWF+lcIe9NSoVyiSUq1GDvwKJqVDQsqVmiNkv1ytVLBfIhmKaIKzXuG901csDnkETFJr2z7R0E2pAcEkH+U7seWaslt6zduORjTkdrc/FNpmtJtmOUZ425l9PFDGvQhwpn5fTmlobRIGbjKkcGN/e2PxL68dw6DaTUWkY+AR+zw0iCV2qcel4L4wnIFMHKHMCPdtzG6ooQxcdXnXx1do/D+5LJd8x88PivXjSTM5tIm20A7y0RGqo1N0hhqIoQTdowf06N15T96z9/I5OuVNdjbQD3ivHzQembY=</latexit> Data from: Ruffert+1994; Hawley+1995; Hawley+1996; Stone+1999; Igumenshchev 2000; Stone+2001; Hawley+2002; De Villiers+2003; McKinney+2004; Kuwabara+2005; Hawley+2006; Beckwith+2009; Tchekhovskoy+2011; Narayan+2012; McKinney+2012; Sorathia+2012; McKinney+2013; Hawley+2013; Kestener+2014; Gheller+2015; Benitez- Llambay+2016; Schneider+2017 Liska+2018 Schneider+2018
  13. https://vimeo.com/228857009 Cholla: hydrodynamics of galaxy evolution and galactic winds Schneider

    & Robertson 2017 Resolution = 2048 x 512 x 512 MPI + CUDA 16000 GPUs (Titan)
  14. https://vimeo.com/228857009 Cholla: hydrodynamics of galaxy evolution and galactic winds Schneider

    & Robertson 2017 Resolution = 2048 x 512 x 512 MPI + CUDA 16000 GPUs (Titan)
  15. Alexander (Sasha) Tchekhovskoy SPSAS-HighAstro ’17 H-AMR = BH Revolution Matthew

    Liska (U of Amsterdam) • Multi-GPU 3D H-AMR (“hammer”, Liska, AT, et al. 2017): • Based on Godunov-type code HARM2D (Gammie et al. ‘03) • 85% parallel scaling to 4096 GPUs (MPI, OpenMP, OpenCL) • 100-450x speedup compared to a CPU core • Features on par with state-of-the-art CPU codes (e.g., Athena++): Adaptive Mesh Refinement (AMR) with local adaptive time-stepping • features developed in the past year on BW • ideal for studying typical, tilted systems • give an additional speedup of ≳10x (or even 100x) for hardest problems! • Ideal for getting computational time • 9M GPU-hours ≳ 9B CPU core-hour allocation on Blue Waters supercomputer • Science is no longer limited by computational resources! • Graphical Processing Units (GPUs), the cutting edge of modern supercomputing Alexander Tchekhovskoy (Northwestern) H-AMR: multi-GPU general relativistic MHD code to simulate black hole disks Based on Godunov code HARM2D MPI + CUDA + OpenMP 100-450x speedup compared to a CPU core Adaptive mesh refinement (AMR) + local adaptive time-stepping Liska+2018
  16. https://www.youtube.com/watch?v=mbnG5_UTTdk Liska+2019 Resolution = 2880×864×1200 MPI + CUDA 450 GPUs

    (Blue Waters) H-AMR: MHD simulation of black hole and tilted accretion disk
  17. https://www.youtube.com/watch?v=mbnG5_UTTdk Liska+2019 Resolution = 2880×864×1200 MPI + CUDA 450 GPUs

    (Blue Waters) H-AMR: MHD simulation of black hole and tilted accretion disk
  18. Previously intractable problems are now possible w/ GPUs Change of

    paradigm in computational astrophysics: science no longer limited by hardware resources GPUs ideal for getting computational time: 1 GPU-hour = (10-20) CPU-hours
  19. Problem is too small Can divide in parallel chunks Is

    your problem appropriate for a GPU? ✔ e.g. loops can be split-up Do not bother
  20. Problem is too small Can divide in parallel chunks Lots

    of communication Is your problem appropriate for a GPU? ✔ e.g. loops can be split-up Do not bother ✘ need to constantly read from memory
  21. Problem is too small Can divide in parallel chunks Lots

    of communication Too unpredictable Is your problem appropriate for a GPU? ✔ e.g. loops can be split-up Do not bother ✘ need to constantly read from memory ✘ branch divergence: breaks SIMD paradigm
  22. OpenACC Directives Overvi Program myscience ... serial code ... !$acc

    kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compil Compiler Parallelise Works on many-core GPU multicore CPUs OpenACC Compiler Hint CPU
  23. OpenACC Directives Overvi Program myscience ... serial code ... !$acc

    kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compil Compiler Parallelise Works on many-core GPU multicore CPUs OpenACC Compiler Hint CPU kernel
  24. OpenACC Directives Overvi Program myscience ... serial code ... !$acc

    kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compil Compiler Parallelise Works on many-core GPU multicore CPUs OpenACC Compiler Hint CPU kernel sure we do not go out of bounds < n) d] = a[id] + b[id]; GPU GPU CPU-GPU communication is a bottleneck
  25. How to adapt your problem to GPUs Less development time

    Bigger speedup and flexibility DIFFICULTY
  26. integer :: n, i !$acc kernels do i=1,n y(i) =

    a*x(i) enddo !$acc end kernels end subroutine sa ... $ From main progr $ call SAXPY on 1 call saxpy(2**20, ... float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs Example of serial code to be parallelised Array y
  27. integer :: n, i !$acc kernels do i=1,n y(i) =

    a*x(i) enddo !$acc end kernels end subroutine sa ... $ From main progr $ call SAXPY on 1 call saxpy(2**20, ... float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs subrout real integ !$acc k do i= y(i enddo !$acc e end sub void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } A SIMPLE EXAMPLE: SAXPY in C SA ; Example of serial code to be parallelised Array y GPU version w/ Vector Addition Device Code (CUDA) Launch a CUDA kernel for the Vector addition __global__ void vecaddgpu(float *r, float *a, float *b, int n) { // Get global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } GPU
  28. float *x, float *restrict y) { #pragma acc kernels for

    (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs Serial code: GPU version w/
  29. float *x, float *restrict y) { #pragma acc kernels for

    (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs Serial code: GPU version w/ // Kernel function (runs on GPU) __global__ void multiply(int n, float a, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = a*x[i]; }
  30. float *x, float *restrict y) { #pragma acc kernels for

    (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Somewhere in main // call SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); ... ; How to adapt your problem to GPUs Serial code: GPU version w/ // Kernel function (runs on GPU) __global__ void multiply(int n, float a, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = a*x[i]; } // Allocate Unified Memory – accessible from CPU or GPU cudaMallocManaged(&x, N*sizeof(float)); cudaMallocManaged(&y, N*sizeof(float)); // Run kernel on the GPU multiply<<<1, 1>>>(N, x, y); Call from main
  31. blackholegroup.org Gustavo Soares PhD Artur Vemado MsC Fabio Cafardo PhD

    Raniere Menezes PhD Ivan Almeida Msc Rodrigo Nemmen Roberta Pereira MsC Edson Ponciano undergrad Matheus Bernardino undergrad (CS) Apply to join my group
  32. Let there be light from the hot plasma Spectra log(frequency

    / Hz) log(luminosity [erg/s]) Courtesy: @mmosc_m
  33. Let there be light from the hot plasma Time series

    Spectra log(frequency / Hz) log(luminosity [erg/s]) Courtesy: @mmosc_m
  34. Let there be light from the hot plasma Time series

    Spectra log(frequency / Hz) log(luminosity [erg/s]) Courtesy: @mmosc_m
  35. 1. Spacetime and plasma properties gμν ρ0 rest-mass density u

    internal energy uμ 4-velocity bμ magnetic field
  36. 3. Photon propagation Solve geodesic equation in curved spacetime Geodesic

    equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, With ε = 0.04, grmonty integrates ∼16,700 geodesics s−1 on a single core of an Intel Xeon model E5430. If we use fourth- order Runge–Kutta exclusively so that the error in E, l, and Q is ∼1000 times smaller, then the speed is ∼ 6200 geodesics s−1. If we use the Runge–Kutta Prince–Dorman method in GSL with ε = 0.04 the fraction error is ∼ 10−10 and the speed is ∼1700 geodesics s−1. These results can be compared to the publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) (see Mihalas & Mihalas 1984). Here Iν is specific intensity and for example, is proportional to Since Iν/ν3 is proportional t along each ray, Iν/ν3 ∝ w, and emission) dw dτa = where dτa = (ν is the differential optical depth parentheses is the “invariant opa with second-order accuracy τa = 1 2 ((ναν,a ) n + and then set wn+1 = Since the components of kµ are rest-mass energy, ν = −kµu opacity at the end of each step be reused as the beginning of t 5. SCAT Our treatment of scattering determines where a superpho
  37. 3. Photon propagation Solve geodesic equation in curved spacetime Geodesic

    equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, With ε = 0.04, grmonty integrates ∼16,700 geodesics s−1 on a single core of an Intel Xeon model E5430. If we use fourth- order Runge–Kutta exclusively so that the error in E, l, and Q is ∼1000 times smaller, then the speed is ∼ 6200 geodesics s−1. If we use the Runge–Kutta Prince–Dorman method in GSL with ε = 0.04 the fraction error is ∼ 10−10 and the speed is ∼1700 geodesics s−1. These results can be compared to the publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) (see Mihalas & Mihalas 1984). Here Iν is specific intensity and for example, is proportional to Since Iν/ν3 is proportional t along each ray, Iν/ν3 ∝ w, and emission) dw dτa = where dτa = (ν is the differential optical depth parentheses is the “invariant opa with second-order accuracy τa = 1 2 ((ναν,a ) n + and then set wn+1 = Since the components of kµ are rest-mass energy, ν = −kµu opacity at the end of each step be reused as the beginning of t 5. SCAT Our treatment of scattering determines where a superpho
  38. Geodesic equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative

    transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) is pa wi an Si res op be 4. Photon absorption and scattering Solve radiative transfer equation along each path
  39. Geodesic equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative

    transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) is pa wi an Si res op be 4. Photon absorption and scattering Solve radiative transfer equation along each path absorption
  40. Geodesic equation Radiative transfer 3. GEODESIC INTEGRATION General relativistic radiative

    transfer differs from conven- tional radiative transfer in Minkowski space in that photon tra- jectories are no longer trivial; photons move along geodesics. Tracking geodesics is a significant computational expense in grmonty. The governing equations for a photon trajectory are dxα dλ = kα (11) which defines λ, the affine parameter, the geodesic equation dkα dλ = −Γα µν kµkν, (12) and the definition of the connection coefficients k require per ste Equatio with kµ n process estimat iteratio only o evaluat to conv a class How We pro direct, publicly available integral-based geokerr code of Dexter & Agol (2009), whose geodesics are shown as the (more accurate) solid lines in Figure 1. If we use geokerr to sample each geodesic the same number of times as grmonty (∼180), then on the same machine geokerr runs at ∼1000 geodesics s−1. It is possible that other implementations of an integral-of-motion- based geodesic tracker could be faster. If only the initial and final states of the photon are required, we find that geokerr computes ∼77,000 geodesics s−1. The adaptive Runge–Kutta Cash–Karp integrator in GSL computes ∼34,500 geodesics s−1 with fractional error ∼10−3. 4. ABSORPTION grmonty treats absorption deterministically. We begin with the radiative transfer equation written in the covariant form 1 C d dλ Iν ν3 = jν ν2 − (ναν,a ) Iν ν3 . (15) is pa wi an Si res op be 4. Photon absorption and scattering Solve radiative transfer equation along each path absorption scattering
  41. 5. Count photons that leave “box” Produce spectra or images

    E1 E2 E3 E4 {e ̂ α } O bserver orthonormal basis
  42. 5. Count photons that leave “box” Produce spectra or images

    E1 E2 E3 E4 {e ̂ α } O bserver orthonormal basis Energy Number of photons Observed spectrum
  43. 5. Count photons that leave “box” Produce spectra or images

    E1 E2 E3 E4 {e ̂ α } O bserver orthonormal basis Energy Number of photons Observed spectrum Observed image
  44. Radiative transfer scheme: Monte Carlo ray tracing in full general

    relativity Current features: Only thermal synchrotron emission Inverse Compton scattering 2D flows OpenMP parallelization Arbitrary spacetimes Dolence+09 In progress: support for 3D flows GPU acceleration non thermal distributions brehmstrahlung ✅ ✅ grmonty gpu-monty N, Bernardino, Goldmann, in prep.
  45. thread 1 thread 2 thread 3 thread 4 thread 5

    Work mostly by undergrad Matheus Bernardino GPU-acceleration: collaboration with computer scientists // Get global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } GPU
  46. Phase I: Tests • Robust end-to-end test. • Testing code

    correctness: Race condition was found. Matheus Bernardino undergrad (CS) Alfredo Goldmann Professor (CS) Spectrum appropriate for Sagittarius A* GIT codebase contributor
  47. Results: code runs 4x faster on a gamer GPU than

    CPU (32x faster than 1 CPU-core) thread ID kIdx.x*blockDim.x+threadIdx.x; we do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU Intel i7 2.8 GHz 8 cores NVIDIA GTX 1080
  48. Results: code runs 4x faster on a gamer GPU than

    CPU (32x faster than 1 CPU-core) thread ID kIdx.x*blockDim.x+threadIdx.x; we do not go out of bounds [id] + b[id]; GPU OpenACC Directives Overview Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience CPU GPU Your original Fortran or C code Simple Compiler hints Compiler Parallelises code Works on many-core GPUs & multicore CPUs OpenACC Compiler Hint GPU CPU Intel i7 2.8 GHz 8 cores NVIDIA GTX 1080 Plenty of room for optimizations (should get to 100× speedup)
  49. AI developed by Google beats best GO player Zero AlphaZero

    Silver+2016, 2017 Nature; Silver+2018 Science
  50. AI developed by Google beats best GO player Zero AlphaZero

    Silver+2016, 2017 Nature; Silver+2018 Science
  51. AI developed by Google beats best GO player AlphaZero: Algorithm

    learned by playing against itself After 24 horas: beats best human player After 3 days: becomes invincible against last version of the code After 21 days: invincible against all existing GO codes Zero AlphaZero Silver+2016, 2017 Nature; Silver+2018 Science
  52. Garry Kasparov is the former world chess champion and the

    author of Deep Thinking: Where Machine Intelligence Ends and Human Downloaded from The recent world chess championship saw Mag- nus Carlsen defend his title against Fabiano Caruana. But it was not a contest between the two strongest chess players on the planet, only the strongest humans. Soon after I lost my re- match against IBM’s Deep Blue in 1997, the short window of human-machine chess competition slammed shut forever. Unlike humans, machines keep getting faster, and today a smartphone chess app can be stronger than Deep Blue. But as we see with the Al phaZero system (see pages 1118 and 1140), machine dominance has not ended the historical role of chess as a laboratory of cognition. Much as the Drosophila melanogaster fruit fly be- came a model organism for geneticists, chess became a Drosophila of reasoning. In the late 19th century, Alfred Binet hoped that understand- ing why certain people ex- celled at chess would unlock secrets of human thought. braries of opening and endgame moves, AlphaZero starts out knowing only the rules of chess, with no embedded human strategies. In just a few hours, it plays more games against itself than have been recorded in human chess history. It teaches itself the best way to play, reeval- uating such fundamental concepts as the relative values of the pieces. It quickly becomes strong enough to defeat the best chess-playing entities in the world, winning 28, drawing 72, and losing none in a victory over Stockfish. I admit that I was pleased to see that AlphaZero had a dynamic, open style like my own. The conventional wisdom was that machines would approach perfection with endless dry maneuver- ing, usually leading to drawn games. But in my observa- tion, AlphaZero prioritizes piece activity over material, preferring positions that to my eye looked risky and ag- gressive. Programs usually re- flect priorities and prejudices of programmers, but because AlphaZero programs itself, Chess, a Drosophila of reasoning Garry Kasparov is the former world chess champion and the author of Deep Thinking: Where Machine Intelligence Ends and Human Creativity Begins. He is chairman of the Human Rights Foundation, New York, NY, USA. [email protected] http://science.sciencem Downloaded from Science 2018
  53. Machine learning predicts future of simple chaotic system for long

    time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time)
  54. Machine learning predicts future of simple chaotic system for long

    time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time)
  55. Machine learning predicts future of simple chaotic system for long

    time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time)
  56. Machine learning predicts future of simple chaotic system for long

    time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time)
  57. Machine learning predicts future of simple chaotic system for long

    time Pathak+2018, PRL Reservoir computing technique Model-free prediction for evolution of large spatiotemporally chaotic system yt = − yyx − yxx − yxxxx + μ cos ( 2πx λ ) During training: No access to equation, only data 5 10 15 20 5 10 15 20 -2 0 2 2 4 6 8 10 12 5 10 15 20 (a) (b) (c) PHYSICAL REVIEW Actual data from model ML prediction Error in prediction (units of Lyapunov time) Does this work for astrophysical chaotic systems?
  58. Black hole blowback simulations Almeida & N, MNRAS submitted, arXiv:1905.13708

    Wind powers similar to values required in AGN feedback: Pwind ∼ (10−3 − 10−2) · Mc2 Terminal velocities = 300-3000 km/s Results consistent with Akira LLAGN (Cheung+2016) Ivan Almeida
  59. Black hole blowback simulations Almeida & N, MNRAS submitted, arXiv:1905.13708

    Wind powers similar to values required in AGN feedback: Pwind ∼ (10−3 − 10−2) · Mc2 Terminal velocities = 300-3000 km/s Results consistent with Akira LLAGN (Cheung+2016) Ivan Almeida
  60. Using deep learning for black hole weather forecasting Pereira, N,

    Navarro, in prep. Deep learning: convolutional neural network UNet + ReLU Roberta D. Pereira João Paulo Navarro Time (GM/c3) R2
  61. Using deep learning for black hole weather forecasting Pereira, N,

    Navarro, in prep. Deep learning: convolutional neural network UNet + ReLU Roberta D. Pereira João Paulo Navarro Time (GM/c3) R2
  62. Using deep learning for black hole weather forecasting Pereira, N,

    Navarro, in prep. Deep learning: convolutional neural network UNet + ReLU Neural network imagines the future well for Limits of DL for multidimensional chaotic systems? “Artificial Alzheimer’s” Δt ≈ 3000GM/c3 ≈ 2tdyn (100rS ) Roberta D. Pereira João Paulo Navarro Time (GM/c3) R2
  63. Summary: GPUs in astrophysics GPUs: not only for bitcoin mining,

    games and deep learning We need to train our researchers and students in GPU computing techniques rodrigonemmen.com blackholegroup.org Collaborate with computer scientists: they will love to teach how to optimize things Use GPUs if possible: more bang for your buck Ideal for getting computational time: 
 1 GPU-hours > 10 CPU-hours
  64. Github Twitter Web E-mail Bitbucket Facebook Group figshare [email protected] rodrigonemmen.com

    @nemmen rsnemmen facebook.com/rodrigonemmen nemmen blackholegroup.org bit.ly/2fax2cT