Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CSC364 Lecture 18

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

CSC364 Lecture 18

GPU Programming
(202603)

Avatar for Javier Gonzalez-Sanchez

Javier Gonzalez-Sanchez PRO

March 12, 2026
Tweet

Transcript

  1. Dr. Javier Gonzalez-Sanchez [email protected] www.javiergs.info o ffi ce: 14 -227

    CSC 364 Introduction to Networked, Distributed, and Parallel Computing Lecture 18. GPU Programming with CUDA 1
  2. 2 Motivation • M a ssive p a r a

    llelism (m a chine le a rning, im a ge processing, scienti f ic computing,… ) • CPU :: 4-32 cores • GPU :: Ex a mple NVIDIA Tesl a T4 h a s 2,560 cores (sm a ll CPUs) a nd 320 Tensor Cores (Tensor, N-dimension a l p a ck a ged d a t a , oper a tions).
  3. 5 Data Parallelism 1 progr a m → 1 core

    1 progr a m → 10 cores 1 progr a m → thous a nds of (cores) thre a ds C[i] = A[i] + B[I]
  4. 6 Architecture A GPU is not just a l a

    rge collection of individu a l cores. Inste a d, cores a re grouped into units c a lled Stre a ming Multiprocessors (SMs). • SM = sm a ll processor th a t m a n a ges m a ny cores • CUDA cores = simple a rithmetic units inside the SM For a Tesl a T4 GPU: • 40 SMs • 64 CUDA cores inside e a ch SM So the tot a l number of CUDA cores is: 40 x 64 = 2,560
  5. 7 Threads Grid ᵓ Block │ ᵓ Thre a d

    │ ᵓ Thre a d │ └ Thre a d ᵓ Block │ ᵓ Thre a d │ ᵓ Thre a d │ └ Thre a d
  6. 8 Threads Ex a mple: • blocks = 100 •

    thre a ds per block = 256 • tot a l: 256 x 100 = 25,600 thre a ds This me a ns we a re a sking the GPU to execute 25,600 p a r a llel t a sks.
  7. 9 Execution in Warps The GPU scheduler executes thre a

    ds in groups c a lled w a rps. W a rp = 32 thre a ds This is the b a sic execution unit of the GPU. So the GPU runs: • 32 thre a ds together • a ll executing the s a me instruction a t the s a me time This is c a lled SIMT (Single Instruction Multiple Thre a ds) 25,600 / 32 = 800 w a rps (distributed a cross 40 SM)
  8. 10 CUDA Programming Model • kernel • grid • block

    • thre a d i = blockIdx.x * blockDim.x + thre a dIdx.x blocks = 2 threads per block = 4 Block0 → threads 0–3 Block1 → threads 4–7
  9. 12 Platform Running CUDA loc a lly requires: • NVIDIA

    GPU • CUDA drivers • CUDA toolkit • compiler setup Col a b a voids these issues: • runs in browser • free GPU a ccess • no inst a ll a tion required
  10. 23 Problem and CPU solution • Compute dist a nce

    from origin for m a ny points, d = sqrt(x² + y²) where e a ch point is independent // Sequential C function: void distance_cpu(float *x, float *y, float *d, int n) { for (int i = 0; i < n; i++) { d[i] = sqrt(x[i]*x[i] + y[i]*y[i]); } }
  11. 24 CUDA Kernel // Sequential C function: __global__ void distance_gpu(float

    *x, float *y, float *d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) { d[i] = sqrt(x[i]*x[i] + y[i]*y[i]); } } threads = 256 blocks = (N + threads - 1) / threads distance_gpu <<<blocks,256>>> (...)
  12. 25 Execution Pipeline 1. a lloc a te GPU memory

    2. copy d a t a to GPU 3. l a unch kernel 4. thre a ds execute in p a r a llel 5. copy results b a ck
  13. 30 N = 10,000,000 Threads per block = 256 Blocks

    = 39,063 10,000,128 thread are created!
  14. CSC 364 Introduction to Introduction to Networked, Distributed, and Parallel

    Computing Javier Gonzalez-Sanchez, Ph.D. [email protected] Winter 2026 Copyright. These slides can only be used as study material for the class CSC 364 at Cal Poly. They cannot be distributed or used for another purpose. 37