CSC364 Lecture 18

Dr. Javier Gonzalez-Sanchez [email protected] www.javiergs.info o ffi ce: 14 -227
CSC 364 Introduction to Networked, Distributed, and Parallel Computing Lecture 18. GPU Programming with CUDA 1

2 Motivation • M a ssive p a r a
llelism (m a chine le a rning, im a ge processing, scienti f ic computing,… ) • CPU :: 4-32 cores • GPU :: Ex a mple NVIDIA Tesl a T4 h a s 2,560 cores (sm a ll CPUs) a nd 320 Tensor Cores (Tensor, N-dimension a l p a ck a ged d a t a , oper a tions).

3 CPU Architecture

4 GPU Architecture

5 Data Parallelism 1 progr a m → 1 core
1 progr a m → 10 cores 1 progr a m → thous a nds of (cores) thre a ds C[i] = A[i] + B[I]

6 Architecture A GPU is not just a l a
rge collection of individu a l cores. Inste a d, cores a re grouped into units c a lled Stre a ming Multiprocessors (SMs). • SM = sm a ll processor th a t m a n a ges m a ny cores • CUDA cores = simple a rithmetic units inside the SM For a Tesl a T4 GPU: • 40 SMs • 64 CUDA cores inside e a ch SM So the tot a l number of CUDA cores is: 40 x 64 = 2,560

7 Threads Grid ᵓ Block │ ᵓ Thre a d
│ ᵓ Thre a d │ └ Thre a d ᵓ Block │ ᵓ Thre a d │ ᵓ Thre a d │ └ Thre a d

8 Threads Ex a mple: • blocks = 100 •
thre a ds per block = 256 • tot a l: 256 x 100 = 25,600 thre a ds This me a ns we a re a sking the GPU to execute 25,600 p a r a llel t a sks.

9 Execution in Warps The GPU scheduler executes thre a
ds in groups c a lled w a rps. W a rp = 32 thre a ds This is the b a sic execution unit of the GPU. So the GPU runs: • 32 thre a ds together • a ll executing the s a me instruction a t the s a me time This is c a lled SIMT (Single Instruction Multiple Thre a ds) 25,600 / 32 = 800 w a rps (distributed a cross 40 SM)

10 CUDA Programming Model • kernel • grid • block
• thre a d i = blockIdx.x * blockDim.x + thre a dIdx.x blocks = 2 threads per block = 4 Block0 → threads 0–3 Block1 → threads 4–7

Platform 11

12 Platform Running CUDA loc a lly requires: • NVIDIA
GPU • CUDA drivers • CUDA toolkit • compiler setup Col a b a voids these issues: • runs in browser • free GPU a ccess • no inst a ll a tion required

13 https://colab.research.google.com/

20 Linux shell command (!)

21 Notes CUDA is C File extension is cu Compiler
is nvcc

Problem to be Solved

23 Problem and CPU solution • Compute dist a nce
from origin for m a ny points, d = sqrt(x² + y²) where e a ch point is independent // Sequential C function: void distance_cpu(float *x, float *y, float *d, int n) { for (int i = 0; i < n; i++) { d[i] = sqrt(x[i]*x[i] + y[i]*y[i]); } }

24 CUDA Kernel // Sequential C function: __global__ void distance_gpu(float
*x, float *y, float *d, int n) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) { d[i] = sqrt(x[i]*x[i] + y[i]*y[i]); } } threads = 256 blocks = (N + threads - 1) / threads distance_gpu <<<blocks,256>>> (...)

25 Execution Pipeline 1. a lloc a te GPU memory
2. copy d a t a to GPU 3. l a unch kernel 4. thre a ds execute in p a r a llel 5. copy results b a ck

Excercise

28 Code

29 Code

30 N = 10,000,000 Threads per block = 256 Blocks
= 39,063 10,000,128 thread are created!

31 10,000,000 (~3 x faster)

32 100,000,000 (~36 x faster)

33 1,000,000,000 (~170 x faster)

34 Questions

36 Let’s Work

CSC 364 Introduction to Introduction to Networked, Distributed, and Parallel
Computing Javier Gonzalez-Sanchez, Ph.D. [email protected] Winter 2026 Copyright. These slides can only be used as study material for the class CSC 364 at Cal Poly. They cannot be distributed or used for another purpose. 37

CSC364 Lecture 18

CSC364 Lecture 18

More Decks by Javier Gonzalez-Sanchez

Featured

Transcript