llelism (m a chine le a rning, im a ge processing, scienti f ic computing,… ) • CPU :: 4-32 cores • GPU :: Ex a mple NVIDIA Tesl a T4 h a s 2,560 cores (sm a ll CPUs) a nd 320 Tensor Cores (Tensor, N-dimension a l p a ck a ged d a t a , oper a tions).
rge collection of individu a l cores. Inste a d, cores a re grouped into units c a lled Stre a ming Multiprocessors (SMs). • SM = sm a ll processor th a t m a n a ges m a ny cores • CUDA cores = simple a rithmetic units inside the SM For a Tesl a T4 GPU: • 40 SMs • 64 CUDA cores inside e a ch SM So the tot a l number of CUDA cores is: 40 x 64 = 2,560
ds in groups c a lled w a rps. W a rp = 32 thre a ds This is the b a sic execution unit of the GPU. So the GPU runs: • 32 thre a ds together • a ll executing the s a me instruction a t the s a me time This is c a lled SIMT (Single Instruction Multiple Thre a ds) 25,600 / 32 = 800 w a rps (distributed a cross 40 SM)
from origin for m a ny points, d = sqrt(x² + y²) where e a ch point is independent // Sequential C function: void distance_cpu(float *x, float *y, float *d, int n) { for (int i = 0; i < n; i++) { d[i] = sqrt(x[i]*x[i] + y[i]*y[i]); } }
Computing Javier Gonzalez-Sanchez, Ph.D. [email protected] Winter 2026 Copyright. These slides can only be used as study material for the class CSC 364 at Cal Poly. They cannot be distributed or used for another purpose. 37