Seo @sonots ・AI System Dept. ・Analytics Infra Group ・Tools Team (Supervisor) ・Cloud Infra (Supervisor) ・Fluentd & Ruby Committer ・Recently, “出向” to PFN ・Chainer/CuPy core dev
CUDA programs are written as 1. Allocate GPU memory 2. Copy CPU memory to GPU memory 3. Launch CUDA kernel (a function executed by GPU) 4. Copy GPU memory to CPU memory 5. Free GPU memory Basic CUDA programming
pool improvement 512 1024 1536 2048 2560 …. free bins or arena chunks size free_list or bin 1. Round up memory size by 512 2. cudaMalloc, and use it 3. Push to arena intead of cudaFree 4. Pop from arena if a block of exactly same size is available in arena instead of cudaMalloc
Cache miss occurs even if there exist memory blocks of larger sizes than the required size (best-fit). • The cache miss typically occurs for Natural Language Processing applications whose input data size are varying. Memory pool improvement 512 1024 1536 2048 2560 …. size Want How about using cache of larger size (2560)?
Memory pool improvement 512 1024 1536 2048 2560 … 1. Pop a chunk if larger size than required size is available 2. Split and use only necessary size. Push back a chunk of remained size Split Pop 512 2048 512 1024 1536 2048 2560 …. Push (1) (2) Split and merge 4QMJU Split next prev want to use
Memory pool improvement 1. Merge the chunk to free with next or prev chunks in free lists 2. Push back the merged chunk Merge 512 2048 512 1024 1536 2048 2560 …. Push .FSHF next prev
from chainer.func.on_hooks import CupyMemoryProfileHook hook = CupyMemoryProfileHook() with hook: trainer.run() hook.print_report() Bytes used from CuPy mem pool Bytes acquired from GPU device GPU memory profiler
$ nvprof -o prof.nvvp python examples/cusolver.py $ /Developer/NVIDIA/CUDA-9.0/bin/nvvp prof.nvvp I install nvvp on Mac OSX, and scp prof.nvvp from GPU machine hQps://developer.nvidia.com/cuda-downloads GPU memory profiler
• The ability to perform multiple CUDA operations • CUDA Kernel • cudaMemcpyAsync (HostToDevice) • cudaMemcpyAsync (DeviceToHost) • simultaneously Support CUDA stream
NFNPSZQPPMXJUIPOFTUSFBN CPU stream1 Kernel1 Launch kernel2 Launch kernel1 malloc free malloc free Kernel2 • Returns memory to mem pool before kernel execution finishes • It was fine because it is sure that kernel2 is ran after kernel1. Support CUDA stream
CSPLFONFNPSZQPPMXJUIUXPTUSFBNT CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free • kernel2 may use memory blocks which kernel1 is still using. Support CUDA stream
Support CUDA stream Mem pool1 CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free Mem pool2 • Drawback: stream2 can not reuse cached memory of stream1
CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free callback to call free callback to call free Support CUDA stream • kernel2 does not touch memory which kernel1 is using • Drawback: Registering callback to all kernels would degrade performance
Create a separated memory pool for each stream • (2) Use cuda stream callback? cudaStreamAddCallback synchronizes CPU and GPU hQps://gist.github.com/sonots/e98a95aaceae65a15d2b59a81bem023 Support CUDA stream So, we chose (1)
cupy x = cupy.array([1]) y = cupy.array([1]) stream1 = cupy.cuda.stream.Stream() stream2 = cupy.cuda.stream.Stream() with stream1: z1 = x + y with stream2: z2 = x * y # Default stream waits unUl all stream’s operaUons finish as default z = z1 + z2 Support CUDA stream example.py
chainer import chainer.funcUons as F Import chainer.links as L class MyAwesomeNet(chainer.chain): def __init(self): super(MyAwesomeNet, self).__init__() with self.init_scope(): self.stream1 = chainer.cuda.stream.Stream() self.stream2 = chainer.cuda.stream.Stream() self.conv1 = L.ConvoluUon2D(None, 384, 3, pad=1) def __call__(self, x, t): with stream1: h1 = self.conv1(x) with stream2: h2 = self.conv2(x) example.py Support CUDA stream