How to Run Neural Nets on GPUs

Slide 1

Slide 1 text

Neural Nets on GPUs Melanie Warrick @nyghtowl

Slide 2

Slide 2 text

@nyghtowl ● Neural Nets (NNs) ● Graphical Processing Units (GPUs) ● Code Stuff Overview

Slide 3

Slide 3 text

@nyghtowl Artificial Neural Nets Input Output Hidden Run until error stops improving = converge Loss Function Output k j X M kj W y

Slide 4

Slide 4 text

@nyghtowl Real World... - Siri (NLP) - Google Car (Computer Vision) - Netflix (Recommender Engines)

Slide 5

Slide 5 text

@nyghtowl

Slide 6

Slide 6 text

@nyghtowl Key Reasons for Success Computational Power Labeled & Accessible Data

Slide 7

Slide 7 text

@nyghtowl Einstein? Example: Computer Vision Layers Pixels Edges Object Parts Object Models Layer 2 Layer 3 Input Layer 4

Slide 8

Slide 8 text

@nyghtowl NN Challenge => Training Time still computationally expensive .

Slide 9

Slide 9 text

@nyghtowl - Graphics Card - Gaming & Research - Optimized for FLOPs What are GPUs

Slide 10

Slide 10 text

@nyghtowl CPUs GPUs focus decision maker laborer processing sequential serial parallel cores 4 - 48 100s - 1000s RAM 16.8M TB 2-12GB ALU 4-12 32-bit instructs / clock 32K 32-bit instructs / clock FLOPs faster clock speed ~ 1000s MHz ~ 100s MHz vs

Slide 11

Slide 11 text

@nyghtowl Dist-Belief: YouTube Image Rec Google - 2012 Stanford - 2013 - 1K CPUs = 16K cores - $5B - week - 3 GPUs = 18K cores - $33K - week

Slide 12

Slide 12 text

@nyghtowl GPU Hardware & Software Nvidia AMD Lower power & quieter Lower price AWS & Google Macs CUDA & OpenCL OpenCL ● GeForce (Consumer) ● Quadro (Prof) & Tesla ( HPC) ● Radeon (Consumer) ● FirePro (Prof & HPC) Titan X (3K cores & 12GB RAM)

Slide 13

Slide 13 text

@nyghtowl GPU Challenges - Moving data on and off GPU - Memory limits - Branching

Slide 14

Slide 14 text

@nyghtowl Moving Memory:

Slide 15

Slide 15 text

@nyghtowl Memory Limits: Options - Resize Data - Minibatch &/or Stream - Distributed Approach

Slide 16

Slide 16 text

@nyghtowl Distributed Approach HW - 1 chip - mult chips - mult boxes SW - split data - split NN model - mix of both

Slide 17

Slide 17 text

@nyghtowl Branching: GPU Chip Processing

Slide 18

Slide 18 text

@nyghtowl GPU Block is Single Minded

Slide 19

Slide 19 text

@nyghtowl

Slide 20

Slide 20 text

@nyghtowl Focus on Matrix Math W 3 W 4 W 2 W 1 Input Output Hidden Loss Function Output k j X M kj W y

Slide 21

Slide 21 text

@nyghtowl Independent Math W1 X a b c d e f g 1 Example h W2 a b c d e f g h W3 a b c d e f g h 1st Hidden Layer W4 a b c d e f g h

Slide 22

Slide 22 text

@nyghtowl Ex: Cuda GPU Code allocate memory cudaMalloc((void**)&dA, sizeof(double) * size * size); data in cublasSetMatrix (size, size, sizeof(double), B, size, dB, size); run kernel cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, size, size, size, &one, dB, size, dB, size, &zero, dA, size ); data out cublasGetMatrix (size, size, sizeof(double), dA, size, A, size); sync cudaDeviceSynchronize();

Slide 23

Slide 23 text

@nyghtowl Neural Net Packages Java / Scala Python C / C++ Lua Matlab DL4J Theano Neon Caffe Torch ConvNet Lasagne Graphlab CXXNet OpenDeep Chainer Minerva Keras NuPIC DeepLearning Kayak PyBrain Bocks PyLearn2

Slide 24

Slide 24 text

@nyghtowl Simple Commands Caffe ● "solver_mode: GPU" DL4J ● nd4j-jcublas-*.0 Theano ● $ THEANO_FLAGS=device=gpu, python example.py

Slide 25

Slide 25 text

@nyghtowl Neural net code….

Slide 26

Slide 26 text

@nyghtowl Example: MNIST ~ “Hello World” ● Classify handwritten digits 0-9 ● Each pixel is an input ● Input value ranges 0-255 (white to black)

Slide 27

Slide 27 text

0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 Example: Input @nyghtowl

Slide 28

Slide 28 text

Example: MNIST Structure Output Hidden ...784 Nodes 1000 10 outputs 4 9 8 7 6 5 4 3 2 1 0 @nyghtowl Output k j X M kj W y Input

Slide 29

Slide 29 text

@nyghtowl no CUDA-capable device is detected => check GPU is running sudo kextload /System/Library/Extensions/CUDA.kext kill => reduce size of mini-batch and check flushing frequency runs slow => check syncing interval across the chip IOError: No such file or directory: => fix data path Techniques to Troubleshoot

Slide 30

Slide 30 text

@nyghtowl ● 10 Misconceptions about Neural Networks http://www.turingfinance.com/misconceptions-about- neural-networks/#blackbox ● Nature of Code: Neural Networks http://natureofcode.com/book/chapter-10-neural-networks/ ● Theano Tutorial http://deeplearning.net/software/theano/tutorial/index.html#tutorial ● Machine Learning (Coursera-Ng) https://class.coursera.org/ml-005/lecture ● Hacker’s Guide to Neural Nets (Stanford - Karpathy) https://karpathy.github.io/neuralnets/ ● Neural Networks for Machine Learning (Coursera - Hinton) https://class.coursera.org/neuralnets-2012- 001/lecture ● Neural Nets and Deep Learning http://neuralnetworksanddeeplearning.com/ ● Deep Learning Stanford CS http://deeplearning.stanford.edu/ ● Deep Learning Tutorial (NYU - LeCun) http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf ● Deep Learning Tutorial (U Montreal - Bengio) http://deeplearning.net/tutorial/deeplearning.pdf ● Tutorial on Deep Learning for Vision https://sites.google.com/site/deeplearningcvpr2014/ References: Neural Nets

Slide 31

Slide 31 text

@nyghtowl ● An Introduction to Using GPUs for Computation: http://www.stat.berkeley.edu/scf/paciorek- gpuWorkshop.html ● Comparison of GPU and CPU implementations of mean-firing rate neural networks on parallel hardware http://www.researchgate.net/publication/233392650_Comparison_of_GPU-_and_CPU- implementations_of_mean-firing_rate_neural_networks_on_parallel_hardware ● My first CUDA program! https://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/ ● PyCuda Tutorial: http://documen.tician.de/pycuda/tutorial.html#transferring-data ● Accelerated Machine Learning with the cuDNN Deep Neural Network Library http://devblogs.nvidia. com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ ● Neural Networks with Parallel and GPU Computing: https://www.mathworks.com/help/nnet/ug/neural- networks-with-parallel-and-gpu-computing.html ● One weird trick for parallelizing convolutional neural networks: http://arxiv.org/pdf/1404.5997v2.pdf ● Which GPU for deep-learning: https://timdettmers.wordpress.com/2014/08/14/which-gpu-for-deeplearning/ ● Why a GPU mines faster than a CPU: https://developer.nvidia.com/gpu-accelerated-libraries https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU References: GPUs

Slide 32

Slide 32 text

@nyghtowl ● Deeplearning4J: http://nd4j.org/getstarted.html ● Caffe: http://caffe.berkeleyvision.org/install_osx.html ● Theano: http://deeplearning.net/software/theano/install.html ● PyCuda: http://wiki.tiker.net/PyCuda/Installation/Mac#Pre-install_Tips ● Installing CUDA, OpenCL, & PyOpenCL on AWS EC2: http://vasir.net/blog/opencl/installing-cuda-opencl- pyopencl-on-aws-ec2 References: Setup

Slide 33

Slide 33 text

@nyghtowl References: Images ● http://www.texample.net/tikz/examples/neural-network/ ● http://jaoying-google.blogspot.com/2012_12_01_archive.html ● https://www.kaggle.com/forums/f/15/kaggle-forum/t/10878/feature-representation-in-deep-learning ● http://users.clas.ufl.edu/glue/longman/1/einstein.html ● http://www.nvidia.com/object/what-is-gpu-computing.html ● https://www.classes.cs.uchicago.edu/archive/2013/spring/12300-1/pa/pa1/ ● http://www.hitechreview.com/it-products/pc/nvidia-strikes-back-presents-tesla-k20x-graphics- card/40392/ ● http://disney.wikia.com/wiki/Magic_Brooms ● http://www.playinterference.com/view/7713/ ● http://www.drmichellemazur.com/wp-content/uploads/2013/08/fowl_storm.jpg ● https://eda360insider.wordpress.com/2011/09/14/what-would-you-do-with-a-23000-simultaneous- thread-school-of-piranha-asks-nvidia/ ● http://www.nvidiadefect.com/what-exactly-is-the-nvidia-defect-t3.html ● http://www.maximumpc.com/everything-you-need-to-know-about-nvidias-gf100-fermi-gpu/ ● http://www.theregister.co.uk/2013/11/16/nvidia_reveals_cuda_6_joins_cpugpu_shared_memory_party/ ● http://adailypinch.com/bill-cat-spirit-animal ● https://stackoverflow.com/questions/20146098/can-cpu-process-write-to-memoryuva-in-gpu-ram- allocated-by-other-cpu-process

Slide 34

Slide 34 text

@nyghtowl ● Tim Elser ● Tarin Ziyaee ● Phillip Culliton ● Megan Speir ● Lindsay Cade ● Isabel Markl ● Jeremy Dunck ● Erin O’Connell ● Cyprien Noel ● Christian Fernandez ● Charles Ruhland ● Bryan Catanzaro ● Adam Gibson Special Thanks

Slide 35

Slide 35 text

@nyghtowl Last Points - NNs ~ personalization - Training NNs is hard - GPUs makes training faster - Same thing multiple times at the same time Go play with GPUs!

Slide 36

Slide 36 text

@nyghtowl How to Run Neural Nets on GPUs Melanie Warrick github.com/nyghtowl/Neural_Nets_GPUs(code) skymind.io (company) gitter.im/deeplearning4j/deeplearning4j