How to Run Neural Nets on GPUs - Strata

Neural Nets on GPUs Melanie Warrick @nyghtowl

@nyghtowl • Neural Nets (NNs) • Graphical Processing Units (GPUs)
• Code Stuff Overview

@nyghtowl Artificial Neural Nets Input Output Hidden Run until error
stops improving = converge Loss Function Output k j X M kj W y

@nyghtowl Real World... - Real-time Language Translation (NLP) - Auto
Image Tagging (Computer Vision) - Movie Recs (Recommender Engines)

@nyghtowl

@nyghtowl Key Reasons for Success Computational Power Labeled & Accessible
Data

@nyghtowl Einstein? Example: Computer Vision Layers Pixels Edges Object Parts
Object Models Layer 2 Layer 3 Input Layer 4

@nyghtowl NN Challenge => Training Time still computationally expensive .

@nyghtowl - Graphics Card - Gaming & Research - Optimized
for FLOPs What are GPUs

@nyghtowl CPUs GPUs focus decision maker laborer processing sequential serial
parallel cores 4 - 48 100s - 1000s RAM 16.8M TB 2-12GB ALU 4-12 32-bit instructs / clock 32K 32-bit instructs / clock FLOPs faster clock speed ~ 1000s MHz ~ 100s MHz vs

@nyghtowl Dist-Belief: YouTube Image Rec Google - 2012 Stanford -
2013 - 1K CPUs = 16K cores - $5B - week - 3 GPUs = 18K cores - $33K - week

@nyghtowl GPU Hardware & Software Nvidia AMD Lower power &
quieter Lower price AWS & Google Macs CUDA & OpenCL OpenCL • GeForce (Consumer) • Quadro (Prof) & Tesla ( HPC) • Radeon (Consumer) • FirePro (Prof & HPC) Titan X (3K cores & 12GB RAM)

@nyghtowl GPU Challenges - Moving data on and off GPU
- Memory limits - Branching

@nyghtowl Moving Memory:

@nyghtowl Memory Limits: Options - Resize Data - Minibatch &/or
Stream - Distributed Approach

@nyghtowl Distributed Approach HW - 1 chip - mult chips
- mult boxes SW - split data - split NN model - mix of both

@nyghtowl Branching: GPU Chip Processing

@nyghtowl GPU Block is Single Minded

@nyghtowl

@nyghtowl Focus on Matrix Math W 3 W 4 W
2 W 1 Input Output Hidden Loss Function Output k j X M kj W y

@nyghtowl Independent Math W1 X a b c d e
f g 1 Example h W2 a b c d e f g h W3 a b c d e f g h 1st Hidden Layer W4 a b c d e f g h

@nyghtowl Ex: Cuda GPU Code allocate memory cudaMalloc((void**)&dev_a, N *
sizeof(int)); data in cudaMemcpy( dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice); run kernel cublasDgemm(); OR add<<N, 1>>( dev_a, dev_b, dev_c ); data out cudaMemcpy( c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost ); sync cudaDeviceSynchronize(); free memory cudaFree(dev_a);

@nyghtowl Neural Net Packages Java / Scala Python C /
C++ Lua Matlab DL4J Theano TensorFlow Caffe Torch ConvNet Neon Chainer MXNet matrbm Lasagne Graphlab SINGA DBN Keras NuPIC Eblearn Kayak PyBrain Cuda-Convnet Blocks PyLearn2

@nyghtowl Simple Commands Caffe • "solver_mode: GPU" DL4J • <artifactId>nd4j-jcublas-*.0</artifactId>
Theano • $ THEANO_FLAGS=device=gpu, python example.py

@nyghtowl Neural net code….

@nyghtowl Example: MNIST ~ “Hello World” • Classify handwritten digits
0-9 • Each pixel is an input • Input value ranges 0-255 (white to black)

0 0 1 0 0 0 0 1 1 0
0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 Example: Input @nyghtowl

Example: MNIST Structure Output Hidden ...784 Nodes 1000 10 outputs
4 9 8 7 6 5 4 3 2 1 0 @nyghtowl Output k j X M kj W y Input

@nyghtowl Tools for Troubleshooting Debugging - Cuda-Memcheck - Cuda-GDB Profiling
- Nvidia Visual Profiler - CUDA Profiling Interface Nvidia Nsight

@nyghtowl no CUDA-capable device is detected => check GPU is
running sudo kextload /System/Library/Extensions/CUDA.kext kill => reduce size of mini-batch and check flushing frequency runs slow => check syncing interval across the chip Troubleshoot Pointers

@nyghtowl • 10 Misconceptions about Neural Networks http://www.turingfinance.com/misconceptions-about- neural-networks/#blackbox •
Nature of Code: Neural Networks http://natureofcode.com/book/chapter-10-neural-networks/ • Theano Tutorial http://deeplearning.net/software/theano/tutorial/index.html#tutorial • Machine Learning (Coursera-Ng) https://class.coursera.org/ml-005/lecture • Hacker’s Guide to Neural Nets (Stanford - Karpathy) https://karpathy.github.io/neuralnets/ • Neural Networks for Machine Learning (Coursera - Hinton) https://class.coursera.org/neuralnets-2012- 001/lecture • Neural Nets and Deep Learning http://neuralnetworksanddeeplearning.com/ • Deep Learning Stanford CS http://deeplearning.stanford.edu/ • Deep Learning Tutorial (NYU - LeCun) http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf • Deep Learning Tutorial (U Montreal - Bengio) http://deeplearning.net/tutorial/deeplearning.pdf • Tutorial on Deep Learning for Vision https://sites.google.com/site/deeplearningcvpr2014/ • Deep Learning http://deeplearning.net/ References: Neural Nets

@nyghtowl • An Introduction to Using GPUs for Computation: http://www.stat.berkeley.edu/scf/paciorek-
gpuWorkshop.html • Comparison of GPU and CPU implementations of mean-firing rate neural networks on parallel hardware http://www.researchgate.net/publication/233392650_Comparison_of_GPU-_and_CPU- implementations_of_mean-firing_rate_neural_networks_on_parallel_hardware • My first CUDA program! https://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/ • PyCuda Tutorial: http://documen.tician.de/pycuda/tutorial.html#transferring-data • Accelerated Machine Learning with the cuDNN Deep Neural Network Library http://devblogs.nvidia. com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ • Neural Networks with Parallel and GPU Computing: https://www.mathworks.com/help/nnet/ug/neural- networks-with-parallel-and-gpu-computing.html • One weird trick for parallelizing convolutional neural networks: http://arxiv.org/pdf/1404.5997v2.pdf • Which GPU for deep-learning: https://timdettmers.wordpress.com/2014/08/14/which-gpu-for-deeplearning/ • Why a GPU mines faster than a CPU: https://developer.nvidia.com/gpu-accelerated-libraries https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU References: GPUs

@nyghtowl • Deeplearning4J: http://nd4j.org/getstarted.html • Caffe: http://caffe.berkeleyvision.org/install_osx.html • Theano: http://deeplearning.net/software/theano/install.html
• PyCuda: http://wiki.tiker.net/PyCuda/Installation/Mac#Pre-install_Tips • Installing CUDA, OpenCL, & PyOpenCL on AWS EC2: http://vasir.net/blog/opencl/installing-cuda-opencl- pyopencl-on-aws-ec2 References: Setup

@nyghtowl References: Images • http://www.texample.net/tikz/examples/neural-network/ • http://jaoying-google.blogspot.com/2012_12_01_archive.html • https://www.kaggle.com/forums/f/15/kaggle-forum/t/10878/feature-representation-in-deep-learning •
http://users.clas.ufl.edu/glue/longman/1/einstein.html • http://www.nvidia.com/object/what-is-gpu-computing.html • https://www.classes.cs.uchicago.edu/archive/2013/spring/12300-1/pa/pa1/ • http://www.hitechreview.com/it-products/pc/nvidia-strikes-back-presents-tesla-k20x-graphics- card/40392/ • http://disney.wikia.com/wiki/Magic_Brooms • http://www.playinterference.com/view/7713/ • http://www.drmichellemazur.com/wp-content/uploads/2013/08/fowl_storm.jpg • https://eda360insider.wordpress.com/2011/09/14/what-would-you-do-with-a-23000-simultaneous- thread-school-of-piranha-asks-nvidia/ • http://www.nvidiadefect.com/what-exactly-is-the-nvidia-defect-t3.html • http://www.maximumpc.com/everything-you-need-to-know-about-nvidias-gf100-fermi-gpu/ • http://www.theregister.co.uk/2013/11/16/nvidia_reveals_cuda_6_joins_cpugpu_shared_memory_party/ • http://adailypinch.com/bill-cat-spirit-animal • https://stackoverflow.com/questions/20146098/can-cpu-process-write-to-memoryuva-in-gpu-ram- allocated-by-other-cpu-process

@nyghtowl • Tim Elser • Tarin Ziyaee • Phillip Culliton
• Megan Speir • Lindsay Cade • Isabel Markl • Jeremy Dunck • Erin O’Connell • Cyprien Noel • Christian Fernandez • Charles Ruhland • Bryan Catanzaro • Adam Gibson Special Thanks

@nyghtowl Last Points - NNs ~ personalization - Training NNs
is hard - GPUs makes training faster - Same thing multiple times at the same time Go play with GPUs!

@nyghtowl How to Run Neural Nets on GPUs Melanie Warrick
github.com/nyghtowl/Neural_Nets_GPUs(code) skymind.io (company) gitter.im/deeplearning4j/deeplearning4j (chat)

How to Run Neural Nets on GPUs - Strata

How to Run Neural Nets on GPUs - Strata

More Decks by Melanie Warrick

Other Decks in Technology

Featured

Transcript