How to Run Neural Nets on GPUs

How to Run Neural Nets on GPUs

This talk is just what the title says. I will demonstrate how to run a neural net on a GPU because neural nets are solving some interesting problems and GPUs are a good tool to use.

Neural networks have regained popularity in the last decade plus because there are real world applications we are finally able to apply them to (e.g. Siri, self-driving​ cars, facial recognition). This is due to significant improvements in computational power and the amount of data that is available for building the models. However, neural nets still have a barrier to entry as a useful tool in companies because they can be computationally expensive to obtain value and implement.

GPUs are popular processors in gaming and research due to their computational speed. Deep Neural Net's parallel structures (millions of identical nodes that perform the same operation on different data), are ideal for GPU's. Depending on the neural net, you can use a single server with GPUs vs. a CPU cluster and improve communication latency as well as reduces size and power consumption. Running an optimization method (training algorithm) like Stochastic Gradient Descent on a CPU vs. a GPU can be up to 40 times faster.

This talk will briefly explain what neural nets are and why they're important, as well as give context about GPUs. Then I will walk through the code and actually launch a neural net on a GPU. I will cover key pitfalls you may hit and techniques to diagnose and troubleshoot. You will walk away understanding how to approach using GPUs on your own and have some resources to dive into for further understanding.

2168aa4564112d3ba88869ca3cc994b3?s=128

Melanie Warrick

September 26, 2015
Tweet

Transcript

  1. Neural Nets on GPUs Melanie Warrick @nyghtowl

  2. @nyghtowl • Neural Nets (NNs) • Graphical Processing Units (GPUs)

    • Code Stuff Overview
  3. @nyghtowl Artificial Neural Nets Input Output Hidden Run until error

    stops improving = converge Loss Function Output k j X M kj W y
  4. @nyghtowl Real World... - Siri (NLP) - Google Car (Computer

    Vision) - Netflix (Recommender Engines)
  5. @nyghtowl

  6. @nyghtowl Key Reasons for Success Computational Power Labeled & Accessible

    Data
  7. @nyghtowl Einstein? Example: Computer Vision Layers Pixels Edges Object Parts

    Object Models Layer 2 Layer 3 Input Layer 4
  8. @nyghtowl NN Challenge => Training Time still computationally expensive .

  9. @nyghtowl - Graphics Card - Gaming & Research - Optimized

    for FLOPs What are GPUs
  10. @nyghtowl CPUs GPUs focus decision maker laborer processing sequential serial

    parallel cores 4 - 48 100s - 1000s RAM 16.8M TB 2-12GB ALU 4-12 32-bit instructs / clock 32K 32-bit instructs / clock FLOPs faster clock speed ~ 1000s MHz ~ 100s MHz vs
  11. @nyghtowl Dist-Belief: YouTube Image Rec Google - 2012 Stanford -

    2013 - 1K CPUs = 16K cores - $5B - week - 3 GPUs = 18K cores - $33K - week
  12. @nyghtowl GPU Hardware & Software Nvidia AMD Lower power &

    quieter Lower price AWS & Google Macs CUDA & OpenCL OpenCL • GeForce (Consumer) • Quadro (Prof) & Tesla ( HPC) • Radeon (Consumer) • FirePro (Prof & HPC) Titan X (3K cores & 12GB RAM)
  13. @nyghtowl GPU Challenges - Moving data on and off GPU

    - Memory limits - Branching
  14. @nyghtowl Moving Memory:

  15. @nyghtowl Memory Limits: Options - Resize Data - Minibatch &/or

    Stream - Distributed Approach
  16. @nyghtowl Distributed Approach HW - 1 chip - mult chips

    - mult boxes SW - split data - split NN model - mix of both
  17. @nyghtowl Branching: GPU Chip Processing

  18. @nyghtowl GPU Block is Single Minded

  19. @nyghtowl

  20. @nyghtowl Focus on Matrix Math W 3 W 4 W

    2 W 1 Input Output Hidden Loss Function Output k j X M kj W y
  21. @nyghtowl Independent Math W1 X a b c d e

    f g 1 Example h W2 a b c d e f g h W3 a b c d e f g h 1st Hidden Layer W4 a b c d e f g h
  22. @nyghtowl Ex: Cuda GPU Code allocate memory cudaMalloc((void**)&dA, sizeof(double) *

    size * size); data in cublasSetMatrix (size, size, sizeof(double), B, size, dB, size); run kernel cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, size, size, size, &one, dB, size, dB, size, &zero, dA, size ); data out cublasGetMatrix (size, size, sizeof(double), dA, size, A, size); sync cudaDeviceSynchronize();
  23. @nyghtowl Neural Net Packages Java / Scala Python C /

    C++ Lua Matlab DL4J Theano Neon Caffe Torch ConvNet Lasagne Graphlab CXXNet OpenDeep Chainer Minerva Keras NuPIC DeepLearning Kayak PyBrain Bocks PyLearn2
  24. @nyghtowl Simple Commands Caffe • "solver_mode: GPU" DL4J • <artifactId>nd4j-jcublas-*.0</artifactId>

    Theano • $ THEANO_FLAGS=device=gpu, python example.py
  25. @nyghtowl Neural net code….

  26. @nyghtowl Example: MNIST ~ “Hello World” • Classify handwritten digits

    0-9 • Each pixel is an input • Input value ranges 0-255 (white to black)
  27. 0 0 1 0 0 0 0 1 1 0

    0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 Example: Input @nyghtowl
  28. Example: MNIST Structure Output Hidden ...784 Nodes 1000 10 outputs

    4 9 8 7 6 5 4 3 2 1 0 @nyghtowl Output k j X M kj W y Input
  29. @nyghtowl no CUDA-capable device is detected => check GPU is

    running sudo kextload /System/Library/Extensions/CUDA.kext kill => reduce size of mini-batch and check flushing frequency runs slow => check syncing interval across the chip IOError: No such file or directory: => fix data path Techniques to Troubleshoot
  30. @nyghtowl • 10 Misconceptions about Neural Networks http://www.turingfinance.com/misconceptions-about- neural-networks/#blackbox •

    Nature of Code: Neural Networks http://natureofcode.com/book/chapter-10-neural-networks/ • Theano Tutorial http://deeplearning.net/software/theano/tutorial/index.html#tutorial • Machine Learning (Coursera-Ng) https://class.coursera.org/ml-005/lecture • Hacker’s Guide to Neural Nets (Stanford - Karpathy) https://karpathy.github.io/neuralnets/ • Neural Networks for Machine Learning (Coursera - Hinton) https://class.coursera.org/neuralnets-2012- 001/lecture • Neural Nets and Deep Learning http://neuralnetworksanddeeplearning.com/ • Deep Learning Stanford CS http://deeplearning.stanford.edu/ • Deep Learning Tutorial (NYU - LeCun) http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf • Deep Learning Tutorial (U Montreal - Bengio) http://deeplearning.net/tutorial/deeplearning.pdf • Tutorial on Deep Learning for Vision https://sites.google.com/site/deeplearningcvpr2014/ References: Neural Nets
  31. @nyghtowl • An Introduction to Using GPUs for Computation: http://www.stat.berkeley.edu/scf/paciorek-

    gpuWorkshop.html • Comparison of GPU and CPU implementations of mean-firing rate neural networks on parallel hardware http://www.researchgate.net/publication/233392650_Comparison_of_GPU-_and_CPU- implementations_of_mean-firing_rate_neural_networks_on_parallel_hardware • My first CUDA program! https://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/ • PyCuda Tutorial: http://documen.tician.de/pycuda/tutorial.html#transferring-data • Accelerated Machine Learning with the cuDNN Deep Neural Network Library http://devblogs.nvidia. com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ • Neural Networks with Parallel and GPU Computing: https://www.mathworks.com/help/nnet/ug/neural- networks-with-parallel-and-gpu-computing.html • One weird trick for parallelizing convolutional neural networks: http://arxiv.org/pdf/1404.5997v2.pdf • Which GPU for deep-learning: https://timdettmers.wordpress.com/2014/08/14/which-gpu-for-deep- learning/ • Why a GPU mines faster than a CPU: https://developer.nvidia.com/gpu-accelerated-libraries https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU References: GPUs
  32. @nyghtowl • Deeplearning4J: http://nd4j.org/getstarted.html • Caffe: http://caffe.berkeleyvision.org/install_osx.html • Theano: http://deeplearning.net/software/theano/install.html

    • PyCuda: http://wiki.tiker.net/PyCuda/Installation/Mac#Pre-install_Tips • Installing CUDA, OpenCL, & PyOpenCL on AWS EC2: http://vasir.net/blog/opencl/installing-cuda-opencl- pyopencl-on-aws-ec2 References: Setup
  33. @nyghtowl References: Images • http://www.texample.net/tikz/examples/neural-network/ • http://jaoying-google.blogspot.com/2012_12_01_archive.html • https://www.kaggle.com/forums/f/15/kaggle-forum/t/10878/feature-representation-in-deep-learning •

    http://users.clas.ufl.edu/glue/longman/1/einstein.html • http://www.nvidia.com/object/what-is-gpu-computing.html • https://www.classes.cs.uchicago.edu/archive/2013/spring/12300-1/pa/pa1/ • http://www.hitechreview.com/it-products/pc/nvidia-strikes-back-presents-tesla-k20x-graphics- card/40392/ • http://disney.wikia.com/wiki/Magic_Brooms • http://www.playinterference.com/view/7713/ • http://www.drmichellemazur.com/wp-content/uploads/2013/08/fowl_storm.jpg • https://eda360insider.wordpress.com/2011/09/14/what-would-you-do-with-a-23000-simultaneous- thread-school-of-piranha-asks-nvidia/ • http://www.nvidiadefect.com/what-exactly-is-the-nvidia-defect-t3.html • http://www.maximumpc.com/everything-you-need-to-know-about-nvidias-gf100-fermi-gpu/ • http://www.theregister.co.uk/2013/11/16/nvidia_reveals_cuda_6_joins_cpugpu_shared_memory_party/ • http://adailypinch.com/bill-cat-spirit-animal • https://stackoverflow.com/questions/20146098/can-cpu-process-write-to-memoryuva-in-gpu-ram- allocated-by-other-cpu-process
  34. @nyghtowl • Tim Elser • Tarin Ziyaee • Phillip Culliton

    • Megan Speir • Lindsay Cade • Isabel Markl • Jeremy Dunck • Erin O’Connell • Cyprien Noel • Christian Fernandez • Charles Ruhland • Bryan Catanzaro • Adam Gibson Special Thanks
  35. @nyghtowl Last Points - NNs ~ personalization - Training NNs

    is hard - GPUs makes training faster - Same thing multiple times at the same time Go play with GPUs!
  36. @nyghtowl How to Run Neural Nets on GPUs Melanie Warrick

    github.com/nyghtowl/Neural_Nets_GPUs(code) skymind.io (company) gitter.im/deeplearning4j/deeplearning4j