Upgrade to Pro — share decks privately, control downloads, hide ads and more …

iDeep meets Chainer - Chainer LT Meetup #1 2018

0dgr
March 22, 2018

iDeep meets Chainer - Chainer LT Meetup #1 2018

0dgr

March 22, 2018
Tweet

Other Decks in Technology

Transcript

  1. What is iDeep • DLフレームワーク向け、IAに最適化されたオープンソー スライブラリ • 内部ではMKL, MKL-DNNを使っている •

    ideep4pyがPython向けAPIを提供 • MDArrayを導入。MKL-DNNとNDArrayのメモリレイア ウト変換のオーバーヘッドを削減 • NDArrayとの互換性 • Chainer v4.0.0b4リリースよりiDeepに対応 Chainer NumPy CuPy BLAS cuDNN/cuBLAS Intel CPUs NVIDIA GPU MKL-DNN/MKL iDeep Stub Stub * https://github.com/intel/ideep
  2. How to Use インストール • PIP • Source Code •

    Intel Python付属のlibmkldnn.soのリネームまた は削除(iDeepのライブラリ依存問題) 使い方 • 追加環境変数: CHAINER_USE_IDEEP="auto" • SGD, MomentumSGD: model.to_intel64() $ pip install ideep4py * https://github.com/chainer/chainer/releases/tag/v4.0.0b4 $ git clone -b v1.0.3 https://github.com/intel/ideep.git $ cd ideep/python $ python setup.py build && python setup.py install
  3. Performance (Condition) • SKX (自作ワークステーション): 250万円くらい Intel Xeon Platinum 8180,

    [email protected] GHz (TB: 3.80 GHz) x2, 192GB • SKL (Lenovo P50、モバイルワークステーション): 50万円くらい Intel Xeon E3-1505M v5, [email protected] GHz (TB: 3.70 GHz), 64GB • Chainer: v4.0.0b4 • iDeep: v1.0.3 • Python: Intel Distributed Python 3 • Convnet-benchmarks: https://github.com/mitmul/convnet-benchmarks
  4. Performance on Xeon Platinum 8180 0 20 40 60 80

    100 120 140 AlexNet GoogleNet VGGA Overfeat iDeep vs NumPy+MKL vs NumPy+MKL (SKL) iDeep NumPy+MKL NumPy+MKL on SKL Forward + Backward Exec Time (smaller is better) * Normalized based on iDeep (Chainer) execution time 0 0.2 0.4 0.6 0.8 1 1.2 1.4 AlexNet GoogleNet VGGA Overfeat Caffe (MKL-DNN) vs Chainer (iDeep) Caffe (MKL-DNN) Chainer (iDeep) * Normalized based on Caffe execution time Forward + Backward Exec Time (smaller is better) SKLより遅い?
  5. Performance on Different Platforms 0 0.5 1 1.5 2 2.5

    3 AlexNet GoogleNet VGGA Overfeat P100 vs SKX P100 SKX * Normalized based on P100 execution time Forward + Backward Exec Time (smaller is better) * P100 numbers were captured by running benchmark app AS-IS. Not captured by CUDA/GPGPU experts, so may not be the best configuration and optimization.