iDeep meets Chainer - Chainer LT Meetup #1 2018

iDeep meets Chainer @0dgr Chainer LT Meetup #1 (Mar 22,
2018)

インテルCPU使いの方！朗報です！

* https://github.com/chainer/chainer/releases/tag/v4.0.0b4

What is iDeep • DLフレームワーク向け、IAに最適化されたオープンソースライブラリ • 内部ではMKL, MKL-DNNを使っている •
ideep4pyがPython向けAPIを提供 • MDArrayを導入。MKL-DNNとNDArrayのメモリレイアウト変換のオーバーヘッドを削減 • NDArrayとの互換性 • Chainer v4.0.0b4リリースよりiDeepに対応 Chainer NumPy CuPy BLAS cuDNN/cuBLAS Intel CPUs NVIDIA GPU MKL-DNN/MKL iDeep Stub Stub * https://github.com/intel/ideep

How to Use インストール • PIP • Source Code •
Intel Python付属のlibmkldnn.soのリネームまたは削除（iDeepのライブラリ依存問題）使い方 • 追加環境変数: CHAINER_USE_IDEEP="auto" • SGD, MomentumSGD: model.to_intel64() $ pip install ideep4py * https://github.com/chainer/chainer/releases/tag/v4.0.0b4 $ git clone -b v1.0.3 https://github.com/intel/ideep.git $ cd ideep/python $ python setup.py build && python setup.py install

Performance (Condition) • SKX (自作ワークステーション): 250万円くらい Intel Xeon Platinum 8180,
[email protected] GHz (TB: 3.80 GHz) x2, 192GB • SKL (Lenovo P50、モバイルワークステーション): 50万円くらい Intel Xeon E3-1505M v5, [email protected] GHz (TB: 3.70 GHz), 64GB • Chainer: v4.0.0b4 • iDeep: v1.0.3 • Python: Intel Distributed Python 3 • Convnet-benchmarks: https://github.com/mitmul/convnet-benchmarks

Performance on Xeon Platinum 8180 0 20 40 60 80
100 120 140 AlexNet GoogleNet VGGA Overfeat iDeep vs NumPy+MKL vs NumPy+MKL (SKL) iDeep NumPy+MKL NumPy+MKL on SKL Forward + Backward Exec Time (smaller is better) * Normalized based on iDeep (Chainer) execution time 0 0.2 0.4 0.6 0.8 1 1.2 1.4 AlexNet GoogleNet VGGA Overfeat Caffe (MKL-DNN) vs Chainer (iDeep) Caffe (MKL-DNN) Chainer (iDeep) * Normalized based on Caffe execution time Forward + Backward Exec Time (smaller is better) SKLより遅い？

Performance on Different Platforms 0 0.5 1 1.5 2 2.5
3 AlexNet GoogleNet VGGA Overfeat P100 vs SKX P100 SKX * Normalized based on P100 execution time Forward + Backward Exec Time (smaller is better) * P100 numbers were captured by running benchmark app AS-IS. Not captured by CUDA/GPGPU experts, so may not be the best configuration and optimization.

Summary • Chainer v4.0.0b4からCPU最適化が簡単に使えるようになった • NumPy時のパフォーマンスはiDeepに比べて25–115倍ほど遅い... • 下手にNumPy+MKLを使うと、上位マシンより下位マシンのほうが高い性能に • 絶望的な状況から使える状況に
• CHAINER_USE_IDEEP, to_intel64を忘れずにぜひiDeepを一回使ってみてください（CPUでやっている方がいれば）

iDeep meets Chainer - Chainer LT Meetup #1 2018

iDeep meets Chainer - Chainer LT Meetup #1 2018

0dgr

Other Decks in Technology

Featured

Transcript

iDeep meets Chainer @0dgr Chainer LT Meetup #1 (Mar 22,

インテルCPU使いの方！朗報です！

* https://github.com/chainer/chainer/releases/tag/v4.0.0b4

What is iDeep • DLフレームワーク向け、IAに最適化されたオープンソースライブラリ • 内部ではMKL, MKL-DNNを使っている •

How to Use インストール • PIP • Source Code •

Performance (Condition) • SKX (自作ワークステーション): 250万円くらい Intel Xeon Platinum 8180,

Performance on Xeon Platinum 8180 0 20 40 60 80

Performance on Different Platforms 0 0.5 1 1.5 2 2.5