x86 CPUで動くAIアプリ作成で知っていると便利な基礎知識

x86 CPUで動くAIアプリ作成で知っていると便利な基礎知識 IBM Developer Dojo++ (2020/9/16) インテル株式会社 APJデータセンター・グループ・セールス AIテクニカル・ソリューション・スペシャリスト
大内山浩（おおうちやまひろし／Ouchiyama Hiroshi）

インテル株式会社 2 自己紹介大内山浩（おおうちやまひろし）インテル株式会社 AIテクニカルソリューションスペシャリスト（AIの”何でも屋”？） ▪
経歴 • 2006年～ IBM • 2016年～ Microsoft • 2019年～ Intel ▪ 趣味 • 旅行、サッカー、音楽、Youtube 10年前数年前漫画化

インテル株式会社 3 この度インテルのロゴが新しくなりました旧新

インテル株式会社 4 CPUとは？ Central Processing Unitの略で日本語に直訳すると中央演算処理装置となります。 PC、サーバーのなかで中心的な役割を果たし、人間の脳に例えられることがあります。脳からの指令で手足を動かすイメージです

インテル株式会社 5 CPUの歴史について振り返る ()内はマイクロアーキテクチャ Pentium II Xeon Intel 8085 Core
Xeon Atom Intel 8086 Xeon (NetBurst) Xeon-SP (Cascade lake) Pentium4※ (NetBurst) 10th gen Core (Comet lake) Pentium II Atom (Silvermont) Atom (Bonnell) Atom (Airmont) Intel 80386 1998 ~1975 2001 1978 ~2020 2000 1997 2008 2012 1985 1991

インテル株式会社 6 CPUの歴史について振り返る ()内はマイクロアーキテクチャ Pentium II Xeon Intel 8085 Core
Xeon Atom Intel 8086 Xeon (NetBurst) Xeon-SP (Cascade lake) Pentium4※ (NetBurst) 10th gen Core (Comet lake) Pentium II Atom (Silvermont) Atom (Bonnell) Atom (Airmont) Intel 80386 1998 ~1975 2001 1978 ~2020 2000 1997 2008 2012 1985 1991 X86の名称のもと低消費電力用として登場。消費電力を抑えるためにインオーダー実行アウトオブオーダーに対応 Pentium IIをベースに複数ソケット対応など拡張 Pentium4をベースに複数ソケット対応など拡張

インテル株式会社 7 インテル® CPUの世代をどれだけ知っていますか？インテル® Xeon® プロセッサー E3/E5/E7 インテル® Xeon®
プロセッサー E3/E5/E7 v2 インテル® Xeon® プロセッサー E3/E5/E7 v3 （コードネーム：Haswell）インテル® Xeon® プロセッサー E3/E5/E7 v4 （コードネーム：Broadwell）インテル® Xeon® スケーラブル・プロセッサー（コードネーム：Skylake）第 2 世代インテル® Xeon® スケーラブル・プロセッサー（コードネーム：Cascade Lake）第5世代インテル® Core™ プロセッサー第6世代インテル® Core™ プロセッサー第7世代インテル® Core™ プロセッサー第8世代インテル® Core™ プロセッサー第9世代インテル® Core™ プロセッサー第10世代インテル® Core™ プロセッサー（コードネーム：Ice Lake / Comet Lake）第 3 世代インテル® Xeon® スケーラブル・プロセッサー（コードネーム：Cooper Lake）第11世代インテル® Core™ プロセッサー（コードネーム：Tiger Lake） Xeon Core 旧新薦薦・・・・・・ AVX-512 AVX-512 AVX2 AVX2 AVX AVX DL Boost AVX-512 DL Boost AVX2 AVX2 AVX2 AVX2 AVX2 ※ Comet Lakeは、AVX512/VNNIは含まれず、AVX2のまま AVX-512が入ってる世代がおすすめ AVX-512 DL Boost BF16 AVX-512 DL Boost AVX-512 → https://www.intel.co.jp/content/www/jp/ja/architecture-and-technology/avx-512-overview.html IBM Cloud に搭載

インテル株式会社 8 AI ”も” 動かすCPU あらゆるワークロードに対応できる汎用性と柔軟性がCPUの特徴です。

インテル株式会社 9 Intel® Xeon® processor Platform Performance INFERENCE THROUGHPUT Up
to 277x1 Intel® Xeon® Platinum 8180 Processor higher Intel optimized Caffe GoogleNet v1 with Intel® MKL inference throughput compared to Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe 1 The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance.Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of June 2018. Configurations: See slide 4. TRAINING THROUGHPUT Up to 241x1 Intel® Xeon® Platinum 8180 Processor higher Intel Optimized Caffe AlexNet with Intel® MKL training throughput compared to Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe Deliver significant AI performance with hardware and software optimizations on Intel® Xeon® Scalable Family Optimized Frameworks Optimized Intel® oneDNN Libraries Inference and training throughput uses FP32 instructions Now Ready For Deep Learning

インテル株式会社 10 AIライフサイクル ~モデルの構築／モデルの運用~ TIME-TO- SOLUTION Opportunity Hypotheses Data Modeling
Deployment Iteration Evaluation Source Data Scale & Deploy Inference Scale & Deploy inference within broader application 15% 15% 23% 15% 15% 8% 8% Dev Cycle … Build, deploy & Scale 運用構築モデルの構築（学習メイン、推論少々）目標の精度を目指し、限られた期間内にひたすらモデルを作り続けるフェーズモデルの運用（推論メイン、学習少々）モデルを本番環境へ展開し新規ビジネスデータをひたすら推論するフェーズ • 期間が長い • スケーラブル • ROIをより重視 • 期間が短い • ハイパフォーマンス Intel works with customers across the entire AI lifecycle

インテル株式会社 11 ML& Analytics DL学習 DL推論

インテル株式会社 12 ML& AnALytics DL学習 DL推論

インテル株式会社 13 AIワークロードに適した命令セット ▪ AVX-512（SIMD）から始まり、INT8推論用のVNNIに加え、 BFLOAT16での学習／推論用の命令も追加されました。 Inside インテル® Xeon® スケーラブル・プロセッサー
インテル® AVX-512 (Intel® Advanced Vector Extensions 512) 第2世代インテル® Xeon® スケーラブル・プロセッサーインテル® Deep Learning Boost (Vector Neural Network Instruction (VNNI) for INT8) 第3世代インテル® Xeon® スケーラブル・プロセッサーインテル® Deep Learning Boost (for BFLOAT16) Inside Inside Skylake Cascade Lake Cooper Lake

インテル株式会社 14 インテル® AI ソフトウェア: マシンラーニングとディープラーニング Red font products are
the most broadly applicable SW products for AI users developer ToolS App Developers SW Platform Developer MAchine LeArning Deep LEarning Architect & DevOps Topologies & Models Data Scientist Frameworks Data Scientist Graph ML Performance Engineer Kernel ML Performance Engineer ▪ Intel Data Analytics Acceleration Library (Intel DAAL) ▪ Intel Math Kernel Library (Intel MKL) ▪ Intel Machine Learning Scaling Library (Intel MLSL) ▪ Intel® Deep Neural Network Library (DNNL) Deep Learning Reference Stack Data Analytics Reference Stack ▪ Intel Distribution for Python (SKlearn, Pandas) Management Tools CPU cPU ▪︎gPU ▪︎FPgA ▪︎専用 Containers

インテル株式会社 15 インテルによるディープラーニング・フレームワークの最適化 for インストール・ガイドはこちら↓ ai.intel.com/framework-optimizations/ スケール ▪ ロード・バランシング向上
▪ 同期イベント、all-to- all 通信の削減全コアの有効活用 ▪ OpenMP、MPI ▪ 同期イベント、直列コードの削減 ▪ ロード・バランシング向上ベクトル演算 / SIMD ▪ SIMD レーンごとのユニット・ストライド・アクセス ▪ 高いベクトル効率 ▪ データ・アライメント効率的なメモリーとキャッシュ利用 ▪ ブロッキング ▪ データ再利用 ▪ プリフェッチ ▪ メモリー・アロケーション更なるフレームワークの最適化が進行中 (例、 PaddlePaddle* 、CNTK* など) SEE ALSO: Machine Learning Libraries for Python (Scikit-learn, Pandas, NumPy), R (Cart, randomForest, e1071), Distributed (MlLib on Spark, Mahout) *Limited availability today Optimization Notice

インテル株式会社 16 推論処理の更なる性能向上のためのディープラーニング・モデルの最適化と量子化 ▪ 最適化：不要な Ops の除去、複数の Ops の統合などによりモデルをスマート化
▪ 量子化*：モデル内部の数値表現を FP32→INT8 に変換することでスリム化最適化量子化最適化＆量子化元のモデル（TensorFlow*、PyTorch* などで作成） * 2020年5月現在、インテル® ディープラーニング・ブースト（VNNI）が搭載された第 2 世代インテル® Xeon® スケーラブル・プロセッサー以降、第 10 世代インテル® Core™ プロセッサー・ファミリー (Ice Lake† のみ)以降にてより効力を発揮する ※各フレームワークごとに量子化ツールを用意 by by by

インテル株式会社 17 OpenVINO™ ツールキット https://software.intel.com/en-us/openvino-toolkit ▪ 画像処理とディープラーニング推論のためのライブラリスイートです。3つの特徴をぜひご理解ください。コンピュータビジョンアプリ向けソフトウェア・ライブラリ・スイート
（Python、C++対応）画像処理ディープラーニング推論【特徴１】AIパーツ【特徴２】モデルコンパイラ【特徴３】ヘテロジニアス・オーケストレータ Ubuntu, CentOS, Yocto, Win10 MacOS

インテル株式会社 18 特徴１．AIパーツとしてのOpenVINO https://software.intel.com/en-us/openvino-toolkit/documentation/pretrained-models 性別・年齢テキスト認識超解像顔認識顔特徴点検出、感情認識
人物検出人・車両検出行動検出人のポーズ推論オープンな開発者コミュニティーで公開されているモデルに加えて、インテルが開発した学習済みモデル５０種類以上を無償で提供 BERTベースのQAモデル自然言語認識

インテル株式会社 19 特徴２：モデルコンパイラとしてのOpenVINO Caffe TensorFlow* MxNet* .data IR IR IR
= 中間表現形式読み込み、推論 CPU プラグイン GPU プラグイン FPGA プラグイン NCS プラグインモデル・オプティマイザー変換と最適化モデル・オプティマイザー ▪ 概要: 学習済みモデルをインポートし、中間表現に変換する Python* ベースのツール ▪ 重要な理由: トポロジー変換に基づく抑制により、ハードウェアに適したデータ型に変換することで、パフォーマンスを最大化。推論エンジン ▪ 概要: 高レベルの推論 API ▪ 重要な理由: インターフェイスは、ハードウェアのタイプに応じた動的読み込みのプラグインとして実装。複数のコードを実装および管理することなく、タイプごとに最適なパフォーマンスを実現可能。学習済みモデル推論エンジン共通 API (C++/Python*) 最適化されたクロスプラットフォーム推論 GPU = グラフィックス・プロセシング・ユニット / インテル® プロセッサー・グラフィックスが統合されたインテル® CPU Kaldi ONNX* (Pytorch*、Caffe2 など) GNA プラグイン拡張 C++ 拡張 OpenCL* 拡張 OpenCL* VAD プラグイン VAD = ビジョン・アクセラレーター・デザイン・プロダクト。FPGA バージョンと 8 つの Myriad™ X バージョンを含む

インテル株式会社 20 ディープラーニング推論処理ベンチマークインテル® Xeon® Gold 6254 プロセッサー @ 2.10GHz
(18 cores × 1 sockets) 参考値 0.00 2.00 4.00 6.00 8.00 Resnet50 推論スループット（FPS） Input=224x224, BS=1, 1 stream 性能比 (倍) FP32 (量子化前) INT8 (量子化後) TensorFlow* 1.15.0 OpenVINO™ ツールキット 2020R1 TensorFlow* 1.15.0 OpenVINO™ ツールキット 2020R1 2020年3月20日に計測注）インテル社員による性能確認のための個人的なベンチマーク結果であり、インテルの公式結果ではありません。最適化前最適化後最適化前最適化後

インテル株式会社 21 事例：理化学研究所様 CheXNet の推論性能改善 744 sec 11,177 sec
(Baseline) 1,116 sec 359 sec 251 sec on 他社アクセラレータ on Xeon 6252 x2 約2.2万枚のテスト画像をバッチ処理で推論 → After Optimization Before Optimization← x10.0 x3.1 x1.4 上記対応内容は下記Githubを参照 https://github.com/taneishi/CheXNet （計算科学研究機構種石様のレポジトリ） x 44.5 against Baseline • モデルをONNXに変換 • OpenVINOのモデルオプティマイザーで ONNX→IRへ変換 • OpenVINOの推論エンジン上で同期実行 • OpenVINOの量子化ツールにてIR内一部のレイヤーの数値表現をINT8 へ変換（ツールのカスタマイズ含む） • OpenVINOの推論エンジン上で同期実行（VNNI 利用） • OpenVINOの推論エンジン上で非同期実行（8並列で推論処理を実行）最適化量子化並列化 ※オリジナルモデルは PyTorch 1.2.0にて実装

インテル株式会社 22 どのようにやるか？ ▪ TensorFlow* 量子化ガイド • https://github.com/IntelAI/tools/releases/tag/v1.0.0 ▪ PyTorch*
量子化ガイド • https://pytorch.org/docs/stable/quantization.html • https://pytorch.org/blog/introduction-to-quantization-on-pytorch/ • https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html • https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html ▪ OpenVINO™ ツールキット・モデル最適化ガイド（Model Optimizer の使い方） • https://docs.openvinotoolkit.org/latest/_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.html ▪ OpenVINO™ ツールキット量子化ガイド（Post-Training Optimization Toolkit の使い方） • https://docs.openvinotoolkit.org/latest/_README.html 今後Dojoにて開催予定

インテル株式会社 24 Training with Huge Memory ~U-Net Training by NUS~
GPU-based env CPU-based env • V100 GPU (32GB memory) • 10 CPU cores • 126GB RAM • Batch size of 1 • 2 x Intel Platinum CPUs. • 2 x 24 CPU cores • 384GB RAM • Batch size of 6 Result インテル® CPU上でトレーニングしたモデルの方が、DICE（モデルの正確性）が平均5％ほど高い。

インテル株式会社 25 大量のメモリー使用時、なぜ CPU を使うべきなのか？物理ノード #1 GPGPU CPU
Main Mem GPU Mem (~32GB) PCIe* PCIe* がボトルネックとなり、アクセラレーター（GPGPU など）からメインメモリーへのアクセスが遅いアクセラレーター内の内蔵メモリー（GPU メモリー）にデータを格納して処理を実行するが、内蔵メモリーのサイズは決して大きくない CPU は広大なメインメモリーへダイレクトにアクセス可能 ↓ 実装の工夫なしで大容量データを取扱い可能

インテル株式会社 26 パフォーマンスが欲しい場合はどうすればいい？ ↓ 複数の CPU を束ねて使いましょうつまり、分散学習（Distributed Training）
です☝

インテル株式会社 27 既存インフラ上で効率的な深層学習のスケール ~GENCI と CERN の事例~ Succeeded in training
a plant classification model for 300K species, 1.5TByte dataset of 12 million images on 1024 2S Intel® Xeon® Nodes with Resnet50. 94% scaling efficiency up to 128 nodes, with a significant reduction in training time per epoch for 3D-GANs 1.0 2.0 3.9 7.8 15.5 31 61 120 100% 100% 98% 97% 97% 96% 95% 94% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 Speedup Efficiency Speedup Intel(R) 2S Xeon(R) Nodes High Energy Physics: 3D GANs Training Speedup Performance Intel 2S Xeon(R) on Stampede2/TACC, OPA Fabric TensorFlow 1.9+MKL-DNN+horovod, Intel MPI, Core Aff. BKMs, 4 Workers/Node 2S Xeon 8160: Secs/Epoch Speedup Ideal Scaling Efficiency 128-Node Perf: 148 Secs/Epoch GENCI French research institute focused on numerical simulation and HPC across all scientific and industrial fields CERN the European Organization for Nuclear Research, which operates the Large Hadron Collider (LHC), the world’s largest particle accelerator

インテル株式会社 28 インテル® Xeon® プロセッサー上での分散学習における技術考慮点物理ノード #2 物理ノード #1
CPU CPU Main Mem Main Mem UPI CPU CPU Main Mem Main Mem UPI Ethernet (25Gb~)/ IB/OPA Worker #1 Worker #2 Worker #3 Worker #4 Worker #5 Worker #6 Worker #7 Worker #8 1 2 3 • Parameter Server • Ring All-Reduce • Butterfly All-Reduce • Tree All-Reduce, etc.. 集団通信戦略 • oneCCL • MPI • Gloo etc.. 集団通信ライブラリー • Horovod • Distributed TensorFlow* • Ray etc.. 分散学習フレームワーク • XenServer* / KVM • Docker* / Singularity • Kubernetes* / Swarm etc.. 仮想化／コンテナー

インテル株式会社 29 どのようにやるか？ ▪ TensorFlow* – インテル® Xeon® プロセッサー上での分散学習実践ガイド •
https://www.intel.ai/multi-node-convergence-and-scaling-of-inception-resnet-v2-model-using-intel-xeon- processors • https://software.intel.com/en-us/articles/using-intel-xeon-processors-for-multi-node-scaling-of- tensorflow-with-horovod • https://software.intel.com/en-us/articles/deploy-distributed-tensorflow-using-horovod-and-kubernetes- on-intel-xeon-platforms • https://software.intel.com/en-us/articles/intel-processors-for-deep-learning-training • https://software.intel.com/en-us/articles/ai-practitioners-guide-for-beginners • https://www.isus.jp/machine-learning/ai-practitioners-guide-for-beginners/ (日本語) • https://github.com/hiouchiy/IntelAI/tree/master/distributed_training_on_cpu (日本語) ▪ PyTorch* -インテル® Xeon® プロセッサー上での分散学習実践ガイド • https://pytorch.org/tutorials/intermediate/ddp_tutorial.html オンデマンド動画深層学習 Deep Dive @Data Centric Innovation Day

インテル株式会社 31 まだまだマシンラーニングは重要 22% 22% 23% 23% 25% 27% 28%
28% 30% 32% 33% 39% 45% 46% 47% 48% 56% Support Vector Machine Neural Networks - CNN Dgradient Boosted Machines Anomaly / Deviation Detection Neural Networks - Deep Learning Boosting Text Mining PCA Ensamble Methods Time Series K-NearestNeighbors Statistics - Descriptive Random Forests Visualizaiton Clustering Decision Trees / Rules Regression Top Data Science, Machine Learning Methods used in 2018/2019 ディープラーニングマシンラーニング AI 画像、音声、自然言語の認識などが得意 Share of Respondents 引用元：https://www.kdnuggets.com/2019/04/top-data-science-machine-learning-methods-2018-2019.html

インテル株式会社 32 Intel® Distribution for Python* • Numpy • Pandas
• Scipy • Scikit-learn • XGBoost • TensorFlow • etc.. インテルが実装、かつ、最適化した Python、および、周辺ライブラリ https://software.intel.com/en-us/distribution-for-python/benchmarks Public Cloud 回帰分析学習処理 on AVX512 & 72cores 423 倍（OSS実装との比較） Public Cloud 行列のコレスキー分解 on AVX512 & 72cores 9 倍 (OSS実装との比較)

インテル株式会社 33 Intel® Distribution for Python* - 導入方法一覧 Build from
Source https://software.intel.com/en- us/distribution-for-python/choose- download/linux Anaconda https://software.intel.com/en- us/articles/using-intel-distribution-for- python-with-anaconda Pip https://software.intel.com/en- us/articles/installing-the-intel- distribution-for-python-and-intel- performance-libraries-with-pip-and Docker* Image https://software.intel.com/en- us/articles/docker-images-for-intel- python Linux* Repositories YUM https://software.intel.com/en- us/articles/installing-intel-free-libs-and- python-yum-repo APT https://software.intel.com/en- us/articles/installing-intel-free-libs-and- python-apt-repo https://software.intel.com/en-us/distribution-for-python/choose-download

インテル株式会社 34 インテル® のAI系ライブラリー＆ oneDALの使い方インテル® oneAPI Math Kernel
Library (oneMKL) インテル® oneAPI Data Analytics Library (oneDAL) インテル® oneAPI Deep Neural Network Library (oneDNN) インテル® oneAPI Collective Communication Library (oneCCL) 数学マシンラーニング／データ分析ディープラーニング集団通信 pip install daal4py pip install intel-scikit-learn パートナーソリューション https://www.oneapi.com/ daal4py http://www.intel.com/analytics 詳細は次ページ

3 Intel technologies’ features and benefits depend on system configuration
and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. Performance results are based on testing as of 11/11/2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Configuration: Testing by Intel as of 11/11/2019. 7 x m5.2xlarge AWS instances, Intel® Data Analytics Acceleration Library 2020 (Intel® DAAL); Correlation (# samples = 10M, # features = 1000, (Intel® DAAL=35.2s, MLLib=638.2s)), PCA (# samples = 10M, # features = 1000 (Intel® DAAL=35.2s, MLLib=639.8s)), implicit ALS (# users = 1M, # items = 1M, # factors = 100, # Iterations = 1 (Intel® DAAL=37.6s, MLLib=134.9s)), Linear Regression (# samples = 100M, # features = 50 (Intel® DAAL=16.3s, MLLib=224.5s)), k-means (# samples = 100M, # features = 50, # clusters = 10, # Iterations = 100 (Intel® DAAL=211s, MLLib=1567.3s)) Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 1 1 1 1 1 3.6 7.4 13.8 18.1 18.2 0 2 4 6 8 10 12 14 16 18 20 Implicit ALS Kmeans Linear Regression Correlation PCA Speedup Intel® oneDAL vs Apache Spark* MlLib performance (Higher is better) Apache Spark MlLib Intel DAAL Spark / Databricksへの oneDALインストールガイド https://github.com/hiouchi y/Data_Analytics/tree/mast er/spark

インテル株式会社 36 グラフ分析に関するインテルの技術ブログ https://medium.com/intel-analytics-software/you-dont- have-to-spend-800-000-to-compute-pagerank- fa6799133402 https://medium.com/intel-analytics-software

インテル株式会社 37 インテル® AI ソフトウェア – パートナー・ソリューション Solutions ISV partners
Platforms CPU All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Optimization Notice Visit: www.intel.com/analytics Create Transmit Ingest Integrate Stage Clean Normalize Data pipeline ACT

インテル株式会社 38 インテル・パートナー・ソリューション最新Xeon®上でのソフトウェア最適化の効果 IN-MEMORY DATABASE THROUGHPUT 1 SQL DATA
WAREHOUSING 8280 VS 4 YEAR OLD SYSTEM2 BUSINESS ANALYTICS 8268 VS E5-2699 V43 TIMESTEN IMDB 8260 + OPTANE PM VS DRAM4 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. See configurations in backup for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. DRIVERLESS AI PLATFORM WITH OPTIMIZED XGBOOST + 82605 HAZELCAST RESTART TIME WITH OPTANE PM VS SSDS8 BIGDL ON APACHE SPARK WITH INTEL OPTIMIZATION OF CAFFE RESNET-50 + 81807 for AI INFERENCING SOLUTION WITH OPENVINO OR TENSORFLOW USING INTEL® DL BOOST6 AND

インテル株式会社 39 システム構成: 「ソフトウェアの最適化が決め手」性能の測定結果はシステム構成に記載された日付時点のテストに基づいています。また、現在公開中のすべてのセキュリティー・アップデートが適用されているとは限りません。詳細については、公開されている構成情報を参照してください。絶対的なセキュリティーを提供できる製品またはコンポーネントはありません。性能に関するテストに使用されるソフトウェアとワークロードは、性能がインテル® マイクロプロセッサー用に最適化されていることがあります。SYSmark* や MobileMark* などの性能テストは、特定のコンピューター・システム、コンポーネント、ソフトウェア、操作、機能に基づいて行ったものです。結
果はこれらの要因によって異なります。製品の購入を検討される場合は、他の製品と組み合わせた場合の本製品の性能など、ほかの情報や性能テストも参考にして、パフォーマンスを総合的に評価することをお勧めします。詳細については、https://www.intel.com/benchmarks/ (英語) を参照してください。 1. IBM* Db2* v11.1.4.4。IBM Big Data Insights Internal Heavy Multiuser Workload (BDInsights) は、小売環境に基づくマルチユーザーのデータ・ウェアハウス・ワークロード。このワークロードは、複雑なクエリーと中程度のクエリーの混合で構成。ワークロードのスケール係数は、12 ユーザーで 300GB。2019年2月1日に実施したインテル社内テストで測定。バリアント 1、2、3、3a、4 に対するセキュリティー緩和を実施。ベースライン: 2-way のインテル® Xeon® プロセッサー E5-2697 v2 (2.70GHz / 12 コア)、ターボ有効、HT 有効、BIOS 02.06.0007、メモリー総容量 192GB (12 スロット / 16GB / 1600MT/s DDR3 DIMM)、400GB インテル® SSD DC S3700 x1、Red Hat* Enterprise Linux* 7.5、カーネル 3.10.0-862.el7.x86_64。新しい構成: 2-way のインテル® Xeon® Platinum 8280 プロセッサー (2.70GHz / 28 コア)、ターボ有効、HT 有効、BIOS 0D010299、メモリー総容量 192GB (12 スロット / 16GB / 2666MT/s DDR4 LRDIMM)、375GB インテル® Optane™ SSD DC P4800X x1、Red Hat* Enterprise Linux* 7.5、カーネル 3.10.0-862.el7.x86_64。 2. DW クエリーでパフォーマンスが最大 24.8 倍に向上: 1 ノード、インテル® Xeon® プロセッサー E5-2699 v3 x2 搭載 Wildcat Pass、メモリー総容量 768GB (24 スロット / 32GB / 2666MHz)、 Windows Server* 2008 R2、ucode 0x3D、200GB インテル® SSD DC S3710 x1、1.6TB インテル® SSD DC S3500 x1、6.4TB インテル® SSD DC P4608 x2、SQL Server* 2008 R2 SP1 (Enterprise Edition)、HT 有効、ターボ有効、結果: 1 時間当たりのクエリー数 = 33681。比較対象: 1 ノード、インテル® Xeon® Platinum 8280 プロセッサー x2 搭載 Wolf Pass、メモリー総容量 1536GB (24 スロット / 64GB / 2666MHz (1866MT/s))、Windows Server* 2016 (RS1 14393)、ucode 0xA、200GB インテル® SSD DC S3710 x1、7.6TB インテル® SSD DC P4610 x4、1.6TB インテル® SSD DC 3500 x1、SQL Server* 2017 RTM CU13 (Enterprise Edition)、HT 有効、ターボ有効、結果: 1 時間当たりのクエリー数 = 836,261。ワークロードの詳細: 1TB のデータ・ウェアハウス、同時に 7 人のユーザーが 22 の DSS クエリーのセットを発行。2019年3月13日に実施したインテル社内テストで測定。 3. SAS* 9.4 のパフォーマンスが 2.38 倍に向上: 1 ノード、インテル® Xeon® プロセッサー E5 2699 v4 x2 搭載 S2600WTT、メモリー総容量 128GB (16 スロット / 8GB / 1866MT/s)、CentOS* 7.6、4.19.8、ucode 0xb00002e、800GB インテル® SSD DC S3710 x7 (sasdata ファイルシステム用)、750GB NVMe* 対応インテル® SSD データセンター・ファミリー P3700 x2 (saswork ファイルシステム用)、インテル® XC710 x1、SAS* 9.4 m5 ワークロード、HT 有効、ターボ有効、2019年3月24日に実施したインテル社内テストで測定。1 ノード、インテル® Xeon® Platinum 8268 プロセッサー x2 搭載 S2600WFD、メモリー総容量 192GB (12 スロット / 16GB / 2666MHz)、CentOS* 7.6、4.19.8、ucode 0x4000013、1.6TB NVMe* 対応インテル® SSD データセンター・ファミリー P4610 x3 (sasdata ファイルシステム用)、375GB インテル® Optane™ DC SSD P4800x x4 (saswork ファイルシステム用)、インテル® XC710 x1、SAS 9.4 m5 ワークロード、HT 有効、ターボ有効、 2019年3月25日に実施したテストで測定。 4. 6.49 倍のパフォーマンス向上は、Oracle TimesTen IMDB に基づき、TPTPM ベンチマークを実行した結果。ベースライン・システムとパーシステント・メモリー・システムのハードウェア構成: インテル® Xeon® プロセッサー (28 コア、1 コア当たり 2 スレッド) x2、BIOS バージョン: 1.0134、ストレージ: 6TB SSD DC P4608 x2、OS: Red Hat* Enterprise Linux* 7.5、カーネル 4.18。ベースライン・システム: 1,536GB DDR4 クアッドランク (2666MHz) 搭載。パーシステント・メモリー・システム: 192GB DDR4 デュアルランク (2666 MHz)、 6TB インテル® Optane™ DC パーシステント・メモリー。公開ドキュメント: https://itpeernetwork.intel.com/oracle-open-world-2018/ (英語)、https://blogs.oracle.com/timesten/the-future-of-databases-%3d-persistent-memory/ (英語)、 https://www.oracle.com/openworld/on-demand.html?bcid=5972360479001 (英語) 5. H20 でパフォーマンスが 4.5 倍に向上: https://builders.intel.com/docs/aibuilders/accelerate-ai-development-with-h2o-ai-on-intel-architecture-brief.pdf (英語) 6. AI 推論のインテル® Select ソリューションでパフォーマンスが 3.75 倍に向上: ソリューションは 2019年2月26日に実施したテストで測定。KPI ターゲット: OpenVINO™ ツールキット / ResNet50、INT8、次のハードウェア / ソフトウェア構成を使用: 基本構成: 1 ノード、インテル® Xeon® Gold 6248 プロセッサー x2、インテル® サーバーボード S2600WFT x1、メモリー総容量 192GB (12 スロット / 16GB / 2666MT/s DDR4 RDIMM)、ハイパースレッディング: 有効、ターボ: 有効、ストレージ (ブート): インテル® SSD DC P4101、ストレージ (容量): 2TB 以上のインテル® SSD DC P4610 PCIe* NVMe*、OS / ソフトウェア: CentOS* Linux* リリース 7.6.1810 (コア)、カーネル 3.10.0-957.el7.x86_64、フレームワーク・バージョン: OpenVINO™ ツールキット 2018 R5 445、データセット: ベンチマーク・ツールのサンプル画像、モデルトポロジー: ResNet 50 v1、バッチサイズ: 4、nireq: 20。ソリューションは 2019年3月7日に実施したテストで測定。KPI ターゲット: TensorFlow*/ ResNet50、INT8、次のハードウェア / ソフトウェア構成を使用: 基本構成: 1 ノード、インテル® Xeon® Gold 6248 プロセッサー x2、インテル® サーバーボード S2600WFT x1、メモリー総容量 192GB (12 スロット / 16GB / 2666MT/s DDR4 RDIMM)、ハイパースレッディング: 有効、ターボ有効、ストレージ (ブート): インテル® SSD DC P4101、ストレージ (容量): 2TB 以上のインテル® SSD DC P4610 PCIe* NVMe*、OS / ソフトウェア: CentOS* Linux* リリース 7.6.1810 (コア)、カーネル 3.10.0-957.el7.x86_64、フレームワーク・バージョン: intelaipg/intel-optimizedtensorflow:PR25765-devel-mkl、データセット: ベンチマーク・ツールから合成、モデルトポロジー: ResNet 50 v1、バッチサイズ: 80。 7. BigDL + Apache Spark* で AI 推論が 5.4 倍に高速化: https://builders.intel.com/docs/intel-select-solutions-for-bigdl-on-apache-spark.pdf (英語) 8. Hazelcast で再起動時間が 2.5 分の 1 に短縮: https://builders.intel.com/datacenter/blog/hazelcast-fast-restart-optane-dc-persistent-memory (英語)

Intel Confidential Department or Event Name 40 Xeon Roadmap CASCADE
LAKE ICE LAKE COOPER L AKE SAPPHIRE R APIDS 2 0 1 9 2 H ’ 2 0 2 0 2 1 N O W • Intel DL Boost • BFLOAT16 • TME • PCIe Gen 4 • 8 memory channels • Crypto accel. • ICX-D • DDR5 • PCIe Gen 5 • CXL 1.1 • AMX • DSA

41 製造メディア流通スマートホームテレコム交通 Ai&AnALytics everywhere
農業エネルギー教育公共金融医療

x86 CPUで動くAIアプリ作成で知っていると便利な基礎知識

x86 CPUで動くAIアプリ作成で知っていると便利な基礎知識

More Decks by kkojima

Other Decks in Technology

Featured

Transcript