[Startup.fm] Machine Learning Platform on AWS

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Yoshitaka Haribara Startup Solutions Architect Tokyo, Japan 2018年 AWS Japan 入社。ソリューションアーキテクトとして日本のスタートアップに対する AWS 導入支援を行っており、特に機械学習基盤の設計・構築や、開発体制の整備にまつわる相談を手掛ける。趣味はドラム。最近気になっているドラマーは星野源「不思議」にも参加している石若駿。

Slide 3

Slide 3 text

• 深層学習モデルの開発環境 TensorFlow, PyTorch, Hugging Face, etc. on AWS • GPU や深層学習専用チップ NVIDIA A100/V100/T4, AWS Trainium/Inferentia • MLOps パイプラインと推論エンドポイントデプロイ SageMaker Pipelines Agenda

Slide 4

Slide 4 text

Deep Learning on AWS

Slide 5

Slide 5 text

AMI とコンテナは以下を使うことも可能だが • AWS Deep Learning AMIs − EC2 インスタンスで利用可能な仮想マシンイメージ。 § Base AMI: CUDA, CuBLAS, CuDNN, NVIDIA driver, NCCL, Python, etc. がインストール済み。 § Conda AMI (Anaconda virtual environment): Apache MXNet, Chainer, PyTorch, TensorFlow, and TensorFlow 2 なども入っている。 § OS は Ubuntu と Amazon Linux などをサポート。 § CUDA 10, 10.1, 10.2, and 11.0 から選択。 • AWS Deep Learning Containers − 深層学習用の Docker イメージ。 § CPU/GPU, training/inference, TensorFlow/PyTorch/MXNet それぞれ用意。 § HuggingFace training on GPU § Elastic Inference, Neuron Inference

Slide 6

Slide 6 text

VISION SPEECH TEXT SEARCH CHATBOTS PERSONALIZATION FORECASTING FRAUD CONTACT CENTERS Deep Learning AMIs & Containers GPUs & CPUs Elastic Inference Trainium Inferentia FPGA DeepGraphLibrary Amazon Rekognition Amazon Polly Amazon Transcribe +Medical Amazon Lex Amazon Personalize Amazon Forecast Amazon Comprehend +Medical Amazon Textract Amazon Kendra Amazon CodeGuru Amazon Fraud Detector Amazon Translate INDUSTRIAL AI CODE AND DEVOPS NEW Amazon DevOps Guru Voice ID For Amazon Connect Contact Lens NEW Amazon Monitron NEW AWS Panorama + Appliance NEW Amazon Lookout for Vision NEW Amazon Lookout for Equipment AWS AI/ML サービス全体像 NEW Amazon HealthLake HEALTH AI NEW Amazon Lookout for Metrics ANOMALY DETECTION Amazon Transcribe Medical Amazon Comprehend Medical Amazon SageMaker Label data NEW Aggregate & prepare data NEW Store & share features Auto ML Spark/R NEW Detect bias Visualize in notebooks Pick algorithm Train models Tune parameters NEW Debug & profile Deploy in production Manage & monitor NEW CI/CD Human review NEW: Model management for edge devices NEW: SageMaker JumpStart SAGEMAKER STUDIO IDE AI サービス: 機械学習の深い知識なしに利⽤可能 ML サービス: 機械学習のプロセス全体を効率化するマネージドサービス ML フレームワークとインフラストラクチャ: 機械学習の環境を⾃在に構築して利⽤

Slide 7

Slide 7 text

Amazon SageMaker overview PREPARE SageMaker Ground Truth Label training data for machine learning SageMaker Data Wrangler NEW Aggregate and prepare data for machine learning SageMaker Processing Built-in Python, BYO R/Spark SageMaker Feature Store NEW Store, update, retrieve, and share features SageMaker Clarify NEW Detect bias and understand model predictions BUILD SageMaker Studio Notebooks Jupyter notebooks with elastic compute and sharing Built-in and Bring your-own Algorithms Dozens of optimized algorithms or bring your own Local Mode Test and prototype on your local machine SageMaker Autopilot Automatically create machine learning models with full visibility SageMaker JumpStart NEW Pre-built solutions for common use cases TRAIN & TUNE Managed Training Distributed infrastructure management SageMaker Experiments Capture, organize, and compare every step Automatic Model Tuning Hyperparameter optimization Distributed Training NEW Training for large datasets and models SageMaker Debugger NEW Debug and profile training runs Managed Spot Training Reduce training cost by 90% DEPLOY & MANAGE Managed Deployment Fully managed, ultra low latency, high throughput Kubernetes & Kubeflow Integration Simplify Kubernetes-based machine learning Multi-Model Endpoints Reduce cost by hosting multiple models per instance SageMaker Model Monitor Maintain accuracy of deployed models SageMaker Edge Manager NEW Manage and monitor models on edge devices SageMaker Pipelines NEW Workflow orchestration and automation Amazon SageMaker SageMaker Studio Integrated development environment (IDE) for ML

Slide 8

Slide 8 text

© 2021, Amazon Web Services, Inc. or its Affiliates. SageMaker Python SDK (v2) import sagemaker from sagemaker.pytorch import PyTorch # 各フレームワークに対応した Estimator クラス estimator = PyTorch("train.py", # トレーニングスクリプトなどを指定して初期化 role=sagemaker.get_execution_role(), instance_count=1, instance_type="ml.p3.2xlarge", framework_version="1.6.0", py_version="py3") estimator.fit("s3://mybucket/data/train") # fit でトレーニング predictor = estimator.deploy(initial_instance_count=2, # 2以上にすると Multi-AZ instance_type="ml.m5.xlarge") # deploy でエンドポイント作成

Slide 9

Slide 9 text

© 2021, Amazon Web Services, Inc. or its Affiliates. コードの書き換え (train.py) import argparse if __name__ == '__main__’: parser = argparse.ArgumentParser() # hyperparameters parser.add_argument('--epochs', type=int, default=10) # input data and model directories parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST']) parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) args, _ = parser.parse_known_args() … コンテナ内のパス (環境変数の中身): /opt/ml/input/data/train /opt/ml/input/data/test /opt/ml/model 環境変数から取得 Script Mode では普通の Python スクリプトとして実行される。はじめに環境変数からデータ・モデル入出力のパスを取得して、そこを読むように train.py を書く。推論用にモデルを読み込む。

Slide 10

Slide 10 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Managed Spot Training でトレーニングの料金を削減 • オンデマンドに比べて最大90%のコスト削減 • 中断が発生する可能性があるので checkpoints に途中経過を書き出し • 最大で待てる時間を指定呼び出し方: estimator = PyTorch("train.py", role=sagemaker.get_execution_role(), instance_count=1, instance_type="ml.p3.2xlarge", framework_version="1.6.0", py_version=”py3", use_spot_instances=True, max_run=1*24*60*60 max_wait=2*24*60*60, # max_run より長い時間を指定 checkpoint_s3_uri="s3://mybucket/checkpoints", checkpoint_local_path="/opt/ml/checkpoints/" ) estimator.fit("s3://mybucket/data/train") # fit でトレーニングは同様

Slide 11

Slide 11 text

プロセッサの選択肢

Slide 12

Slide 12 text

深層学習向けアクセラレータ on AWS • NVIDIA GPU − 学習: A100 (P4d), V100 32 GB (P3dn) / 16 GB (P3) − 推論: T4 (G4dn) • Intel − 学習: Habana Gaudi − 推論: (CPU instances) • AWS − 学習: AWS Trainium − 推論: AWS Inferentia (Inf1), AWS Graviton2 (C6g, etc.) Accelerator (Instance Family)

Slide 13

Slide 13 text

AWS to offer NVIDIA A100 Tensor Core GPU-based Amazon EC2 instances https://aws.amazon.com/blogs/machine- learning/aws-to-offer-nvidia-a100-tensor- core-gpu-based-amazon-ec2-instances/

Slide 14

Slide 14 text

Amazon EC2 P4d インスタンス NVIDIA A100 Tensor Core GPU を搭載した P4d インスタンス • p4d.24xlarge (A100 x 8枚搭載) の 1サイズのみの提供 (表参照) • GPU間は 600 GB/s の NVSwitch/NVLink で接続 • インスタンスあたり 400 Gbps の EFA 対応の⾼速なネットワークインターフェース • 1 TBのNVMe SSD を8枚搭載しており、RAID0 構成時、最⼤ 16 GB/s のスループット • Multi-Instance GPU (MIG) にも対応 https://aws.amazon.com/jp/ec2/instance-types/p4/ * p3dn.24xlarge: 31.212 USD/h

Slide 15

Slide 15 text

P4d のパフォーマンス様々な深層学習モデルのトレーニングにおいて、P3dn よりも2倍以上⾼速 Throughput Improvement DNN P3dn FP32 (imgs/sec) P3dn FP16 (imgs/sec) P4d TF32 (imgs/sec) P4d FP16 (imgs/sec) P4d over p3dn TF32/FP32 P4d over P3dn FP16 Resnet50 3057 7413 6841 15621 2.2x 2.1x Resnet152 1145 2644 2823 5700 2.5x 2.2x Inception3 2010 4969 4808 10433 2.4x 2.1x Inception4 847 1778 2025 3811 2.4x 2.1x VGG16 1202 2092 4532 7240 3.8x 3.5x Alexnet 32198 50708 82192 133068 2.6x 2.6x SSD300 1554 2918 3467 6016 2.2x 2.1x https://aws.amazon.com/jp/blogs/compute/amazon-ec2-p4d-instances-deep-dive/ https://github.com/aws-samples/deep-learning-models

Slide 16

Slide 16 text

機械学習モデルのトレーニングにおいて、既存のコードの修正なしに学習時間を40%削減、コストパフォーマンスも向上 P4d インスタンス活用事例: TRI-AD 様 https://aws.amazon.com/jp/ec2/instance-types/p4/

Slide 17

Slide 17 text

Habana Gaudi-based Amazon EC2 深層学習モデルのトレーニング⽤に特別に設計された、 Habana Labs の Gaudi アクセラレータを搭載した EC2 インスタンス • 8カードの Gaudi アクセラレーターでの深層学習トレーニングにより、現在の GPU ベースの EC 2インスタンスより最⼤40％優れたコストパフォーマンス • TensorFlow, PyTorch などをサポート。⾃然⾔語処理、物体検出・分類、リコメンドやパーソナライズなど、深層学習のトレーニングワークロードに最適 • Amazon EC2 に加え、Amazon EKS/ECS, Amazon SageMaker が対応予定 Coming in 2021! https://habana.ai/wp- content/uploads/pdf/2020/Habana%20Gaudi%20customer%20enableme nt%20on%20AWS%20December%202020.pdf

Slide 18

Slide 18 text

AWS Trainium AWS により設計された⾼性能な機械学習トレーニングチップ • クラウドで ML モデルをトレーニングするための最⾼のコストパフォーマンスを提供 • AWS Inferentia 同様 Neuron SDK を利⽤し、TensorFlow, MXNet, PyTorch といったフレームワークをサポート • Trainium チップは、画像分類、セマンティック検索、翻訳、⾳声認識、⾃然⾔語処理、レコメンデーションエンジンなど、アプリケーションのディープラーニングトレーニングワークロード向けに特別に最適化 • Amazon EC2 インスタンスに加え、AWS Deep Learning AMI, Amazon SageMaker, Amazon ECS, EKS, AWS Batch などのマネージドサービスを介して利⽤可能 Coming in 2021!

Slide 19

Slide 19 text

機械学習推論用プロセッサ AWS Inferentia 搭載 EC2 Inf1 インスタンス • AWS による独⾃設計推論プロセッサ • 4 Neuron コア / チップ • チップ当たり最⼤128 TOPS • (2,000 TOPS @inf1.24xlarge) • 2ステージメモリ階層 • ⼤容量オンチップキャッシュと DRAM メモリ • FP16, BF16, INT8 データタイプをサポート • FP32 で構築された学習モデルを BF16 で実⾏可能 • ⾼速なチップ間通信 Inferentia Neuron コア cache Neuron コア cache メモリ Neuron コア cache Neuron コア cache メモリメモリメモリ

Slide 20

Slide 20 text

Neuronコアパイプライン - 大規模モデルを低遅延で推論⼤規模モデルを低遅延で推論 Neuron コア間、チップ間をパイプラインモードで接続することにより、⼤規模モデルを各オンチップキャッシュメモリ上に展開し、⾼スループット・低レイテンシを実現 CACHE Memory CACHE Memory CACHE Memory CACHE Memory Neuron コアパイプライン

Slide 21

Slide 21 text

AWS Neuron SDK https://github.com/aws/aws-neuron-sdk コンパイル Neuron コンパイラ (NCC) NEFF を出⼒ Neuron バイナリ (NEFF) デプロイ Neuron ランタイム (NRT) プロファイル Neuron ツール C:\>code --version 1.1.1

Slide 22

Slide 22 text

MLOps と推論

Slide 23

Slide 23 text

機械学習モデルのライフサイクルとプロジェクトの関係者 Data Quality Assurance Feature Engineering Model Monitoring Data Sourcing Model Development Model Training & Evaluation Model Deployment & Inference Production Integration Data Engineers Data Scientists ML Engineers AWS Accounts, Controls, Dev environments, and MLOps stacks (DevOps tools, artefacts repos, ML logs insights) SysOps ML Workflow Automation - Model Management - Continuous Delivery

Slide 24

Slide 24 text

29 © 2021 Amazon Web Services, Inc. or its affiliates. All rights reserved | Amazon SageMaker Pipelines 概要 Amazon SageMaker Pipelines フルマネージドな機械学習ワークフローを構築 Model registry モデルバージョン、メトリクス、承認、モデルデプロイのカタログ化 Real-time inference Batch scoring Input data Model drift Prepare or transform Explain Train Validate CI/CD とモデル系列追跡で ML Ops の自動化

Slide 25

Slide 25 text

30 © 2021 Amazon Web Services, Inc. or its affiliates. All rights reserved | How Amazon SageMaker Pipelines works パイプライン実行の開始: • 手動 • データアップロード時の CloudWatch event • コード check-in (git push) Acceptable accuracy Non-acceptable accuracy Get input data Process data Train model Validation Deploy model Alert and stop

Slide 26

Slide 26 text

31 © 2021 Amazon Web Services, Inc. or its affiliates. All rights reserved | パイプライン実行の詳細とリアルタイムのメトリクス Follow completed steps and monitor steps in progress Understand the output from each step with the output logs Monitor, change, and manage the parameters for each step

Slide 27

Slide 27 text

Slide 28

Slide 28 text

AWS Step Functions workflow その他、機械学習パイプラインの構築例 Test data Train data Data Scientists/ Developers Git webhook docker push Amazon SageMaker Processing Amazon S3 (data) Amazon SageMaker Training Job / HPO AWS CodeCommit or 3rd party Git repo Amazon S3 (raw data) Amazon Elastic Container Registry (ECR) AWS CodeBuild Endpoint Amazon SageMaker Batch Transform / Endpoint deploy Amazon S3 (trained model) git push AWS CodePipeline

Slide 29

Slide 29 text

AWS のワークフロー管理ツール • サーバーレスオーケストレーションサービス • 分散アプリケーション・マイクロサービスの全体を「ステートマシン」と呼ばれる仕組みでオーケストレート • 定義したステートマシンは AWS コンソールから「ワークフロー」という形式で可視化 • ステートマシンの各ステップの実⾏履歴をログから追跡できる • Apache Airflow によるワークフローを構築可能なマネージドサービス • ETLジョブやデータパイプラインを実⾏するワークフローをマネージド型で実⾏可能。開発者がビジネス上の課題解決に注⼒できるようにする • Airflowのメトリクスを CloudWatch メトリクスとして扱い、ログを CloudWatch Logs に転送可能 Amazon SageMaker Pipelines Amazon Managed Workflows for Apache Airflow (MWAA) AWS Step Functions w/Data Science SDK (Python) • 機械学習の CI/CD を実現する Amazon SageMaker の機能 • 機械学習ワークフローのデータロードや学習処理などの⼀連の処理ステップを任意のタイミングや所定の時間に実⾏できる • 各ステップの処理結果は SageMaker Experiments で記録され、モデルの出来映えや学習パラメータなどを視覚化できる

Slide 30

Slide 30 text

イベントご案内 • AWS Summit Online 2021 オンデマンドで開催中です − スタートアップ Zone > re:Cap for startups - AI/ML − http://bit.ly/summit-2021-aiml • AWS でもう一歩進める機械学習 Amazon SageMaker ハンズオンセミナー − 2021 年 6 月 15 日（火）12:00 ～ 15:00 − Amazon SageMaker 利用の課題・悩みを解消するハンズオン − http://bit.ly/sagemaker-workshop-next

Slide 31

Slide 31 text

ご案内 • AWS Startup ブログ − 他のスタートアップは AWS 使ってどんな感じで機械学習やってるの？と聞かれるのでSageMaker と Personalize の事例まとめブログを書きました § https://aws.amazon.com/jp/blogs/startup/tech-case-study-jp-startup-ai-ml/ • JAWS-UG AI/ML 支部 − ユーザーグループが復活しました。スタートアップのお客様も中心メンバーにいます § https://jawsug-ai.connpass.com/