Tensor Processing Unit (TPU) Overview (July 6, 2018)

Slide 1

Slide 1 text

Tensor Processing Unit Designed for fast and affordable AI

Slide 2

Slide 2 text

+Kazunori Sato @kazunori_279 Kaz Sato Staff Developer Advocate Data & Analytics Google Cloud

Slide 3

Slide 3 text

Agenda What is TPU? domain specific architecture for deep learning TPU Pod HPC-powered scalable "All Reduce" distributed training Cloud TPU Programming with TensorFlow Estimator API and TensorBoard

Slide 4

Slide 4 text

TPU public resources Cloud TPU Documentation Effective machine learning using Cloud TPUs by Zak Stone (Google I/O '18 video) Training Performance: A user’s guide to converge faster by Brennan Saeta (TensorFlow Dev Summit 2018 video) In-Datacenter Performance Analysis of a Tensor Processing Unit by Norm Jouppi et al. (paper) An in-depth look at Google’s first Tensor Processing Unit by Kaz Sato, Cliff Young and David Patterson (blog post) The future of computing by John Hennessy (Google I/O '18 video)

Slide 5

Slide 5 text

The end of Moore's Law

Slide 6

Slide 6 text

Moore's Law is ending.

Slide 7

Slide 7 text

Based on SPECintCPU. Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e. 2018 End of the Line? 2X / 20 yrs (3%/yr) RISC 2X / 1.5 yrs (52%/yr) CISC 2X / 3.5 yrs (22%/yr) End of Dennard Scaling ⇒ Multicore 2X / 3.5 yrs (23%/yr) Am- dahl’s Law ⇒ 2X / 6 yrs (12%/yr) End of Growth of Single Program Speed?

Slide 8

Slide 8 text

The solution: domain specific hardware

Slide 9

Slide 9 text

Example: hardware acceleration in BigQuery Why BQ's so fast? because: hardware acceleration + MPP (massively parallel processing)

Slide 10

Slide 10 text

TPU history: Designing a domain specific hardware for ML Early discussion started in 2006 Production project started in 2013 (...after 15 months…) The first deployment in 2015

Slide 11

Slide 11 text

What is TPU?

Slide 12

Slide 12 text

TPU v1 and v2 TPU v1 Launched in 2015 Inference only TPU v2 Launched in 2017 Inference and training

Slide 13

Slide 13 text

TPU 3.0 Launched in 2018 Inference and training

Slide 14

Slide 14 text

14 ASIC (28nm process) Clock: 700MHz Power consumption: 40W Size: SATA disk drive slot Bus: PCIe Gen3 x16 (12.5GB/s sustained) TPU v1 Overview

Slide 15

Slide 15 text

TPU v1: in production since 2015 Search Search ranking Speech recognition Translate Text, graphic and speech translation Photos Photos search

Slide 16

Slide 16 text

TPU v1 Performance comparison with Intel Haswell CPU and NVIDIA K80 GPU Performance: 15 - 30x Performance per watt: 30 - 80x

Slide 17

Slide 17 text

17 Domain specific architecture for Deep Learning Reduced precision Matrix Processor Minimal and deterministic design Why TPU v1 was so successful?

Slide 18

Slide 18 text

Reduced precision in TPU v1

Slide 19

Slide 19 text

Neural Network: a bunch of Multiply and Add

Slide 20

Slide 20 text

TPU v1 workloads in Google Type of network # of network layers # of weights % of deployed MLP0 5 20M 61% MLP1 4 5M LSTM0 58 52M 29% LSTM1 56 34M CNN0 16 8M 5% CNN1 89 100M as of June 2016

Slide 21

Slide 21 text

Reducing precision max min ↓ -3.4E+38 +3.4E+38 ↑ 32 bit float 8 bit int Common practice: Inference: 8 bit int quantization Training: 16 bit fp truncation

Slide 22

Slide 22 text

22 DType ● DT_QINT8 ● DT_QINT32 ● DT_QUINT8 Quantize/Dequantize ● tf.quantize_v2 ● tf.dequantize Operations ● matmul ● Conv/Pool ● Activation Quantization in TensorFlow

Slide 23

Slide 23 text

Quantized = 25x more multipliers Tesla K80: 2,496 x 32 bit FP multipliers TPU v1: 65,536 x 8 bit Integer multipliers

Slide 24

Slide 24 text

Matrix Processing in TPU v1

Slide 25

Slide 25 text

How CPU and GPU work CPU GPU (NVIDIA P100: 3,584 CUDA Cores)

Slide 26

Slide 26 text

1234 How CPU works code operator memory (register) General purpose processor requires memory access at every calculation

Slide 27

Slide 27 text

1234 1234 1234 How GPU works

Slide 28

Slide 28 text

How TPU works 1234 1234 Specialized for matrix operations with significantly less memory access More operators with smaller footprint and less power

Slide 29

Slide 29 text

The core of TPU: Systolic Array Large hard-wired matrix calculation without memory access

Slide 30

Slide 30 text

TPU v1: specific design for large matrix operations CPU GPU / SIMD TPU

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

TPU v1: a matrix processor for neural network prediction Matrix Multiplier Unit (MXU) ● 65,536 x 8-bit multiply-and-add Unified Buffer ● 24MB SRAM Activation Unit ● Hardwired activation functions

Slide 33

Slide 33 text

Matrix Multiply Unit (MXU): a BIG systolic array Up to 256K ops / cycle Up to 256M ops / instruction Operations per cycle CPU a few CPU (vector extension) tens GPU tens of thousands TPU hundreds of thousands, up to 256K

Slide 34

Slide 34 text

TPU v1 Instruction Set and software stack Instruction Function Read_Host_Memory Read data from memory Read_Weights Read weights from memory MatrixMultiply/ Convolve Multiply or Convolve, Accumulate the results Activate Apply activation functions Write_Host_Memory Write result to memory

Slide 35

Slide 35 text

TPU v1 Performance / watt: 83x better than CPU

Slide 36

Slide 36 text

Minimal and deterministic design

Slide 37

Slide 37 text

37 TPU v1: minimal design for neural network Control logic: only 2% Removing all complexities: caching, branch prediction, OOO, multi-processing/threading, context switching etc Guaranteed latency: 7ms with high thruput

Slide 38

Slide 38 text

TPU v1 throughput at 7 ms latency limit

Slide 39

Slide 39 text

TPU v2

Slide 40

Slide 40 text

2nd generation Tensor Processing Unit ASIC for NN calculation Training & Inference 180 Tflops / Cloud TPU NVIDIA V100: 128 Tflops

Slide 41

Slide 41 text

Confidential & Proprietary Vector Unit Scalar Unit Matrix Unit (MXU) Matrix Unit (MXU) 8GB HBM 8GB HBM Vector Unit Scalar Unit TPU v2 processor layout

Slide 42

Slide 42 text

Confidential & Proprietary Vector Unit Scalar Unit Matrix Unit (MXU) Matrix Unit (MXU) 8GB HBM 8GB HBM Vector Unit Scalar Unit TPU v2 "Tensor Core" 2 cores per processor Matrix Unit Scalar Unit Vector Unit

Slide 43

Slide 43 text

Confidential & Proprietary Vector Unit Scalar Unit Matrix Unit (MXU) Matrix Unit (MXU) 8GB HBM Vector Unit Scalar Unit TPU v2 MXU Matrix Unit (MXU) 128 x 128 systolic array bfloat16 multiplies float32 accumulate 8GB HBM

Slide 44

Slide 44 text

Floating Point Formats in TPU M M M M M M M M M M M M M M M M M M M M M M M S E E E E E E E E Exponent: 8 bits Mantissa (Significand): 23 bits fp32: Single-precision IEEE Floating Point Format Range: ~1e−38 to ~3e38 S E E E E E M M M M M M M M M M Exponent: 5 bits Mantissa (Significand): 10 bits fp16: Half-precision IEEE Floating Point Format Range: ~5.96e−8 to 65504 Less bandwidth, Larger model, but shorter range

Slide 45

Slide 45 text

M M M M M M M M M M M M M M M M M M M M M M M S E E E E E E E E Exponent: 8 bits Mantissa (Significand): 23 bits fp32: Single-precision IEEE Floating Point Format Range: ~1e−38 to ~3e38 S E E E E E M M M M M M M M M M Exponent: 5 bits Mantissa (Significand): 10 bits fp16: Half-precision IEEE Floating Point Format Range: ~5.96e−8 to 65504 bfloat16: Brain Floating Point Format S E E E E E E E E Exponent: 8 bits Mantissa (Significand): 7 bits M M M M M M M Range: ~1e−38 to ~3e38 Supported by TPU Same range as fp32 Floating Point Formats in TPU

Slide 46

Slide 46 text

Cloud TPU cost and performance

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Cloud TPU pricing Requires recovery from checkpoint file

Slide 49

Slide 49 text

[ https://dawn.cs.stanford.edu/benchmark/ ]

Slide 50

Slide 50 text

Cloud TPU Performance AmoebaNet-D Final accuracy: 93% Training time: 7.5 Hrs Training cost: $49 #1 training cost on DAWNBench as of Apr 2018

Slide 51

Slide 51 text

Cloud TPU Performance Tuned ResNet-50 on Preemptible TPU Final accuracy: 93% Training cost: $7.5 1/10 of the #1 training cost by GPU on DAWNBench as of June 2018

Slide 52

Slide 52 text

Using Cloud TPUs instead of clusters of other accelerators has allowed us to focus on building our models without being distracted by the need to manage the complexity of cluster communication patterns. “ Alfred Spector, CTO, Two Sigma Anantha Kancherla, Head of Software, Self-Driving Level 5, Lyft Since working with Google Cloud TPUs, we’ve been extremely impressed with their speed—what could normally take days can now take hours.

Slide 53

Slide 53 text

TPU Pod: Large scale TPU cluster

Slide 54

Slide 54 text

HPC technology is the key for scalable ML eg NVIDIA DGX-2: 16 x V100 at $400K, 2 PFLOPS

Slide 55

Slide 55 text

TPU v2 Pod: Google's HPC cluster for ML 11.6 PFLOPS with 64 Cloud TPUs

Slide 56

Slide 56 text

Parameter Server v. All Reduce Parameter Server Model Replicas Data Shards w’ = w - n Δ w w Δw Model Replicas Data Shards Δw Δw High speed interconnect PS with gRPC on TCP/IP by software on CPU → PS becomes the bottleneck, tedious distributed cluster mgmt All Reduce with 2-D toroidal mesh network by Google's HPC hardware → as easy as using a single node as scalable as supercomputers

Slide 57

Slide 57 text

TPU v2 Pod for Resnet-50: linearly scalable NVIDIA DGX-2: up to 16 x V100s

Slide 58

Slide 58 text

TPU v2 Pod performance ResNet-50 on TPU v2 half-pod Real data: images/sec 77,392 Final accuracy: 93% Training time: 30 min #1 training time on DAWNBench

Slide 59

Slide 59 text

TPU v2 Pod performance RankBrain: 132 h on 275 CPU → 9 h on 16 TPU Image model: 216 h → 22 h on 16 TPU WaveNet (Speech): generation at 20X real time

Slide 60

Slide 60 text

Large-batch training: 8K images / batch on 32 TPUs hours to 90 epochs top-1 validation accuracy 76% 25 min ResNet-50 training on ImageNet 10 hrs 52 min time

Slide 61

Slide 61 text

TPU 3.0 Pod: >100 PFLOPS (8X faster than v2)

Slide 62

Slide 62 text

Programming Cloud TPU

Slide 63

Slide 63 text

Cloud TPU doc is your great guide!

Slide 64

Slide 64 text

Scalable training with TensorFlow Estimator GPU 1 GPU 2 Mean Update Variable s Gradient s Loss Model Gradient s Loss Model Parameter Server Model Replicas Data Shards w’ = w - n Δ w w Δw Multi GPUs Distributed TensorFlow TPU

Slide 65

Slide 65 text

Programming Cloud TPU: TPU Estimator estimator = tf.contrib.tpu.TPUEstimator( model_fn=model_fn, use_tpu=FLAGS.use_tpu, train_batch_size=FLAGS.batch_size, eval_batch_size=FLAGS.batch_size, params={"data_dir": FLAGS.data_dir}, config=run_config)

Slide 66

Slide 66 text

def model_fn(features, labels, mode, params): input_layer = tf.reshape(features, [-1, 28, 28, 1]) conv1 = tf.layers.conv2d(inputs=input_layer, ...) pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2,2], strides=2) # ... loss = tf.losses.softmax_cross_entropy( onehot_labels=onehot_labels, logits=logits) optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01) train_op = optimizer.minimize(loss) return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op) Sample Model with Layers & Estimators

Slide 67

Slide 67 text

def model_fn(features, labels, mode, params): input_layer = tf.reshape(features, [-1, 28, 28, 1]) conv1 = tf.layers.conv2d(inputs=input_layer, ...) pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2,2], strides=2) # ... loss = tf.losses.softmax_cross_entropy( onehot_labels=onehot_labels, logits=logits) optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01) optimizer = tpu_optimizer.CrossShardOptimizer(optimizer) train_op = optimizer.minimize(loss) return tpu_estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op) Model with TPU Modifications No further change required for TPU Pod

Slide 68

Slide 68 text

TensorFlow, XLA and Cloud TPU

Slide 69

Slide 69 text

When to use Cloud TPU Use Cloud TPU when: Needs tons of matrix operations Large model with large batch size Can run with TPU supported ops Don't use Cloud TPU when: Sparse, small, high-precision, or many branches Can't run with TPU supported ops Eg. large CNN such as ResNet

Slide 70

Slide 70 text

Confidential & Proprietary Image Recognition & Object Detection Machine Translation and Language Modeling Speech Recognition Model: ASR Transformer (LibriSpeech) Available reference models for Cloud TPUs Models: Machine translation Language modeling Sentiment analysis Question-answering (all Transformer-based) Image Generation Models: Image Transformer DCGAN Image Recognition: AmoebaNet-D ResNet-50/101/152/200 Inception v2/v3/v4 DenseNet Object Detection: RetinaNet Low-Resource Models: MobileNet SqueezeNet

Slide 71

Slide 71 text

Cloud TPU FAQs (as of June 5, 2018) How do you count Cloud TPUs? 1 Cloud TPU has 4 TPU processors and 8 cores. Total 64GB HBM and 180 TFLOPS. Can you use Cloud TPU for inference? Batch inference works on Cloud TPU. Online inference does not. TensorFlow Serving and ML Engine prediction does not work on Cloud TPU. Is Cloud TPU faster than GPU? Google hasn't published any comparison, but RiseML has a blog post comparing with NVIDIA V100. Any other way of using TPU than TPUEstimator? No. We strongly recommend to start with the reference models and then customise TPUEstimator. Does Colaboratory or Cloud Datalab support TPU? Stay tuned. Does Cloud ML Engine support TPU? Yes. Training with Cloud TPU is supported as beta.

Slide 72

Slide 72 text

TensorBoard TPU tools

Slide 73

Slide 73 text

"Romit and I discovered some new TensorBoard profiling features that analyze your ENTIRE TensorFlow pipeline including data ingestion and ETL to CPU, GPU, and TPU utilization and graph/operator optimization...These profiling tools are exactly what we've always from Spark-based ETL pipelines, but we've never seen them on the market - not at this level of system detail and optimization." https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/23 3979387/

Slide 74

Slide 74 text

Overview

Slide 75

Slide 75 text

XLA graphs

Slide 76

Slide 76 text

TPU Compatibility Checker

Slide 77

Slide 77 text

Trace Viewer

Slide 78

Slide 78 text

Input Pipeline Analyzer

Slide 79

Slide 79 text

Find the bottleneck between Storage, Host and TPU

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

Host-side analysis details Tips Stats Reading data from storage Preprocessing the data Sending the data to device caching, prefetching parallel processing

Slide 82

Slide 82 text

Ops bottleneck with op_profile

Slide 83

Slide 83 text

Summary What is TPU? domain specific architecture for deep learning TPU Pod HPC-powered scalable "All Reduce" distributed training Cloud TPU Programming with TensorFlow Estimator API and TensorBoard