Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tensor Processing Unit (TPU) Overview (July 6, 2018)

Tensor Processing Unit (TPU) Overview (July 6, 2018)

TPU Overview

Kazunori Sato

July 06, 2018
Tweet

More Decks by Kazunori Sato

Other Decks in Technology

Transcript

  1. Tensor Processing Unit
    Designed for fast and affordable AI

    View Slide

  2. +Kazunori Sato
    @kazunori_279
    Kaz Sato
    Staff Developer Advocate
    Data & Analytics
    Google Cloud

    View Slide

  3. Agenda
    What is TPU?
    domain specific architecture for deep learning
    TPU Pod
    HPC-powered scalable "All Reduce" distributed training
    Cloud TPU
    Programming with TensorFlow Estimator API and TensorBoard

    View Slide

  4. TPU public resources
    Cloud TPU Documentation
    Effective machine learning using Cloud TPUs by Zak Stone (Google I/O '18 video)
    Training Performance: A user’s guide to converge faster by Brennan Saeta (TensorFlow Dev Summit
    2018 video)
    In-Datacenter Performance Analysis of a Tensor Processing Unit by Norm Jouppi et al. (paper)
    An in-depth look at Google’s first Tensor Processing Unit by Kaz Sato, Cliff Young and David
    Patterson (blog post)
    The future of computing by John Hennessy (Google I/O '18 video)

    View Slide

  5. The end of
    Moore's Law

    View Slide

  6. Moore's Law is ending.

    View Slide

  7. Based on SPECintCPU. Source: John
    Hennessy and David Patterson, Computer
    Architecture: A Quantitative Approach, 6/e.
    2018
    End of
    the Line?
    2X /
    20 yrs
    (3%/yr)
    RISC
    2X / 1.5 yrs
    (52%/yr)
    CISC
    2X / 3.5 yrs
    (22%/yr)
    End of
    Dennard
    Scaling

    Multicore
    2X / 3.5 yrs
    (23%/yr)
    Am-
    dahl’s
    Law

    2X /
    6 yrs
    (12%/yr)
    End of Growth of
    Single Program
    Speed?

    View Slide

  8. The solution: domain specific hardware

    View Slide

  9. Example: hardware acceleration in BigQuery
    Why BQ's so fast? because:
    hardware acceleration + MPP
    (massively parallel processing)

    View Slide

  10. TPU history:
    Designing a domain specific hardware for ML
    Early discussion started in 2006
    Production project started in 2013
    (...after 15 months…)
    The first deployment in 2015

    View Slide

  11. What is TPU?

    View Slide

  12. TPU v1 and v2
    TPU v1
    Launched in 2015
    Inference only
    TPU v2
    Launched in 2017
    Inference and training

    View Slide

  13. TPU 3.0
    Launched in 2018
    Inference and training

    View Slide

  14. 14
    ASIC (28nm process)
    Clock: 700MHz
    Power consumption: 40W
    Size: SATA disk drive slot
    Bus: PCIe Gen3 x16
    (12.5GB/s sustained)
    TPU v1 Overview

    View Slide

  15. TPU v1: in production since 2015
    Search
    Search ranking
    Speech recognition
    Translate
    Text, graphic and speech
    translation
    Photos
    Photos search

    View Slide

  16. TPU v1 Performance comparison with
    Intel Haswell CPU and NVIDIA K80 GPU
    Performance: 15 - 30x
    Performance per watt: 30 - 80x

    View Slide

  17. 17
    Domain specific architecture for Deep Learning
    Reduced precision
    Matrix Processor
    Minimal and deterministic design
    Why TPU v1 was so successful?

    View Slide

  18. Reduced precision
    in TPU v1

    View Slide

  19. Neural Network: a bunch of Multiply and Add

    View Slide

  20. TPU v1 workloads in Google
    Type of network # of network
    layers
    # of weights % of deployed
    MLP0 5 20M 61%
    MLP1 4 5M
    LSTM0 58 52M 29%
    LSTM1 56 34M
    CNN0 16 8M 5%
    CNN1 89 100M
    as of June 2016

    View Slide

  21. Reducing precision
    max
    min

    -3.4E+38
    +3.4E+38

    32 bit
    float
    8 bit
    int
    Common practice:
    Inference: 8 bit int quantization
    Training: 16 bit fp truncation

    View Slide

  22. 22
    DType
    ● DT_QINT8
    ● DT_QINT32
    ● DT_QUINT8
    Quantize/Dequantize
    ● tf.quantize_v2
    ● tf.dequantize
    Operations
    ● matmul
    ● Conv/Pool
    ● Activation
    Quantization in TensorFlow

    View Slide

  23. Quantized = 25x more multipliers
    Tesla K80:
    2,496 x 32 bit
    FP multipliers
    TPU v1:
    65,536 x 8 bit
    Integer multipliers

    View Slide

  24. Matrix Processing
    in TPU v1

    View Slide

  25. How CPU and GPU work
    CPU GPU (NVIDIA P100: 3,584 CUDA Cores)

    View Slide

  26. 1234
    How CPU works
    code
    operator
    memory
    (register)
    General purpose processor
    requires memory access
    at every calculation

    View Slide

  27. 1234
    1234 1234
    How GPU works

    View Slide

  28. How TPU works
    1234 1234
    Specialized for matrix operations with significantly less memory access
    More operators with smaller footprint and less power

    View Slide

  29. The core of TPU: Systolic Array
    Large hard-wired matrix calculation without memory access

    View Slide

  30. TPU v1: specific design for large matrix operations
    CPU GPU / SIMD TPU

    View Slide

  31. View Slide

  32. TPU v1: a matrix processor for neural network prediction
    Matrix Multiplier Unit (MXU)
    ● 65,536 x 8-bit multiply-and-add
    Unified Buffer
    ● 24MB SRAM
    Activation Unit
    ● Hardwired activation functions

    View Slide

  33. Matrix Multiply Unit (MXU): a BIG systolic array
    Up to 256K ops / cycle
    Up to 256M ops / instruction
    Operations per cycle
    CPU a few
    CPU (vector extension) tens
    GPU tens of thousands
    TPU hundreds of thousands,
    up to 256K

    View Slide

  34. TPU v1 Instruction Set and software stack
    Instruction Function
    Read_Host_Memory Read data from memory
    Read_Weights Read weights from memory
    MatrixMultiply/
    Convolve
    Multiply or Convolve,
    Accumulate the results
    Activate Apply activation functions
    Write_Host_Memory Write result to memory

    View Slide

  35. TPU v1 Performance / watt: 83x better than CPU

    View Slide

  36. Minimal and
    deterministic design

    View Slide

  37. 37
    TPU v1: minimal design for neural network
    Control logic: only 2%
    Removing all complexities:
    caching, branch prediction, OOO,
    multi-processing/threading,
    context switching etc
    Guaranteed latency: 7ms
    with high thruput

    View Slide

  38. TPU v1 throughput at 7 ms latency limit

    View Slide

  39. TPU v2

    View Slide

  40. 2nd generation
    Tensor Processing Unit
    ASIC for NN calculation
    Training & Inference
    180 Tflops / Cloud TPU
    NVIDIA V100:
    128 Tflops

    View Slide

  41. Confidential & Proprietary
    Vector
    Unit
    Scalar
    Unit
    Matrix Unit (MXU) Matrix Unit (MXU)
    8GB
    HBM
    8GB
    HBM
    Vector
    Unit
    Scalar
    Unit
    TPU v2 processor layout

    View Slide

  42. Confidential & Proprietary
    Vector
    Unit
    Scalar
    Unit
    Matrix Unit (MXU) Matrix Unit (MXU)
    8GB
    HBM
    8GB
    HBM
    Vector
    Unit
    Scalar
    Unit
    TPU v2 "Tensor Core"
    2 cores per processor
    Matrix Unit
    Scalar Unit
    Vector Unit

    View Slide

  43. Confidential & Proprietary
    Vector
    Unit
    Scalar
    Unit
    Matrix Unit (MXU) Matrix Unit (MXU)
    8GB
    HBM
    Vector
    Unit
    Scalar
    Unit
    TPU v2 MXU
    Matrix Unit (MXU)
    128 x 128 systolic array
    bfloat16 multiplies
    float32 accumulate
    8GB
    HBM

    View Slide

  44. Floating Point Formats in TPU
    M M M M M M M M M M M M M M M M M M M M M M M
    S E E E E E E E E
    Exponent: 8 bits Mantissa (Significand): 23 bits
    fp32: Single-precision IEEE Floating Point Format Range: ~1e−38 to ~3e38
    S E E E E E M M M M M M M M M M
    Exponent: 5 bits Mantissa (Significand): 10 bits
    fp16: Half-precision IEEE Floating Point Format Range: ~5.96e−8 to 65504
    Less bandwidth,
    Larger model,
    but shorter range

    View Slide

  45. M M M M M M M M M M M M M M M M M M M M M M M
    S E E E E E E E E
    Exponent: 8 bits Mantissa (Significand): 23 bits
    fp32: Single-precision IEEE Floating Point Format Range: ~1e−38 to ~3e38
    S E E E E E M M M M M M M M M M
    Exponent: 5 bits Mantissa (Significand): 10 bits
    fp16: Half-precision IEEE Floating Point Format Range: ~5.96e−8 to 65504
    bfloat16: Brain Floating Point Format
    S E E E E E E E E
    Exponent: 8 bits Mantissa (Significand): 7 bits
    M M M M M M M
    Range: ~1e−38 to ~3e38
    Supported by TPU
    Same range as fp32
    Floating Point Formats in TPU

    View Slide

  46. Cloud TPU
    cost and performance

    View Slide

  47. View Slide

  48. Cloud TPU pricing
    Requires recovery from checkpoint file

    View Slide

  49. [ https://dawn.cs.stanford.edu/benchmark/ ]

    View Slide

  50. Cloud TPU Performance
    AmoebaNet-D
    Final accuracy: 93%
    Training time: 7.5 Hrs
    Training cost: $49
    #1 training cost on DAWNBench
    as of Apr 2018

    View Slide

  51. Cloud TPU Performance
    Tuned ResNet-50 on Preemptible TPU
    Final accuracy: 93%
    Training cost: $7.5
    1/10 of the #1 training cost by
    GPU on DAWNBench
    as of June 2018

    View Slide

  52. Using Cloud TPUs instead of clusters of other
    accelerators has allowed us to focus on building our
    models without being distracted by the need to
    manage the complexity of cluster communication
    patterns.

    Alfred Spector,
    CTO, Two Sigma
    Anantha Kancherla,
    Head of Software, Self-Driving Level 5, Lyft
    Since working with Google Cloud TPUs, we’ve been
    extremely impressed with their speed—what could
    normally take days can now take hours.

    View Slide

  53. TPU Pod:
    Large scale TPU cluster

    View Slide

  54. HPC technology is the key for scalable ML
    eg NVIDIA DGX-2: 16 x V100 at $400K, 2 PFLOPS

    View Slide

  55. TPU v2 Pod: Google's HPC cluster for ML
    11.6 PFLOPS with 64 Cloud TPUs

    View Slide

  56. Parameter Server v. All Reduce
    Parameter Server
    Model
    Replicas
    Data
    Shards
    w’ = w - n Δ w
    w
    Δw
    Model
    Replicas
    Data
    Shards
    Δw
    Δw
    High speed interconnect
    PS with gRPC on TCP/IP
    by software on CPU
    → PS becomes the bottleneck,
    tedious distributed cluster mgmt
    All Reduce with 2-D toroidal mesh network
    by Google's HPC hardware
    → as easy as using a single node
    as scalable as supercomputers

    View Slide

  57. TPU v2 Pod for Resnet-50: linearly scalable
    NVIDIA DGX-2:
    up to 16 x V100s

    View Slide

  58. TPU v2 Pod performance
    ResNet-50 on TPU v2 half-pod
    Real data:
    images/sec
    77,392
    Final accuracy: 93%
    Training time: 30 min
    #1 training time on DAWNBench

    View Slide

  59. TPU v2 Pod performance
    RankBrain: 132 h on 275 CPU → 9 h on 16 TPU
    Image model: 216 h → 22 h on 16 TPU
    WaveNet (Speech): generation at 20X real time

    View Slide

  60. Large-batch training: 8K images / batch on 32 TPUs
    hours to 90 epochs
    top-1 validation accuracy
    76%
    25 min
    ResNet-50 training on ImageNet
    10 hrs 52 min
    time

    View Slide

  61. TPU 3.0 Pod: >100 PFLOPS (8X faster than v2)

    View Slide

  62. Programming
    Cloud TPU

    View Slide

  63. Cloud TPU doc is your great guide!

    View Slide

  64. Scalable training with TensorFlow Estimator
    GPU
    1
    GPU
    2
    Mean
    Update
    Variable
    s
    Gradient
    s
    Loss
    Model
    Gradient
    s
    Loss
    Model
    Parameter Server
    Model
    Replicas
    Data
    Shards
    w’ = w - n Δ w
    w
    Δw
    Multi GPUs Distributed
    TensorFlow
    TPU

    View Slide

  65. Programming Cloud TPU: TPU Estimator
    estimator = tf.contrib.tpu.TPUEstimator(
    model_fn=model_fn,
    use_tpu=FLAGS.use_tpu,
    train_batch_size=FLAGS.batch_size,
    eval_batch_size=FLAGS.batch_size,
    params={"data_dir": FLAGS.data_dir},
    config=run_config)

    View Slide

  66. def model_fn(features, labels, mode, params):
    input_layer = tf.reshape(features, [-1, 28, 28, 1])
    conv1 = tf.layers.conv2d(inputs=input_layer, ...)
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2,2],
    strides=2)
    # ...
    loss = tf.losses.softmax_cross_entropy(
    onehot_labels=onehot_labels, logits=logits)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
    train_op = optimizer.minimize(loss)
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss,
    train_op=train_op)
    Sample Model with Layers & Estimators

    View Slide

  67. def model_fn(features, labels, mode, params):
    input_layer = tf.reshape(features, [-1, 28, 28, 1])
    conv1 = tf.layers.conv2d(inputs=input_layer, ...)
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2,2],
    strides=2)
    # ...
    loss = tf.losses.softmax_cross_entropy(
    onehot_labels=onehot_labels, logits=logits)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
    optimizer = tpu_optimizer.CrossShardOptimizer(optimizer)
    train_op = optimizer.minimize(loss)
    return tpu_estimator.EstimatorSpec(mode=mode, loss=loss,
    train_op=train_op)
    Model with TPU Modifications
    No further change
    required for TPU Pod

    View Slide

  68. TensorFlow, XLA and Cloud TPU

    View Slide

  69. When to use Cloud TPU
    Use Cloud TPU when:
    Needs tons of matrix operations
    Large model with large batch size
    Can run with TPU supported ops
    Don't use Cloud TPU when:
    Sparse, small, high-precision, or many branches
    Can't run with TPU supported ops
    Eg. large CNN
    such as ResNet

    View Slide

  70. Confidential & Proprietary
    Image Recognition &
    Object Detection
    Machine Translation and
    Language Modeling
    Speech
    Recognition
    Model:
    ASR Transformer
    (LibriSpeech)
    Available reference models for Cloud TPUs
    Models:
    Machine translation
    Language modeling
    Sentiment analysis
    Question-answering
    (all Transformer-based)
    Image
    Generation
    Models:
    Image Transformer
    DCGAN
    Image Recognition:
    AmoebaNet-D
    ResNet-50/101/152/200
    Inception v2/v3/v4
    DenseNet
    Object Detection:
    RetinaNet
    Low-Resource Models:
    MobileNet
    SqueezeNet

    View Slide

  71. Cloud TPU FAQs
    (as of June 5, 2018)
    How do you count Cloud TPUs?
    1 Cloud TPU has 4 TPU processors and 8 cores. Total 64GB HBM and 180 TFLOPS.
    Can you use Cloud TPU for inference?
    Batch inference works on Cloud TPU. Online inference does not.
    TensorFlow Serving and ML Engine prediction does not work on Cloud TPU.
    Is Cloud TPU faster than GPU?
    Google hasn't published any comparison, but RiseML has a blog post comparing with NVIDIA V100.
    Any other way of using TPU than TPUEstimator?
    No. We strongly recommend to start with the reference models and then customise TPUEstimator.
    Does Colaboratory or Cloud Datalab support TPU?
    Stay tuned.
    Does Cloud ML Engine support TPU?
    Yes. Training with Cloud TPU is supported as beta.

    View Slide

  72. TensorBoard
    TPU tools

    View Slide

  73. "Romit and I discovered some new TensorBoard profiling features that analyze
    your ENTIRE TensorFlow pipeline including data ingestion and ETL to CPU, GPU,
    and TPU utilization and graph/operator optimization...These profiling tools are
    exactly what we've always from Spark-based ETL pipelines, but we've never
    seen them on the market - not at this level of system detail and optimization."
    https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/23
    3979387/

    View Slide

  74. Overview

    View Slide

  75. XLA graphs

    View Slide

  76. TPU Compatibility Checker

    View Slide

  77. Trace Viewer

    View Slide

  78. Input Pipeline Analyzer

    View Slide

  79. Find the bottleneck between Storage, Host and TPU

    View Slide

  80. View Slide

  81. Host-side analysis details
    Tips
    Stats
    Reading data from storage
    Preprocessing the data
    Sending the data to device
    caching,
    prefetching
    parallel
    processing

    View Slide

  82. Ops bottleneck with op_profile

    View Slide

  83. Summary
    What is TPU?
    domain specific architecture for deep learning
    TPU Pod
    HPC-powered scalable "All Reduce" distributed training
    Cloud TPU
    Programming with TensorFlow Estimator API and TensorBoard

    View Slide

  84. cloud.google.com/tpu
    to get started

    View Slide

  85. Thank You!

    View Slide