Upgrade to Pro — share decks privately, control downloads, hide ads and more …

And Then There Are Algorithms – Part 2

And Then There Are Algorithms – Part 2

Machine Learning for the Enterprise Conference, Rome, October 28th, 2019

Machine Learning = Algorithms + Data + Tools

Part 2

Danilo Poccia

October 28, 2019
Tweet

More Decks by Danilo Poccia

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates.
    Danilo Poccia
    Principal Evangelist
    AWS
    @danilop
    danilop.net
    And Then There Are Algorithms

    View Slide

  2. Neural
    Networks

    View Slide

  3. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    1943 Warren McCulloch, Walter Pitts
    Threshold
    Logic
    Units

    View Slide

  4. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    1962 Frank Rosenblatt
    Perceptron

    View Slide

  5. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Perceptron

    x1
    x2
    x3
    xn
    w1
    w2
    w3
    wn
    w0
    = #
    output
    weights
    (parameters)
    activation
    function
    input

    View Slide

  6. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Perceptron
    f(∑)
    x1
    x2
    x3
    xn
    w1
    w2
    w3
    wn
    w0
    = #
    weights
    (parameters)
    activation
    function
    input output

    View Slide

  7. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Perceptron
    f(∑)
    input output

    View Slide

  8. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    1969 Marvin Minsky, Seymour Papert
    Perceptrons:
    An Introduction
    to Computational Geometry
    A perceptron can only solve
    linearly separable functions
    (e.g. no XOR)

    View Slide

  9. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Neural
    Netw
    ork
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    input
    layer
    hidden
    layer
    output
    layer
    input output
    Multiple Layers
    Lots of Parameters
    Backpropagation

    View Slide

  10. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Microprocessor Transistor Counts 1971-2018
    Intel Xeon CPU
    28 cores
    NVIDIA V100 GPU
    5,120 CUDA Cores
    640 Tensor Cores
    M
    oore’s
    Law

    View Slide

  11. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    LeCun, Gradient-Based
    Learning Applied to
    Document Recognition,1998
    Hinton, A Fast Learning
    Algorithm for Deep Belief
    Nets, 2006
    Bengio, Learning Deep
    Architectures for AI, 2009
    Deep
    Learning
    Advances in Research 1998-2009

    View Slide

  12. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Image
    Processing
    Deep
    Learning

    View Slide

  13. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Im
    age
    Processing
    output
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    f(∑)
    How to give images in input
    to a Neural Network?
    Photo by David Iliff. License: CC-BY-SA 3.0
    https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg

    View Slide

  14. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Im
    age
    Processing
    Convolution Matrix
    0 0 0
    0 1 0
    0 0 0
    Identity
    Photo by David Iliff. License: CC-BY-SA 3.0
    https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg

    View Slide

  15. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Im
    age
    Processing
    Convolution Matrix
    1 0 -1
    2 0 -2
    1 0 -1
    Left Edges
    Photo by David Iliff. License: CC-BY-SA 3.0
    https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg

    View Slide

  16. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Im
    age
    Processing
    Convolution Matrix
    -1 0 1
    -2 0 2
    -1 0 1
    Right Edges
    Photo by David Iliff. License: CC-BY-SA 3.0
    https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg

    View Slide

  17. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Im
    age
    Processing
    Convolution Matrix
    1 2 1
    0 0 0
    -1 -2 -1
    Top Edges
    Photo by David Iliff. License: CC-BY-SA 3.0
    https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg

    View Slide

  18. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Im
    age
    Processing
    Convolution Matrix
    -1 -2 -1
    0 0 0
    1 2 1
    Bottom Edges
    Photo by David Iliff. License: CC-BY-SA 3.0
    https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg

    View Slide

  19. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    Im
    age
    Processing
    Convolution Matrix
    0.6 -0.6 1.2
    -1.4 1.2 -1.6
    0.8 -1.4 1.6
    Random Values
    Photo by David Iliff. License: CC-BY-SA 3.0
    https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg

    View Slide

  20. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    CNNs
    Convolutional Neural Networks (CNNs)
    https://en.wikipedia.org/wiki/Convolutional_neural_network

    View Slide

  21. © 2019, Amazon Web Services, Inc. or its Affiliates.
    © 2019, Amazon Web Services, Inc. or its Affiliates.
    ImageNet Classification Error Over Time
    0
    5
    10
    15
    20
    25
    30
    2010 2011 2012 2013 2014 2015 2016 2017
    CNNs

    View Slide

  22. 2012 ImageNet Classification with Deep Convolutional Neural Networks

    View Slide

  23. CNNs
    SuperVision: 8 layers, 60M parameters
    0

    View Slide

  24. 2013 Visualizing and Understanding Convolutional Networks

    View Slide

  25. CNNs

    View Slide

  26. CNNs

    View Slide

  27. CNNs
    How Do Neural Networks Learn?
    ?
    More generic and can be reused
    as feature extractor for other visual tasks
    Specific
    to task
    Cat
    Dog
    0

    View Slide

  28. Image Classification
    Deep Residual Learning for Image Recognition
    Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
    Microsoft Research
    {kahe, v-xiangz, v-shren, jiansun}@microsoft.com
    Abstract
    Deeper neural networks are more difficult to train. We
    present a residual learning framework to ease the training
    of networks that are substantially deeper than those used
    previously. We explicitly reformulate the layers as learn-
    ing residual functions with reference to the layer inputs, in-
    stead of learning unreferenced functions. We provide com-
    prehensive empirical evidence showing that these residual
    networks are easier to optimize, and can gain accuracy from
    considerably increased depth. On the ImageNet dataset we
    evaluate residual nets with a depth of up to 152 layers—8⇥
    deeper than VGG nets [41] but still having lower complex-
    ity. An ensemble of these residual nets achieves 3.57% error
    on the ImageNet test set. This result won the 1st place on the
    ILSVRC 2015 classification task. We also present analysis
    on CIFAR-10 with 100 and 1000 layers.
    The depth of representations is of central importance
    for many visual recognition tasks. Solely due to our ex-
    tremely deep representations, we obtain a 28% relative im-
    provement on the COCO object detection dataset. Deep
    residual nets are foundations of our submissions to ILSVRC
    & COCO 2015 competitions1, where we also won the 1st
    places on the tasks of ImageNet detection, ImageNet local-
    ization, COCO detection, and COCO segmentation.
    1. Introduction
    Deep convolutional neural networks [22, 21] have led
    to a series of breakthroughs for image classification [21,
    50, 40]. Deep networks naturally integrate low/mid/high-
    level features [50] and classifiers in an end-to-end multi-
    layer fashion, and the “levels” of features can be enriched
    by the number of stacked layers (depth). Recent evidence
    [41, 44] reveals that network depth is of crucial importance,
    and the leading results [41, 44, 13, 16] on the challenging
    ImageNet dataset [36] all exploit “very deep” [41] models,
    with a depth of sixteen [41] to thirty [16]. Many other non-
    trivial visual recognition tasks [8, 12, 7, 32, 27] have also
    1
    http://image-net.org/challenges/LSVRC/2015/ and
    http://mscoco.org/dataset/#detections-challenge2015.
    0 1 2 3 4 5 6
    0
    10
    20
    iter. (1e4)
    training error (%)
    0 1 2 3 4 5 6
    0
    10
    20
    iter. (1e4)
    test error (%)
    56-layer
    20-layer
    56-layer
    20-layer
    Figure 1. Training error (left) and test error (right) on CIFAR-10
    with 20-layer and 56-layer “plain” networks. The deeper network
    has higher training error, and thus test error. Similar phenomena
    on ImageNet is presented in Fig. 4.
    greatly benefited from very deep models.
    Driven by the significance of depth, a question arises: Is
    learning better networks as easy as stacking more layers?
    An obstacle to answering this question was the notorious
    problem of vanishing/exploding gradients [1, 9], which
    hamper convergence from the beginning. This problem,
    however, has been largely addressed by normalized initial-
    ization [23, 9, 37, 13] and intermediate normalization layers
    [16], which enable networks with tens of layers to start con-
    verging for stochastic gradient descent (SGD) with back-
    propagation [22].
    When deeper networks are able to start converging, a
    degradation problem has been exposed: with the network
    depth increasing, accuracy gets saturated (which might be
    unsurprising) and then degrades rapidly. Unexpectedly,
    such degradation is not caused by overfitting, and adding
    more layers to a suitably deep model leads to higher train-
    ing error, as reported in [11, 42] and thoroughly verified by
    our experiments. Fig. 1 shows a typical example.
    The degradation (of training accuracy) indicates that not
    all systems are similarly easy to optimize. Let us consider a
    shallower architecture and its deeper counterpart that adds
    more layers onto it. There exists a solution by construction
    to the deeper model: the added layers are identity mapping,
    and the other layers are copied from the learned shallower
    model. The existence of this constructed solution indicates
    that a deeper model should produce no higher training error
    than its shallower counterpart. But experiments show that
    our current solvers on hand are unable to find solutions that
    1
    arXiv:1512.03385v1 [cs.CV] 10 Dec 2015
    Densely Connected Convolutional Networks
    Gao Huang⇤
    Cornell University
    [email protected]
    Zhuang Liu⇤
    Tsinghua University
    [email protected]
    Laurens van der Maaten
    Facebook AI Research
    [email protected]
    Kilian Q. Weinberger
    Cornell University
    [email protected]
    Abstract
    Recent work has shown that convolutional networks can
    be substantially deeper, more accurate, and efficient to train
    if they contain shorter connections between layers close to
    the input and those close to the output. In this paper, we
    embrace this observation and introduce the Dense Convo-
    lutional Network (DenseNet), which connects each layer
    to every other layer in a feed-forward fashion. Whereas
    traditional convolutional networks with L layers have L
    connections—one between each layer and its subsequent
    layer—our network has
    L(L+1)
    2 direct connections. For
    each layer, the feature-maps of all preceding layers are
    used as inputs, and its own feature-maps are used as inputs
    into all subsequent layers. DenseNets have several com-
    pelling advantages: they alleviate the vanishing-gradient
    problem, strengthen feature propagation, encourage fea-
    ture reuse, and substantially reduce the number of parame-
    ters. We evaluate our proposed architecture on four highly
    competitive object recognition benchmark tasks (CIFAR-10,
    CIFAR-100, SVHN, and ImageNet). DenseNets obtain sig-
    nificant improvements over the state-of-the-art on most of
    them, whilst requiring less computation to achieve high per-
    formance. Code and pre-trained models are available at
    https://github.com/liuzhuang13/DenseNet.
    1. Introduction
    Convolutional neural networks (CNNs) have become
    the dominant machine learning approach for visual object
    recognition. Although they were originally introduced over
    20 years ago [18], improvements in computer hardware and
    network structure have enabled the training of truly deep
    CNNs only recently. The original LeNet5 [19] consisted of
    5 layers, VGG featured 19 [29], and only last year Highway
    ⇤Authors contributed equally
    x0
    x1
    H1
    x2
    H2
    H3
    H4
    x3
    x4
    Figure 1: A 5-layer dense block with a growth rate of k = 4.
    Each layer takes all preceding feature-maps as input.
    Networks [34] and Residual Networks (ResNets) [11] have
    surpassed the 100-layer barrier.
    As CNNs become increasingly deep, a new research
    problem emerges: as information about the input or gra-
    dient passes through many layers, it can vanish and “wash
    out” by the time it reaches the end (or beginning) of the
    network. Many recent publications address this or related
    problems. ResNets [11] and Highway Networks [34] by-
    pass signal from one layer to the next via identity connec-
    tions. Stochastic depth [13] shortens ResNets by randomly
    dropping layers during training to allow better information
    and gradient flow. FractalNets [17] repeatedly combine sev-
    eral parallel layer sequences with different number of con-
    volutional blocks to obtain a large nominal depth, while
    maintaining many short paths in the network. Although
    these different approaches vary in network topology and
    training procedure, they all share a key characteristic: they
    create short paths from early layers to later layers.
    1
    arXiv:1608.06993v5 [cs.CV] 28 Jan 2018
    Inception Recurrent Convolutional Neural Network for Object Recognition
    Md Zahangir Alom [email protected]
    University of Dayton, Dayton, OH, USA
    Mahmudul Hasan [email protected]
    Comcast Labs, Washington, DC, USA
    Chris Yakopcic [email protected]
    University of Dayton, Dayton, OH, USA
    Tarek M. Taha [email protected]
    University of Dayton, Dayton, OH, USA
    Abstract
    Deep convolutional neural networks (DCNNs)
    are an influential tool for solving various prob-
    lems in the machine learning and computer vi-
    sion fields. In this paper, we introduce a
    new deep learning model called an Inception-
    Recurrent Convolutional Neural Network (IR-
    CNN), which utilizes the power of an incep-
    tion network combined with recurrent layers in
    DCNN architecture. We have empirically eval-
    uated the recognition performance of the pro-
    posed IRCNN model using different benchmark
    datasets such as MNIST, CIFAR-10, CIFAR-
    100, and SVHN. Experimental results show sim-
    ilar or higher recognition accuracy when com-
    pared to most of the popular DCNNs including
    the RCNN. Furthermore, we have investigated
    IRCNN performance against equivalent Incep-
    tion Networks and Inception-Residual Networks
    using the CIFAR-100 dataset. We report about
    3.5%, 3.47% and 2.54% improvement in classifi-
    cation accuracy when compared to the RCNN,
    equivalent Inception Networks, and Inception-
    Residual Networks on the augmented CIFAR-
    100 dataset respectively.
    1. Introduction
    In recent years, deep learning using Convolutional Neu-
    ral Networks (CNNs) has shown enormous success in the
    field of machine learning and computer vision. CNNs pro-
    vide state-of-the-art accuracy in various image recognition
    tasks including object recognition (Schmidhuber, 2015;
    Krizhevsky et al., 2012; Simonyan & Zisserman, 2014;
    Szegedy et al., 2015), object detection (Girshick et al.,
    2014), tracking (Wang et al., 2015), and image caption-
    ing (Xu et al., 2014). In addition, this technique has been
    applied massively in computer vision tasks such as video
    representation and classification of human activity (Bal-
    las et al., 2015). Machine translation and natural language
    processing are applied deep learning techniques that show
    great success in this domain (Collobert & Weston, 2008;
    Manning et al., 2014). Furthermore, this technique has
    been used extensively in the field of speech recognition
    (Hinton et al., 2012). Moreover, deep learning is not lim-
    ited to signal, natural language, image, and video process-
    ing tasks, it has been applying successfully for game devel-
    opment (Mnih et al., 2013; Lillicrap et al., 2015). There is
    a lot of ongoing research for developing even better perfor-
    mance and improving the training process of DCNNs (Lin
    et al., 2013; Springenberg et al., 2014; Goodfellow et al.,
    2013; Ioffe & Szegedy, 2015; Zeiler & Fergus, 2013).
    In some cases, machine intelligence shows better perfor-
    mance compared to human intelligence including calcula-
    tion, chess, memory, and pattern matching. On the other
    hand, human intelligence still provides better performance
    in other fields such as object recognition, scene under-
    standing, and more. Deep learning techniques (DCNNs
    in particular) perform very well in the domains of detec-
    tion, classification, and scene understanding. There is a
    still a gap that must be closed before human level intelli-
    gence is reached when performing visual recognition tasks.
    Machine intelligence may open an opportunity to build a
    system that can process visual information the way that a
    human brain does. According to the study on the visual
    processing system within a human brain by James DiCarlo
    et al. (Zoccolan & Rust, 2012) the brain consists of sev-
    eral visual processing units starting with the visual cortex
    arXiv:1704.07709v1 [cs.CV] 25 Apr 2017
    2015-2017
    Supervised
    Im
    age
    Classification

    View Slide

  29. Image Classification (ResNet)
    2015
    Supervised
    Im
    age
    Classification

    View Slide

  30. Image Classification (DenseNet)
    2016
    Supervised
    Im
    age
    Classification

    View Slide

  31. Image Classification (Inception)
    2017
    Supervised
    Im
    age
    Classification

    View Slide

  32. Object Detection
    2016
    Supervised
    O
    bject Detection
    SSD: Single Shot MultiBox Detector
    Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3,
    Scott Reed4, Cheng-Yang Fu1, Alexander C. Berg1
    1UNC Chapel Hill 2Zoox Inc. 3Google Inc. 4University of Michigan, Ann-Arbor
    1[email protected], [email protected], 3{dumitru,szegedy}@google.com,
    [email protected], 1{cyfu,aberg}@cs.unc.edu
    Abstract. We present a method for detecting objects in images using a single
    deep neural network. Our approach, named SSD, discretizes the output space of
    bounding boxes into a set of default boxes over different aspect ratios and scales
    per feature map location. At prediction time, the network generates scores for the
    presence of each object category in each default box and produces adjustments to
    the box to better match the object shape. Additionally, the network combines pre-
    dictions from multiple feature maps with different resolutions to naturally handle
    objects of various sizes. SSD is simple relative to methods that require object
    proposals because it completely eliminates proposal generation and subsequent
    pixel or feature resampling stages and encapsulates all computation in a single
    network. This makes SSD easy to train and straightforward to integrate into sys-
    tems that require a detection component. Experimental results on the PASCAL
    VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy
    to methods that utilize an additional object proposal step and is much faster, while
    providing a unified framework for both training and inference. For 300 ⇥ 300 in-
    put, SSD achieves 74.3% mAP1 on VOC2007 test at 59 FPS on a Nvidia Titan
    X and for 512 ⇥ 512 input, SSD achieves 76.9% mAP, outperforming a compa-
    rable state-of-the-art Faster R-CNN model. Compared to other single stage meth-
    ods, SSD has much better accuracy even with a smaller input image size. Code is
    available at: https://github.com/weiliu89/caffe/tree/ssd .
    Keywords: Real-time Object Detection; Convolutional Neural Network
    1 Introduction
    Current state-of-the-art object detection systems are variants of the following approach:
    hypothesize bounding boxes, resample pixels or features for each box, and apply a high-
    quality classifier. This pipeline has prevailed on detection benchmarks since the Selec-
    tive Search work [1] through the current leading results on PASCAL VOC, COCO, and
    ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as
    [3]. While accurate, these approaches have been too computationally intensive for em-
    bedded systems and, even with high-end hardware, too slow for real-time applications.
    1 We achieved even better results using an improved data augmentation scheme in follow-on
    experiments: 77.2% mAP for 300⇥300 input and 79.8% mAP for 512⇥512 input on VOC2007.
    Please see Sec. 3.6 for details.
    arXiv:1512.02325v5 [cs.CV] 29 Dec 2016

    View Slide

  33. Semantic Segmentation (Image)
    2016-2017
    Supervised
    Sem
    antic Segm
    entation
    1
    Fully Convolutional Networks
    for Semantic Segmentation
    Evan Shelhamer⇤, Jonathan Long⇤, and Trevor Darrell, Member, IEEE
    Abstract—Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks
    by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to
    build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference
    and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction
    tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet)
    into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a
    skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer
    to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC
    (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of
    a second for a typical image.
    Index Terms—Semantic Segmentation, Convolutional Networks, Deep Learning, Transfer Learning
    F
    1 INTRODUCTION
    CONVOLUTIONAL networks are driving advances in
    recognition. Convnets are not only improving for
    whole-image classification [1], [2], [3], but also making
    progress on local tasks with structured output. These in-
    clude advances in bounding box object detection [4], [5], [6],
    part and keypoint prediction [7], [8], and local correspon-
    dence [8], [9].
    The natural next step in the progression from coarse to
    fine inference is to make a prediction at every pixel. Prior
    approaches have used convnets for semantic segmentation
    [10], [11], [12], [13], [14], [15], [16], in which each pixel is
    labeled with the class of its enclosing object or region, but
    with shortcomings that this work addresses.
    We show that fully convolutional networks (FCNs)
    trained end-to-end, pixels-to-pixels on semantic segmen-
    tation exceed the previous best results without further
    machinery. To our knowledge, this is the first work to
    train FCNs end-to-end (1) for pixelwise prediction and (2)
    from supervised pre-training. Fully convolutional versions
    of existing networks predict dense outputs from arbitrary-
    sized inputs. Both learning and inference are performed
    whole-image-at-a-time by dense feedforward computation
    and backpropagation. In-network upsampling layers enable
    pixelwise prediction and learning in nets with subsampling.
    This method is efficient, both asymptotically and ab-
    solutely, and precludes the need for the complications in
    other works. Patchwise training is common [10], [11], [12],
    [13], [16], but lacks the efficiency of fully convolutional
    training. Our approach does not make use of pre- and post-
    processing complications, including superpixels [12], [14],
    proposals [14], [15], or post-hoc refinement by random fields
    ⇤Authors contributed equally
    • E. Shelhamer, J. Long, and T. Darrell are with the Department of Electrical
    Engineering and Computer Science (CS Division), UC Berkeley. E-mail:
    {shelhamer,jonlong,trevor}@cs.berkeley.edu.
    or local classifiers [12], [14]. Our model transfers recent
    success in classification [1], [2], [3] to dense prediction by
    reinterpreting classification nets as fully convolutional and
    fine-tuning from their learned representations. In contrast,
    previous works have applied small convnets without super-
    vised pre-training [10], [12], [13].
    Semantic segmentation faces an inherent tension be-
    tween semantics and location: global information resolves
    what while local information resolves where. What can be
    done to navigate this spectrum from location to semantics?
    How can local decisions respect global structure? It is not
    immediately clear that deep networks for image classifica-
    tion yield representations sufficient for accurate, pixelwise
    recognition.
    In the conference version of this paper [17], we cast
    pre-trained networks into fully convolutional form, and
    augment them with a skip architecture that takes advantage
    of the full feature spectrum. The skip architecture fuses
    the feature hierarchy to combine deep, coarse, semantic
    information and shallow, fine, appearance information (see
    Section 4.3 and Figure 3). In this light, deep feature hierar-
    chies encode location and semantics in a nonlinear local-to-
    global pyramid.
    This journal paper extends our earlier work [17] through
    further tuning, analysis, and more results. Alternative
    choices, ablations, and implementation details better cover
    the space of FCNs. Tuning optimization leads to more accu-
    rate networks and a means to learn skip architectures all-at-
    once instead of in stages. Experiments that mask foreground
    and background investigate the role of context and shape.
    Results on the object and scene labeling of PASCAL-Context
    reinforce merging object segmentation and scene parsing as
    unified pixelwise prediction.
    In the next section, we review related work on deep
    classification nets, FCNs, recent approaches to semantic seg-
    mentation using convnets, and extensions to FCNs. The fol-
    arXiv:1605.06211v1 [cs.CV] 20 May 2016
    Pyramid Scene Parsing Network
    Hengshuang Zhao1 Jianping Shi2 Xiaojuan Qi1 Xiaogang Wang1 Jiaya Jia1
    1The Chinese University of Hong Kong 2SenseTime Group Limited
    {hszhao, xjqi, leojia}@cse.cuhk.edu.hk, [email protected], [email protected]
    Abstract
    Scene parsing is challenging for unrestricted open vo-
    cabulary and diverse scenes. In this paper, we exploit the
    capability of global context information by different-region-
    based context aggregation through our pyramid pooling
    module together with the proposed pyramid scene parsing
    network (PSPNet). Our global prior representation is ef-
    fective to produce good quality results on the scene parsing
    task, while PSPNet provides a superior framework for pixel-
    level prediction. The proposed approach achieves state-of-
    the-art performance on various datasets. It came first in Im-
    ageNet scene parsing challenge 2016, PASCAL VOC 2012
    benchmark and Cityscapes benchmark. A single PSPNet
    yields the new record of mIoU accuracy 85.4% on PASCAL
    VOC 2012 and accuracy 80.2% on Cityscapes.
    1. Introduction
    Scene parsing, based on semantic segmentation, is a fun-
    damental topic in computer vision. The goal is to assign
    each pixel in the image a category label. Scene parsing pro-
    vides complete understanding of the scene. It predicts the
    label, location, as well as shape for each element. This topic
    is of broad interest for potential applications of automatic
    driving, robot sensing, to name a few.
    Difficulty of scene parsing is closely related to scene and
    label variety. The pioneer scene parsing task [23] is to clas-
    sify 33 scenes for 2,688 images on LMO dataset [22]. More
    recent PASCAL VOC semantic segmentation and PASCAL
    context datasets [8, 29] include more labels with similar
    context, such as chair and sofa, horse and cow, etc. The
    new ADE20K dataset [43] is the most challenging one with
    a large and unrestricted open vocabulary and more scene
    classes. A few representative images are shown in Fig. 1.
    To develop an effective algorithm for these datasets needs
    to conquer a few difficulties.
    State-of-the-art scene parsing frameworks are mostly
    based on the fully convolutional network (FCN) [26]. The
    deep convolutional neural network (CNN) based methods
    boost dynamic object understanding, and yet still face chal-
    Figure 1. Illustration of complex scenes in ADE20K dataset.
    lenges considering diverse scenes and unrestricted vocabu-
    lary. One example is shown in the first row of Fig. 2, where
    a boat is mistaken as a car. These errors are due to similar
    appearance of objects. But when viewing the image regard-
    ing the context prior that the scene is described as boathouse
    near a river, correct prediction should be yielded.
    Towards accurate scene perception, the knowledge graph
    relies on prior information of scene context. We found
    that the major issue for current FCN based models is lack
    of suitable strategy to utilize global scene category clues.
    For typical complex scene understanding, previously to get
    a global image-level feature, spatial pyramid pooling [18]
    was widely employed where spatial statistics provide a good
    descriptor for overall scene interpretation. Spatial pyramid
    pooling network [12] further enhances the ability.
    Different from these methods, to incorporate suitable
    global features, we propose pyramid scene parsing network
    (PSPNet). In addition to traditional dilated FCN [3, 40] for
    pixel prediction, we extend the pixel-level feature to the
    specially designed global pyramid pooling one. The local
    and global clues together make the final prediction more
    reliable. We also propose an optimization strategy with
    1
    arXiv:1612.01105v2 [cs.CV] 27 Apr 2017
    Rethinking Atrous Convolution for Semantic Image Segmentation
    Liang-Chieh Chen George Papandreou Florian Schroff Hartwig Adam
    Google Inc.
    {lcchen, gpapan, fschroff, hadam}@google.com
    Abstract
    In this work, we revisit atrous convolution, a powerful tool
    to explicitly adjust filter’s field-of-view as well as control the
    resolution of feature responses computed by Deep Convolu-
    tional Neural Networks, in the application of semantic image
    segmentation. To handle the problem of segmenting objects
    at multiple scales, we design modules which employ atrous
    convolution in cascade or in parallel to capture multi-scale
    context by adopting multiple atrous rates. Furthermore, we
    propose to augment our previously proposed Atrous Spatial
    Pyramid Pooling module, which probes convolutional fea-
    tures at multiple scales, with image-level features encoding
    global context and further boost performance. We also elab-
    orate on implementation details and share our experience
    on training our system. The proposed ‘DeepLabv3’ system
    significantly improves over our previous DeepLab versions
    without DenseCRF post-processing and attains comparable
    performance with other state-of-art models on the PASCAL
    VOC 2012 semantic image segmentation benchmark.
    1. Introduction
    For the task of semantic segmentation [20, 63, 14, 97, 7],
    we consider two challenges in applying Deep Convolutional
    Neural Networks (DCNNs) [50]. The first one is the reduced
    feature resolution caused by consecutive pooling operations
    or convolution striding, which allows DCNNs to learn in-
    creasingly abstract feature representations. However, this
    invariance to local image transformation may impede dense
    prediction tasks, where detailed spatial information is de-
    sired. To overcome this problem, we advocate the use of
    atrous convolution [36, 26, 74, 66], which has been shown
    to be effective for semantic image segmentation [10, 90, 11].
    Atrous convolution, also known as dilated convolution, al-
    lows us to repurpose ImageNet [72] pretrained networks
    to extract denser feature maps by removing the downsam-
    pling operations from the last few layers and upsampling
    the corresponding filter kernels, equivalent to inserting holes
    (‘trous’ in French) between filter weights. With atrous convo-
    lution, one is able to control the resolution at which feature
    rate = 6
    rate = 24
    rate = 1
    Conv
    kernel: 3x3
    rate: 1
    Conv
    kernel: 3x3
    rate: 6
    Conv
    kernel: 3x3
    rate: 24
    Feature map Feature map
    Feature map
    Figure 1. Atrous convolution with kernel size 3 ⇥ 3 and different
    rates. Standard convolution corresponds to atrous convolution
    with rate = 1. Employing large value of atrous rate enlarges the
    model’s field-of-view, enabling object encoding at multiple scales.
    responses are computed within DCNNs without requiring
    learning extra parameters.
    Another difficulty comes from the existence of objects
    at multiple scales. Several methods have been proposed to
    handle the problem and we mainly consider four categories
    in this work, as illustrated in Fig. 2. First, the DCNN is
    applied to an image pyramid to extract features for each
    scale input [22, 19, 69, 55, 12, 11] where objects at different
    scales become prominent at different feature maps. Sec-
    ond, the encoder-decoder structure [3, 71, 25, 54, 70, 68, 39]
    exploits multi-scale features from the encoder part and re-
    covers the spatial resolution from the decoder part. Third,
    extra modules are cascaded on top of the original network for
    capturing long range information. In particular, DenseCRF
    [45] is employed to encode pixel-level pairwise similarities
    [10, 96, 55, 73], while [59, 90] develop several extra convo-
    lutional layers in cascade to gradually capture long range
    context. Fourth, spatial pyramid pooling [11, 95] probes
    an incoming feature map with filters or pooling operations
    at multiple rates and multiple effective field-of-views, thus
    capturing objects at multiple scales.
    In this work, we revisit applying atrous convolution,
    which allows us to effectively enlarge the field of view of
    filters to incorporate multi-scale context, in the framework of
    both cascaded modules and spatial pyramid pooling. In par-
    ticular, our proposed module consists of atrous convolution
    with various rates and batch normalization layers which we
    1
    arXiv:1706.05587v3 [cs.CV] 5 Dec 2017

    View Slide

  34. Semantic Segmentation (Image)
    2016-2017
    Supervised
    Sem
    antic Segm
    entation

    View Slide

  35. Autonomous Driving Systems

    View Slide

  36. Real Time, Per Pixel Object Segmentation

    View Slide

  37. Centimeter-accurate Positioning

    View Slide

  38. View Slide

  39. output
    input output
    input
    state(t)
    memory
    Feedforward
    Neural Networks
    Recurrent
    Neural Networks
    What About Memory?

    View Slide

  40. RNNs
    https://en.wikipedia.org/wiki/Long_short-term_memory
    Long Short-Term Memory (LSTM)
    How much
    goes into
    memory
    How much
    is used
    in computing
    the output
    How much
    remains in
    memory

    View Slide

  41. SOCKEYE:
    A Toolkit for Neural Machine Translation
    Felix Hieber, Tobias Domhan, Michael Denkowski,
    David Vilar, Artem Sokolov, Ann Clifton, Matt Post
    {fhieber,domhant,mdenkows,dvilar,artemsok,acclift,mattpost}@amazon.com
    Abstract
    We describe SOCKEYE,1 an open-source sequence-to-sequence toolkit for Neural
    Machine Translation (NMT). SOCKEYE is a production-ready framework for
    training and applying models as well as an experimental platform for researchers.
    Written in Python and built on MXNET, the toolkit offers scalable training and
    inference for the three most prominent encoder-decoder architectures: attentional
    recurrent neural networks, self-attentional transformers, and fully convolutional
    networks. SOCKEYE also supports a wide range of optimizers, normalization and
    regularization techniques, and inference improvements from current NMT literature.
    Users can easily run standard training recipes, explore different model settings, and
    incorporate new ideas. In this paper, we highlight SOCKEYE’s features and bench-
    mark it against other NMT toolkits on two language arcs from the 2017 Conference
    on Machine Translation (WMT): English–German and Latvian–English. We report
    competitive BLEU scores across all three architectures, including an overall best
    score for SOCKEYE’s transformer implementation. To facilitate further comparison,
    we release all system outputs and training scripts used in our experiments. The
    SOCKEYE toolkit is free software released under the Apache 2.0 license.
    1 Introduction
    The past two years have seen a deep learning revolution bring rapid and dramatic change to the field
    of machine translation. For users, new neural network-based models consistently deliver better quality
    translations than the previous generation of phrase-based systems. For researchers, Neural Machine
    Translation (NMT) provides an exciting new landscape where training pipelines are simplified and
    unified models can be trained directly from data. The promise of moving beyond the limitations of
    Statistical Machine Translation (SMT) has energized the community, leading recent work to focus
    almost exclusively on NMT and seemingly advance the state of the art every few months.
    For all its success, NMT also presents a range of new challenges. While popular encoder-decoder
    models are attractively simple, recent literature and the results of shared evaluation tasks show that
    a significant amount of engineering is required to achieve “production-ready” performance in both
    translation quality and computational efficiency. In a trend that carries over from SMT, the strongest
    NMT systems benefit from subtle architecture modifications, hyper-parameter tuning, and empirically
    effective heuristics. Unlike SMT, there is no “de-facto” toolkit that attracts most of the community’s
    attention and thus contains all the best ideas from recent literature.2 Instead, the presence of many
    independent toolkits3 brings diversity to the field, but also makes it difficult to compare architectural
    and algorithmic improvements that are each implemented in different toolkits.
    1https://github.com/awslabs/sockeye (version 1.12)
    2For SMT, this role was largely filled by MOSES [Koehn et al., 2007].
    3https://github.com/jonsafari/nmt-list
    arXiv:1712.05690v1 [cs.CL] 15 Dec 2017
    Sequence to Sequence (seq2seq)
    • seq2seq is a supervised learning algorithm where the
    input is a sequence of tokens (for example, text,
    audio) and the output generated is another
    sequence of tokens.
    • Example applications include:
    • machine translation (input a sentence from
    one language and predict what that sentence
    would be in another language)
    • text summarization (input a longer string of
    words and predict a shorter string of words
    that is a summary)
    • speech-to-text (audio clips converted into
    output sentences in tokens).
    2014-2017
    Supervised
    Text, Audio

    View Slide

  42. SOCKEYE:
    A Toolkit for Neural Machine Translation
    Felix Hieber, Tobias Domhan, Michael Denkowski,
    David Vilar, Artem Sokolov, Ann Clifton, Matt Post
    {fhieber,domhant,mdenkows,dvilar,artemsok,acclift,mattpost}@amazon.com
    Abstract
    We describe SOCKEYE,1 an open-source sequence-to-sequence toolkit for Neural
    Machine Translation (NMT). SOCKEYE is a production-ready framework for
    training and applying models as well as an experimental platform for researchers.
    Written in Python and built on MXNET, the toolkit offers scalable training and
    inference for the three most prominent encoder-decoder architectures: attentional
    recurrent neural networks, self-attentional transformers, and fully convolutional
    networks. SOCKEYE also supports a wide range of optimizers, normalization and
    regularization techniques, and inference improvements from current NMT literature.
    Users can easily run standard training recipes, explore different model settings, and
    incorporate new ideas. In this paper, we highlight SOCKEYE’s features and bench-
    mark it against other NMT toolkits on two language arcs from the 2017 Conference
    on Machine Translation (WMT): English–German and Latvian–English. We report
    competitive BLEU scores across all three architectures, including an overall best
    score for SOCKEYE’s transformer implementation. To facilitate further comparison,
    we release all system outputs and training scripts used in our experiments. The
    SOCKEYE toolkit is free software released under the Apache 2.0 license.
    1 Introduction
    The past two years have seen a deep learning revolution bring rapid and dramatic change to the field
    of machine translation. For users, new neural network-based models consistently deliver better quality
    translations than the previous generation of phrase-based systems. For researchers, Neural Machine
    Translation (NMT) provides an exciting new landscape where training pipelines are simplified and
    unified models can be trained directly from data. The promise of moving beyond the limitations of
    Statistical Machine Translation (SMT) has energized the community, leading recent work to focus
    almost exclusively on NMT and seemingly advance the state of the art every few months.
    For all its success, NMT also presents a range of new challenges. While popular encoder-decoder
    models are attractively simple, recent literature and the results of shared evaluation tasks show that
    a significant amount of engineering is required to achieve “production-ready” performance in both
    translation quality and computational efficiency. In a trend that carries over from SMT, the strongest
    NMT systems benefit from subtle architecture modifications, hyper-parameter tuning, and empirically
    effective heuristics. Unlike SMT, there is no “de-facto” toolkit that attracts most of the community’s
    attention and thus contains all the best ideas from recent literature.2 Instead, the presence of many
    independent toolkits3 brings diversity to the field, but also makes it difficult to compare architectural
    and algorithmic improvements that are each implemented in different toolkits.
    1https://github.com/awslabs/sockeye (version 1.12)
    2For SMT, this role was largely filled by MOSES [Koehn et al., 2007].
    3https://github.com/jonsafari/nmt-list
    arXiv:1712.05690v1 [cs.CL] 15 Dec 2017
    Sequence to Sequence (seq2seq)
    • Recently, problems in this domain have been
    successfully modeled with deep neural networks
    that show a significant performance boost over
    previous methodologies.
    • Amazon released in open source the Sockeye
    package, which uses Recurrent Neural Networks
    (RNNs) and Convolutional Neural Network (CNN)
    models with attention as encoder-decoder
    architectures.
    • https://github.com/awslabs/sockeye
    • provides an experimental image-to-
    description module
    2014-2017
    Supervised
    Text, Audio

    View Slide

  43. Sequence to Sequence (seq2seq)
    https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
    2014-2017
    Supervised
    Text, Audio

    View Slide

  44. Sequence to Sequence (seq2seq)
    https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye
    “Das grüne Haus”
    “the Green House”
    2014-2017
    Supervised
    Text, Audio

    View Slide

  45. “Sentence to synthesize”
    ˈsɛntəns tə ˈsɪnθəˌsaɪz.
    Concatenative TTS Neural TTS
    Text
    Phonetic transcription
    ˈsɛnt sɛntəns tə ˈsɪnθ əˌsaɪz.
    Improving text-to-speech

    View Slide

  46. US English Matthew voice
    “Sources tell CNN he believes the
    media and the northeast elite are
    needlessly hyperventilating and
    overreacting to his comments.”
    US English Joanna voice
    “President Donald Trump said on
    March 13 his administration was
    ordering the grounding of all Max 8
    and 9 models, hours after Canada
    said it was grounding the planes after
    analyzing new satellite tracking data.”
    Amazon Polly NTTS and newscaster style
    https://aws.amazon.com/blogs/aws/amazon-polly-introduces-neural-text-to-speech-and-newscaster-style/

    View Slide

  47. Latent Dirichlet Allocation (LDA)
    Copyright  2000 by the Genetics Society of America
    Inference of Population Structure Using Multilocus Genotype Data
    Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly
    Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom
    Manuscript received September 23, 1999
    Accepted for publication February 18, 2000
    ABSTRACT
    We describe a model-based clustering method for using multilocus genotype data to infer population
    structure and assign individuals to populations. We assume a model in which there are K populations
    (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus.
    Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more popula-
    tions if their genotypes indicate that they are admixed. Our model does not assume a particular mutation
    process, and it can be applied to most of the commonly used genetic markers, provided that they are not
    closely linked. Applications of our method include demonstrating the presence of population structure,
    assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individu-
    als. We showthat the method can produce highlyaccurate assignments using modest numbers of loci—e.g.,
    seven microsatellite loci in an example using genotype data from an endangered bird species. The software
    used for this article is available from http:// www.stats.ox.ac.uk/ zpritch/ home.html.
    IN applications of population genetics, it is often use- populationsbased on these subjective criteria represents
    a natural assignment in genetic terms, and it would be
    ful to classify individuals in a sample into popula-
    tions. In one scenario, the investigator begins with a useful to be able to confirm that subjective classifications
    are consistent with genetic information and hence ap-
    sample of individuals and wants to say something about
    the properties of populations. For example, in studies propriate for studying the questions of interest. Further,
    there are situations where one is interested in “cryptic”
    of human evolution, the population is often considered
    to be the unit of interest, and a great deal of work has population structure—i.e., population structure that is
    difficult to detect using visible characters, but may be
    focused on learning about the evolutionary relation-
    ships of modern populations (e.g., Caval l i et al. 1994). significant in genetic terms. For example, when associa-
    tion mapping is used to find disease genes, the presence
    In a second scenario, the investigator begins with a set
    of predefined populations and wishes to classifyindivid- of undetected population structure can lead to spurious
    associations and thus invalidate standard tests (Ewens
    uals of unknown origin. This type of problem arises
    in many contexts (reviewed by Davies et al. 1999). A and Spiel man 1995). The problem of cryptic population
    structure also arises in the context of DNA fingerprint-
    standard approach involves sampling DNA from mem-
    bers of a number of potential source populations and ing for forensics, where it is important to assess the
    degree of population structure to estimate the probabil-
    using these samples to estimate allele frequencies in
    ity of false matches (Bal ding and Nich ol s 1994, 1995;
    each population at a series of unlinked loci. Using the
    For eman et al. 1997; Roeder et al. 1998).
    estimated allele frequencies, it is then possible to com-
    Pr it ch ar d and Rosenber g (1999) considered how
    pute the likelihood that a given genotype originated in
    genetic information might be used to detect the pres-
    each population. Individuals of unknown origin can be
    ence of cryptic population structure in the association
    assigned to populations according to these likelihoods
    mapping context. More generally, one would like to be
    Paet kau et al. 1995; Rannal a and Mount ain 1997).
    able to identify the actual subpopulations and assign
    In both situations described above, a crucial first step
    individuals (probabilistically) to these populations. In
    is to define a set of populations. The definition of popu-
    this article we use a Bayesian clustering approach to
    lations is typically subjective, based, for example, on
    tackle this problem. We assume a model in which there
    linguistic, cultural, or physical characters, as well as the
    are K populations (where K may be unknown), each of
    geographic location of sampled individuals. This subjec-
    which is characterized by a set of allele frequencies at
    tive approach is usually a sensible way of incorporating
    each locus. Our method attempts to assign individuals
    diverse types of information. However, it maybe difficult
    to populations on the basis of their genotypes, while
    to know whether a given assignment of individuals to
    simultaneously estimating population allele frequen-
    cies. The method can be applied to various types of
    markers [e.g., microsatellites, restriction fragment
    Corresponding author: Jonathan Pritchard, Department of Statistics,
    length polymorphisms (RFLPs), or single nucleotide
    University of Oxford, 1 S. Parks Rd., Oxford OX1 3TG, United King-
    dom. E-mail: [email protected] polymorphisms (SNPs)], but it assumes that the marker
    Genetics 155: 945–959 ( June 2000)
    Journal of Machine Learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03
    Latent Dirichlet Allocation
    David M. Blei [email protected]
    Computer Science Division
    University of California
    Berkeley, CA 94720, USA
    Andrew Y. Ng [email protected]
    Computer Science Department
    Stanford University
    Stanford, CA 94305, USA
    Michael I. Jordan [email protected]
    Computer Science Division and Department of Statistics
    University of California
    Berkeley, CA 94720, USA
    Editor: John Lafferty
    Abstract
    We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of
    discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each
    item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in
    turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of
    text modeling, the topic probabilities provide an explicit representation of a document. We present
    efficient approximate inference techniques based on variational methods and an EM algorithm for
    empirical Bayes parameter estimation. We report results in document modeling, text classification,
    and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI
    model.
    1. Introduction
    In this paper we consider the problem of modeling text corpora and other collections of discrete
    data. The goal is to find short descriptions of the members of a collection that enable efficient
    processing of large collections while preserving the essential statistical relationships that are useful
    for basic tasks such as classification, novelty detection, summarization, and similarity and relevance
    judgments.
    Significant progress has been made on this problem by researchers in the field of informa-
    tion retrieval (IR) (Baeza-Yates and Ribeiro-Neto, 1999). The basic methodology proposed by
    IR researchers for text corpora—a methodology successfully deployed in modern Internet search
    engines—reduces each document in the corpus to a vector of real numbers, each of which repre-
    sents ratios of counts. In the popular tf-idf scheme (Salton and McGill, 1983), a basic vocabulary
    of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the
    number of occurrences of each word. After suitable normalization, this term frequency count is
    compared to an inverse document frequency count, which measures the number of occurrences of a
    c 2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan.
    2000-2003
    Unsupervised
    Topic M
    odeling

    View Slide

  48. Latent Dirichlet Allocation (LDA)
    • As an extremely simple example, given a set of documents where the
    only words that occur within them are eat, sleep, play, meow, and
    bark, LDA might produce topics like the following:
    Topic eat sleep play meow bark
    Cats? Topic 1 0.1 0.3 0.2 0.4 0.0
    Dogs? Topic 2 0.2 0.1 0.4 0.0 0.3
    2000-2003
    Unsupervised
    Topic M
    odeling

    View Slide

  49. Neural Topic Model (NTM)
    Encoder: feedforward net
    Input term counts vector
    µ
    z
    Document
    Posterior
    Sampled Document
    Representation
    Decoder:
    Softmax
    Neural Variational Inference for Text Processing
    Yishu Miao1 [email protected]
    Lei Yu1 [email protected]
    Phil Blunsom12 [email protected]
    1University of Oxford, 2Google Deepmind
    Abstract
    Recent advances in neural variational inference
    have spawned a renaissance in deep latent vari-
    able models. In this paper we introduce a generic
    variational inference framework for generative
    and conditional models of text. While traditional
    variational methods derive an analytic approxi-
    mation for the intractable distributions over latent
    variables, here we construct an inference network
    conditioned on the discrete text input to pro-
    vide the variational distribution. We validate this
    framework on two very different text modelling
    applications, generative document modelling and
    supervised question answering. Our neural vari-
    ational document model combines a continuous
    stochastic document representation with a bag-
    of-words generative model and achieves the low-
    est reported perplexities on two standard test cor-
    pora. The neural answer selection model em-
    ploys a stochastic representation layer within an
    attention mechanism to extract the semantics be-
    tween a question and answer pair. On two ques-
    tion answering benchmarks this model exceeds
    all previous published benchmarks.
    1. Introduction
    Probabilistic generative models underpin many successful
    applications within the field of natural language process-
    ing (NLP). Their popularity stems from their ability to use
    unlabelled data effectively, to incorporate abundant linguis-
    tic features, and to learn interpretable dependencies among
    data. However these successes are tempered by the fact that
    as the structure of such generative models becomes deeper
    and more complex, true Bayesian inference becomes in-
    tractable due to the high dimensional integrals required.
    Markov chain Monte Carlo (MCMC) (Neal, 1993; Andrieu
    Proceedings of the 33rd International Conference on Machine
    Learning, New York, NY, USA, 2016. JMLR: W&CP volume
    48. Copyright 2016 by the author(s).
    et al., 2003) and variational inference (Jordan et al., 1999;
    Attias, 2000; Beal, 2003) are the standard approaches for
    approximating these integrals. However the computational
    cost of the former results in impractical training for the
    large and deep neural networks which are now fashion-
    able, and the latter is conventionally confined due to the
    underestimation of posterior variance. The lack of effec-
    tive and efficient inference methods hinders our ability to
    create highly expressive models of text, especially in the
    situation where the model is non-conjugate.
    This paper introduces a neural variational framework for
    generative models of text, inspired by the variational auto-
    encoder (Rezende et al., 2014; Kingma & Welling, 2014).
    The principle idea is to build an inference network, imple-
    mented by a deep neural network conditioned on text, to ap-
    proximate the intractable distributions over the latent vari-
    ables. Instead of providing an analytic approximation, as in
    traditional variational Bayes, neural variational inference
    learns to model the posterior probability, thus endowing
    the model with strong generalisation abilities. Due to the
    flexibility of deep neural networks, the inference network
    is capable of learning complicated non-linear distributions
    and processing structured inputs such as word sequences.
    Inference networks can be designed as, but not restricted
    to, multilayer perceptrons (MLP), convolutional neural net-
    works (CNN), and recurrent neural networks (RNN), ap-
    proaches which are rarely used in conventional generative
    models. By using the reparameterisation method (Rezende
    et al., 2014; Kingma & Welling, 2014), the inference net-
    work is trained through back-propagating unbiased and low
    variance gradients w.r.t. the latent variables. Within this
    framework, we propose a Neural Variational Document
    Model (NVDM) for document modelling and a Neural An-
    swer Selection Model (NASM) for question answering, a
    task that selects the sentences that correctly answer a fac-
    toid question from a set of candidate sentences.
    The NVDM (Figure 1) is an unsupervised generative model
    of text which aims to extract a continuous semantic latent
    variable for each document. This model can be interpreted
    as a variational auto-encoder: an MLP encoder (inference
    arXiv:1511.06038v4 [cs.CL] 4 Jun 2016
    Output term counts vector
    2015
    Unsupervised
    Topic M
    odeling

    View Slide

  50. Random Cut Forest (RCF)
    2004-2016
    Unsupervised
    Anom
    aly
    Detection
    Downloaded 06/11/18 to 54.240.197.235. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
    Robust Random Cut Forest Based Anomaly Detection On Streams
    Sudipto Guha [email protected]
    University of Pennsylvania, Philadelphia, PA 19104.
    Nina Mishra [email protected]
    Amazon, Palo Alto, CA 94303.
    Gourav Roy [email protected]
    Amazon, Bangalore, India 560055.
    Okke Schrijvers [email protected]
    Stanford University, Palo Alto, CA 94305.
    Abstract
    In this paper we focus on the anomaly detection
    problem for dynamic data streams through the
    lens of random cut forests. We investigate a ro-
    bust random cut data structure that can be used
    as a sketch or synopsis of the input stream. We
    provide a plausible definition of non-parametric
    anomalies based on the influence of an unseen
    point on the remainder of the data, i.e., the exter-
    nality imposed by that point. We show how the
    sketch can be efficiently updated in a dynamic
    data stream. We demonstrate the viability of the
    algorithm on publicly available real data.
    1. Introduction
    Anomaly detection is one of the cornerstone problems in
    data mining. Even though the problem has been well stud-
    ied over the last few decades, the emerging explosion of
    data from the internet of things and sensors leads us to re-
    consider the problem. In most of these contexts the data
    is streaming and well-understood prior models do not ex-
    ist. Furthermore the input streams need not be append only,
    there may be corrections, updates and a variety of other dy-
    namic changes. Two central questions in this regard are
    (1) how do we define anomalies? and (2) what data struc-
    ture do we use to efficiently detect anomalies over dynamic
    data streams? In this paper we initiate the formal study of
    both of these questions. For (1), we view the problem from
    the perspective of model complexity and say that a point is
    an anomaly if the complexity of the model increases sub-
    stantially with the inclusion of the point. The labeling of
    Proceedings of the 33rd International Conference on Machine
    Learning, New York, NY, USA, 2016. JMLR: W&CP volume
    48. Copyright 2016 by the author(s).
    a point is data dependent and corresponds to the external-
    ity imposed by the point in explaining the remainder of the
    data. We extend this notion of externality to handle “outlier
    masking” that often arises from duplicates and near dupli-
    cate records. Note that the notion of model complexity has
    to be amenable to efficient computation in dynamic data
    streams. This relates question (1) to question (2) which we
    discuss in greater detail next. However it is worth noting
    that anomaly detection is not well understood even in the
    simpler context of static batch processing and (2) remains
    relevant in the batch setting as well.
    For question (2), we explore a randomized approach, akin
    to (Liu et al., 2012), due in part to the practical success re-
    ported in (Emmott et al., 2013). Randomization is a pow-
    erful tool and known to be valuable in supervised learn-
    ing (Breiman, 2001). But its technical exploration in the
    context of anomaly detection is not well-understood and
    the same comment applies to the algorithm put forth in (Liu
    et al., 2012). Moreover that algorithm has several lim-
    itations as described in Section 4.1. In particular, we
    show that in the presence of irrelevant dimensions, cru-
    cial anomalies are missed. In addition, it is unclear how
    to extend this work to a stream. Prior work attempted so-
    lutions (Tan et al., 2011) that extend to streaming, however
    those were not found to be effective (Emmott et al., 2013).
    To address these limitations, we put forward a sketch or
    synopsis termed robust random cut forest (RRCF) formally
    defined as follows.
    Definition 1 A robust random cut tree (RRCT) on point
    set S is generated as follows:
    1. Choose a random dimension proportional to ℓi
    j
    ℓj
    ,
    where ℓi = maxx∈S xi
    − minx∈Sxi
    .
    2. Choose Xi
    ∼ Uniform[minx∈S xi, maxx∈S xi]
    3. Let S1 = {x|x ∈ S, xi
    ≤ Xi
    } and S2 = S \ S1
    and
    recurse on S1
    and S2
    .

    View Slide

  51. Random Cut Forest (RCF)
    2004-2016
    Unsupervised
    Anom
    aly
    Detection

    View Slide

  52. Random Cut Forest (RCF)
    2004-2016
    Unsupervised
    Anom
    aly
    Detection

    View Slide

  53. Random Cut Forest (RCF)
    2004-2016
    Unsupervised
    Anom
    aly
    Detection

    View Slide

  54. • The idea is to treat a period of P datapoints as a single datapoint of feature
    length P and then run the algorithm on these feature vectors
    • This is especially useful when working with periodic data with known
    period
    Shingling

    View Slide

  55. Random Cut Forest (RCF)
    2004-2016
    Unsupervised
    Anom
    aly
    Detection
    Using “shingling”

    View Slide

  56. Anomaly Detection to Improve
    Infrastructure and Application Monitoring
    Am
    azon
    CloudW
    atch

    View Slide

  57. Time Series Forecasting (DeepAR)
    DeepAR: Probabilistic Forecasting with
    Autoregressive Recurrent Networks
    Valentin Flunkert

    , David Salinas

    , Jan Gasthaus
    Amazon Development Center
    Germany

    Abstract
    Probabilistic forecasting, i.e. estimating the probability distribution of a time se-
    ries’ future given its past, is a key enabler for optimizing business processes. In
    retail businesses, for example, forecasting demand is crucial for having the right
    inventory available at the right time at the right place. In this paper we propose
    DeepAR, a methodology for producing accurate probabilistic forecasts, based on
    training an auto-regressive recurrent network model on a large number of related
    time series. We demonstrate how by applying deep learning techniques to fore-
    casting, one can overcome many of the challenges faced by widely-used classical
    approaches to the problem. We show through extensive empirical evaluation on
    several real-world forecasting data sets that our methodology produces more accu-
    rate forecasts than other state-of-the-art methods, while requiring minimal manual
    work.
    1 Introduction
    Forecasting plays a key role in automating and optimizing operational processes in most businesses
    and enables data driven decision making. In retail for example, probabilistic forecasts of product
    supply and demand can be used for optimal inventory management, staff scheduling and topology
    planning [17], and are more generally a crucial technology for most aspects of supply chain opti-
    mization.
    The prevalent forecasting methods in use today have been developed in the setting of forecasting
    individual or small groups of time series. In this approach, model parameters for each given time
    series are independently estimated from past observations. The model is typically manually selected
    to account for different factors, such as autocorrelation structure, trend, seasonality, and other ex-
    planatory variables. The fitted model is then used to forecast the time series into the future according
    to the model dynamics, possibly admitting probabilistic forecasts through simulation or closed-form
    expressions for the predictive distributions. Many methods in this class are based on the classical
    Box-Jenkins methodology [3], exponential smoothing techniques, or state space models [11, 18].
    In recent years, a new type of forecasting problem has become increasingly important in many appli-
    cations. Instead of needing to predict individual or a small number of time series, one is faced with
    forecasting thousands or millions of related time series. Examples include forecasting the energy
    consumption of individual households, forecasting the load for servers in a data center, or forecast-
    ing the demand for all products that a large retailer offers. In all these scenarios, a substantial amount
    of data on past behavior of similar, related time series can be leveraged for making a forecast for an
    individual time series. Using data from related time series not only allows fitting more complex (and
    hence potentially more accurate) models without overfitting, it can also alleviate the time and labor
    intensive manual feature engineering and model selection steps required by classical techniques.
    ⇤equal contribution
    arXiv:1704.04110v2 [cs.AI] 5 Jul 2017
    2017
    Supervised
    Tim
    e
    Series Forecasting
    • DeepAR is a supervised learning algorithm for
    forecasting scalar time series using recurrent neural
    networks (RNN)
    • Classical forecasting methods fit one model to each
    individual time series, and then use that model to
    extrapolate the time series into the future
    • In many applications you might have many similar time
    series across a set of cross-sectional units
    • For example, demand for different products, load of servers,
    requests for web pages, and so on
    • In this case, it can be beneficial to train a single model
    jointly over all of these time series
    • DeepAR takes this approach, training a model for predicting a
    time series over a large set of (related) time series

    View Slide

  58. Time Series Forecasting (DeepAR)
    2017
    Supervised
    Tim
    e
    Series Forecasting

    View Slide

  59. Time Series Forecasting (DeepAR)

    View Slide

  60. BlazingText (Word2vec)
    BlazingText: Scaling and Accelerating Word2Vec using Multiple
    GPUs
    Saurabh Gupta
    Amazon Web Services
    [email protected]
    Vineet Khare
    Amazon Web Services
    [email protected]
    ABSTRACT
    Word2Vec is a popular algorithm used for generating dense vector
    representations of words in large corpora using unsupervised learn-
    ing. The resulting vectors have been shown to capture semantic
    relationships between the corresponding words and are used ex-
    tensively for many downstream natural language processing (NLP)
    tasks like sentiment analysis, named entity recognition and machine
    translation. Most open-source implementations of the algorithm
    have been parallelized for multi-core CPU architectures including
    the original C implementation by Mikolov et al. [1] and FastText
    [2] by Facebook. A few other implementations have attempted to
    leverage GPU parallelization but at the cost of accuracy and scal-
    ability. In this work, we present BlazingText, a highly optimized
    implementation of word2vec in CUDA, that can leverage multiple
    GPUs for training. BlazingText can achieve a training speed of up to
    43M words/sec on 8 GPUs, which is a 9x speedup over 8-threaded
    CPU implementations, with minimal eect on the quality of the
    embeddings.
    CCS CONCEPTS
    • Computing methodologies → Neural networks; Natural
    language processing;
    KEYWORDS
    Word embeddings, Word2Vec, Natural Language Processing, Ma-
    chine Learning, CUDA, GPU
    ACM Reference format:
    Saurabh Gupta and Vineet Khare. 2017. BlazingText: Scaling and Accelerat-
    ing Word2Vec using Multiple GPUs. In Proceedings of MLHPC’17: Machine
    Learning in HPC Environments, Denver, CO, USA, November 12–17, 2017,
    5 pages.
    https://doi.org/10.1145/3146347.3146354
    1 INTRODUCTION
    Word2Vec aims to represent each word as a vector in a low-dimensional
    embedding space such that the geometry of resulting vectors cap-
    tures word semantic similarity through the cosine similarity of cor-
    responding vectors as well as more complex relationships through
    vector subtractions, such as vec(“King”) - vec(“Queen”) + vec(“Woman”)
    MLHPC’17: Machine Learning in HPC Environments, November 12–17, 2017, Denver, CO,
    USA
    © 2017 Copyright held by the owner/author(s).
    ACM ISBN 978-1-4503-5137-9/17/11.
    https://doi.org/10.1145/3146347.3146354
    ⇡ vec(“Man”). This idea has enabled many Natural Language Pro-
    cessing (NLP) algorithms to achieve better performance [3, 4].
    The optimization in word2vec is done using Stochastic Gradient
    Descent (SGD), which solves the problem iteratively; at each step,
    it picks a pair of words: an input word and a target word either
    from its window or a random negative sample. It then computes the
    gradients of the objective function with respect to the two chosen
    words, and updates the word representations of the two words
    based on the gradient values. The algorithm then proceeds to the
    next iteration with a dierent word pair being chosen.
    One of the main issues with SGD is that it is inherently sequential;
    since there is a dependency between the update from one iteration
    and the computation in the next iteration (they may happen to touch
    the same word representations), each iteration must potentially wait
    for the update from the previous iteration to complete. This does
    not allow us to use the parallel resources of the hardware.
    However, to solve the above issue, word2vec uses Hogwild [5],
    a scheme where dierent threads process dierent word pairs in
    parallel and ignore any conicts that may arise in the model up-
    date phases. In theory, this can reduce the rate of convergence of
    algorithm as compared to a sequential run. However, the Hogwild
    approach has been shown to work well in the case updates across
    threads are unlikely to be to the same word; and indeed for large
    vocabulary sizes, conicts are relatively rare and convergence is
    not typically aected.
    The success of Hogwild approach for Word2Vec in case of multi-
    core architectures makes this algorithm a good candidate for ex-
    ploiting GPU, which provides orders of magnitude more parallelism
    than a CPU. In this paper, we propose an ecient parallelization
    technique for accelerating word2vec using GPUs.
    GPU acceleration using deep learning frameworks is not a good
    choice for accelerating word2vec [6]. These frameworks are often
    suitable for “deep networks” where the computation is dominated
    by heavy operations like convolutions and large matrix multiplica-
    tions. On the other hand, word2vec is a relatively shallow network,
    as each training step consists of an embedding lookup, gradient
    computation and nally weight updates for the word pair under
    consideration. The gradient computation and updates involve small
    dot products and thus don’t benet from the use of cuDNN [7] or
    cuBLAS [8] libraries.
    The limitations of deep learning frameworks led us to explore
    the CUDA C++ API. We design the training algorithm from scratch,
    to utilize CUDA multi-threading capabilities optimally, without
    hurting the output accuracy by over-exploiting GPU parallelism.
    Finally, to scale out BlazingText to process text corpus at several
    million words/sec, we demonstrate the possibility of using multiple
    GPUs to perform data parallelism based training, which is one of the
    main contributions of our work. We benchmark BlazingText against
    2013-2017
    Supervised
    W
    ord
    Em
    bedding
    Efficient Estimation of Word Representations in
    Vector Space
    Tomas Mikolov
    Google Inc., Mountain View, CA
    [email protected]
    Kai Chen
    Google Inc., Mountain View, CA
    [email protected]
    Greg Corrado
    Google Inc., Mountain View, CA
    [email protected]
    Jeffrey Dean
    Google Inc., Mountain View, CA
    [email protected]
    Abstract
    We propose two novel model architectures for computing continuous vector repre-
    sentations of words from very large data sets. The quality of these representations
    is measured in a word similarity task, and the results are compared to the previ-
    ously best performing techniques based on different types of neural networks. We
    observe large improvements in accuracy at much lower computational cost, i.e. it
    takes less than a day to learn high quality word vectors from a 1.6 billion words
    data set. Furthermore, we show that these vectors provide state-of-the-art perfor-
    mance on our test set for measuring syntactic and semantic word similarities.
    1 Introduction
    Many current NLP systems and techniques treat words as atomic units - there is no notion of similar-
    ity between words, as these are represented as indices in a vocabulary. This choice has several good
    reasons - simplicity, robustness and the observation that simple models trained on huge amounts of
    data outperform complex systems trained on less data. An example is the popular N-gram model
    used for statistical language modeling - today, it is possible to train N-grams on virtually all available
    data (trillions of words [3]).
    However, the simple techniques are at their limits in many tasks. For example, the amount of
    relevant in-domain data for automatic speech recognition is limited - the performance is usually
    dominated by the size of high quality transcribed speech data (often just millions of words). In
    machine translation, the existing corpora for many languages contain only a few billions of words
    or less. Thus, there are situations where simple scaling up of the basic techniques will not result in
    any significant progress, and we have to focus on more advanced techniques.
    With progress of machine learning techniques in recent years, it has become possible to train more
    complex models on much larger data set, and they typically outperform the simple models. Probably
    the most successful concept is to use distributed representations of words [10]. For example, neural
    network based language models significantly outperform N-gram models [1, 27, 17].
    1.1 Goals of the Paper
    The main goal of this paper is to introduce techniques that can be used for learning high-quality word
    vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As
    far as we know, none of the previously proposed architectures has been successfully trained on more
    1
    arXiv:1301.3781v3 [cs.CL] 7 Sep 2013

    View Slide

  61. @data_monsters
    https://twitter.com/data_monsters/status/844256398393462784

    View Slide

  62. Word2vec ⇾ Word Embedding
    2013
    Supervised
    W
    ord
    Em
    bedding
    Contextual
    Bag-Of-Words
    (CBOW)
    to predict a word
    given its context
    Skip-Gram with
    Negative Sampling
    (SGNS)
    to predict the context
    given a word

    View Slide

  63. BlazingText (Word2vec) Scaling
    2017
    Supervised
    W
    ord
    Em
    bedding

    View Slide

  64. AW
    S
    Sum
    m
    it
    M
    ilan
    2018

    View Slide

  65. https://bit.ly/2SSI2Qo

    View Slide

  66. And Then There Are (Built-in) Algorithms
    Algorithm Scope
    Linear Learner classification, regression
    Factorization Machines classification, regression, sparse datasets
    K-Nearest Neighbors (k-NN) classification, regression
    K-Means Clustering clustering, unsupervised
    Principal Component Analysis (PCA) dimensionality reduction, unsupervised
    XGBoost regression, classification (binary and multiclass), and ranking
    Image Classification CNNs (ResNet)
    Object Classification Object classification (and bounding box) inside an image
    Semantic Segmentation Pixel by pixel classification of an image
    Sequence to Sequence (seq2seq) translation, text summarization, speech-to-text (RNNs, CNN)
    Latent Dirichlet Allocation (LDA) topic modeling, unsupervised
    Neural Topic Model (NTM) topic modeling, unsupervised
    Random Cut Forest (RCF) anomaly detection
    Time Series Forecasting (DeepAR) time series forecasting (RNN)
    BlazingText (Word2vec) word embeddings

    View Slide

  67. Machine Learning = Algorithms + Data + Tools

    View Slide

  68. Customers want more value from their data
    Growing
    exponentially
    From new
    sources
    Increasingly
    diverse
    Used by
    many people
    Analyzed by
    many applications

    View Slide

  69. Cloud data lakes are the future
    Customers want:
    A single data store that is scalable & cost effective
    To store data securely in standard formats
    To analyze their data in a variety of ways
    Cloud Data Lake
    Infrastructure
    Decoupled Storage
    & Compute Resources
    Security & Governance
    Data
    Migration
    Streaming
    Services
    Data
    Warehouse
    Big Data
    Processing
    Serverless Data
    Processing
    Real-time
    Analytics
    Operational
    Analytics
    Predictive
    Analytics
    ETL & Catalog
    Data Management

    View Slide

  70. 125+ million players
    Data provides a constant feedback loop
    for game designers
    Up-to-the-minute analysis of gamer
    satisfaction to drive gamer engagement
    Resulting in the most popular
    game played in the world
    Fortnite

    View Slide

  71. Data lake infrastructure
    & management
    “With an enterprise-ready
    option like Lake Formation,
    we will be able to spend more
    time deriving value from our
    data rather than doing the
    heavy lifting involved
    in manually setting up and
    managing our data lake.”
    —Joshua Couch, VP Engineering
    at Fender Digital

    View Slide

  72. Analytics
    FINRA’s legacy system did not
    scale to handle 75 billion events
    per day. They needed to run
    complex surveillance queries
    over 20+ PB of data
    FINRA migrated their big data
    appliance to a S3 Data Lake
    and uses EMR for ingestion
    and processing

    View Slide

  73. CHALLENGE
    Needed to analyze data to find
    insights, identify opportunities, and
    evaluate business performance.
    The Oracle DW did not scale, was
    difficult to maintain, and costly.
    SOLUTION
    Deployed a data lake with S3, and run
    analytics with Redshift, Redshift
    Spectrum, and EMR.
    Result: they doubled the data stored
    (100PB), lowered costs, and was able
    to gain insights faster.
    50PB of data
    600,000 analytics jobs/day
    S3
    DynamoDB
    Relational Stores
    Non Relational Stores
    S3
    Kinesis
    Data Lake Web
    Interface
    Data Lake APIs
    Workflows
    service
    Discovery
    service
    Data
    Ingestion
    Subscription
    Service
    Data security and governance
    EMR
    Redshift
    Redshift Spectrum
    Other Compute
    Source systems Big data marketplace Analytics
    100PB
    Data Quality
    / Curation

    View Slide

  74. Amazon.com,1995

    View Slide

  75. View Slide

  76. View Slide

  77. View Slide

  78. View Slide

  79. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
    Our mission at AWS
    Put machine learning in the hands
    of every developer

    View Slide

  80. 142
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
    FRAMEWORKS INTERFACES INFRASTRUCTURE
    AI Services
    Broadest and deepest set of capabilities
    T H E A W S M L S T A C K
    VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
    ML Services
    ML Frameworks + Infrastructure
    P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D
    & C O M P R E H E N D
    M E D I C A L
    L E X F O R E C A S T
    R E K O G N I T I O N
    I M A G E
    R E K O G N I T I O N
    V I D E O
    T E X T R A C T P E R S O N A L I Z E
    Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting
    Amazon SageMaker
    F P G A S
    E C 2 P 3
    & P 3 D N
    E C 2 G 4
    E C 2 C 5
    I N F E R E N T I A
    G R E E N G R A S S E L A S T I C
    I N F E R E N C E
    D L
    C O N T A I N E R S
    & A M I s
    E L A S T I C
    K U B E R N E T E S
    S E R V I C E
    E L A S T I C
    C O N T A I N E R
    S E R V I C E

    View Slide

  81. 143
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 143
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
    Accelerating
    investigation timelines
    FINRA uses Amazon Comprehend to process and
    review millions of documents with unstructured data,
    helping flag records of interest that should be
    reviewed by human investigators.

    View Slide

  82. 144
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 144
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
    Predicting
    global markets
    Moody’s uses Amazon SageMaker to better
    predict market conditions and credit actions.

    View Slide

  83. 145
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 145
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
    Accelerating
    financial analysis
    Using TensorFlow on Amazon SageMaker, Siemens
    Financial Services developed an NLP model to extract
    critical information to accelerate investment due
    diligence, reducing time to summarize diligence
    documents from 12 hours down to 30 seconds.

    View Slide

  84. 146
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 146
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
    Optimizing
    interactive games
    Rovio uses deep reinforcement learning on AWS to
    help predict the difficulty of levels in Angry Birds
    Dream Blast. This lets their developers focus on
    creating better player experiences, instead of testing
    levels.

    View Slide

  85. 147
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 147
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
    Driving better
    healthcare outcomes
    Using Amazon SageMaker, GE Healthcare developed
    an ML model that can learn from thousands of
    medical scans to detect anomalies more accurately
    and efficiently, allowing radiologists to prioritize
    patients needing immediate attention.

    View Slide

  86. 148
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 148
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
    Enhancing the
    fan experience
    Formula 1 uses Amazon SageMaker to create real time
    insights on how a driver is performing, improving the fan
    experience on television broadcasts and digital platforms.

    View Slide

  87. 149
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 149
    © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
    Improving
    customer service
    T-Mobile uses Amazon SageMaker Ground Truth
    to label unstructured data from customer service
    interactions. These data sets are used to train
    machine learning models that provide their human
    agents with recommended actions for a given
    customer.

    View Slide

  88. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
    Culture
    Setting your organization
    up for success

    View Slide

  89. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
    Assess your structured
    and unstructured
    data sources
    Create
    the loop
    1.
    Connect technology
    initiatives with
    business outcomes
    Advance your
    data strategy
    Put machine learning
    in the hands of
    your developers
    Organize
    for success
    2. 3.
    ?

    View Slide

  90. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
    • Purpose-built for ML-skills development
    • Fully programmable & customizable
    • Build custom Amazon SageMaker models
    • 10-minutes to your first deep learning project
    The world’s first deep learning-enabled video camera for developers

    View Slide

  91. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
    © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
    • Build machine learning models in Amazon
    SageMaker
    • Train, test, and iterate on the track using the
    AWS DeepRacer 3D racing simulator
    • Compete in the world’s first global autonomous
    racing league, to race for prizes and a chance to
    advance to win the coveted AWS DeepRacer Cup
    A fully autonomous 1/18th-scale race car designed to help you learn about
    reinforcement learning through autonomous driving

    View Slide

  92. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark

    View Slide

  93. © 2019, Amazon Web Services, Inc. or its Affiliates.
    Danilo Poccia
    Principal Evangelist
    AWS
    @danilop
    danilop.net
    And Then There Are Algorithms

    View Slide