Machine Learning for the Enterprise Conference, Rome, October 28th, 2019
Machine Learning = Algorithms + Data + Tools
Part 2
© 2019, Amazon Web Services, Inc. or its Affiliates.
Danilo Poccia
Principal Evangelist
AWS
@danilop
danilop.net
And Then There Are Algorithms
Neural
Networks
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
1943 Warren McCulloch, Walter Pitts
Threshold
Logic
Units
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
1962 Frank Rosenblatt
Perceptron
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Perceptron
∑
x1
x2
x3
xn
w1
w2
w3
wn
w0
= #
output
weights
(parameters)
activation
function
input
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Perceptron
f(∑)
x1
x2
x3
xn
w1
w2
w3
wn
w0
= #
weights
(parameters)
activation
function
input output
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Perceptron
f(∑)
input output
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
1969 Marvin Minsky, Seymour Papert
Perceptrons:
An Introduction
to Computational Geometry
A perceptron can only solve
linearly separable functions
(e.g. no XOR)
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Neural
Netw
ork
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
input
layer
hidden
layer
output
layer
input output
Multiple Layers
Lots of Parameters
Backpropagation
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Microprocessor Transistor Counts 1971-2018
Intel Xeon CPU
28 cores
NVIDIA V100 GPU
5,120 CUDA Cores
640 Tensor Cores
M
oore’s
Law
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
LeCun, Gradient-Based
Learning Applied to
Document Recognition,1998
Hinton, A Fast Learning
Algorithm for Deep Belief
Nets, 2006
Bengio, Learning Deep
Architectures for AI, 2009
Deep
Learning
Advances in Research 1998-2009
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Image
Processing
Deep
Learning
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Im
age
Processing
output
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
f(∑)
How to give images in input
to a Neural Network?
Photo by David Iliff. License: CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Im
age
Processing
Convolution Matrix
0 0 0
0 1 0
0 0 0
Identity
Photo by David Iliff. License: CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Im
age
Processing
Convolution Matrix
1 0 -1
2 0 -2
1 0 -1
Left Edges
Photo by David Iliff. License: CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Im
age
Processing
Convolution Matrix
-1 0 1
-2 0 2
-1 0 1
Right Edges
Photo by David Iliff. License: CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Im
age
Processing
Convolution Matrix
1 2 1
0 0 0
-1 -2 -1
Top Edges
Photo by David Iliff. License: CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Im
age
Processing
Convolution Matrix
-1 -2 -1
0 0 0
1 2 1
Bottom Edges
Photo by David Iliff. License: CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
Im
age
Processing
Convolution Matrix
0.6 -0.6 1.2
-1.4 1.2 -1.6
0.8 -1.4 1.6
Random Values
Photo by David Iliff. License: CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
CNNs
Convolutional Neural Networks (CNNs)
https://en.wikipedia.org/wiki/Convolutional_neural_network
© 2019, Amazon Web Services, Inc. or its Affiliates.
© 2019, Amazon Web Services, Inc. or its Affiliates.
ImageNet Classification Error Over Time
0
5
10
15
20
25
30
2010 2011 2012 2013 2014 2015 2016 2017
CNNs
2012 ImageNet Classification with Deep Convolutional Neural Networks
CNNs
SuperVision: 8 layers, 60M parameters
0
2013 Visualizing and Understanding Convolutional Networks
CNNs
CNNs
CNNs
How Do Neural Networks Learn?
?
More generic and can be reused
as feature extractor for other visual tasks
Specific
to task
Cat
Dog
0
Image Classification
Deep Residual Learning for Image Recognition
Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun
Microsoft Research
{kahe, v-xiangz, v-shren, jiansun}@microsoft.com
Abstract
Deeper neural networks are more difficult to train. We
present a residual learning framework to ease the training
of networks that are substantially deeper than those used
previously. We explicitly reformulate the layers as learn-
ing residual functions with reference to the layer inputs, in-
stead of learning unreferenced functions. We provide com-
prehensive empirical evidence showing that these residual
networks are easier to optimize, and can gain accuracy from
considerably increased depth. On the ImageNet dataset we
evaluate residual nets with a depth of up to 152 layers—8⇥
deeper than VGG nets [41] but still having lower complex-
ity. An ensemble of these residual nets achieves 3.57% error
on the ImageNet test set. This result won the 1st place on the
ILSVRC 2015 classification task. We also present analysis
on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance
for many visual recognition tasks. Solely due to our ex-
tremely deep representations, we obtain a 28% relative im-
provement on the COCO object detection dataset. Deep
residual nets are foundations of our submissions to ILSVRC
& COCO 2015 competitions1, where we also won the 1st
places on the tasks of ImageNet detection, ImageNet local-
ization, COCO detection, and COCO segmentation.
1. Introduction
Deep convolutional neural networks [22, 21] have led
to a series of breakthroughs for image classification [21,
50, 40]. Deep networks naturally integrate low/mid/high-
level features [50] and classifiers in an end-to-end multi-
layer fashion, and the “levels” of features can be enriched
by the number of stacked layers (depth). Recent evidence
[41, 44] reveals that network depth is of crucial importance,
and the leading results [41, 44, 13, 16] on the challenging
ImageNet dataset [36] all exploit “very deep” [41] models,
with a depth of sixteen [41] to thirty [16]. Many other non-
trivial visual recognition tasks [8, 12, 7, 32, 27] have also
1
http://image-net.org/challenges/LSVRC/2015/ and
http://mscoco.org/dataset/#detections-challenge2015.
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
training error (%)
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
test error (%)
56-layer
20-layer
56-layer
20-layer
Figure 1. Training error (left) and test error (right) on CIFAR-10
with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error. Similar phenomena
on ImageNet is presented in Fig. 4.
greatly benefited from very deep models.
Driven by the significance of depth, a question arises: Is
learning better networks as easy as stacking more layers?
An obstacle to answering this question was the notorious
problem of vanishing/exploding gradients [1, 9], which
hamper convergence from the beginning. This problem,
however, has been largely addressed by normalized initial-
ization [23, 9, 37, 13] and intermediate normalization layers
[16], which enable networks with tens of layers to start con-
verging for stochastic gradient descent (SGD) with back-
propagation [22].
When deeper networks are able to start converging, a
degradation problem has been exposed: with the network
depth increasing, accuracy gets saturated (which might be
unsurprising) and then degrades rapidly. Unexpectedly,
such degradation is not caused by overfitting, and adding
more layers to a suitably deep model leads to higher train-
ing error, as reported in [11, 42] and thoroughly verified by
our experiments. Fig. 1 shows a typical example.
The degradation (of training accuracy) indicates that not
all systems are similarly easy to optimize. Let us consider a
shallower architecture and its deeper counterpart that adds
more layers onto it. There exists a solution by construction
to the deeper model: the added layers are identity mapping,
and the other layers are copied from the learned shallower
model. The existence of this constructed solution indicates
that a deeper model should produce no higher training error
than its shallower counterpart. But experiments show that
our current solvers on hand are unable to find solutions that
1
arXiv:1512.03385v1 [cs.CV] 10 Dec 2015
Densely Connected Convolutional Networks
Gao Huang⇤
Cornell University
[email protected]
Zhuang Liu⇤
Tsinghua University
[email protected]
Laurens van der Maaten
Facebook AI Research
[email protected]
Kilian Q. Weinberger
Cornell University
[email protected]
Abstract
Recent work has shown that convolutional networks can
be substantially deeper, more accurate, and efficient to train
if they contain shorter connections between layers close to
the input and those close to the output. In this paper, we
embrace this observation and introduce the Dense Convo-
lutional Network (DenseNet), which connects each layer
to every other layer in a feed-forward fashion. Whereas
traditional convolutional networks with L layers have L
connections—one between each layer and its subsequent
layer—our network has
L(L+1)
2 direct connections. For
each layer, the feature-maps of all preceding layers are
used as inputs, and its own feature-maps are used as inputs
into all subsequent layers. DenseNets have several com-
pelling advantages: they alleviate the vanishing-gradient
problem, strengthen feature propagation, encourage fea-
ture reuse, and substantially reduce the number of parame-
ters. We evaluate our proposed architecture on four highly
competitive object recognition benchmark tasks (CIFAR-10,
CIFAR-100, SVHN, and ImageNet). DenseNets obtain sig-
nificant improvements over the state-of-the-art on most of
them, whilst requiring less computation to achieve high per-
formance. Code and pre-trained models are available at
https://github.com/liuzhuang13/DenseNet.
1. Introduction
Convolutional neural networks (CNNs) have become
the dominant machine learning approach for visual object
recognition. Although they were originally introduced over
20 years ago [18], improvements in computer hardware and
network structure have enabled the training of truly deep
CNNs only recently. The original LeNet5 [19] consisted of
5 layers, VGG featured 19 [29], and only last year Highway
⇤Authors contributed equally
x0
x1
H1
x2
H2
H3
H4
x3
x4
Figure 1: A 5-layer dense block with a growth rate of k = 4.
Each layer takes all preceding feature-maps as input.
Networks [34] and Residual Networks (ResNets) [11] have
surpassed the 100-layer barrier.
As CNNs become increasingly deep, a new research
problem emerges: as information about the input or gra-
dient passes through many layers, it can vanish and “wash
out” by the time it reaches the end (or beginning) of the
network. Many recent publications address this or related
problems. ResNets [11] and Highway Networks [34] by-
pass signal from one layer to the next via identity connec-
tions. Stochastic depth [13] shortens ResNets by randomly
dropping layers during training to allow better information
and gradient flow. FractalNets [17] repeatedly combine sev-
eral parallel layer sequences with different number of con-
volutional blocks to obtain a large nominal depth, while
maintaining many short paths in the network. Although
these different approaches vary in network topology and
training procedure, they all share a key characteristic: they
create short paths from early layers to later layers.
1
arXiv:1608.06993v5 [cs.CV] 28 Jan 2018
Inception Recurrent Convolutional Neural Network for Object Recognition
Md Zahangir Alom [email protected]
University of Dayton, Dayton, OH, USA
Mahmudul Hasan [email protected]
Comcast Labs, Washington, DC, USA
Chris Yakopcic [email protected]
University of Dayton, Dayton, OH, USA
Tarek M. Taha [email protected]
University of Dayton, Dayton, OH, USA
Abstract
Deep convolutional neural networks (DCNNs)
are an influential tool for solving various prob-
lems in the machine learning and computer vi-
sion fields. In this paper, we introduce a
new deep learning model called an Inception-
Recurrent Convolutional Neural Network (IR-
CNN), which utilizes the power of an incep-
tion network combined with recurrent layers in
DCNN architecture. We have empirically eval-
uated the recognition performance of the pro-
posed IRCNN model using different benchmark
datasets such as MNIST, CIFAR-10, CIFAR-
100, and SVHN. Experimental results show sim-
ilar or higher recognition accuracy when com-
pared to most of the popular DCNNs including
the RCNN. Furthermore, we have investigated
IRCNN performance against equivalent Incep-
tion Networks and Inception-Residual Networks
using the CIFAR-100 dataset. We report about
3.5%, 3.47% and 2.54% improvement in classifi-
cation accuracy when compared to the RCNN,
equivalent Inception Networks, and Inception-
Residual Networks on the augmented CIFAR-
100 dataset respectively.
1. Introduction
In recent years, deep learning using Convolutional Neu-
ral Networks (CNNs) has shown enormous success in the
field of machine learning and computer vision. CNNs pro-
vide state-of-the-art accuracy in various image recognition
tasks including object recognition (Schmidhuber, 2015;
Krizhevsky et al., 2012; Simonyan & Zisserman, 2014;
Szegedy et al., 2015), object detection (Girshick et al.,
2014), tracking (Wang et al., 2015), and image caption-
ing (Xu et al., 2014). In addition, this technique has been
applied massively in computer vision tasks such as video
representation and classification of human activity (Bal-
las et al., 2015). Machine translation and natural language
processing are applied deep learning techniques that show
great success in this domain (Collobert & Weston, 2008;
Manning et al., 2014). Furthermore, this technique has
been used extensively in the field of speech recognition
(Hinton et al., 2012). Moreover, deep learning is not lim-
ited to signal, natural language, image, and video process-
ing tasks, it has been applying successfully for game devel-
opment (Mnih et al., 2013; Lillicrap et al., 2015). There is
a lot of ongoing research for developing even better perfor-
mance and improving the training process of DCNNs (Lin
et al., 2013; Springenberg et al., 2014; Goodfellow et al.,
2013; Ioffe & Szegedy, 2015; Zeiler & Fergus, 2013).
In some cases, machine intelligence shows better perfor-
mance compared to human intelligence including calcula-
tion, chess, memory, and pattern matching. On the other
hand, human intelligence still provides better performance
in other fields such as object recognition, scene under-
standing, and more. Deep learning techniques (DCNNs
in particular) perform very well in the domains of detec-
tion, classification, and scene understanding. There is a
still a gap that must be closed before human level intelli-
gence is reached when performing visual recognition tasks.
Machine intelligence may open an opportunity to build a
system that can process visual information the way that a
human brain does. According to the study on the visual
processing system within a human brain by James DiCarlo
et al. (Zoccolan & Rust, 2012) the brain consists of sev-
eral visual processing units starting with the visual cortex
arXiv:1704.07709v1 [cs.CV] 25 Apr 2017
2015-2017
Supervised
Im
age
Classification
Image Classification (ResNet)
2015
Supervised
Im
age
Classification
Image Classification (DenseNet)
2016
Supervised
Im
age
Classification
Image Classification (Inception)
2017
Supervised
Im
age
Classification
Object Detection
2016
Supervised
O
bject Detection
SSD: Single Shot MultiBox Detector
Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3,
Scott Reed4, Cheng-Yang Fu1, Alexander C. Berg1
1UNC Chapel Hill 2Zoox Inc. 3Google Inc. 4University of Michigan, Ann-Arbor
1[email protected], [email protected], 3{dumitru,szegedy}@google.com,
[email protected], 1{cyfu,aberg}@cs.unc.edu
Abstract. We present a method for detecting objects in images using a single
deep neural network. Our approach, named SSD, discretizes the output space of
bounding boxes into a set of default boxes over different aspect ratios and scales
per feature map location. At prediction time, the network generates scores for the
presence of each object category in each default box and produces adjustments to
the box to better match the object shape. Additionally, the network combines pre-
dictions from multiple feature maps with different resolutions to naturally handle
objects of various sizes. SSD is simple relative to methods that require object
proposals because it completely eliminates proposal generation and subsequent
pixel or feature resampling stages and encapsulates all computation in a single
network. This makes SSD easy to train and straightforward to integrate into sys-
tems that require a detection component. Experimental results on the PASCAL
VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy
to methods that utilize an additional object proposal step and is much faster, while
providing a unified framework for both training and inference. For 300 ⇥ 300 in-
put, SSD achieves 74.3% mAP1 on VOC2007 test at 59 FPS on a Nvidia Titan
X and for 512 ⇥ 512 input, SSD achieves 76.9% mAP, outperforming a compa-
rable state-of-the-art Faster R-CNN model. Compared to other single stage meth-
ods, SSD has much better accuracy even with a smaller input image size. Code is
available at: https://github.com/weiliu89/caffe/tree/ssd .
Keywords: Real-time Object Detection; Convolutional Neural Network
1 Introduction
Current state-of-the-art object detection systems are variants of the following approach:
hypothesize bounding boxes, resample pixels or features for each box, and apply a high-
quality classifier. This pipeline has prevailed on detection benchmarks since the Selec-
tive Search work [1] through the current leading results on PASCAL VOC, COCO, and
ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as
[3]. While accurate, these approaches have been too computationally intensive for em-
bedded systems and, even with high-end hardware, too slow for real-time applications.
1 We achieved even better results using an improved data augmentation scheme in follow-on
experiments: 77.2% mAP for 300⇥300 input and 79.8% mAP for 512⇥512 input on VOC2007.
Please see Sec. 3.6 for details.
arXiv:1512.02325v5 [cs.CV] 29 Dec 2016
Semantic Segmentation (Image)
2016-2017
Supervised
Sem
antic Segm
entation
1
Fully Convolutional Networks
for Semantic Segmentation
Evan Shelhamer⇤, Jonathan Long⇤, and Trevor Darrell, Member, IEEE
Abstract—Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks
by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to
build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference
and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction
tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet)
into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a
skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer
to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC
(30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of
a second for a typical image.
Index Terms—Semantic Segmentation, Convolutional Networks, Deep Learning, Transfer Learning
F
1 INTRODUCTION
CONVOLUTIONAL networks are driving advances in
recognition. Convnets are not only improving for
whole-image classification [1], [2], [3], but also making
progress on local tasks with structured output. These in-
clude advances in bounding box object detection [4], [5], [6],
part and keypoint prediction [7], [8], and local correspon-
dence [8], [9].
The natural next step in the progression from coarse to
fine inference is to make a prediction at every pixel. Prior
approaches have used convnets for semantic segmentation
[10], [11], [12], [13], [14], [15], [16], in which each pixel is
labeled with the class of its enclosing object or region, but
with shortcomings that this work addresses.
We show that fully convolutional networks (FCNs)
trained end-to-end, pixels-to-pixels on semantic segmen-
tation exceed the previous best results without further
machinery. To our knowledge, this is the first work to
train FCNs end-to-end (1) for pixelwise prediction and (2)
from supervised pre-training. Fully convolutional versions
of existing networks predict dense outputs from arbitrary-
sized inputs. Both learning and inference are performed
whole-image-at-a-time by dense feedforward computation
and backpropagation. In-network upsampling layers enable
pixelwise prediction and learning in nets with subsampling.
This method is efficient, both asymptotically and ab-
solutely, and precludes the need for the complications in
other works. Patchwise training is common [10], [11], [12],
[13], [16], but lacks the efficiency of fully convolutional
training. Our approach does not make use of pre- and post-
processing complications, including superpixels [12], [14],
proposals [14], [15], or post-hoc refinement by random fields
⇤Authors contributed equally
• E. Shelhamer, J. Long, and T. Darrell are with the Department of Electrical
Engineering and Computer Science (CS Division), UC Berkeley. E-mail:
{shelhamer,jonlong,trevor}@cs.berkeley.edu.
or local classifiers [12], [14]. Our model transfers recent
success in classification [1], [2], [3] to dense prediction by
reinterpreting classification nets as fully convolutional and
fine-tuning from their learned representations. In contrast,
previous works have applied small convnets without super-
vised pre-training [10], [12], [13].
Semantic segmentation faces an inherent tension be-
tween semantics and location: global information resolves
what while local information resolves where. What can be
done to navigate this spectrum from location to semantics?
How can local decisions respect global structure? It is not
immediately clear that deep networks for image classifica-
tion yield representations sufficient for accurate, pixelwise
recognition.
In the conference version of this paper [17], we cast
pre-trained networks into fully convolutional form, and
augment them with a skip architecture that takes advantage
of the full feature spectrum. The skip architecture fuses
the feature hierarchy to combine deep, coarse, semantic
information and shallow, fine, appearance information (see
Section 4.3 and Figure 3). In this light, deep feature hierar-
chies encode location and semantics in a nonlinear local-to-
global pyramid.
This journal paper extends our earlier work [17] through
further tuning, analysis, and more results. Alternative
choices, ablations, and implementation details better cover
the space of FCNs. Tuning optimization leads to more accu-
rate networks and a means to learn skip architectures all-at-
once instead of in stages. Experiments that mask foreground
and background investigate the role of context and shape.
Results on the object and scene labeling of PASCAL-Context
reinforce merging object segmentation and scene parsing as
unified pixelwise prediction.
In the next section, we review related work on deep
classification nets, FCNs, recent approaches to semantic seg-
mentation using convnets, and extensions to FCNs. The fol-
arXiv:1605.06211v1 [cs.CV] 20 May 2016
Pyramid Scene Parsing Network
Hengshuang Zhao1 Jianping Shi2 Xiaojuan Qi1 Xiaogang Wang1 Jiaya Jia1
1The Chinese University of Hong Kong 2SenseTime Group Limited
{hszhao, xjqi, leojia}@cse.cuhk.edu.hk, [email protected], [email protected]
Abstract
Scene parsing is challenging for unrestricted open vo-
cabulary and diverse scenes. In this paper, we exploit the
capability of global context information by different-region-
based context aggregation through our pyramid pooling
module together with the proposed pyramid scene parsing
network (PSPNet). Our global prior representation is ef-
fective to produce good quality results on the scene parsing
task, while PSPNet provides a superior framework for pixel-
level prediction. The proposed approach achieves state-of-
the-art performance on various datasets. It came first in Im-
ageNet scene parsing challenge 2016, PASCAL VOC 2012
benchmark and Cityscapes benchmark. A single PSPNet
yields the new record of mIoU accuracy 85.4% on PASCAL
VOC 2012 and accuracy 80.2% on Cityscapes.
1. Introduction
Scene parsing, based on semantic segmentation, is a fun-
damental topic in computer vision. The goal is to assign
each pixel in the image a category label. Scene parsing pro-
vides complete understanding of the scene. It predicts the
label, location, as well as shape for each element. This topic
is of broad interest for potential applications of automatic
driving, robot sensing, to name a few.
Difficulty of scene parsing is closely related to scene and
label variety. The pioneer scene parsing task [23] is to clas-
sify 33 scenes for 2,688 images on LMO dataset [22]. More
recent PASCAL VOC semantic segmentation and PASCAL
context datasets [8, 29] include more labels with similar
context, such as chair and sofa, horse and cow, etc. The
new ADE20K dataset [43] is the most challenging one with
a large and unrestricted open vocabulary and more scene
classes. A few representative images are shown in Fig. 1.
To develop an effective algorithm for these datasets needs
to conquer a few difficulties.
State-of-the-art scene parsing frameworks are mostly
based on the fully convolutional network (FCN) [26]. The
deep convolutional neural network (CNN) based methods
boost dynamic object understanding, and yet still face chal-
Figure 1. Illustration of complex scenes in ADE20K dataset.
lenges considering diverse scenes and unrestricted vocabu-
lary. One example is shown in the first row of Fig. 2, where
a boat is mistaken as a car. These errors are due to similar
appearance of objects. But when viewing the image regard-
ing the context prior that the scene is described as boathouse
near a river, correct prediction should be yielded.
Towards accurate scene perception, the knowledge graph
relies on prior information of scene context. We found
that the major issue for current FCN based models is lack
of suitable strategy to utilize global scene category clues.
For typical complex scene understanding, previously to get
a global image-level feature, spatial pyramid pooling [18]
was widely employed where spatial statistics provide a good
descriptor for overall scene interpretation. Spatial pyramid
pooling network [12] further enhances the ability.
Different from these methods, to incorporate suitable
global features, we propose pyramid scene parsing network
(PSPNet). In addition to traditional dilated FCN [3, 40] for
pixel prediction, we extend the pixel-level feature to the
specially designed global pyramid pooling one. The local
and global clues together make the final prediction more
reliable. We also propose an optimization strategy with
1
arXiv:1612.01105v2 [cs.CV] 27 Apr 2017
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen George Papandreou Florian Schroff Hartwig Adam
Google Inc.
{lcchen, gpapan, fschroff, hadam}@google.com
Abstract
In this work, we revisit atrous convolution, a powerful tool
to explicitly adjust filter’s field-of-view as well as control the
resolution of feature responses computed by Deep Convolu-
tional Neural Networks, in the application of semantic image
segmentation. To handle the problem of segmenting objects
at multiple scales, we design modules which employ atrous
convolution in cascade or in parallel to capture multi-scale
context by adopting multiple atrous rates. Furthermore, we
propose to augment our previously proposed Atrous Spatial
Pyramid Pooling module, which probes convolutional fea-
tures at multiple scales, with image-level features encoding
global context and further boost performance. We also elab-
orate on implementation details and share our experience
on training our system. The proposed ‘DeepLabv3’ system
significantly improves over our previous DeepLab versions
without DenseCRF post-processing and attains comparable
performance with other state-of-art models on the PASCAL
VOC 2012 semantic image segmentation benchmark.
1. Introduction
For the task of semantic segmentation [20, 63, 14, 97, 7],
we consider two challenges in applying Deep Convolutional
Neural Networks (DCNNs) [50]. The first one is the reduced
feature resolution caused by consecutive pooling operations
or convolution striding, which allows DCNNs to learn in-
creasingly abstract feature representations. However, this
invariance to local image transformation may impede dense
prediction tasks, where detailed spatial information is de-
sired. To overcome this problem, we advocate the use of
atrous convolution [36, 26, 74, 66], which has been shown
to be effective for semantic image segmentation [10, 90, 11].
Atrous convolution, also known as dilated convolution, al-
lows us to repurpose ImageNet [72] pretrained networks
to extract denser feature maps by removing the downsam-
pling operations from the last few layers and upsampling
the corresponding filter kernels, equivalent to inserting holes
(‘trous’ in French) between filter weights. With atrous convo-
lution, one is able to control the resolution at which feature
rate = 6
rate = 24
rate = 1
Conv
kernel: 3x3
rate: 1
Conv
kernel: 3x3
rate: 6
Conv
kernel: 3x3
rate: 24
Feature map Feature map
Feature map
Figure 1. Atrous convolution with kernel size 3 ⇥ 3 and different
rates. Standard convolution corresponds to atrous convolution
with rate = 1. Employing large value of atrous rate enlarges the
model’s field-of-view, enabling object encoding at multiple scales.
responses are computed within DCNNs without requiring
learning extra parameters.
Another difficulty comes from the existence of objects
at multiple scales. Several methods have been proposed to
handle the problem and we mainly consider four categories
in this work, as illustrated in Fig. 2. First, the DCNN is
applied to an image pyramid to extract features for each
scale input [22, 19, 69, 55, 12, 11] where objects at different
scales become prominent at different feature maps. Sec-
ond, the encoder-decoder structure [3, 71, 25, 54, 70, 68, 39]
exploits multi-scale features from the encoder part and re-
covers the spatial resolution from the decoder part. Third,
extra modules are cascaded on top of the original network for
capturing long range information. In particular, DenseCRF
[45] is employed to encode pixel-level pairwise similarities
[10, 96, 55, 73], while [59, 90] develop several extra convo-
lutional layers in cascade to gradually capture long range
context. Fourth, spatial pyramid pooling [11, 95] probes
an incoming feature map with filters or pooling operations
at multiple rates and multiple effective field-of-views, thus
capturing objects at multiple scales.
In this work, we revisit applying atrous convolution,
which allows us to effectively enlarge the field of view of
filters to incorporate multi-scale context, in the framework of
both cascaded modules and spatial pyramid pooling. In par-
ticular, our proposed module consists of atrous convolution
with various rates and batch normalization layers which we
1
arXiv:1706.05587v3 [cs.CV] 5 Dec 2017
Semantic Segmentation (Image)
2016-2017
Supervised
Sem
antic Segm
entation
Autonomous Driving Systems
Real Time, Per Pixel Object Segmentation
Centimeter-accurate Positioning
output
input output
input
state(t)
memory
Feedforward
Neural Networks
Recurrent
Neural Networks
What About Memory?
RNNs
https://en.wikipedia.org/wiki/Long_short-term_memory
Long Short-Term Memory (LSTM)
How much
goes into
memory
How much
is used
in computing
the output
How much
remains in
memory
SOCKEYE:
A Toolkit for Neural Machine Translation
Felix Hieber, Tobias Domhan, Michael Denkowski,
David Vilar, Artem Sokolov, Ann Clifton, Matt Post
{fhieber,domhant,mdenkows,dvilar,artemsok,acclift,mattpost}@amazon.com
Abstract
We describe SOCKEYE,1 an open-source sequence-to-sequence toolkit for Neural
Machine Translation (NMT). SOCKEYE is a production-ready framework for
training and applying models as well as an experimental platform for researchers.
Written in Python and built on MXNET, the toolkit offers scalable training and
inference for the three most prominent encoder-decoder architectures: attentional
recurrent neural networks, self-attentional transformers, and fully convolutional
networks. SOCKEYE also supports a wide range of optimizers, normalization and
regularization techniques, and inference improvements from current NMT literature.
Users can easily run standard training recipes, explore different model settings, and
incorporate new ideas. In this paper, we highlight SOCKEYE’s features and bench-
mark it against other NMT toolkits on two language arcs from the 2017 Conference
on Machine Translation (WMT): English–German and Latvian–English. We report
competitive BLEU scores across all three architectures, including an overall best
score for SOCKEYE’s transformer implementation. To facilitate further comparison,
we release all system outputs and training scripts used in our experiments. The
SOCKEYE toolkit is free software released under the Apache 2.0 license.
1 Introduction
The past two years have seen a deep learning revolution bring rapid and dramatic change to the field
of machine translation. For users, new neural network-based models consistently deliver better quality
translations than the previous generation of phrase-based systems. For researchers, Neural Machine
Translation (NMT) provides an exciting new landscape where training pipelines are simplified and
unified models can be trained directly from data. The promise of moving beyond the limitations of
Statistical Machine Translation (SMT) has energized the community, leading recent work to focus
almost exclusively on NMT and seemingly advance the state of the art every few months.
For all its success, NMT also presents a range of new challenges. While popular encoder-decoder
models are attractively simple, recent literature and the results of shared evaluation tasks show that
a significant amount of engineering is required to achieve “production-ready” performance in both
translation quality and computational efficiency. In a trend that carries over from SMT, the strongest
NMT systems benefit from subtle architecture modifications, hyper-parameter tuning, and empirically
effective heuristics. Unlike SMT, there is no “de-facto” toolkit that attracts most of the community’s
attention and thus contains all the best ideas from recent literature.2 Instead, the presence of many
independent toolkits3 brings diversity to the field, but also makes it difficult to compare architectural
and algorithmic improvements that are each implemented in different toolkits.
1https://github.com/awslabs/sockeye (version 1.12)
2For SMT, this role was largely filled by MOSES [Koehn et al., 2007].
3https://github.com/jonsafari/nmt-list
arXiv:1712.05690v1 [cs.CL] 15 Dec 2017
Sequence to Sequence (seq2seq)
• seq2seq is a supervised learning algorithm where the
input is a sequence of tokens (for example, text,
audio) and the output generated is another
sequence of tokens.
• Example applications include:
• machine translation (input a sentence from
one language and predict what that sentence
would be in another language)
• text summarization (input a longer string of
words and predict a shorter string of words
that is a summary)
• speech-to-text (audio clips converted into
output sentences in tokens).
2014-2017
Supervised
Text, Audio
SOCKEYE:
A Toolkit for Neural Machine Translation
Felix Hieber, Tobias Domhan, Michael Denkowski,
David Vilar, Artem Sokolov, Ann Clifton, Matt Post
{fhieber,domhant,mdenkows,dvilar,artemsok,acclift,mattpost}@amazon.com
Abstract
We describe SOCKEYE,1 an open-source sequence-to-sequence toolkit for Neural
Machine Translation (NMT). SOCKEYE is a production-ready framework for
training and applying models as well as an experimental platform for researchers.
Written in Python and built on MXNET, the toolkit offers scalable training and
inference for the three most prominent encoder-decoder architectures: attentional
recurrent neural networks, self-attentional transformers, and fully convolutional
networks. SOCKEYE also supports a wide range of optimizers, normalization and
regularization techniques, and inference improvements from current NMT literature.
Users can easily run standard training recipes, explore different model settings, and
incorporate new ideas. In this paper, we highlight SOCKEYE’s features and bench-
mark it against other NMT toolkits on two language arcs from the 2017 Conference
on Machine Translation (WMT): English–German and Latvian–English. We report
competitive BLEU scores across all three architectures, including an overall best
score for SOCKEYE’s transformer implementation. To facilitate further comparison,
we release all system outputs and training scripts used in our experiments. The
SOCKEYE toolkit is free software released under the Apache 2.0 license.
1 Introduction
The past two years have seen a deep learning revolution bring rapid and dramatic change to the field
of machine translation. For users, new neural network-based models consistently deliver better quality
translations than the previous generation of phrase-based systems. For researchers, Neural Machine
Translation (NMT) provides an exciting new landscape where training pipelines are simplified and
unified models can be trained directly from data. The promise of moving beyond the limitations of
Statistical Machine Translation (SMT) has energized the community, leading recent work to focus
almost exclusively on NMT and seemingly advance the state of the art every few months.
For all its success, NMT also presents a range of new challenges. While popular encoder-decoder
models are attractively simple, recent literature and the results of shared evaluation tasks show that
a significant amount of engineering is required to achieve “production-ready” performance in both
translation quality and computational efficiency. In a trend that carries over from SMT, the strongest
NMT systems benefit from subtle architecture modifications, hyper-parameter tuning, and empirically
effective heuristics. Unlike SMT, there is no “de-facto” toolkit that attracts most of the community’s
attention and thus contains all the best ideas from recent literature.2 Instead, the presence of many
independent toolkits3 brings diversity to the field, but also makes it difficult to compare architectural
and algorithmic improvements that are each implemented in different toolkits.
1https://github.com/awslabs/sockeye (version 1.12)
2For SMT, this role was largely filled by MOSES [Koehn et al., 2007].
3https://github.com/jonsafari/nmt-list
arXiv:1712.05690v1 [cs.CL] 15 Dec 2017
Sequence to Sequence (seq2seq)
• Recently, problems in this domain have been
successfully modeled with deep neural networks
that show a significant performance boost over
previous methodologies.
• Amazon released in open source the Sockeye
package, which uses Recurrent Neural Networks
(RNNs) and Convolutional Neural Network (CNN)
models with attention as encoder-decoder
architectures.
• https://github.com/awslabs/sockeye
• provides an experimental image-to-
description module
2014-2017
Supervised
Text, Audio
Sequence to Sequence (seq2seq)
https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
2014-2017
Supervised
Text, Audio
Sequence to Sequence (seq2seq)
https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye
“Das grüne Haus”
“the Green House”
2014-2017
Supervised
Text, Audio
“Sentence to synthesize”
ˈsɛntəns tə ˈsɪnθəˌsaɪz.
Concatenative TTS Neural TTS
Text
Phonetic transcription
ˈsɛnt sɛntəns tə ˈsɪnθ əˌsaɪz.
Improving text-to-speech
US English Matthew voice
“Sources tell CNN he believes the
media and the northeast elite are
needlessly hyperventilating and
overreacting to his comments.”
US English Joanna voice
“President Donald Trump said on
March 13 his administration was
ordering the grounding of all Max 8
and 9 models, hours after Canada
said it was grounding the planes after
analyzing new satellite tracking data.”
Amazon Polly NTTS and newscaster style
https://aws.amazon.com/blogs/aws/amazon-polly-introduces-neural-text-to-speech-and-newscaster-style/
Latent Dirichlet Allocation (LDA)
Copyright 2000 by the Genetics Society of America
Inference of Population Structure Using Multilocus Genotype Data
Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly
Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom
Manuscript received September 23, 1999
Accepted for publication February 18, 2000
ABSTRACT
We describe a model-based clustering method for using multilocus genotype data to infer population
structure and assign individuals to populations. We assume a model in which there are K populations
(where K may be unknown), each of which is characterized by a set of allele frequencies at each locus.
Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more popula-
tions if their genotypes indicate that they are admixed. Our model does not assume a particular mutation
process, and it can be applied to most of the commonly used genetic markers, provided that they are not
closely linked. Applications of our method include demonstrating the presence of population structure,
assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individu-
als. We showthat the method can produce highlyaccurate assignments using modest numbers of loci—e.g.,
seven microsatellite loci in an example using genotype data from an endangered bird species. The software
used for this article is available from http:// www.stats.ox.ac.uk/ zpritch/ home.html.
IN applications of population genetics, it is often use- populationsbased on these subjective criteria represents
a natural assignment in genetic terms, and it would be
ful to classify individuals in a sample into popula-
tions. In one scenario, the investigator begins with a useful to be able to confirm that subjective classifications
are consistent with genetic information and hence ap-
sample of individuals and wants to say something about
the properties of populations. For example, in studies propriate for studying the questions of interest. Further,
there are situations where one is interested in “cryptic”
of human evolution, the population is often considered
to be the unit of interest, and a great deal of work has population structure—i.e., population structure that is
difficult to detect using visible characters, but may be
focused on learning about the evolutionary relation-
ships of modern populations (e.g., Caval l i et al. 1994). significant in genetic terms. For example, when associa-
tion mapping is used to find disease genes, the presence
In a second scenario, the investigator begins with a set
of predefined populations and wishes to classifyindivid- of undetected population structure can lead to spurious
associations and thus invalidate standard tests (Ewens
uals of unknown origin. This type of problem arises
in many contexts (reviewed by Davies et al. 1999). A and Spiel man 1995). The problem of cryptic population
structure also arises in the context of DNA fingerprint-
standard approach involves sampling DNA from mem-
bers of a number of potential source populations and ing for forensics, where it is important to assess the
degree of population structure to estimate the probabil-
using these samples to estimate allele frequencies in
ity of false matches (Bal ding and Nich ol s 1994, 1995;
each population at a series of unlinked loci. Using the
For eman et al. 1997; Roeder et al. 1998).
estimated allele frequencies, it is then possible to com-
Pr it ch ar d and Rosenber g (1999) considered how
pute the likelihood that a given genotype originated in
genetic information might be used to detect the pres-
each population. Individuals of unknown origin can be
ence of cryptic population structure in the association
assigned to populations according to these likelihoods
mapping context. More generally, one would like to be
Paet kau et al. 1995; Rannal a and Mount ain 1997).
able to identify the actual subpopulations and assign
In both situations described above, a crucial first step
individuals (probabilistically) to these populations. In
is to define a set of populations. The definition of popu-
this article we use a Bayesian clustering approach to
lations is typically subjective, based, for example, on
tackle this problem. We assume a model in which there
linguistic, cultural, or physical characters, as well as the
are K populations (where K may be unknown), each of
geographic location of sampled individuals. This subjec-
which is characterized by a set of allele frequencies at
tive approach is usually a sensible way of incorporating
each locus. Our method attempts to assign individuals
diverse types of information. However, it maybe difficult
to populations on the basis of their genotypes, while
to know whether a given assignment of individuals to
simultaneously estimating population allele frequen-
cies. The method can be applied to various types of
markers [e.g., microsatellites, restriction fragment
Corresponding author: Jonathan Pritchard, Department of Statistics,
length polymorphisms (RFLPs), or single nucleotide
University of Oxford, 1 S. Parks Rd., Oxford OX1 3TG, United King-
dom. E-mail: [email protected] polymorphisms (SNPs)], but it assumes that the marker
Genetics 155: 945–959 ( June 2000)
Journal of Machine Learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03
Latent Dirichlet Allocation
David M. Blei [email protected]
Computer Science Division
University of California
Berkeley, CA 94720, USA
Andrew Y. Ng [email protected]
Computer Science Department
Stanford University
Stanford, CA 94305, USA
Michael I. Jordan [email protected]
Computer Science Division and Department of Statistics
University of California
Berkeley, CA 94720, USA
Editor: John Lafferty
Abstract
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of
discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each
item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in
turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of
text modeling, the topic probabilities provide an explicit representation of a document. We present
efficient approximate inference techniques based on variational methods and an EM algorithm for
empirical Bayes parameter estimation. We report results in document modeling, text classification,
and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI
model.
1. Introduction
In this paper we consider the problem of modeling text corpora and other collections of discrete
data. The goal is to find short descriptions of the members of a collection that enable efficient
processing of large collections while preserving the essential statistical relationships that are useful
for basic tasks such as classification, novelty detection, summarization, and similarity and relevance
judgments.
Significant progress has been made on this problem by researchers in the field of informa-
tion retrieval (IR) (Baeza-Yates and Ribeiro-Neto, 1999). The basic methodology proposed by
IR researchers for text corpora—a methodology successfully deployed in modern Internet search
engines—reduces each document in the corpus to a vector of real numbers, each of which repre-
sents ratios of counts. In the popular tf-idf scheme (Salton and McGill, 1983), a basic vocabulary
of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the
number of occurrences of each word. After suitable normalization, this term frequency count is
compared to an inverse document frequency count, which measures the number of occurrences of a
c 2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan.
2000-2003
Unsupervised
Topic M
odeling
Latent Dirichlet Allocation (LDA)
• As an extremely simple example, given a set of documents where the
only words that occur within them are eat, sleep, play, meow, and
bark, LDA might produce topics like the following:
Topic eat sleep play meow bark
Cats? Topic 1 0.1 0.3 0.2 0.4 0.0
Dogs? Topic 2 0.2 0.1 0.4 0.0 0.3
2000-2003
Unsupervised
Topic M
odeling
Neural Topic Model (NTM)
Encoder: feedforward net
Input term counts vector
µ
z
Document
Posterior
Sampled Document
Representation
Decoder:
Softmax
Neural Variational Inference for Text Processing
Yishu Miao1 [email protected]
Lei Yu1 [email protected]
Phil Blunsom12 [email protected]
1University of Oxford, 2Google Deepmind
Abstract
Recent advances in neural variational inference
have spawned a renaissance in deep latent vari-
able models. In this paper we introduce a generic
variational inference framework for generative
and conditional models of text. While traditional
variational methods derive an analytic approxi-
mation for the intractable distributions over latent
variables, here we construct an inference network
conditioned on the discrete text input to pro-
vide the variational distribution. We validate this
framework on two very different text modelling
applications, generative document modelling and
supervised question answering. Our neural vari-
ational document model combines a continuous
stochastic document representation with a bag-
of-words generative model and achieves the low-
est reported perplexities on two standard test cor-
pora. The neural answer selection model em-
ploys a stochastic representation layer within an
attention mechanism to extract the semantics be-
tween a question and answer pair. On two ques-
tion answering benchmarks this model exceeds
all previous published benchmarks.
1. Introduction
Probabilistic generative models underpin many successful
applications within the field of natural language process-
ing (NLP). Their popularity stems from their ability to use
unlabelled data effectively, to incorporate abundant linguis-
tic features, and to learn interpretable dependencies among
data. However these successes are tempered by the fact that
as the structure of such generative models becomes deeper
and more complex, true Bayesian inference becomes in-
tractable due to the high dimensional integrals required.
Markov chain Monte Carlo (MCMC) (Neal, 1993; Andrieu
Proceedings of the 33rd International Conference on Machine
Learning, New York, NY, USA, 2016. JMLR: W&CP volume
48. Copyright 2016 by the author(s).
et al., 2003) and variational inference (Jordan et al., 1999;
Attias, 2000; Beal, 2003) are the standard approaches for
approximating these integrals. However the computational
cost of the former results in impractical training for the
large and deep neural networks which are now fashion-
able, and the latter is conventionally confined due to the
underestimation of posterior variance. The lack of effec-
tive and efficient inference methods hinders our ability to
create highly expressive models of text, especially in the
situation where the model is non-conjugate.
This paper introduces a neural variational framework for
generative models of text, inspired by the variational auto-
encoder (Rezende et al., 2014; Kingma & Welling, 2014).
The principle idea is to build an inference network, imple-
mented by a deep neural network conditioned on text, to ap-
proximate the intractable distributions over the latent vari-
ables. Instead of providing an analytic approximation, as in
traditional variational Bayes, neural variational inference
learns to model the posterior probability, thus endowing
the model with strong generalisation abilities. Due to the
flexibility of deep neural networks, the inference network
is capable of learning complicated non-linear distributions
and processing structured inputs such as word sequences.
Inference networks can be designed as, but not restricted
to, multilayer perceptrons (MLP), convolutional neural net-
works (CNN), and recurrent neural networks (RNN), ap-
proaches which are rarely used in conventional generative
models. By using the reparameterisation method (Rezende
et al., 2014; Kingma & Welling, 2014), the inference net-
work is trained through back-propagating unbiased and low
variance gradients w.r.t. the latent variables. Within this
framework, we propose a Neural Variational Document
Model (NVDM) for document modelling and a Neural An-
swer Selection Model (NASM) for question answering, a
task that selects the sentences that correctly answer a fac-
toid question from a set of candidate sentences.
The NVDM (Figure 1) is an unsupervised generative model
of text which aims to extract a continuous semantic latent
variable for each document. This model can be interpreted
as a variational auto-encoder: an MLP encoder (inference
arXiv:1511.06038v4 [cs.CL] 4 Jun 2016
Output term counts vector
2015
Unsupervised
Topic M
odeling
Random Cut Forest (RCF)
2004-2016
Unsupervised
Anom
aly
Detection
Downloaded 06/11/18 to 54.240.197.235. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Robust Random Cut Forest Based Anomaly Detection On Streams
Sudipto Guha [email protected]
University of Pennsylvania, Philadelphia, PA 19104.
Nina Mishra [email protected]
Amazon, Palo Alto, CA 94303.
Gourav Roy [email protected]
Amazon, Bangalore, India 560055.
Okke Schrijvers [email protected]
Stanford University, Palo Alto, CA 94305.
Abstract
In this paper we focus on the anomaly detection
problem for dynamic data streams through the
lens of random cut forests. We investigate a ro-
bust random cut data structure that can be used
as a sketch or synopsis of the input stream. We
provide a plausible definition of non-parametric
anomalies based on the influence of an unseen
point on the remainder of the data, i.e., the exter-
nality imposed by that point. We show how the
sketch can be efficiently updated in a dynamic
data stream. We demonstrate the viability of the
algorithm on publicly available real data.
1. Introduction
Anomaly detection is one of the cornerstone problems in
data mining. Even though the problem has been well stud-
ied over the last few decades, the emerging explosion of
data from the internet of things and sensors leads us to re-
consider the problem. In most of these contexts the data
is streaming and well-understood prior models do not ex-
ist. Furthermore the input streams need not be append only,
there may be corrections, updates and a variety of other dy-
namic changes. Two central questions in this regard are
(1) how do we define anomalies? and (2) what data struc-
ture do we use to efficiently detect anomalies over dynamic
data streams? In this paper we initiate the formal study of
both of these questions. For (1), we view the problem from
the perspective of model complexity and say that a point is
an anomaly if the complexity of the model increases sub-
stantially with the inclusion of the point. The labeling of
Proceedings of the 33rd International Conference on Machine
Learning, New York, NY, USA, 2016. JMLR: W&CP volume
48. Copyright 2016 by the author(s).
a point is data dependent and corresponds to the external-
ity imposed by the point in explaining the remainder of the
data. We extend this notion of externality to handle “outlier
masking” that often arises from duplicates and near dupli-
cate records. Note that the notion of model complexity has
to be amenable to efficient computation in dynamic data
streams. This relates question (1) to question (2) which we
discuss in greater detail next. However it is worth noting
that anomaly detection is not well understood even in the
simpler context of static batch processing and (2) remains
relevant in the batch setting as well.
For question (2), we explore a randomized approach, akin
to (Liu et al., 2012), due in part to the practical success re-
ported in (Emmott et al., 2013). Randomization is a pow-
erful tool and known to be valuable in supervised learn-
ing (Breiman, 2001). But its technical exploration in the
context of anomaly detection is not well-understood and
the same comment applies to the algorithm put forth in (Liu
et al., 2012). Moreover that algorithm has several lim-
itations as described in Section 4.1. In particular, we
show that in the presence of irrelevant dimensions, cru-
cial anomalies are missed. In addition, it is unclear how
to extend this work to a stream. Prior work attempted so-
lutions (Tan et al., 2011) that extend to streaming, however
those were not found to be effective (Emmott et al., 2013).
To address these limitations, we put forward a sketch or
synopsis termed robust random cut forest (RRCF) formally
defined as follows.
Definition 1 A robust random cut tree (RRCT) on point
set S is generated as follows:
1. Choose a random dimension proportional to ℓi
j
ℓj
,
where ℓi = maxx∈S xi
− minx∈Sxi
.
2. Choose Xi
∼ Uniform[minx∈S xi, maxx∈S xi]
3. Let S1 = {x|x ∈ S, xi
≤ Xi
} and S2 = S \ S1
and
recurse on S1
and S2
.
Random Cut Forest (RCF)
2004-2016
Unsupervised
Anom
aly
Detection
Random Cut Forest (RCF)
2004-2016
Unsupervised
Anom
aly
Detection
Random Cut Forest (RCF)
2004-2016
Unsupervised
Anom
aly
Detection
• The idea is to treat a period of P datapoints as a single datapoint of feature
length P and then run the algorithm on these feature vectors
• This is especially useful when working with periodic data with known
period
Shingling
Random Cut Forest (RCF)
2004-2016
Unsupervised
Anom
aly
Detection
Using “shingling”
Anomaly Detection to Improve
Infrastructure and Application Monitoring
Am
azon
CloudW
atch
Time Series Forecasting (DeepAR)
DeepAR: Probabilistic Forecasting with
Autoregressive Recurrent Networks
Valentin Flunkert
⇤
, David Salinas
⇤
, Jan Gasthaus
Amazon Development Center
Germany
Abstract
Probabilistic forecasting, i.e. estimating the probability distribution of a time se-
ries’ future given its past, is a key enabler for optimizing business processes. In
retail businesses, for example, forecasting demand is crucial for having the right
inventory available at the right time at the right place. In this paper we propose
DeepAR, a methodology for producing accurate probabilistic forecasts, based on
training an auto-regressive recurrent network model on a large number of related
time series. We demonstrate how by applying deep learning techniques to fore-
casting, one can overcome many of the challenges faced by widely-used classical
approaches to the problem. We show through extensive empirical evaluation on
several real-world forecasting data sets that our methodology produces more accu-
rate forecasts than other state-of-the-art methods, while requiring minimal manual
work.
1 Introduction
Forecasting plays a key role in automating and optimizing operational processes in most businesses
and enables data driven decision making. In retail for example, probabilistic forecasts of product
supply and demand can be used for optimal inventory management, staff scheduling and topology
planning [17], and are more generally a crucial technology for most aspects of supply chain opti-
mization.
The prevalent forecasting methods in use today have been developed in the setting of forecasting
individual or small groups of time series. In this approach, model parameters for each given time
series are independently estimated from past observations. The model is typically manually selected
to account for different factors, such as autocorrelation structure, trend, seasonality, and other ex-
planatory variables. The fitted model is then used to forecast the time series into the future according
to the model dynamics, possibly admitting probabilistic forecasts through simulation or closed-form
expressions for the predictive distributions. Many methods in this class are based on the classical
Box-Jenkins methodology [3], exponential smoothing techniques, or state space models [11, 18].
In recent years, a new type of forecasting problem has become increasingly important in many appli-
cations. Instead of needing to predict individual or a small number of time series, one is faced with
forecasting thousands or millions of related time series. Examples include forecasting the energy
consumption of individual households, forecasting the load for servers in a data center, or forecast-
ing the demand for all products that a large retailer offers. In all these scenarios, a substantial amount
of data on past behavior of similar, related time series can be leveraged for making a forecast for an
individual time series. Using data from related time series not only allows fitting more complex (and
hence potentially more accurate) models without overfitting, it can also alleviate the time and labor
intensive manual feature engineering and model selection steps required by classical techniques.
⇤equal contribution
arXiv:1704.04110v2 [cs.AI] 5 Jul 2017
2017
Supervised
Tim
e
Series Forecasting
• DeepAR is a supervised learning algorithm for
forecasting scalar time series using recurrent neural
networks (RNN)
• Classical forecasting methods fit one model to each
individual time series, and then use that model to
extrapolate the time series into the future
• In many applications you might have many similar time
series across a set of cross-sectional units
• For example, demand for different products, load of servers,
requests for web pages, and so on
• In this case, it can be beneficial to train a single model
jointly over all of these time series
• DeepAR takes this approach, training a model for predicting a
time series over a large set of (related) time series
Time Series Forecasting (DeepAR)
2017
Supervised
Tim
e
Series Forecasting
Time Series Forecasting (DeepAR)
BlazingText (Word2vec)
BlazingText: Scaling and Accelerating Word2Vec using Multiple
GPUs
Saurabh Gupta
Amazon Web Services
[email protected]
Vineet Khare
Amazon Web Services
[email protected]
ABSTRACT
Word2Vec is a popular algorithm used for generating dense vector
representations of words in large corpora using unsupervised learn-
ing. The resulting vectors have been shown to capture semantic
relationships between the corresponding words and are used ex-
tensively for many downstream natural language processing (NLP)
tasks like sentiment analysis, named entity recognition and machine
translation. Most open-source implementations of the algorithm
have been parallelized for multi-core CPU architectures including
the original C implementation by Mikolov et al. [1] and FastText
[2] by Facebook. A few other implementations have attempted to
leverage GPU parallelization but at the cost of accuracy and scal-
ability. In this work, we present BlazingText, a highly optimized
implementation of word2vec in CUDA, that can leverage multiple
GPUs for training. BlazingText can achieve a training speed of up to
43M words/sec on 8 GPUs, which is a 9x speedup over 8-threaded
CPU implementations, with minimal eect on the quality of the
embeddings.
CCS CONCEPTS
• Computing methodologies → Neural networks; Natural
language processing;
KEYWORDS
Word embeddings, Word2Vec, Natural Language Processing, Ma-
chine Learning, CUDA, GPU
ACM Reference format:
Saurabh Gupta and Vineet Khare. 2017. BlazingText: Scaling and Accelerat-
ing Word2Vec using Multiple GPUs. In Proceedings of MLHPC’17: Machine
Learning in HPC Environments, Denver, CO, USA, November 12–17, 2017,
5 pages.
https://doi.org/10.1145/3146347.3146354
1 INTRODUCTION
Word2Vec aims to represent each word as a vector in a low-dimensional
embedding space such that the geometry of resulting vectors cap-
tures word semantic similarity through the cosine similarity of cor-
responding vectors as well as more complex relationships through
vector subtractions, such as vec(“King”) - vec(“Queen”) + vec(“Woman”)
MLHPC’17: Machine Learning in HPC Environments, November 12–17, 2017, Denver, CO,
USA
© 2017 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-5137-9/17/11.
https://doi.org/10.1145/3146347.3146354
⇡ vec(“Man”). This idea has enabled many Natural Language Pro-
cessing (NLP) algorithms to achieve better performance [3, 4].
The optimization in word2vec is done using Stochastic Gradient
Descent (SGD), which solves the problem iteratively; at each step,
it picks a pair of words: an input word and a target word either
from its window or a random negative sample. It then computes the
gradients of the objective function with respect to the two chosen
words, and updates the word representations of the two words
based on the gradient values. The algorithm then proceeds to the
next iteration with a dierent word pair being chosen.
One of the main issues with SGD is that it is inherently sequential;
since there is a dependency between the update from one iteration
and the computation in the next iteration (they may happen to touch
the same word representations), each iteration must potentially wait
for the update from the previous iteration to complete. This does
not allow us to use the parallel resources of the hardware.
However, to solve the above issue, word2vec uses Hogwild [5],
a scheme where dierent threads process dierent word pairs in
parallel and ignore any conicts that may arise in the model up-
date phases. In theory, this can reduce the rate of convergence of
algorithm as compared to a sequential run. However, the Hogwild
approach has been shown to work well in the case updates across
threads are unlikely to be to the same word; and indeed for large
vocabulary sizes, conicts are relatively rare and convergence is
not typically aected.
The success of Hogwild approach for Word2Vec in case of multi-
core architectures makes this algorithm a good candidate for ex-
ploiting GPU, which provides orders of magnitude more parallelism
than a CPU. In this paper, we propose an ecient parallelization
technique for accelerating word2vec using GPUs.
GPU acceleration using deep learning frameworks is not a good
choice for accelerating word2vec [6]. These frameworks are often
suitable for “deep networks” where the computation is dominated
by heavy operations like convolutions and large matrix multiplica-
tions. On the other hand, word2vec is a relatively shallow network,
as each training step consists of an embedding lookup, gradient
computation and nally weight updates for the word pair under
consideration. The gradient computation and updates involve small
dot products and thus don’t benet from the use of cuDNN [7] or
cuBLAS [8] libraries.
The limitations of deep learning frameworks led us to explore
the CUDA C++ API. We design the training algorithm from scratch,
to utilize CUDA multi-threading capabilities optimally, without
hurting the output accuracy by over-exploiting GPU parallelism.
Finally, to scale out BlazingText to process text corpus at several
million words/sec, we demonstrate the possibility of using multiple
GPUs to perform data parallelism based training, which is one of the
main contributions of our work. We benchmark BlazingText against
2013-2017
Supervised
W
ord
Em
bedding
Efficient Estimation of Word Representations in
Vector Space
Tomas Mikolov
Google Inc., Mountain View, CA
[email protected]
Kai Chen
Google Inc., Mountain View, CA
[email protected]
Greg Corrado
Google Inc., Mountain View, CA
[email protected]
Jeffrey Dean
Google Inc., Mountain View, CA
[email protected]
Abstract
We propose two novel model architectures for computing continuous vector repre-
sentations of words from very large data sets. The quality of these representations
is measured in a word similarity task, and the results are compared to the previ-
ously best performing techniques based on different types of neural networks. We
observe large improvements in accuracy at much lower computational cost, i.e. it
takes less than a day to learn high quality word vectors from a 1.6 billion words
data set. Furthermore, we show that these vectors provide state-of-the-art perfor-
mance on our test set for measuring syntactic and semantic word similarities.
1 Introduction
Many current NLP systems and techniques treat words as atomic units - there is no notion of similar-
ity between words, as these are represented as indices in a vocabulary. This choice has several good
reasons - simplicity, robustness and the observation that simple models trained on huge amounts of
data outperform complex systems trained on less data. An example is the popular N-gram model
used for statistical language modeling - today, it is possible to train N-grams on virtually all available
data (trillions of words [3]).
However, the simple techniques are at their limits in many tasks. For example, the amount of
relevant in-domain data for automatic speech recognition is limited - the performance is usually
dominated by the size of high quality transcribed speech data (often just millions of words). In
machine translation, the existing corpora for many languages contain only a few billions of words
or less. Thus, there are situations where simple scaling up of the basic techniques will not result in
any significant progress, and we have to focus on more advanced techniques.
With progress of machine learning techniques in recent years, it has become possible to train more
complex models on much larger data set, and they typically outperform the simple models. Probably
the most successful concept is to use distributed representations of words [10]. For example, neural
network based language models significantly outperform N-gram models [1, 27, 17].
1.1 Goals of the Paper
The main goal of this paper is to introduce techniques that can be used for learning high-quality word
vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As
far as we know, none of the previously proposed architectures has been successfully trained on more
1
arXiv:1301.3781v3 [cs.CL] 7 Sep 2013
@data_monsters
https://twitter.com/data_monsters/status/844256398393462784
Word2vec ⇾ Word Embedding
2013
Supervised
W
ord
Em
bedding
Contextual
Bag-Of-Words
(CBOW)
to predict a word
given its context
Skip-Gram with
Negative Sampling
(SGNS)
to predict the context
given a word
BlazingText (Word2vec) Scaling
2017
Supervised
W
ord
Em
bedding
AW
S
Sum
m
it
M
ilan
2018
https://bit.ly/2SSI2Qo
And Then There Are (Built-in) Algorithms
Algorithm Scope
Linear Learner classification, regression
Factorization Machines classification, regression, sparse datasets
K-Nearest Neighbors (k-NN) classification, regression
K-Means Clustering clustering, unsupervised
Principal Component Analysis (PCA) dimensionality reduction, unsupervised
XGBoost regression, classification (binary and multiclass), and ranking
Image Classification CNNs (ResNet)
Object Classification Object classification (and bounding box) inside an image
Semantic Segmentation Pixel by pixel classification of an image
Sequence to Sequence (seq2seq) translation, text summarization, speech-to-text (RNNs, CNN)
Latent Dirichlet Allocation (LDA) topic modeling, unsupervised
Neural Topic Model (NTM) topic modeling, unsupervised
Random Cut Forest (RCF) anomaly detection
Time Series Forecasting (DeepAR) time series forecasting (RNN)
BlazingText (Word2vec) word embeddings
Machine Learning = Algorithms + Data + Tools
Customers want more value from their data
Growing
exponentially
From new
sources
Increasingly
diverse
Used by
many people
Analyzed by
many applications
Cloud data lakes are the future
Customers want:
A single data store that is scalable & cost effective
To store data securely in standard formats
To analyze their data in a variety of ways
Cloud Data Lake
Infrastructure
Decoupled Storage
& Compute Resources
Security & Governance
Data
Migration
Streaming
Services
Data
Warehouse
Big Data
Processing
Serverless Data
Processing
Real-time
Analytics
Operational
Analytics
Predictive
Analytics
ETL & Catalog
Data Management
125+ million players
Data provides a constant feedback loop
for game designers
Up-to-the-minute analysis of gamer
satisfaction to drive gamer engagement
Resulting in the most popular
game played in the world
Fortnite
Data lake infrastructure
& management
“With an enterprise-ready
option like Lake Formation,
we will be able to spend more
time deriving value from our
data rather than doing the
heavy lifting involved
in manually setting up and
managing our data lake.”
—Joshua Couch, VP Engineering
at Fender Digital
Analytics
FINRA’s legacy system did not
scale to handle 75 billion events
per day. They needed to run
complex surveillance queries
over 20+ PB of data
FINRA migrated their big data
appliance to a S3 Data Lake
and uses EMR for ingestion
and processing
CHALLENGE
Needed to analyze data to find
insights, identify opportunities, and
evaluate business performance.
The Oracle DW did not scale, was
difficult to maintain, and costly.
SOLUTION
Deployed a data lake with S3, and run
analytics with Redshift, Redshift
Spectrum, and EMR.
Result: they doubled the data stored
(100PB), lowered costs, and was able
to gain insights faster.
50PB of data
600,000 analytics jobs/day
S3
DynamoDB
Relational Stores
Non Relational Stores
S3
Kinesis
Data Lake Web
Interface
Data Lake APIs
Workflows
service
Discovery
service
Data
Ingestion
Subscription
Service
Data security and governance
EMR
Redshift
Redshift Spectrum
Other Compute
Source systems Big data marketplace Analytics
100PB
Data Quality
/ Curation
Amazon.com,1995
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
Our mission at AWS
Put machine learning in the hands
of every developer
142
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
FRAMEWORKS INTERFACES INFRASTRUCTURE
AI Services
Broadest and deepest set of capabilities
T H E A W S M L S T A C K
VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS
ML Services
ML Frameworks + Infrastructure
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D
& C O M P R E H E N D
M E D I C A L
L E X F O R E C A S T
R E K O G N I T I O N
I M A G E
R E K O G N I T I O N
V I D E O
T E X T R A C T P E R S O N A L I Z E
Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting
Amazon SageMaker
F P G A S
E C 2 P 3
& P 3 D N
E C 2 G 4
E C 2 C 5
I N F E R E N T I A
G R E E N G R A S S E L A S T I C
I N F E R E N C E
D L
C O N T A I N E R S
& A M I s
E L A S T I C
K U B E R N E T E S
S E R V I C E
E L A S T I C
C O N T A I N E R
S E R V I C E
143
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 143
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Accelerating
investigation timelines
FINRA uses Amazon Comprehend to process and
review millions of documents with unstructured data,
helping flag records of interest that should be
reviewed by human investigators.
144
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 144
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Predicting
global markets
Moody’s uses Amazon SageMaker to better
predict market conditions and credit actions.
145
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 145
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Accelerating
financial analysis
Using TensorFlow on Amazon SageMaker, Siemens
Financial Services developed an NLP model to extract
critical information to accelerate investment due
diligence, reducing time to summarize diligence
documents from 12 hours down to 30 seconds.
146
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 146
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Optimizing
interactive games
Rovio uses deep reinforcement learning on AWS to
help predict the difficulty of levels in Angry Birds
Dream Blast. This lets their developers focus on
creating better player experiences, instead of testing
levels.
147
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 147
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Driving better
healthcare outcomes
Using Amazon SageMaker, GE Healthcare developed
an ML model that can learn from thousands of
medical scans to detect anomalies more accurately
and efficiently, allowing radiologists to prioritize
patients needing immediate attention.
148
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 148
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Enhancing the
fan experience
Formula 1 uses Amazon SageMaker to create real time
insights on how a driver is performing, improving the fan
experience on television broadcasts and digital platforms.
149
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | 149
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved |
Improving
customer service
T-Mobile uses Amazon SageMaker Ground Truth
to label unstructured data from customer service
interactions. These data sets are used to train
machine learning models that provide their human
agents with recommended actions for a given
customer.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
Culture
Setting your organization
up for success
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
Assess your structured
and unstructured
data sources
Create
the loop
1.
Connect technology
initiatives with
business outcomes
Advance your
data strategy
Put machine learning
in the hands of
your developers
Organize
for success
2. 3.
?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
• Purpose-built for ML-skills development
• Fully programmable & customizable
• Build custom Amazon SageMaker models
• 10-minutes to your first deep learning project
The world’s first deep learning-enabled video camera for developers
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark
• Build machine learning models in Amazon
SageMaker
• Train, test, and iterate on the track using the
AWS DeepRacer 3D racing simulator
• Compete in the world’s first global autonomous
racing league, to race for prizes and a chance to
advance to win the coveted AWS DeepRacer Cup
A fully autonomous 1/18th-scale race car designed to help you learn about
reinforcement learning through autonomous driving
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark
© 2019, Amazon Web Services, Inc. or its Affiliates.
Danilo Poccia
Principal Evangelist
AWS
@danilop
danilop.net
And Then There Are Algorithms