And Then There Are Algorithms – Part 2

And Then There Are Algorithms – Part 2

Machine Learning for the Enterprise Conference, Rome, October 28th, 2019

Machine Learning = Algorithms + Data + Tools

Part 2

7c9b8b368924556d8642bdaed3ded1f5?s=128

Danilo Poccia

October 28, 2019
Tweet

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates. Danilo

    Poccia Principal Evangelist AWS @danilop danilop.net And Then There Are Algorithms
  2. Neural Networks

  3. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. 1943 Warren McCulloch, Walter Pitts Threshold Logic Units
  4. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. 1962 Frank Rosenblatt Perceptron
  5. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Perceptron ∑ x1 x2 x3 xn w1 w2 w3 wn w0 = # output weights (parameters) activation function input
  6. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Perceptron f(∑) x1 x2 x3 xn w1 w2 w3 wn w0 = # weights (parameters) activation function input output
  7. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Perceptron f(∑) input output
  8. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. 1969 Marvin Minsky, Seymour Papert Perceptrons: An Introduction to Computational Geometry A perceptron can only solve linearly separable functions (e.g. no XOR)
  9. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Neural Netw ork f(∑) f(∑) f(∑) f(∑) f(∑) f(∑) f(∑) f(∑) f(∑) input layer hidden layer output layer input output Multiple Layers Lots of Parameters Backpropagation
  10. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Microprocessor Transistor Counts 1971-2018 Intel Xeon CPU 28 cores NVIDIA V100 GPU 5,120 CUDA Cores 640 Tensor Cores M oore’s Law
  11. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. LeCun, Gradient-Based Learning Applied to Document Recognition,1998 Hinton, A Fast Learning Algorithm for Deep Belief Nets, 2006 Bengio, Learning Deep Architectures for AI, 2009 Deep Learning Advances in Research 1998-2009
  12. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Image Processing Deep Learning
  13. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Im age Processing output f(∑) f(∑) f(∑) f(∑) f(∑) f(∑) f(∑) f(∑) f(∑) How to give images in input to a Neural Network? Photo by David Iliff. License: CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
  14. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Im age Processing Convolution Matrix 0 0 0 0 1 0 0 0 0 Identity Photo by David Iliff. License: CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
  15. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Im age Processing Convolution Matrix 1 0 -1 2 0 -2 1 0 -1 Left Edges Photo by David Iliff. License: CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
  16. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Im age Processing Convolution Matrix -1 0 1 -2 0 2 -1 0 1 Right Edges Photo by David Iliff. License: CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
  17. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Im age Processing Convolution Matrix 1 2 1 0 0 0 -1 -2 -1 Top Edges Photo by David Iliff. License: CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
  18. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Im age Processing Convolution Matrix -1 -2 -1 0 0 0 1 2 1 Bottom Edges Photo by David Iliff. License: CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
  19. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. Im age Processing Convolution Matrix 0.6 -0.6 1.2 -1.4 1.2 -1.6 0.8 -1.4 1.6 Random Values Photo by David Iliff. License: CC-BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Colosseum_in_Rome,_Italy_-_April_2007.jpg
  20. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. CNNs Convolutional Neural Networks (CNNs) https://en.wikipedia.org/wiki/Convolutional_neural_network
  21. © 2019, Amazon Web Services, Inc. or its Affiliates. ©

    2019, Amazon Web Services, Inc. or its Affiliates. ImageNet Classification Error Over Time 0 5 10 15 20 25 30 2010 2011 2012 2013 2014 2015 2016 2017 CNNs
  22. 2012 ImageNet Classification with Deep Convolutional Neural Networks

  23. CNNs SuperVision: 8 layers, 60M parameters 0

  24. 2013 Visualizing and Understanding Convolutional Networks

  25. CNNs

  26. CNNs

  27. CNNs How Do Neural Networks Learn? ? More generic and

    can be reused as feature extractor for other visual tasks Specific to task Cat Dog 0
  28. Image Classification Deep Residual Learning for Image Recognition Kaiming He

    Xiangyu Zhang Shaoqing Ren Jian Sun Microsoft Research {kahe, v-xiangz, v-shren, jiansun}@microsoft.com Abstract Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learn- ing residual functions with reference to the layer inputs, in- stead of learning unreferenced functions. We provide com- prehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8⇥ deeper than VGG nets [41] but still having lower complex- ity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our ex- tremely deep representations, we obtain a 28% relative im- provement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet local- ization, COCO detection, and COCO segmentation. 1. Introduction Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 50, 40]. Deep networks naturally integrate low/mid/high- level features [50] and classifiers in an end-to-end multi- layer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other non- trivial visual recognition tasks [8, 12, 7, 32, 27] have also 1 http://image-net.org/challenges/LSVRC/2015/ and http://mscoco.org/dataset/#detections-challenge2015. 0 1 2 3 4 5 6 0 10 20 iter. (1e4) training error (%) 0 1 2 3 4 5 6 0 10 20 iter. (1e4) test error (%) 56-layer 20-layer 56-layer 20-layer Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4. greatly benefited from very deep models. Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initial- ization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start con- verging for stochastic gradient descent (SGD) with back- propagation [22]. When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher train- ing error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example. The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that 1 arXiv:1512.03385v1 [cs.CV] 10 Dec 2015 Densely Connected Convolutional Networks Gao Huang⇤ Cornell University gh349@cornell.edu Zhuang Liu⇤ Tsinghua University liuzhuang13@mails.tsinghua.edu.cn Laurens van der Maaten Facebook AI Research lvdmaaten@fb.com Kilian Q. Weinberger Cornell University kqw4@cornell.edu Abstract Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convo- lutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections—one between each layer and its subsequent layer—our network has L(L+1) 2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several com- pelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage fea- ture reuse, and substantially reduce the number of parame- ters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain sig- nificant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high per- formance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet. 1. Introduction Convolutional neural networks (CNNs) have become the dominant machine learning approach for visual object recognition. Although they were originally introduced over 20 years ago [18], improvements in computer hardware and network structure have enabled the training of truly deep CNNs only recently. The original LeNet5 [19] consisted of 5 layers, VGG featured 19 [29], and only last year Highway ⇤Authors contributed equally x0 x1 H1 x2 H2 H3 H4 x3 x4 Figure 1: A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding feature-maps as input. Networks [34] and Residual Networks (ResNets) [11] have surpassed the 100-layer barrier. As CNNs become increasingly deep, a new research problem emerges: as information about the input or gra- dient passes through many layers, it can vanish and “wash out” by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets [11] and Highway Networks [34] by- pass signal from one layer to the next via identity connec- tions. Stochastic depth [13] shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. FractalNets [17] repeatedly combine sev- eral parallel layer sequences with different number of con- volutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Although these different approaches vary in network topology and training procedure, they all share a key characteristic: they create short paths from early layers to later layers. 1 arXiv:1608.06993v5 [cs.CV] 28 Jan 2018 Inception Recurrent Convolutional Neural Network for Object Recognition Md Zahangir Alom ALOMM1@UDAYTON.EDU University of Dayton, Dayton, OH, USA Mahmudul Hasan MAHMUD.UCR@GMAIL.COM Comcast Labs, Washington, DC, USA Chris Yakopcic CHRIS@UDAYTON.EDU University of Dayton, Dayton, OH, USA Tarek M. Taha TTAHA1@UDAYTON.EDU University of Dayton, Dayton, OH, USA Abstract Deep convolutional neural networks (DCNNs) are an influential tool for solving various prob- lems in the machine learning and computer vi- sion fields. In this paper, we introduce a new deep learning model called an Inception- Recurrent Convolutional Neural Network (IR- CNN), which utilizes the power of an incep- tion network combined with recurrent layers in DCNN architecture. We have empirically eval- uated the recognition performance of the pro- posed IRCNN model using different benchmark datasets such as MNIST, CIFAR-10, CIFAR- 100, and SVHN. Experimental results show sim- ilar or higher recognition accuracy when com- pared to most of the popular DCNNs including the RCNN. Furthermore, we have investigated IRCNN performance against equivalent Incep- tion Networks and Inception-Residual Networks using the CIFAR-100 dataset. We report about 3.5%, 3.47% and 2.54% improvement in classifi- cation accuracy when compared to the RCNN, equivalent Inception Networks, and Inception- Residual Networks on the augmented CIFAR- 100 dataset respectively. 1. Introduction In recent years, deep learning using Convolutional Neu- ral Networks (CNNs) has shown enormous success in the field of machine learning and computer vision. CNNs pro- vide state-of-the-art accuracy in various image recognition tasks including object recognition (Schmidhuber, 2015; Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2015), object detection (Girshick et al., 2014), tracking (Wang et al., 2015), and image caption- ing (Xu et al., 2014). In addition, this technique has been applied massively in computer vision tasks such as video representation and classification of human activity (Bal- las et al., 2015). Machine translation and natural language processing are applied deep learning techniques that show great success in this domain (Collobert & Weston, 2008; Manning et al., 2014). Furthermore, this technique has been used extensively in the field of speech recognition (Hinton et al., 2012). Moreover, deep learning is not lim- ited to signal, natural language, image, and video process- ing tasks, it has been applying successfully for game devel- opment (Mnih et al., 2013; Lillicrap et al., 2015). There is a lot of ongoing research for developing even better perfor- mance and improving the training process of DCNNs (Lin et al., 2013; Springenberg et al., 2014; Goodfellow et al., 2013; Ioffe & Szegedy, 2015; Zeiler & Fergus, 2013). In some cases, machine intelligence shows better perfor- mance compared to human intelligence including calcula- tion, chess, memory, and pattern matching. On the other hand, human intelligence still provides better performance in other fields such as object recognition, scene under- standing, and more. Deep learning techniques (DCNNs in particular) perform very well in the domains of detec- tion, classification, and scene understanding. There is a still a gap that must be closed before human level intelli- gence is reached when performing visual recognition tasks. Machine intelligence may open an opportunity to build a system that can process visual information the way that a human brain does. According to the study on the visual processing system within a human brain by James DiCarlo et al. (Zoccolan & Rust, 2012) the brain consists of sev- eral visual processing units starting with the visual cortex arXiv:1704.07709v1 [cs.CV] 25 Apr 2017 2015-2017 Supervised Im age Classification
  29. Image Classification (ResNet) 2015 Supervised Im age Classification

  30. Image Classification (DenseNet) 2016 Supervised Im age Classification

  31. Image Classification (Inception) 2017 Supervised Im age Classification

  32. Object Detection 2016 Supervised O bject Detection SSD: Single Shot

    MultiBox Detector Wei Liu1, Dragomir Anguelov2, Dumitru Erhan3, Christian Szegedy3, Scott Reed4, Cheng-Yang Fu1, Alexander C. Berg1 1UNC Chapel Hill 2Zoox Inc. 3Google Inc. 4University of Michigan, Ann-Arbor 1wliu@cs.unc.edu, 2drago@zoox.com, 3{dumitru,szegedy}@google.com, 4reedscot@umich.edu, 1{cyfu,aberg}@cs.unc.edu Abstract. We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines pre- dictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into sys- tems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300 ⇥ 300 in- put, SSD achieves 74.3% mAP1 on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 ⇥ 512 input, SSD achieves 76.9% mAP, outperforming a compa- rable state-of-the-art Faster R-CNN model. Compared to other single stage meth- ods, SSD has much better accuracy even with a smaller input image size. Code is available at: https://github.com/weiliu89/caffe/tree/ssd . Keywords: Real-time Object Detection; Convolutional Neural Network 1 Introduction Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high- quality classifier. This pipeline has prevailed on detection benchmarks since the Selec- tive Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN[2] albeit with deeper features such as [3]. While accurate, these approaches have been too computationally intensive for em- bedded systems and, even with high-end hardware, too slow for real-time applications. 1 We achieved even better results using an improved data augmentation scheme in follow-on experiments: 77.2% mAP for 300⇥300 input and 79.8% mAP for 512⇥512 input on VOC2007. Please see Sec. 3.6 for details. arXiv:1512.02325v5 [cs.CV] 29 Dec 2016
  33. Semantic Segmentation (Image) 2016-2017 Supervised Sem antic Segm entation 1

    Fully Convolutional Networks for Semantic Segmentation Evan Shelhamer⇤, Jonathan Long⇤, and Trevor Darrell, Member, IEEE Abstract—Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image. Index Terms—Semantic Segmentation, Convolutional Networks, Deep Learning, Transfer Learning F 1 INTRODUCTION CONVOLUTIONAL networks are driving advances in recognition. Convnets are not only improving for whole-image classification [1], [2], [3], but also making progress on local tasks with structured output. These in- clude advances in bounding box object detection [4], [5], [6], part and keypoint prediction [7], [8], and local correspon- dence [8], [9]. The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation [10], [11], [12], [13], [14], [15], [16], in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses. We show that fully convolutional networks (FCNs) trained end-to-end, pixels-to-pixels on semantic segmen- tation exceed the previous best results without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary- sized inputs. Both learning and inference are performed whole-image-at-a-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampling. This method is efficient, both asymptotically and ab- solutely, and precludes the need for the complications in other works. Patchwise training is common [10], [11], [12], [13], [16], but lacks the efficiency of fully convolutional training. Our approach does not make use of pre- and post- processing complications, including superpixels [12], [14], proposals [14], [15], or post-hoc refinement by random fields ⇤Authors contributed equally • E. Shelhamer, J. Long, and T. Darrell are with the Department of Electrical Engineering and Computer Science (CS Division), UC Berkeley. E-mail: {shelhamer,jonlong,trevor}@cs.berkeley.edu. or local classifiers [12], [14]. Our model transfers recent success in classification [1], [2], [3] to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representations. In contrast, previous works have applied small convnets without super- vised pre-training [10], [12], [13]. Semantic segmentation faces an inherent tension be- tween semantics and location: global information resolves what while local information resolves where. What can be done to navigate this spectrum from location to semantics? How can local decisions respect global structure? It is not immediately clear that deep networks for image classifica- tion yield representations sufficient for accurate, pixelwise recognition. In the conference version of this paper [17], we cast pre-trained networks into fully convolutional form, and augment them with a skip architecture that takes advantage of the full feature spectrum. The skip architecture fuses the feature hierarchy to combine deep, coarse, semantic information and shallow, fine, appearance information (see Section 4.3 and Figure 3). In this light, deep feature hierar- chies encode location and semantics in a nonlinear local-to- global pyramid. This journal paper extends our earlier work [17] through further tuning, analysis, and more results. Alternative choices, ablations, and implementation details better cover the space of FCNs. Tuning optimization leads to more accu- rate networks and a means to learn skip architectures all-at- once instead of in stages. Experiments that mask foreground and background investigate the role of context and shape. Results on the object and scene labeling of PASCAL-Context reinforce merging object segmentation and scene parsing as unified pixelwise prediction. In the next section, we review related work on deep classification nets, FCNs, recent approaches to semantic seg- mentation using convnets, and extensions to FCNs. The fol- arXiv:1605.06211v1 [cs.CV] 20 May 2016 Pyramid Scene Parsing Network Hengshuang Zhao1 Jianping Shi2 Xiaojuan Qi1 Xiaogang Wang1 Jiaya Jia1 1The Chinese University of Hong Kong 2SenseTime Group Limited {hszhao, xjqi, leojia}@cse.cuhk.edu.hk, xgwang@ee.cuhk.edu.hk, shijianping@sensetime.com Abstract Scene parsing is challenging for unrestricted open vo- cabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region- based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is ef- fective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel- level prediction. The proposed approach achieves state-of- the-art performance on various datasets. It came first in Im- ageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes. 1. Introduction Scene parsing, based on semantic segmentation, is a fun- damental topic in computer vision. The goal is to assign each pixel in the image a category label. Scene parsing pro- vides complete understanding of the scene. It predicts the label, location, as well as shape for each element. This topic is of broad interest for potential applications of automatic driving, robot sensing, to name a few. Difficulty of scene parsing is closely related to scene and label variety. The pioneer scene parsing task [23] is to clas- sify 33 scenes for 2,688 images on LMO dataset [22]. More recent PASCAL VOC semantic segmentation and PASCAL context datasets [8, 29] include more labels with similar context, such as chair and sofa, horse and cow, etc. The new ADE20K dataset [43] is the most challenging one with a large and unrestricted open vocabulary and more scene classes. A few representative images are shown in Fig. 1. To develop an effective algorithm for these datasets needs to conquer a few difficulties. State-of-the-art scene parsing frameworks are mostly based on the fully convolutional network (FCN) [26]. The deep convolutional neural network (CNN) based methods boost dynamic object understanding, and yet still face chal- Figure 1. Illustration of complex scenes in ADE20K dataset. lenges considering diverse scenes and unrestricted vocabu- lary. One example is shown in the first row of Fig. 2, where a boat is mistaken as a car. These errors are due to similar appearance of objects. But when viewing the image regard- ing the context prior that the scene is described as boathouse near a river, correct prediction should be yielded. Towards accurate scene perception, the knowledge graph relies on prior information of scene context. We found that the major issue for current FCN based models is lack of suitable strategy to utilize global scene category clues. For typical complex scene understanding, previously to get a global image-level feature, spatial pyramid pooling [18] was widely employed where spatial statistics provide a good descriptor for overall scene interpretation. Spatial pyramid pooling network [12] further enhances the ability. Different from these methods, to incorporate suitable global features, we propose pyramid scene parsing network (PSPNet). In addition to traditional dilated FCN [3, 40] for pixel prediction, we extend the pixel-level feature to the specially designed global pyramid pooling one. The local and global clues together make the final prediction more reliable. We also propose an optimization strategy with 1 arXiv:1612.01105v2 [cs.CV] 27 Apr 2017 Rethinking Atrous Convolution for Semantic Image Segmentation Liang-Chieh Chen George Papandreou Florian Schroff Hartwig Adam Google Inc. {lcchen, gpapan, fschroff, hadam}@google.com Abstract In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolu- tional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional fea- tures at multiple scales, with image-level features encoding global context and further boost performance. We also elab- orate on implementation details and share our experience on training our system. The proposed ‘DeepLabv3’ system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark. 1. Introduction For the task of semantic segmentation [20, 63, 14, 97, 7], we consider two challenges in applying Deep Convolutional Neural Networks (DCNNs) [50]. The first one is the reduced feature resolution caused by consecutive pooling operations or convolution striding, which allows DCNNs to learn in- creasingly abstract feature representations. However, this invariance to local image transformation may impede dense prediction tasks, where detailed spatial information is de- sired. To overcome this problem, we advocate the use of atrous convolution [36, 26, 74, 66], which has been shown to be effective for semantic image segmentation [10, 90, 11]. Atrous convolution, also known as dilated convolution, al- lows us to repurpose ImageNet [72] pretrained networks to extract denser feature maps by removing the downsam- pling operations from the last few layers and upsampling the corresponding filter kernels, equivalent to inserting holes (‘trous’ in French) between filter weights. With atrous convo- lution, one is able to control the resolution at which feature rate = 6 rate = 24 rate = 1 Conv kernel: 3x3 rate: 1 Conv kernel: 3x3 rate: 6 Conv kernel: 3x3 rate: 24 Feature map Feature map Feature map Figure 1. Atrous convolution with kernel size 3 ⇥ 3 and different rates. Standard convolution corresponds to atrous convolution with rate = 1. Employing large value of atrous rate enlarges the model’s field-of-view, enabling object encoding at multiple scales. responses are computed within DCNNs without requiring learning extra parameters. Another difficulty comes from the existence of objects at multiple scales. Several methods have been proposed to handle the problem and we mainly consider four categories in this work, as illustrated in Fig. 2. First, the DCNN is applied to an image pyramid to extract features for each scale input [22, 19, 69, 55, 12, 11] where objects at different scales become prominent at different feature maps. Sec- ond, the encoder-decoder structure [3, 71, 25, 54, 70, 68, 39] exploits multi-scale features from the encoder part and re- covers the spatial resolution from the decoder part. Third, extra modules are cascaded on top of the original network for capturing long range information. In particular, DenseCRF [45] is employed to encode pixel-level pairwise similarities [10, 96, 55, 73], while [59, 90] develop several extra convo- lutional layers in cascade to gradually capture long range context. Fourth, spatial pyramid pooling [11, 95] probes an incoming feature map with filters or pooling operations at multiple rates and multiple effective field-of-views, thus capturing objects at multiple scales. In this work, we revisit applying atrous convolution, which allows us to effectively enlarge the field of view of filters to incorporate multi-scale context, in the framework of both cascaded modules and spatial pyramid pooling. In par- ticular, our proposed module consists of atrous convolution with various rates and batch normalization layers which we 1 arXiv:1706.05587v3 [cs.CV] 5 Dec 2017
  34. Semantic Segmentation (Image) 2016-2017 Supervised Sem antic Segm entation

  35. Autonomous Driving Systems

  36. Real Time, Per Pixel Object Segmentation

  37. Centimeter-accurate Positioning

  38. None
  39. output input output input state(t) memory Feedforward Neural Networks Recurrent

    Neural Networks What About Memory?
  40. RNNs https://en.wikipedia.org/wiki/Long_short-term_memory Long Short-Term Memory (LSTM) How much goes into

    memory How much is used in computing the output How much remains in memory
  41. SOCKEYE: A Toolkit for Neural Machine Translation Felix Hieber, Tobias

    Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, Matt Post {fhieber,domhant,mdenkows,dvilar,artemsok,acclift,mattpost}@amazon.com Abstract We describe SOCKEYE,1 an open-source sequence-to-sequence toolkit for Neural Machine Translation (NMT). SOCKEYE is a production-ready framework for training and applying models as well as an experimental platform for researchers. Written in Python and built on MXNET, the toolkit offers scalable training and inference for the three most prominent encoder-decoder architectures: attentional recurrent neural networks, self-attentional transformers, and fully convolutional networks. SOCKEYE also supports a wide range of optimizers, normalization and regularization techniques, and inference improvements from current NMT literature. Users can easily run standard training recipes, explore different model settings, and incorporate new ideas. In this paper, we highlight SOCKEYE’s features and bench- mark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English–German and Latvian–English. We report competitive BLEU scores across all three architectures, including an overall best score for SOCKEYE’s transformer implementation. To facilitate further comparison, we release all system outputs and training scripts used in our experiments. The SOCKEYE toolkit is free software released under the Apache 2.0 license. 1 Introduction The past two years have seen a deep learning revolution bring rapid and dramatic change to the field of machine translation. For users, new neural network-based models consistently deliver better quality translations than the previous generation of phrase-based systems. For researchers, Neural Machine Translation (NMT) provides an exciting new landscape where training pipelines are simplified and unified models can be trained directly from data. The promise of moving beyond the limitations of Statistical Machine Translation (SMT) has energized the community, leading recent work to focus almost exclusively on NMT and seemingly advance the state of the art every few months. For all its success, NMT also presents a range of new challenges. While popular encoder-decoder models are attractively simple, recent literature and the results of shared evaluation tasks show that a significant amount of engineering is required to achieve “production-ready” performance in both translation quality and computational efficiency. In a trend that carries over from SMT, the strongest NMT systems benefit from subtle architecture modifications, hyper-parameter tuning, and empirically effective heuristics. Unlike SMT, there is no “de-facto” toolkit that attracts most of the community’s attention and thus contains all the best ideas from recent literature.2 Instead, the presence of many independent toolkits3 brings diversity to the field, but also makes it difficult to compare architectural and algorithmic improvements that are each implemented in different toolkits. 1https://github.com/awslabs/sockeye (version 1.12) 2For SMT, this role was largely filled by MOSES [Koehn et al., 2007]. 3https://github.com/jonsafari/nmt-list arXiv:1712.05690v1 [cs.CL] 15 Dec 2017 Sequence to Sequence (seq2seq) • seq2seq is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. • Example applications include: • machine translation (input a sentence from one language and predict what that sentence would be in another language) • text summarization (input a longer string of words and predict a shorter string of words that is a summary) • speech-to-text (audio clips converted into output sentences in tokens). 2014-2017 Supervised Text, Audio
  42. SOCKEYE: A Toolkit for Neural Machine Translation Felix Hieber, Tobias

    Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, Matt Post {fhieber,domhant,mdenkows,dvilar,artemsok,acclift,mattpost}@amazon.com Abstract We describe SOCKEYE,1 an open-source sequence-to-sequence toolkit for Neural Machine Translation (NMT). SOCKEYE is a production-ready framework for training and applying models as well as an experimental platform for researchers. Written in Python and built on MXNET, the toolkit offers scalable training and inference for the three most prominent encoder-decoder architectures: attentional recurrent neural networks, self-attentional transformers, and fully convolutional networks. SOCKEYE also supports a wide range of optimizers, normalization and regularization techniques, and inference improvements from current NMT literature. Users can easily run standard training recipes, explore different model settings, and incorporate new ideas. In this paper, we highlight SOCKEYE’s features and bench- mark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English–German and Latvian–English. We report competitive BLEU scores across all three architectures, including an overall best score for SOCKEYE’s transformer implementation. To facilitate further comparison, we release all system outputs and training scripts used in our experiments. The SOCKEYE toolkit is free software released under the Apache 2.0 license. 1 Introduction The past two years have seen a deep learning revolution bring rapid and dramatic change to the field of machine translation. For users, new neural network-based models consistently deliver better quality translations than the previous generation of phrase-based systems. For researchers, Neural Machine Translation (NMT) provides an exciting new landscape where training pipelines are simplified and unified models can be trained directly from data. The promise of moving beyond the limitations of Statistical Machine Translation (SMT) has energized the community, leading recent work to focus almost exclusively on NMT and seemingly advance the state of the art every few months. For all its success, NMT also presents a range of new challenges. While popular encoder-decoder models are attractively simple, recent literature and the results of shared evaluation tasks show that a significant amount of engineering is required to achieve “production-ready” performance in both translation quality and computational efficiency. In a trend that carries over from SMT, the strongest NMT systems benefit from subtle architecture modifications, hyper-parameter tuning, and empirically effective heuristics. Unlike SMT, there is no “de-facto” toolkit that attracts most of the community’s attention and thus contains all the best ideas from recent literature.2 Instead, the presence of many independent toolkits3 brings diversity to the field, but also makes it difficult to compare architectural and algorithmic improvements that are each implemented in different toolkits. 1https://github.com/awslabs/sockeye (version 1.12) 2For SMT, this role was largely filled by MOSES [Koehn et al., 2007]. 3https://github.com/jonsafari/nmt-list arXiv:1712.05690v1 [cs.CL] 15 Dec 2017 Sequence to Sequence (seq2seq) • Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. • Amazon released in open source the Sockeye package, which uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures. • https://github.com/awslabs/sockeye • provides an experimental image-to- description module 2014-2017 Supervised Text, Audio
  43. Sequence to Sequence (seq2seq) https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/ 2014-2017 Supervised Text, Audio

  44. Sequence to Sequence (seq2seq) https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye “Das grüne Haus” “the Green

    House” 2014-2017 Supervised Text, Audio
  45. “Sentence to synthesize” ˈsɛntəns tə ˈsɪnθəˌsaɪz. Concatenative TTS Neural TTS

    Text Phonetic transcription ˈsɛnt sɛntəns tə ˈsɪnθ əˌsaɪz. Improving text-to-speech
  46. US English Matthew voice “Sources tell CNN he believes the

    media and the northeast elite are needlessly hyperventilating and overreacting to his comments.” US English Joanna voice “President Donald Trump said on March 13 his administration was ordering the grounding of all Max 8 and 9 models, hours after Canada said it was grounding the planes after analyzing new satellite tracking data.” Amazon Polly NTTS and newscaster style https://aws.amazon.com/blogs/aws/amazon-polly-introduces-neural-text-to-speech-and-newscaster-style/
  47. Latent Dirichlet Allocation (LDA) Copyright  2000 by the Genetics

    Society of America Inference of Population Structure Using Multilocus Genotype Data Jonathan K. Pritchard, Matthew Stephens and Peter Donnelly Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom Manuscript received September 23, 1999 Accepted for publication February 18, 2000 ABSTRACT We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more popula- tions if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individu- als. We showthat the method can produce highlyaccurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http:// www.stats.ox.ac.uk/ zpritch/ home.html. IN applications of population genetics, it is often use- populationsbased on these subjective criteria represents a natural assignment in genetic terms, and it would be ful to classify individuals in a sample into popula- tions. In one scenario, the investigator begins with a useful to be able to confirm that subjective classifications are consistent with genetic information and hence ap- sample of individuals and wants to say something about the properties of populations. For example, in studies propriate for studying the questions of interest. Further, there are situations where one is interested in “cryptic” of human evolution, the population is often considered to be the unit of interest, and a great deal of work has population structure—i.e., population structure that is difficult to detect using visible characters, but may be focused on learning about the evolutionary relation- ships of modern populations (e.g., Caval l i et al. 1994). significant in genetic terms. For example, when associa- tion mapping is used to find disease genes, the presence In a second scenario, the investigator begins with a set of predefined populations and wishes to classifyindivid- of undetected population structure can lead to spurious associations and thus invalidate standard tests (Ewens uals of unknown origin. This type of problem arises in many contexts (reviewed by Davies et al. 1999). A and Spiel man 1995). The problem of cryptic population structure also arises in the context of DNA fingerprint- standard approach involves sampling DNA from mem- bers of a number of potential source populations and ing for forensics, where it is important to assess the degree of population structure to estimate the probabil- using these samples to estimate allele frequencies in ity of false matches (Bal ding and Nich ol s 1994, 1995; each population at a series of unlinked loci. Using the For eman et al. 1997; Roeder et al. 1998). estimated allele frequencies, it is then possible to com- Pr it ch ar d and Rosenber g (1999) considered how pute the likelihood that a given genotype originated in genetic information might be used to detect the pres- each population. Individuals of unknown origin can be ence of cryptic population structure in the association assigned to populations according to these likelihoods mapping context. More generally, one would like to be Paet kau et al. 1995; Rannal a and Mount ain 1997). able to identify the actual subpopulations and assign In both situations described above, a crucial first step individuals (probabilistically) to these populations. In is to define a set of populations. The definition of popu- this article we use a Bayesian clustering approach to lations is typically subjective, based, for example, on tackle this problem. We assume a model in which there linguistic, cultural, or physical characters, as well as the are K populations (where K may be unknown), each of geographic location of sampled individuals. This subjec- which is characterized by a set of allele frequencies at tive approach is usually a sensible way of incorporating each locus. Our method attempts to assign individuals diverse types of information. However, it maybe difficult to populations on the basis of their genotypes, while to know whether a given assignment of individuals to simultaneously estimating population allele frequen- cies. The method can be applied to various types of markers [e.g., microsatellites, restriction fragment Corresponding author: Jonathan Pritchard, Department of Statistics, length polymorphisms (RFLPs), or single nucleotide University of Oxford, 1 S. Parks Rd., Oxford OX1 3TG, United King- dom. E-mail: pritch@stats.ox.ac.uk polymorphisms (SNPs)], but it assumes that the marker Genetics 155: 945–959 ( June 2000) Journal of Machine Learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03 Latent Dirichlet Allocation David M. Blei BLEI@CS.BERKELEY.EDU Computer Science Division University of California Berkeley, CA 94720, USA Andrew Y. Ng ANG@CS.STANFORD.EDU Computer Science Department Stanford University Stanford, CA 94305, USA Michael I. Jordan JORDAN@CS.BERKELEY.EDU Computer Science Division and Department of Statistics University of California Berkeley, CA 94720, USA Editor: John Lafferty Abstract We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model. 1. Introduction In this paper we consider the problem of modeling text corpora and other collections of discrete data. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments. Significant progress has been made on this problem by researchers in the field of informa- tion retrieval (IR) (Baeza-Yates and Ribeiro-Neto, 1999). The basic methodology proposed by IR researchers for text corpora—a methodology successfully deployed in modern Internet search engines—reduces each document in the corpus to a vector of real numbers, each of which repre- sents ratios of counts. In the popular tf-idf scheme (Salton and McGill, 1983), a basic vocabulary of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word. After suitable normalization, this term frequency count is compared to an inverse document frequency count, which measures the number of occurrences of a c 2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan. 2000-2003 Unsupervised Topic M odeling
  48. Latent Dirichlet Allocation (LDA) • As an extremely simple example,

    given a set of documents where the only words that occur within them are eat, sleep, play, meow, and bark, LDA might produce topics like the following: Topic eat sleep play meow bark Cats? Topic 1 0.1 0.3 0.2 0.4 0.0 Dogs? Topic 2 0.2 0.1 0.4 0.0 0.3 2000-2003 Unsupervised Topic M odeling
  49. Neural Topic Model (NTM) Encoder: feedforward net Input term counts

    vector µ z Document Posterior Sampled Document Representation Decoder: Softmax Neural Variational Inference for Text Processing Yishu Miao1 YISHU.MIAO@CS.OX.AC.UK Lei Yu1 LEI.YU@CS.OX.AC.UK Phil Blunsom12 PHIL.BLUNSOM@CS.OX.AC.UK 1University of Oxford, 2Google Deepmind Abstract Recent advances in neural variational inference have spawned a renaissance in deep latent vari- able models. In this paper we introduce a generic variational inference framework for generative and conditional models of text. While traditional variational methods derive an analytic approxi- mation for the intractable distributions over latent variables, here we construct an inference network conditioned on the discrete text input to pro- vide the variational distribution. We validate this framework on two very different text modelling applications, generative document modelling and supervised question answering. Our neural vari- ational document model combines a continuous stochastic document representation with a bag- of-words generative model and achieves the low- est reported perplexities on two standard test cor- pora. The neural answer selection model em- ploys a stochastic representation layer within an attention mechanism to extract the semantics be- tween a question and answer pair. On two ques- tion answering benchmarks this model exceeds all previous published benchmarks. 1. Introduction Probabilistic generative models underpin many successful applications within the field of natural language process- ing (NLP). Their popularity stems from their ability to use unlabelled data effectively, to incorporate abundant linguis- tic features, and to learn interpretable dependencies among data. However these successes are tempered by the fact that as the structure of such generative models becomes deeper and more complex, true Bayesian inference becomes in- tractable due to the high dimensional integrals required. Markov chain Monte Carlo (MCMC) (Neal, 1993; Andrieu Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). et al., 2003) and variational inference (Jordan et al., 1999; Attias, 2000; Beal, 2003) are the standard approaches for approximating these integrals. However the computational cost of the former results in impractical training for the large and deep neural networks which are now fashion- able, and the latter is conventionally confined due to the underestimation of posterior variance. The lack of effec- tive and efficient inference methods hinders our ability to create highly expressive models of text, especially in the situation where the model is non-conjugate. This paper introduces a neural variational framework for generative models of text, inspired by the variational auto- encoder (Rezende et al., 2014; Kingma & Welling, 2014). The principle idea is to build an inference network, imple- mented by a deep neural network conditioned on text, to ap- proximate the intractable distributions over the latent vari- ables. Instead of providing an analytic approximation, as in traditional variational Bayes, neural variational inference learns to model the posterior probability, thus endowing the model with strong generalisation abilities. Due to the flexibility of deep neural networks, the inference network is capable of learning complicated non-linear distributions and processing structured inputs such as word sequences. Inference networks can be designed as, but not restricted to, multilayer perceptrons (MLP), convolutional neural net- works (CNN), and recurrent neural networks (RNN), ap- proaches which are rarely used in conventional generative models. By using the reparameterisation method (Rezende et al., 2014; Kingma & Welling, 2014), the inference net- work is trained through back-propagating unbiased and low variance gradients w.r.t. the latent variables. Within this framework, we propose a Neural Variational Document Model (NVDM) for document modelling and a Neural An- swer Selection Model (NASM) for question answering, a task that selects the sentences that correctly answer a fac- toid question from a set of candidate sentences. The NVDM (Figure 1) is an unsupervised generative model of text which aims to extract a continuous semantic latent variable for each document. This model can be interpreted as a variational auto-encoder: an MLP encoder (inference arXiv:1511.06038v4 [cs.CL] 4 Jun 2016 Output term counts vector 2015 Unsupervised Topic M odeling
  50. Random Cut Forest (RCF) 2004-2016 Unsupervised Anom aly Detection Downloaded

    06/11/18 to 54.240.197.235. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php Robust Random Cut Forest Based Anomaly Detection On Streams Sudipto Guha SUDIPTO@CIS.UPENN.EDU University of Pennsylvania, Philadelphia, PA 19104. Nina Mishra NMISHRA@AMAZON.COM Amazon, Palo Alto, CA 94303. Gourav Roy GOURAVR@AMAZON.COM Amazon, Bangalore, India 560055. Okke Schrijvers OKKES@CS.STANFORD.EDU Stanford University, Palo Alto, CA 94305. Abstract In this paper we focus on the anomaly detection problem for dynamic data streams through the lens of random cut forests. We investigate a ro- bust random cut data structure that can be used as a sketch or synopsis of the input stream. We provide a plausible definition of non-parametric anomalies based on the influence of an unseen point on the remainder of the data, i.e., the exter- nality imposed by that point. We show how the sketch can be efficiently updated in a dynamic data stream. We demonstrate the viability of the algorithm on publicly available real data. 1. Introduction Anomaly detection is one of the cornerstone problems in data mining. Even though the problem has been well stud- ied over the last few decades, the emerging explosion of data from the internet of things and sensors leads us to re- consider the problem. In most of these contexts the data is streaming and well-understood prior models do not ex- ist. Furthermore the input streams need not be append only, there may be corrections, updates and a variety of other dy- namic changes. Two central questions in this regard are (1) how do we define anomalies? and (2) what data struc- ture do we use to efficiently detect anomalies over dynamic data streams? In this paper we initiate the formal study of both of these questions. For (1), we view the problem from the perspective of model complexity and say that a point is an anomaly if the complexity of the model increases sub- stantially with the inclusion of the point. The labeling of Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). a point is data dependent and corresponds to the external- ity imposed by the point in explaining the remainder of the data. We extend this notion of externality to handle “outlier masking” that often arises from duplicates and near dupli- cate records. Note that the notion of model complexity has to be amenable to efficient computation in dynamic data streams. This relates question (1) to question (2) which we discuss in greater detail next. However it is worth noting that anomaly detection is not well understood even in the simpler context of static batch processing and (2) remains relevant in the batch setting as well. For question (2), we explore a randomized approach, akin to (Liu et al., 2012), due in part to the practical success re- ported in (Emmott et al., 2013). Randomization is a pow- erful tool and known to be valuable in supervised learn- ing (Breiman, 2001). But its technical exploration in the context of anomaly detection is not well-understood and the same comment applies to the algorithm put forth in (Liu et al., 2012). Moreover that algorithm has several lim- itations as described in Section 4.1. In particular, we show that in the presence of irrelevant dimensions, cru- cial anomalies are missed. In addition, it is unclear how to extend this work to a stream. Prior work attempted so- lutions (Tan et al., 2011) that extend to streaming, however those were not found to be effective (Emmott et al., 2013). To address these limitations, we put forward a sketch or synopsis termed robust random cut forest (RRCF) formally defined as follows. Definition 1 A robust random cut tree (RRCT) on point set S is generated as follows: 1. Choose a random dimension proportional to ℓi j ℓj , where ℓi = maxx∈S xi − minx∈Sxi . 2. Choose Xi ∼ Uniform[minx∈S xi, maxx∈S xi] 3. Let S1 = {x|x ∈ S, xi ≤ Xi } and S2 = S \ S1 and recurse on S1 and S2 .
  51. Random Cut Forest (RCF) 2004-2016 Unsupervised Anom aly Detection

  52. Random Cut Forest (RCF) 2004-2016 Unsupervised Anom aly Detection

  53. Random Cut Forest (RCF) 2004-2016 Unsupervised Anom aly Detection

  54. • The idea is to treat a period of P

    datapoints as a single datapoint of feature length P and then run the algorithm on these feature vectors • This is especially useful when working with periodic data with known period Shingling
  55. Random Cut Forest (RCF) 2004-2016 Unsupervised Anom aly Detection Using

    “shingling”
  56. Anomaly Detection to Improve Infrastructure and Application Monitoring Am azon

    CloudW atch
  57. Time Series Forecasting (DeepAR) DeepAR: Probabilistic Forecasting with Autoregressive Recurrent

    Networks Valentin Flunkert ⇤ , David Salinas ⇤ , Jan Gasthaus Amazon Development Center Germany <dsalina,flunkert,gasthaus@amazon.com> Abstract Probabilistic forecasting, i.e. estimating the probability distribution of a time se- ries’ future given its past, is a key enabler for optimizing business processes. In retail businesses, for example, forecasting demand is crucial for having the right inventory available at the right time at the right place. In this paper we propose DeepAR, a methodology for producing accurate probabilistic forecasts, based on training an auto-regressive recurrent network model on a large number of related time series. We demonstrate how by applying deep learning techniques to fore- casting, one can overcome many of the challenges faced by widely-used classical approaches to the problem. We show through extensive empirical evaluation on several real-world forecasting data sets that our methodology produces more accu- rate forecasts than other state-of-the-art methods, while requiring minimal manual work. 1 Introduction Forecasting plays a key role in automating and optimizing operational processes in most businesses and enables data driven decision making. In retail for example, probabilistic forecasts of product supply and demand can be used for optimal inventory management, staff scheduling and topology planning [17], and are more generally a crucial technology for most aspects of supply chain opti- mization. The prevalent forecasting methods in use today have been developed in the setting of forecasting individual or small groups of time series. In this approach, model parameters for each given time series are independently estimated from past observations. The model is typically manually selected to account for different factors, such as autocorrelation structure, trend, seasonality, and other ex- planatory variables. The fitted model is then used to forecast the time series into the future according to the model dynamics, possibly admitting probabilistic forecasts through simulation or closed-form expressions for the predictive distributions. Many methods in this class are based on the classical Box-Jenkins methodology [3], exponential smoothing techniques, or state space models [11, 18]. In recent years, a new type of forecasting problem has become increasingly important in many appli- cations. Instead of needing to predict individual or a small number of time series, one is faced with forecasting thousands or millions of related time series. Examples include forecasting the energy consumption of individual households, forecasting the load for servers in a data center, or forecast- ing the demand for all products that a large retailer offers. In all these scenarios, a substantial amount of data on past behavior of similar, related time series can be leveraged for making a forecast for an individual time series. Using data from related time series not only allows fitting more complex (and hence potentially more accurate) models without overfitting, it can also alleviate the time and labor intensive manual feature engineering and model selection steps required by classical techniques. ⇤equal contribution arXiv:1704.04110v2 [cs.AI] 5 Jul 2017 2017 Supervised Tim e Series Forecasting • DeepAR is a supervised learning algorithm for forecasting scalar time series using recurrent neural networks (RNN) • Classical forecasting methods fit one model to each individual time series, and then use that model to extrapolate the time series into the future • In many applications you might have many similar time series across a set of cross-sectional units • For example, demand for different products, load of servers, requests for web pages, and so on • In this case, it can be beneficial to train a single model jointly over all of these time series • DeepAR takes this approach, training a model for predicting a time series over a large set of (related) time series
  58. Time Series Forecasting (DeepAR) 2017 Supervised Tim e Series Forecasting

  59. Time Series Forecasting (DeepAR)

  60. BlazingText (Word2vec) BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs

    Saurabh Gupta Amazon Web Services gsaur@amazon.com Vineet Khare Amazon Web Services vkhare@amazon.com ABSTRACT Word2Vec is a popular algorithm used for generating dense vector representations of words in large corpora using unsupervised learn- ing. The resulting vectors have been shown to capture semantic relationships between the corresponding words and are used ex- tensively for many downstream natural language processing (NLP) tasks like sentiment analysis, named entity recognition and machine translation. Most open-source implementations of the algorithm have been parallelized for multi-core CPU architectures including the original C implementation by Mikolov et al. [1] and FastText [2] by Facebook. A few other implementations have attempted to leverage GPU parallelization but at the cost of accuracy and scal- ability. In this work, we present BlazingText, a highly optimized implementation of word2vec in CUDA, that can leverage multiple GPUs for training. BlazingText can achieve a training speed of up to 43M words/sec on 8 GPUs, which is a 9x speedup over 8-threaded CPU implementations, with minimal e￿ect on the quality of the embeddings. CCS CONCEPTS • Computing methodologies → Neural networks; Natural language processing; KEYWORDS Word embeddings, Word2Vec, Natural Language Processing, Ma- chine Learning, CUDA, GPU ACM Reference format: Saurabh Gupta and Vineet Khare. 2017. BlazingText: Scaling and Accelerat- ing Word2Vec using Multiple GPUs. In Proceedings of MLHPC’17: Machine Learning in HPC Environments, Denver, CO, USA, November 12–17, 2017, 5 pages. https://doi.org/10.1145/3146347.3146354 1 INTRODUCTION Word2Vec aims to represent each word as a vector in a low-dimensional embedding space such that the geometry of resulting vectors cap- tures word semantic similarity through the cosine similarity of cor- responding vectors as well as more complex relationships through vector subtractions, such as vec(“King”) - vec(“Queen”) + vec(“Woman”) MLHPC’17: Machine Learning in HPC Environments, November 12–17, 2017, Denver, CO, USA © 2017 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-5137-9/17/11. https://doi.org/10.1145/3146347.3146354 ⇡ vec(“Man”). This idea has enabled many Natural Language Pro- cessing (NLP) algorithms to achieve better performance [3, 4]. The optimization in word2vec is done using Stochastic Gradient Descent (SGD), which solves the problem iteratively; at each step, it picks a pair of words: an input word and a target word either from its window or a random negative sample. It then computes the gradients of the objective function with respect to the two chosen words, and updates the word representations of the two words based on the gradient values. The algorithm then proceeds to the next iteration with a di￿erent word pair being chosen. One of the main issues with SGD is that it is inherently sequential; since there is a dependency between the update from one iteration and the computation in the next iteration (they may happen to touch the same word representations), each iteration must potentially wait for the update from the previous iteration to complete. This does not allow us to use the parallel resources of the hardware. However, to solve the above issue, word2vec uses Hogwild [5], a scheme where di￿erent threads process di￿erent word pairs in parallel and ignore any con￿icts that may arise in the model up- date phases. In theory, this can reduce the rate of convergence of algorithm as compared to a sequential run. However, the Hogwild approach has been shown to work well in the case updates across threads are unlikely to be to the same word; and indeed for large vocabulary sizes, con￿icts are relatively rare and convergence is not typically a￿ected. The success of Hogwild approach for Word2Vec in case of multi- core architectures makes this algorithm a good candidate for ex- ploiting GPU, which provides orders of magnitude more parallelism than a CPU. In this paper, we propose an e￿cient parallelization technique for accelerating word2vec using GPUs. GPU acceleration using deep learning frameworks is not a good choice for accelerating word2vec [6]. These frameworks are often suitable for “deep networks” where the computation is dominated by heavy operations like convolutions and large matrix multiplica- tions. On the other hand, word2vec is a relatively shallow network, as each training step consists of an embedding lookup, gradient computation and ￿nally weight updates for the word pair under consideration. The gradient computation and updates involve small dot products and thus don’t bene￿t from the use of cuDNN [7] or cuBLAS [8] libraries. The limitations of deep learning frameworks led us to explore the CUDA C++ API. We design the training algorithm from scratch, to utilize CUDA multi-threading capabilities optimally, without hurting the output accuracy by over-exploiting GPU parallelism. Finally, to scale out BlazingText to process text corpus at several million words/sec, we demonstrate the possibility of using multiple GPUs to perform data parallelism based training, which is one of the main contributions of our work. We benchmark BlazingText against 2013-2017 Supervised W ord Em bedding Efficient Estimation of Word Representations in Vector Space Tomas Mikolov Google Inc., Mountain View, CA tmikolov@google.com Kai Chen Google Inc., Mountain View, CA kaichen@google.com Greg Corrado Google Inc., Mountain View, CA gcorrado@google.com Jeffrey Dean Google Inc., Mountain View, CA jeff@google.com Abstract We propose two novel model architectures for computing continuous vector repre- sentations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previ- ously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art perfor- mance on our test set for measuring syntactic and semantic word similarities. 1 Introduction Many current NLP systems and techniques treat words as atomic units - there is no notion of similar- ity between words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data. An example is the popular N-gram model used for statistical language modeling - today, it is possible to train N-grams on virtually all available data (trillions of words [3]). However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques. With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform N-gram models [1, 27, 17]. 1.1 Goals of the Paper The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. As far as we know, none of the previously proposed architectures has been successfully trained on more 1 arXiv:1301.3781v3 [cs.CL] 7 Sep 2013
  61. @data_monsters https://twitter.com/data_monsters/status/844256398393462784

  62. Word2vec ⇾ Word Embedding 2013 Supervised W ord Em bedding

    Contextual Bag-Of-Words (CBOW) to predict a word given its context Skip-Gram with Negative Sampling (SGNS) to predict the context given a word
  63. BlazingText (Word2vec) Scaling 2017 Supervised W ord Em bedding

  64. AW S Sum m it M ilan 2018

  65. https://bit.ly/2SSI2Qo

  66. And Then There Are (Built-in) Algorithms Algorithm Scope Linear Learner

    classification, regression Factorization Machines classification, regression, sparse datasets K-Nearest Neighbors (k-NN) classification, regression K-Means Clustering clustering, unsupervised Principal Component Analysis (PCA) dimensionality reduction, unsupervised XGBoost regression, classification (binary and multiclass), and ranking Image Classification CNNs (ResNet) Object Classification Object classification (and bounding box) inside an image Semantic Segmentation Pixel by pixel classification of an image Sequence to Sequence (seq2seq) translation, text summarization, speech-to-text (RNNs, CNN) Latent Dirichlet Allocation (LDA) topic modeling, unsupervised Neural Topic Model (NTM) topic modeling, unsupervised Random Cut Forest (RCF) anomaly detection Time Series Forecasting (DeepAR) time series forecasting (RNN) BlazingText (Word2vec) word embeddings
  67. Machine Learning = Algorithms + Data + Tools

  68. Customers want more value from their data Growing exponentially From

    new sources Increasingly diverse Used by many people Analyzed by many applications
  69. Cloud data lakes are the future Customers want: A single

    data store that is scalable & cost effective To store data securely in standard formats To analyze their data in a variety of ways Cloud Data Lake Infrastructure Decoupled Storage & Compute Resources Security & Governance Data Migration Streaming Services Data Warehouse Big Data Processing Serverless Data Processing Real-time Analytics Operational Analytics Predictive Analytics ETL & Catalog Data Management
  70. 125+ million players Data provides a constant feedback loop for

    game designers Up-to-the-minute analysis of gamer satisfaction to drive gamer engagement Resulting in the most popular game played in the world Fortnite
  71. Data lake infrastructure & management “With an enterprise-ready option like

    Lake Formation, we will be able to spend more time deriving value from our data rather than doing the heavy lifting involved in manually setting up and managing our data lake.” —Joshua Couch, VP Engineering at Fender Digital
  72. Analytics FINRA’s legacy system did not scale to handle 75

    billion events per day. They needed to run complex surveillance queries over 20+ PB of data FINRA migrated their big data appliance to a S3 Data Lake and uses EMR for ingestion and processing
  73. CHALLENGE Needed to analyze data to find insights, identify opportunities,

    and evaluate business performance. The Oracle DW did not scale, was difficult to maintain, and costly. SOLUTION Deployed a data lake with S3, and run analytics with Redshift, Redshift Spectrum, and EMR. Result: they doubled the data stored (100PB), lowered costs, and was able to gain insights faster. 50PB of data 600,000 analytics jobs/day S3 DynamoDB Relational Stores Non Relational Stores S3 Kinesis Data Lake Web Interface Data Lake APIs Workflows service Discovery service Data Ingestion Subscription Service Data security and governance EMR Redshift Redshift Spectrum Other Compute Source systems Big data marketplace Analytics 100PB Data Quality / Curation
  74. Amazon.com,1995

  75. None
  76. None
  77. None
  78. None
  79. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark Our mission at AWS Put machine learning in the hands of every developer
  80. 142 © 2019 Amazon Web Services, Inc. or its affiliates.

    All rights reserved | FRAMEWORKS INTERFACES INFRASTRUCTURE AI Services Broadest and deepest set of capabilities T H E A W S M L S T A C K VISION SPEECH LANGUAGE CHATBOTS FORECASTING RECOMMENDATIONS ML Services ML Frameworks + Infrastructure P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D & C O M P R E H E N D M E D I C A L L E X F O R E C A S T R E K O G N I T I O N I M A G E R E K O G N I T I O N V I D E O T E X T R A C T P E R S O N A L I Z E Ground Truth Notebooks Algorithms + Marketplace Reinforcement Learning Training Optimization Deployment Hosting Amazon SageMaker F P G A S E C 2 P 3 & P 3 D N E C 2 G 4 E C 2 C 5 I N F E R E N T I A G R E E N G R A S S E L A S T I C I N F E R E N C E D L C O N T A I N E R S & A M I s E L A S T I C K U B E R N E T E S S E R V I C E E L A S T I C C O N T A I N E R S E R V I C E
  81. 143 © 2019 Amazon Web Services, Inc. or its affiliates.

    All rights reserved | 143 © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | Accelerating investigation timelines FINRA uses Amazon Comprehend to process and review millions of documents with unstructured data, helping flag records of interest that should be reviewed by human investigators.
  82. 144 © 2019 Amazon Web Services, Inc. or its affiliates.

    All rights reserved | 144 © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | Predicting global markets Moody’s uses Amazon SageMaker to better predict market conditions and credit actions.
  83. 145 © 2019 Amazon Web Services, Inc. or its affiliates.

    All rights reserved | 145 © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | Accelerating financial analysis Using TensorFlow on Amazon SageMaker, Siemens Financial Services developed an NLP model to extract critical information to accelerate investment due diligence, reducing time to summarize diligence documents from 12 hours down to 30 seconds.
  84. 146 © 2019 Amazon Web Services, Inc. or its affiliates.

    All rights reserved | 146 © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | Optimizing interactive games Rovio uses deep reinforcement learning on AWS to help predict the difficulty of levels in Angry Birds Dream Blast. This lets their developers focus on creating better player experiences, instead of testing levels.
  85. 147 © 2019 Amazon Web Services, Inc. or its affiliates.

    All rights reserved | 147 © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | Driving better healthcare outcomes Using Amazon SageMaker, GE Healthcare developed an ML model that can learn from thousands of medical scans to detect anomalies more accurately and efficiently, allowing radiologists to prioritize patients needing immediate attention.
  86. 148 © 2019 Amazon Web Services, Inc. or its affiliates.

    All rights reserved | 148 © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | Enhancing the fan experience Formula 1 uses Amazon SageMaker to create real time insights on how a driver is performing, improving the fan experience on television broadcasts and digital platforms.
  87. 149 © 2019 Amazon Web Services, Inc. or its affiliates.

    All rights reserved | 149 © 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved | Improving customer service T-Mobile uses Amazon SageMaker Ground Truth to label unstructured data from customer service interactions. These data sets are used to train machine learning models that provide their human agents with recommended actions for a given customer.
  88. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark Culture Setting your organization up for success
  89. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark Assess your structured and unstructured data sources Create the loop 1. Connect technology initiatives with business outcomes Advance your data strategy Put machine learning in the hands of your developers Organize for success 2. 3. ?
  90. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark • Purpose-built for ML-skills development • Fully programmable & customizable • Build custom Amazon SageMaker models • 10-minutes to your first deep learning project The world’s first deep learning-enabled video camera for developers
  91. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Trademark • Build machine learning models in Amazon SageMaker • Train, test, and iterate on the track using the AWS DeepRacer 3D racing simulator • Compete in the world’s first global autonomous racing league, to race for prizes and a chance to advance to win the coveted AWS DeepRacer Cup A fully autonomous 1/18th-scale race car designed to help you learn about reinforcement learning through autonomous driving
  92. © 2018, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Confidential and Trademark
  93. © 2019, Amazon Web Services, Inc. or its Affiliates. Danilo

    Poccia Principal Evangelist AWS @danilop danilop.net And Then There Are Algorithms