Fully Convolutional Networks for Semantic Segmentation Evan Shelhamer⇤, Jonathan Long⇤, and Trevor Darrell, Member, IEEE Abstract—Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image. Index Terms—Semantic Segmentation, Convolutional Networks, Deep Learning, Transfer Learning F 1 INTRODUCTION CONVOLUTIONAL networks are driving advances in recognition. Convnets are not only improving for whole-image classification [1], [2], [3], but also making progress on local tasks with structured output. These in- clude advances in bounding box object detection [4], [5], [6], part and keypoint prediction [7], [8], and local correspon- dence [8], [9]. The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation [10], [11], [12], [13], [14], [15], [16], in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses. We show that fully convolutional networks (FCNs) trained end-to-end, pixels-to-pixels on semantic segmen- tation exceed the previous best results without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary- sized inputs. Both learning and inference are performed whole-image-at-a-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampling. This method is efficient, both asymptotically and ab- solutely, and precludes the need for the complications in other works. Patchwise training is common [10], [11], [12], [13], [16], but lacks the efficiency of fully convolutional training. Our approach does not make use of pre- and post- processing complications, including superpixels [12], [14], proposals [14], [15], or post-hoc refinement by random fields ⇤Authors contributed equally • E. Shelhamer, J. Long, and T. Darrell are with the Department of Electrical Engineering and Computer Science (CS Division), UC Berkeley. E-mail: {shelhamer,jonlong,trevor}@cs.berkeley.edu. or local classifiers [12], [14]. Our model transfers recent success in classification [1], [2], [3] to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representations. In contrast, previous works have applied small convnets without super- vised pre-training [10], [12], [13]. Semantic segmentation faces an inherent tension be- tween semantics and location: global information resolves what while local information resolves where. What can be done to navigate this spectrum from location to semantics? How can local decisions respect global structure? It is not immediately clear that deep networks for image classifica- tion yield representations sufficient for accurate, pixelwise recognition. In the conference version of this paper [17], we cast pre-trained networks into fully convolutional form, and augment them with a skip architecture that takes advantage of the full feature spectrum. The skip architecture fuses the feature hierarchy to combine deep, coarse, semantic information and shallow, fine, appearance information (see Section 4.3 and Figure 3). In this light, deep feature hierar- chies encode location and semantics in a nonlinear local-to- global pyramid. This journal paper extends our earlier work [17] through further tuning, analysis, and more results. Alternative choices, ablations, and implementation details better cover the space of FCNs. Tuning optimization leads to more accu- rate networks and a means to learn skip architectures all-at- once instead of in stages. Experiments that mask foreground and background investigate the role of context and shape. Results on the object and scene labeling of PASCAL-Context reinforce merging object segmentation and scene parsing as unified pixelwise prediction. In the next section, we review related work on deep classification nets, FCNs, recent approaches to semantic seg- mentation using convnets, and extensions to FCNs. The fol- arXiv:1605.06211v1 [cs.CV] 20 May 2016 Pyramid Scene Parsing Network Hengshuang Zhao1 Jianping Shi2 Xiaojuan Qi1 Xiaogang Wang1 Jiaya Jia1 1The Chinese University of Hong Kong 2SenseTime Group Limited {hszhao, xjqi, leojia}@cse.cuhk.edu.hk,
[email protected],
[email protected] Abstract Scene parsing is challenging for unrestricted open vo- cabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region- based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is ef- fective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel- level prediction. The proposed approach achieves state-of- the-art performance on various datasets. It came first in Im- ageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields the new record of mIoU accuracy 85.4% on PASCAL VOC 2012 and accuracy 80.2% on Cityscapes. 1. Introduction Scene parsing, based on semantic segmentation, is a fun- damental topic in computer vision. The goal is to assign each pixel in the image a category label. Scene parsing pro- vides complete understanding of the scene. It predicts the label, location, as well as shape for each element. This topic is of broad interest for potential applications of automatic driving, robot sensing, to name a few. Difficulty of scene parsing is closely related to scene and label variety. The pioneer scene parsing task [23] is to clas- sify 33 scenes for 2,688 images on LMO dataset [22]. More recent PASCAL VOC semantic segmentation and PASCAL context datasets [8, 29] include more labels with similar context, such as chair and sofa, horse and cow, etc. The new ADE20K dataset [43] is the most challenging one with a large and unrestricted open vocabulary and more scene classes. A few representative images are shown in Fig. 1. To develop an effective algorithm for these datasets needs to conquer a few difficulties. State-of-the-art scene parsing frameworks are mostly based on the fully convolutional network (FCN) [26]. The deep convolutional neural network (CNN) based methods boost dynamic object understanding, and yet still face chal- Figure 1. Illustration of complex scenes in ADE20K dataset. lenges considering diverse scenes and unrestricted vocabu- lary. One example is shown in the first row of Fig. 2, where a boat is mistaken as a car. These errors are due to similar appearance of objects. But when viewing the image regard- ing the context prior that the scene is described as boathouse near a river, correct prediction should be yielded. Towards accurate scene perception, the knowledge graph relies on prior information of scene context. We found that the major issue for current FCN based models is lack of suitable strategy to utilize global scene category clues. For typical complex scene understanding, previously to get a global image-level feature, spatial pyramid pooling [18] was widely employed where spatial statistics provide a good descriptor for overall scene interpretation. Spatial pyramid pooling network [12] further enhances the ability. Different from these methods, to incorporate suitable global features, we propose pyramid scene parsing network (PSPNet). In addition to traditional dilated FCN [3, 40] for pixel prediction, we extend the pixel-level feature to the specially designed global pyramid pooling one. The local and global clues together make the final prediction more reliable. We also propose an optimization strategy with 1 arXiv:1612.01105v2 [cs.CV] 27 Apr 2017 Rethinking Atrous Convolution for Semantic Image Segmentation Liang-Chieh Chen George Papandreou Florian Schroff Hartwig Adam Google Inc. {lcchen, gpapan, fschroff, hadam}@google.com Abstract In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter’s field-of-view as well as control the resolution of feature responses computed by Deep Convolu- tional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional fea- tures at multiple scales, with image-level features encoding global context and further boost performance. We also elab- orate on implementation details and share our experience on training our system. The proposed ‘DeepLabv3’ system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark. 1. Introduction For the task of semantic segmentation [20, 63, 14, 97, 7], we consider two challenges in applying Deep Convolutional Neural Networks (DCNNs) [50]. The first one is the reduced feature resolution caused by consecutive pooling operations or convolution striding, which allows DCNNs to learn in- creasingly abstract feature representations. However, this invariance to local image transformation may impede dense prediction tasks, where detailed spatial information is de- sired. To overcome this problem, we advocate the use of atrous convolution [36, 26, 74, 66], which has been shown to be effective for semantic image segmentation [10, 90, 11]. Atrous convolution, also known as dilated convolution, al- lows us to repurpose ImageNet [72] pretrained networks to extract denser feature maps by removing the downsam- pling operations from the last few layers and upsampling the corresponding filter kernels, equivalent to inserting holes (‘trous’ in French) between filter weights. With atrous convo- lution, one is able to control the resolution at which feature rate = 6 rate = 24 rate = 1 Conv kernel: 3x3 rate: 1 Conv kernel: 3x3 rate: 6 Conv kernel: 3x3 rate: 24 Feature map Feature map Feature map Figure 1. Atrous convolution with kernel size 3 ⇥ 3 and different rates. Standard convolution corresponds to atrous convolution with rate = 1. Employing large value of atrous rate enlarges the model’s field-of-view, enabling object encoding at multiple scales. responses are computed within DCNNs without requiring learning extra parameters. Another difficulty comes from the existence of objects at multiple scales. Several methods have been proposed to handle the problem and we mainly consider four categories in this work, as illustrated in Fig. 2. First, the DCNN is applied to an image pyramid to extract features for each scale input [22, 19, 69, 55, 12, 11] where objects at different scales become prominent at different feature maps. Sec- ond, the encoder-decoder structure [3, 71, 25, 54, 70, 68, 39] exploits multi-scale features from the encoder part and re- covers the spatial resolution from the decoder part. Third, extra modules are cascaded on top of the original network for capturing long range information. In particular, DenseCRF [45] is employed to encode pixel-level pairwise similarities [10, 96, 55, 73], while [59, 90] develop several extra convo- lutional layers in cascade to gradually capture long range context. Fourth, spatial pyramid pooling [11, 95] probes an incoming feature map with filters or pooling operations at multiple rates and multiple effective field-of-views, thus capturing objects at multiple scales. In this work, we revisit applying atrous convolution, which allows us to effectively enlarge the field of view of filters to incorporate multi-scale context, in the framework of both cascaded modules and spatial pyramid pooling. In par- ticular, our proposed module consists of atrous convolution with various rates and batch normalization layers which we 1 arXiv:1706.05587v3 [cs.CV] 5 Dec 2017