Semantic Segmentation (Image)
2016-2017
Supervised
Sem
antic Segm
entation
1
Fully Convolutional Networks
for Semantic Segmentation
Evan Shelhamer⇤, Jonathan Long⇤, and Trevor Darrell, Member, IEEE
Abstract—Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks
by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to
build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference
and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction
tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet)
into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a
skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer
to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC
(30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of
a second for a typical image.
Index Terms—Semantic Segmentation, Convolutional Networks, Deep Learning, Transfer Learning
F
1 INTRODUCTION
CONVOLUTIONAL networks are driving advances in
recognition. Convnets are not only improving for
whole-image classification [1], [2], [3], but also making
progress on local tasks with structured output. These in-
clude advances in bounding box object detection [4], [5], [6],
part and keypoint prediction [7], [8], and local correspon-
dence [8], [9].
The natural next step in the progression from coarse to
fine inference is to make a prediction at every pixel. Prior
approaches have used convnets for semantic segmentation
[10], [11], [12], [13], [14], [15], [16], in which each pixel is
labeled with the class of its enclosing object or region, but
with shortcomings that this work addresses.
We show that fully convolutional networks (FCNs)
trained end-to-end, pixels-to-pixels on semantic segmen-
tation exceed the previous best results without further
machinery. To our knowledge, this is the first work to
train FCNs end-to-end (1) for pixelwise prediction and (2)
from supervised pre-training. Fully convolutional versions
of existing networks predict dense outputs from arbitrary-
sized inputs. Both learning and inference are performed
whole-image-at-a-time by dense feedforward computation
and backpropagation. In-network upsampling layers enable
pixelwise prediction and learning in nets with subsampling.
This method is efficient, both asymptotically and ab-
solutely, and precludes the need for the complications in
other works. Patchwise training is common [10], [11], [12],
[13], [16], but lacks the efficiency of fully convolutional
training. Our approach does not make use of pre- and post-
processing complications, including superpixels [12], [14],
proposals [14], [15], or post-hoc refinement by random fields
⇤Authors contributed equally
• E. Shelhamer, J. Long, and T. Darrell are with the Department of Electrical
Engineering and Computer Science (CS Division), UC Berkeley. E-mail:
{shelhamer,jonlong,trevor}@cs.berkeley.edu.
or local classifiers [12], [14]. Our model transfers recent
success in classification [1], [2], [3] to dense prediction by
reinterpreting classification nets as fully convolutional and
fine-tuning from their learned representations. In contrast,
previous works have applied small convnets without super-
vised pre-training [10], [12], [13].
Semantic segmentation faces an inherent tension be-
tween semantics and location: global information resolves
what while local information resolves where. What can be
done to navigate this spectrum from location to semantics?
How can local decisions respect global structure? It is not
immediately clear that deep networks for image classifica-
tion yield representations sufficient for accurate, pixelwise
recognition.
In the conference version of this paper [17], we cast
pre-trained networks into fully convolutional form, and
augment them with a skip architecture that takes advantage
of the full feature spectrum. The skip architecture fuses
the feature hierarchy to combine deep, coarse, semantic
information and shallow, fine, appearance information (see
Section 4.3 and Figure 3). In this light, deep feature hierar-
chies encode location and semantics in a nonlinear local-to-
global pyramid.
This journal paper extends our earlier work [17] through
further tuning, analysis, and more results. Alternative
choices, ablations, and implementation details better cover
the space of FCNs. Tuning optimization leads to more accu-
rate networks and a means to learn skip architectures all-at-
once instead of in stages. Experiments that mask foreground
and background investigate the role of context and shape.
Results on the object and scene labeling of PASCAL-Context
reinforce merging object segmentation and scene parsing as
unified pixelwise prediction.
In the next section, we review related work on deep
classification nets, FCNs, recent approaches to semantic seg-
mentation using convnets, and extensions to FCNs. The fol-
arXiv:1605.06211v1 [cs.CV] 20 May 2016
Pyramid Scene Parsing Network
Hengshuang Zhao1 Jianping Shi2 Xiaojuan Qi1 Xiaogang Wang1 Jiaya Jia1
1The Chinese University of Hong Kong 2SenseTime Group Limited
{hszhao, xjqi, leojia}@cse.cuhk.edu.hk,
[email protected],
[email protected]
Abstract
Scene parsing is challenging for unrestricted open vo-
cabulary and diverse scenes. In this paper, we exploit the
capability of global context information by different-region-
based context aggregation through our pyramid pooling
module together with the proposed pyramid scene parsing
network (PSPNet). Our global prior representation is ef-
fective to produce good quality results on the scene parsing
task, while PSPNet provides a superior framework for pixel-
level prediction. The proposed approach achieves state-of-
the-art performance on various datasets. It came first in Im-
ageNet scene parsing challenge 2016, PASCAL VOC 2012
benchmark and Cityscapes benchmark. A single PSPNet
yields the new record of mIoU accuracy 85.4% on PASCAL
VOC 2012 and accuracy 80.2% on Cityscapes.
1. Introduction
Scene parsing, based on semantic segmentation, is a fun-
damental topic in computer vision. The goal is to assign
each pixel in the image a category label. Scene parsing pro-
vides complete understanding of the scene. It predicts the
label, location, as well as shape for each element. This topic
is of broad interest for potential applications of automatic
driving, robot sensing, to name a few.
Difficulty of scene parsing is closely related to scene and
label variety. The pioneer scene parsing task [23] is to clas-
sify 33 scenes for 2,688 images on LMO dataset [22]. More
recent PASCAL VOC semantic segmentation and PASCAL
context datasets [8, 29] include more labels with similar
context, such as chair and sofa, horse and cow, etc. The
new ADE20K dataset [43] is the most challenging one with
a large and unrestricted open vocabulary and more scene
classes. A few representative images are shown in Fig. 1.
To develop an effective algorithm for these datasets needs
to conquer a few difficulties.
State-of-the-art scene parsing frameworks are mostly
based on the fully convolutional network (FCN) [26]. The
deep convolutional neural network (CNN) based methods
boost dynamic object understanding, and yet still face chal-
Figure 1. Illustration of complex scenes in ADE20K dataset.
lenges considering diverse scenes and unrestricted vocabu-
lary. One example is shown in the first row of Fig. 2, where
a boat is mistaken as a car. These errors are due to similar
appearance of objects. But when viewing the image regard-
ing the context prior that the scene is described as boathouse
near a river, correct prediction should be yielded.
Towards accurate scene perception, the knowledge graph
relies on prior information of scene context. We found
that the major issue for current FCN based models is lack
of suitable strategy to utilize global scene category clues.
For typical complex scene understanding, previously to get
a global image-level feature, spatial pyramid pooling [18]
was widely employed where spatial statistics provide a good
descriptor for overall scene interpretation. Spatial pyramid
pooling network [12] further enhances the ability.
Different from these methods, to incorporate suitable
global features, we propose pyramid scene parsing network
(PSPNet). In addition to traditional dilated FCN [3, 40] for
pixel prediction, we extend the pixel-level feature to the
specially designed global pyramid pooling one. The local
and global clues together make the final prediction more
reliable. We also propose an optimization strategy with
1
arXiv:1612.01105v2 [cs.CV] 27 Apr 2017
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen George Papandreou Florian Schroff Hartwig Adam
Google Inc.
{lcchen, gpapan, fschroff, hadam}@google.com
Abstract
In this work, we revisit atrous convolution, a powerful tool
to explicitly adjust filter’s field-of-view as well as control the
resolution of feature responses computed by Deep Convolu-
tional Neural Networks, in the application of semantic image
segmentation. To handle the problem of segmenting objects
at multiple scales, we design modules which employ atrous
convolution in cascade or in parallel to capture multi-scale
context by adopting multiple atrous rates. Furthermore, we
propose to augment our previously proposed Atrous Spatial
Pyramid Pooling module, which probes convolutional fea-
tures at multiple scales, with image-level features encoding
global context and further boost performance. We also elab-
orate on implementation details and share our experience
on training our system. The proposed ‘DeepLabv3’ system
significantly improves over our previous DeepLab versions
without DenseCRF post-processing and attains comparable
performance with other state-of-art models on the PASCAL
VOC 2012 semantic image segmentation benchmark.
1. Introduction
For the task of semantic segmentation [20, 63, 14, 97, 7],
we consider two challenges in applying Deep Convolutional
Neural Networks (DCNNs) [50]. The first one is the reduced
feature resolution caused by consecutive pooling operations
or convolution striding, which allows DCNNs to learn in-
creasingly abstract feature representations. However, this
invariance to local image transformation may impede dense
prediction tasks, where detailed spatial information is de-
sired. To overcome this problem, we advocate the use of
atrous convolution [36, 26, 74, 66], which has been shown
to be effective for semantic image segmentation [10, 90, 11].
Atrous convolution, also known as dilated convolution, al-
lows us to repurpose ImageNet [72] pretrained networks
to extract denser feature maps by removing the downsam-
pling operations from the last few layers and upsampling
the corresponding filter kernels, equivalent to inserting holes
(‘trous’ in French) between filter weights. With atrous convo-
lution, one is able to control the resolution at which feature
rate = 6
rate = 24
rate = 1
Conv
kernel: 3x3
rate: 1
Conv
kernel: 3x3
rate: 6
Conv
kernel: 3x3
rate: 24
Feature map Feature map
Feature map
Figure 1. Atrous convolution with kernel size 3 ⇥ 3 and different
rates. Standard convolution corresponds to atrous convolution
with rate = 1. Employing large value of atrous rate enlarges the
model’s field-of-view, enabling object encoding at multiple scales.
responses are computed within DCNNs without requiring
learning extra parameters.
Another difficulty comes from the existence of objects
at multiple scales. Several methods have been proposed to
handle the problem and we mainly consider four categories
in this work, as illustrated in Fig. 2. First, the DCNN is
applied to an image pyramid to extract features for each
scale input [22, 19, 69, 55, 12, 11] where objects at different
scales become prominent at different feature maps. Sec-
ond, the encoder-decoder structure [3, 71, 25, 54, 70, 68, 39]
exploits multi-scale features from the encoder part and re-
covers the spatial resolution from the decoder part. Third,
extra modules are cascaded on top of the original network for
capturing long range information. In particular, DenseCRF
[45] is employed to encode pixel-level pairwise similarities
[10, 96, 55, 73], while [59, 90] develop several extra convo-
lutional layers in cascade to gradually capture long range
context. Fourth, spatial pyramid pooling [11, 95] probes
an incoming feature map with filters or pooling operations
at multiple rates and multiple effective field-of-views, thus
capturing objects at multiple scales.
In this work, we revisit applying atrous convolution,
which allows us to effectively enlarge the field of view of
filters to incorporate multi-scale context, in the framework of
both cascaded modules and spatial pyramid pooling. In par-
ticular, our proposed module consists of atrous convolution
with various rates and batch normalization layers which we
1
arXiv:1706.05587v3 [cs.CV] 5 Dec 2017