"UUFOUJPOػߏͲ͏ͬͯྲྀߦͬͨͷ͔ʁ
Published as a conference paper at ICLR 2015
(a) (b)
(c) (d)
Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plot
correspond to the words in the source sentence (English) and the generated translation (French),
respectively. Each pixel shows the weight ↵ij
of the annotation of the j-th source word for the i-th
target word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) three
randomly selected samples among the sentences without any unknown words and of length between
10 and 20 words from the test set.
One of the motivations behind the proposed approach was the use of a fixed-length context vector
in the basic encoder–decoder approach. We conjectured that this limitation may make the basic
encoder–decoder approach to underperform with long sentences. In Fig. 2, we see that the perfor-
mance of RNNencdec dramatically drops as the length of the sentences increases. On the other hand,
both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch-
50, especially, shows no performance deterioration even with sentences of length 50 or more. This
superiority of the proposed model over the basic encoder–decoder is further confirmed by the fact
that the RNNsearch-30 even outperforms RNNencdec-50 (see Table 1).
6
/FVSBM.BDIJOF5SBOTMBUJPOCZ+PJOUMZ-FBSOJOH
UP"MJHOBOE5SBOTMBUF
&⒎FDUJWF"QQSPBDIFTUP
"UUFOUJPOCBTFE/FVSBM.BDIJOF5SBOTMBUJPO
yt
˜
ht
ct
at
ht
pt
¯
hs
Attention Layer
Context vector
Local weights
Aligned position
Figure 3: Local attention model – the model first
predicts a single aligned position pt
for the current
target word. A window centered around the source
position pt
is then used to compute a context vec-
tor ct
, a weighted average of the source hidden
states in the window. The weights at
are inferred
refers to the global attention approach in which
weights are placed “softly” over all patches in the
source image. The hard attention, on the other
hand, selects one patch of the image to attend to at
a time. While less expensive at inference time, the
hard attention model is non-differentiable and re-
quires more complicated techniques such as vari-
ance reduction or reinforcement learning to train.
Our local attention mechanism selectively fo-
cuses on a small window of context and is differ-
entiable. This approach has an advantage of avoid-
ing the expensive computation incurred in the soft
attention and at the same time, is easier to train
than the hard attention approach. In concrete de-
tails, the model first generates an aligned position
pt
for each target word at time t. The context vec-
tor ct
is then derived as a weighted average over
the set of source hidden states within the window
[pt
−D, pt
+D]; D is empirically selected.8 Unlike
Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corre
two variants: a “hard” attention mechanism and a “soft”
attention mechanism. We also show how one advantage of
including attention is the ability to visualize what the model
“sees”. Encouraged by recent advances in caption genera-
tion and inspired by recent success in employing attention
in machine translation (Bahdanau et al., 2014) and object
recognition (Ba et al., 2014; Mnih et al., 2014), we investi-
gate models that can attend to salient part of an image while
generating its caption.
2. Related Work
In this section we provide relevant backgroun
work on image caption generation and attent
several methods have been proposed for gen
descriptions. Many of these methods are b
rent neural networks and inspired by the suc
sequence to sequence training with neural ne
chine translation (Cho et al., 2014; Bahdana
4IPX
"UUFOEBOE5FMM
/FVSBM*NBHF$BQUJPO(FOFSBUJPOXJUI7JTVBM"UUFOUJPO
3FTJEVBM"UUFOUJPO/FUXPSLGPS*NBHF$MBTTJpDBUJPO
down sample
down sample
up sample
up sample
convolution
receptive field
Soft Mask Branch Trunk Branch
1 + # $ % '($)
* +
+
, +
Figure 3: The receptive field comparison between mask
branch and trunk branch.
range to [0, 1] after two consecutive 1 ⇥ 1 convolution lay-
ers. We also added skip connections between bottom-up
and top-down parts to capture information from different
scales. The full module is illustrated in Fig.2.
The bottom-up top-down structure has been applied to
image segmentation and human pose estimation. However,
the difference between our structure and the previous one
lies in its intention. Our mask branch aims at improving
Activation
f1(
f2(
f3(
Table 1: Th
network with
Layer
Conv1
Max pool
Residual U
Attention M
Residual U
Attention M
Residual U
Attention M
Residual U
Average po
FC,Softm
pa
/-1
/-1$7
$7
จͷओுͰɻɻɻ
Stacked Hourglass Networks for
Human Pose Estimation
Alejandro Newell, Kaiyu Yang, and Jia Deng
University of Michigan, Ann Arbor
{alnewell,yangky,jiadeng}@umich.edu
Abstract. This work introduces a novel convolutional network archi-
tecture for the task of human pose estimation. Features are processed
across all scales and consolidated to best capture the various spatial re-
lationships associated with the body. We show how repeated bottom-up,
top-down processing used in conjunction with intermediate supervision
is critical to improving the performance of the network. We refer to the
architecture as a “stacked hourglass” network based on the successive
steps of pooling and upsampling that are done to produce a final set of
predictions. State-of-the-art results are achieved on the FLIC and MPII
benchmarks outcompeting all recent methods.
Keywords: Human Pose Estimation
Fig. 1. Our network for pose estimation consists of multiple stacked hourglass modules
which allow for repeated bottom-up, top-down inference.
1 Introduction
A key step toward understanding people in images and video is accurate pose
estimation. Given a single RGB image, we wish to determine the precise pixel
location of important keypoints of the body. Achieving an understanding of a
person’s posture and limb articulation is useful for higher level tasks like ac-
tion recognition, and also serves as a fundamental tool in fields such as human-
computer interaction and animation.
arXiv:1603.06937v2 [cs.CV] 26 Jul 2016
4UBDLFE)PVSHMBTT/FUXPSLT
GPS)VNBO1PTF&TUJNBUJPO
)JHIXBZ/FUXPSLT
1.1. Notation
We use boldface letters for vectors and matrices, and ital-
icized capital letters to denote transformation functions. 0
and 1 denote vectors of zeros and ones respectively, and I
denotes an identity matrix. The function (x) is defined as
(x) = 1
1+e x , x 2 R.
2. Highway Networks
A plain feedforward neural network typically consists of L
layers where the lth layer (l 2 {1, 2, ..., L}) applies a non-
linear transform H (parameterized by W
H,l
) on its input
x
l
to produce its output y
l
. Thus, x
1
is the input to the
network and y
L
is the network’s output. Omitting the layer
index and biases for clarity,
y = H(x, W
H
). (1)
H is usually an affine transform followed by a non-linear
activation function, but in general it may take other forms.
For a highway network, we additionally define two non-
linear transforms T(x, W
T
) and C(x, W
C
) such that
y = H(x, W
H
)· T(x, W
T
) + x · C(x, W
C
). (2)
We refer to T as the transform gate and C as the carry gate,
since they express how much of the output is produced by
transforming the input and carrying it, respectively. For
simplicity, in this paper we set C = 1 T, giving
y = H(x, W
H
)· T(x, W
T
) + x · (1 T(x, W
T
)). (3)
The dimensionality of x, y, H(x, W
H
) and T(x, W
T
)
must be the same for Equation (3) to be valid. Note that
ple computing units such that the i unit computes yi
=
Hi
(x), a highway network consists of multiple blocks such
that the ith block computes a block state Hi
(x) and trans-
form gate output Ti
(x). Finally, it produces the block out-
put yi
= Hi
(x) ⇤ Ti
(x) + xi
⇤ (1 Ti
(x)), which is con-
nected to the next layer.
2.1. Constructing Highway Networks
As mentioned earlier, Equation (3) requires that the dimen-
sionality of x, y, H(x, W
H
) and T(x, W
T
) be the same.
In cases when it is desirable to change the size of the rep-
resentation, one can replace x with ˆ
x obtained by suitably
sub-sampling or zero-padding x. Another alternative is to
use a plain layer (without highways) to change dimension-
ality and then continue with stacking highway layers. This
is the alternative we use in this study.
Convolutional highway layers are constructed similar to
fully connected layers. Weight-sharing and local receptive
fields are utilized for both H and T transforms. We use
zero-padding to ensure that the block state and transform
gate feature maps are the same size as the input.
2.2. Training Deep Highway Networks
For plain deep networks, training with SGD stalls at the
beginning unless a specific weight initialization scheme is
used such that the variance of the signals during forward
and backward propagation is preserved initially (Glorot &
Bengio, 2010; He et al., 2015). This initialization depends
on the exact functional form of H.
For highway layers, we use the transform gate defined as
T(x) = (W
T
T x+b
T
), where W
T
is the weight matrix
and b
T
the bias vector for the transform gates. This sug-
gests a simple initialization scheme which is independent
of the nature of H: bT
can be initialized with a negative
value (e.g. -1, -3 etc.) such that the network is initially
biased towards carry behavior. This scheme is strongly in-
$7
$7