"UUFOUJPOػߏͲ͏ͬͯྲྀߦͬͨͷ͔ʁ

Published as a conference paper at ICLR 2015

(a) (b)

(c) (d)

Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plot

correspond to the words in the source sentence (English) and the generated translation (French),

respectively. Each pixel shows the weight ↵ij

of the annotation of the j-th source word for the i-th

target word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) three

randomly selected samples among the sentences without any unknown words and of length between

10 and 20 words from the test set.

One of the motivations behind the proposed approach was the use of a ﬁxed-length context vector

in the basic encoder–decoder approach. We conjectured that this limitation may make the basic

encoder–decoder approach to underperform with long sentences. In Fig. 2, we see that the perfor-

mance of RNNencdec dramatically drops as the length of the sentences increases. On the other hand,

both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch-

50, especially, shows no performance deterioration even with sentences of length 50 or more. This

superiority of the proposed model over the basic encoder–decoder is further conﬁrmed by the fact

that the RNNsearch-30 even outperforms RNNencdec-50 (see Table 1).

6

/FVSBM.BDIJOF5SBOTMBUJPOCZ+PJOUMZ-FBSOJOH

UP"MJHOBOE5SBOTMBUF

&⒎FDUJWF"QQSPBDIFTUP

"UUFOUJPOCBTFE/FVSBM.BDIJOF5SBOTMBUJPO

yt

˜

ht

ct

at

ht

pt

¯

hs

Attention Layer

Context vector

Local weights

Aligned position

Figure 3: Local attention model – the model ﬁrst

predicts a single aligned position pt

for the current

target word. A window centered around the source

position pt

is then used to compute a context vec-

tor ct

, a weighted average of the source hidden

states in the window. The weights at

are inferred

refers to the global attention approach in which

weights are placed “softly” over all patches in the

source image. The hard attention, on the other

hand, selects one patch of the image to attend to at

a time. While less expensive at inference time, the

hard attention model is non-differentiable and re-

quires more complicated techniques such as vari-

ance reduction or reinforcement learning to train.

Our local attention mechanism selectively fo-

cuses on a small window of context and is differ-

entiable. This approach has an advantage of avoid-

ing the expensive computation incurred in the soft

attention and at the same time, is easier to train

than the hard attention approach. In concrete de-

tails, the model ﬁrst generates an aligned position

pt

for each target word at time t. The context vec-

tor ct

is then derived as a weighted average over

the set of source hidden states within the window

[pt

−D, pt

+D]; D is empirically selected.8 Unlike

Figure 3. Examples of attending to the correct object (white indicates the attended regions, underlines indicated the corre

two variants: a “hard” attention mechanism and a “soft”

attention mechanism. We also show how one advantage of

including attention is the ability to visualize what the model

“sees”. Encouraged by recent advances in caption genera-

tion and inspired by recent success in employing attention

in machine translation (Bahdanau et al., 2014) and object

recognition (Ba et al., 2014; Mnih et al., 2014), we investi-

gate models that can attend to salient part of an image while

generating its caption.

2. Related Work

In this section we provide relevant backgroun

work on image caption generation and attent

several methods have been proposed for gen

descriptions. Many of these methods are b

rent neural networks and inspired by the suc

sequence to sequence training with neural ne

chine translation (Cho et al., 2014; Bahdana

4IPX
"UUFOEBOE5FMM

/FVSBM*NBHF$BQUJPO(FOFSBUJPOXJUI7JTVBM"UUFOUJPO

3FTJEVBM"UUFOUJPO/FUXPSLGPS*NBHF$MBTTJpDBUJPO

down sample

down sample

up sample

up sample

convolution

receptive field

Soft Mask Branch Trunk Branch

1 + # $ % '($)

* +

+

, +

Figure 3: The receptive ﬁeld comparison between mask

branch and trunk branch.

range to [0, 1] after two consecutive 1 ⇥ 1 convolution lay-

ers. We also added skip connections between bottom-up

and top-down parts to capture information from different

scales. The full module is illustrated in Fig.2.

The bottom-up top-down structure has been applied to

image segmentation and human pose estimation. However,

the difference between our structure and the previous one

lies in its intention. Our mask branch aims at improving

Activation

f1(

f2(

f3(

Table 1: Th

network with

Layer

Conv1

Max pool

Residual U

Attention M

Residual U

Attention M

Residual U

Attention M

Residual U

Average po

FC,Softm

pa

/-1

/-1$7

$7

จͷओுͰɻɻɻ

Stacked Hourglass Networks for

Human Pose Estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng

University of Michigan, Ann Arbor

{alnewell,yangky,jiadeng}@umich.edu

Abstract. This work introduces a novel convolutional network archi-

tecture for the task of human pose estimation. Features are processed

across all scales and consolidated to best capture the various spatial re-

lationships associated with the body. We show how repeated bottom-up,

top-down processing used in conjunction with intermediate supervision

is critical to improving the performance of the network. We refer to the

architecture as a “stacked hourglass” network based on the successive

steps of pooling and upsampling that are done to produce a ﬁnal set of

predictions. State-of-the-art results are achieved on the FLIC and MPII

benchmarks outcompeting all recent methods.

Keywords: Human Pose Estimation

Fig. 1. Our network for pose estimation consists of multiple stacked hourglass modules

which allow for repeated bottom-up, top-down inference.

1 Introduction

A key step toward understanding people in images and video is accurate pose

estimation. Given a single RGB image, we wish to determine the precise pixel

location of important keypoints of the body. Achieving an understanding of a

person’s posture and limb articulation is useful for higher level tasks like ac-

tion recognition, and also serves as a fundamental tool in ﬁelds such as human-

computer interaction and animation.

arXiv:1603.06937v2 [cs.CV] 26 Jul 2016

4UBDLFE)PVSHMBTT/FUXPSLT

GPS)VNBO1PTF&TUJNBUJPO

)JHIXBZ/FUXPSLT

1.1. Notation

We use boldface letters for vectors and matrices, and ital-

icized capital letters to denote transformation functions. 0

and 1 denote vectors of zeros and ones respectively, and I

denotes an identity matrix. The function (x) is deﬁned as

(x) = 1

1+e x , x 2 R.

2. Highway Networks

A plain feedforward neural network typically consists of L

layers where the lth layer (l 2 {1, 2, ..., L}) applies a non-

linear transform H (parameterized by W

H,l

) on its input

x

l

to produce its output y

l

. Thus, x

1

is the input to the

network and y

L

is the network’s output. Omitting the layer

index and biases for clarity,

y = H(x, W

H

). (1)

H is usually an afﬁne transform followed by a non-linear

activation function, but in general it may take other forms.

For a highway network, we additionally deﬁne two non-

linear transforms T(x, W

T

) and C(x, W

C

) such that

y = H(x, W

H

)· T(x, W

T

) + x · C(x, W

C

). (2)

We refer to T as the transform gate and C as the carry gate,

since they express how much of the output is produced by

transforming the input and carrying it, respectively. For

simplicity, in this paper we set C = 1 T, giving

y = H(x, W

H

)· T(x, W

T

) + x · (1 T(x, W

T

)). (3)

The dimensionality of x, y, H(x, W

H

) and T(x, W

T

)

must be the same for Equation (3) to be valid. Note that

ple computing units such that the i unit computes yi

=

Hi

(x), a highway network consists of multiple blocks such

that the ith block computes a block state Hi

(x) and trans-

form gate output Ti

(x). Finally, it produces the block out-

put yi

= Hi

(x) ⇤ Ti

(x) + xi

⇤ (1 Ti

(x)), which is con-

nected to the next layer.

2.1. Constructing Highway Networks

As mentioned earlier, Equation (3) requires that the dimen-

sionality of x, y, H(x, W

H

) and T(x, W

T

) be the same.

In cases when it is desirable to change the size of the rep-

resentation, one can replace x with ˆ

x obtained by suitably

sub-sampling or zero-padding x. Another alternative is to

use a plain layer (without highways) to change dimension-

ality and then continue with stacking highway layers. This

is the alternative we use in this study.

Convolutional highway layers are constructed similar to

fully connected layers. Weight-sharing and local receptive

ﬁelds are utilized for both H and T transforms. We use

zero-padding to ensure that the block state and transform

gate feature maps are the same size as the input.

2.2. Training Deep Highway Networks

For plain deep networks, training with SGD stalls at the

beginning unless a speciﬁc weight initialization scheme is

used such that the variance of the signals during forward

and backward propagation is preserved initially (Glorot &

Bengio, 2010; He et al., 2015). This initialization depends

on the exact functional form of H.

For highway layers, we use the transform gate deﬁned as

T(x) = (W

T

T x+b

T

), where W

T

is the weight matrix

and b

T

the bias vector for the transform gates. This sug-

gests a simple initialization scheme which is independent

of the nature of H: bT

can be initialized with a negative

value (e.g. -1, -3 etc.) such that the network is initially

biased towards carry behavior. This scheme is strongly in-

$7

$7