Every Picture Tells a Story:
Generating Sentences
from Images
Ali Farhadi, Mohsen Hejrati , Mohammad Amin Sadeghi, Peter Young, Cyrus
Rashtchian, Julia Hockenmaier, David Forsyth
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
images sentences
Slide 6
Slide 6 text
Felzenszwalb Detector
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro F. Felzenszwalb, David McAllester and Deva Ramanan
Slide 7
Slide 7 text
Linear SVM
!
!
Felzenszwalb detector
Hoiem 3D scene model
GIST scene features (+Adaboost)
Node features + scores
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
Edge Potentials
• given a test image
• k-nn training examples, average node features
• from the image side: node features for similar
images
• from the sentence side: sentence representation
for similar images
• Multi-label Markov Random Field
Slide 10
Slide 10 text
images sentences
Slide 11
Slide 11 text
Curran & Clark Tools
• Maximum Entropy Tagger
• POS Tagger
• Combinatory Categorial Grammar (CCG)
• Chunker
• Named Entity Recognizer
Slide 12
Slide 12 text
C&C Parser
Dependency
Parse
Subject/Direct Object
Head nouns
from prepositional phrases
(“X in the background”)
Scene information
Edge Potentials
• given a test image
• k-nn training examples, average node features
• from the image side: node features for similar
images
• from the sentence side: sentence representation
for similar images
• Multi-label Markov Random Field
Slide 16
Slide 16 text
Structure Learning
Finding weights on linear combinations on nodes
and edges so that the ground truth triplet scores highest
Slide 17
Slide 17 text
N. Siddharth, Andrei Barbu, Jeffrey Mark Siskind
!
Seeing What You’re Told:
Sentence-Guided Activity
Recognition In Video
Object Detection
Track
Event Recognizer
Sentences
Per-Object/Per-frame
• Position
• Velocity
• Acceleration
• Aspect Ratio
Agent+Instrument
• Distance
• Orientation
A time series of feature vectors
Train with Hidden Markov-Model
(per-word in lexicon)
Slide 25
Slide 25 text
Object Detection
Track
Event Recognizer
Sentences
Recognize with HMM
Maximize linear combination
of observations
and state transitions
Slide 26
Slide 26 text
Object Detection
Track
Event Recognizer
Sentences
Sentence Tracker
Determine whether a set
of tracks matches a sentence
by maximizing the probability
of the cross-product lattice