Looking at two approaches to using words to search for images and video and producing textual descriptions of images and video.
The papers discussed are: http://0xab.com/papers/cvpr2014 and https://www.cs.cmu.edu/~afarhadi/papers/sentence.pdf
The video demo of the second paper (slide 18) is available here: http://0xab.com/research/video-events.html
Presented in MIT 9.S915: Aspects of a Computational Theory of Intelligence. http://cs.wellesley.edu/~vision/