Slide 1

Slide 1 text

Creating Recipes from Videos Misha Fain, Cookpad Ltd. CSS Bristol guest lecture, 27 November 2018, Bristol University

Slide 2

Slide 2 text

Recipe Creation on Cookpad Platform

Slide 3

Slide 3 text

Recipe Creation on Cookpad Platform 1. Start cooking your meal 2. Take pictures along the way 3. Eat your food and have a nap 4. Open Cookpad website or app and start creating a recipe 5. Enter all the ingredients and quantities 6. Write down all the steps 7. Attach the images to the corresponding steps

Slide 4

Slide 4 text

Recipe Creation on Cookpad Platform

Slide 5

Slide 5 text

What if you could do it like this instead? 1. Start recording video 2. Cook your meal 3. Eat it 4. Have your recipe generated for you

Slide 6

Slide 6 text

How to generate a step from a short video clip?

Slide 7

Slide 7 text

How to generate a step from a short video clip?

Slide 8

Slide 8 text

Encoder (from video to vector) Internal Representation ... Time 0.1 -0.34 0.435 ... -0.01 1.324 0.32 ... Video Frames Recurrent Network (GRU) ... CNN (ResNet)

Slide 9

Slide 9 text

Encoder (from video to vector) Internal Representation ... Time 0.1 -0.34 0.435 ... -0.01 1.324 0.32 ... Video Frames Recurrent Network (GRU) ... CNN (ResNet)

Slide 10

Slide 10 text

Encoder: image-to-vector (Convolutional Neural Network) Adapted from https://towardsdatascience.com/build-your-own-convolution-neural-network-in-5-mins-4217c2cf964f Vector: 0.1 -0.34 0.435 ... -0.01 1.324 0.32

Slide 11

Slide 11 text

Encoder (from video to vector) Internal Representation ... Time 0.1 -0.34 0.435 ... -0.01 1.324 0.32 ... Video Frames Recurrent Network (GRU) ... CNN (ResNet)

Slide 12

Slide 12 text

Encoder (from video to vector) Internal Representation ... Time 0.1 -0.34 0.435 ... -0.01 1.324 0.32 ... Video Frames Recurrent Network (GRU) ... CNN (ResNet)

Slide 13

Slide 13 text

Encoder (from sequence of image vectors to vector) From http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

Slide 14

Slide 14 text

Encoder (from video to vector) Internal Representation ... Time 0.1 -0.34 0.435 ... -0.01 1.324 0.32 ... Video Frames Recurrent Network (GRU) ... CNN (ResNet)

Slide 15

Slide 15 text

Encoder (from video to vector)

Slide 16

Slide 16 text

How to generate a step from a short video clip?

Slide 17

Slide 17 text

How to generate a step from a short video clip?

Slide 18

Slide 18 text

0.1 -0.34 0.435 ... -0.01 1.324 0.32 Decoder (from vector to text) Internal Representation Time ... Recurrent Network (GRU) Words cut the potatoes slices

Slide 19

Slide 19 text

How to generate a step from a short video clip?

Slide 20

Slide 20 text

Data Public dataset YouCookII: annotated Youtube cooking videos http://cmos.eecs.umich.edu/static/YouCookII/youcookii_readme.pdf

Slide 21

Slide 21 text

Training Procedure Training = Optimization w_opt = argmin(L(w)) w: Neural Network parameters, millions of them L: loss function, problem-dependent (params -> number) Optimization method - gradient descent, aka steepest descent (intuition - reckless runner with short attention span gets lost in the fog in the mountains)

Slide 22

Slide 22 text

Results by Evander DaCosta

Slide 23

Slide 23 text

Results - what does not work by Evander DaCosta

Slide 24

Slide 24 text

Summary ● We have seen how a Sequence-to-Sequence model can be used to predict recipe text from videos ● Predicting the actions is easier than identifying the ingredients ● While it somewhat works, still not nearly as good as we'd like. Lots of research is going on elsewhere to get better results, e.g. https://arxiv.org/pdf/1804.00819.pdf