GDG DevFest 2019 - "Deep Neural Networks for Video Applications"

DEEP NEURAL NETWORKS Alex Conway alex @ numberboost.com Google Developer
Group Cape Town #DevFest 2019 Neither confidential nor proprietary - please distribute ;) for Video Applications

Note about these slides The original version of this presentation
is over 500 MB (lots of videos) which is too big to upload but there are links to most of the videos in this compressed PDF version * *you might have to download the PDF to click on the links

2016 MultiChoice Innovation Competition 1st Prize Winners 2017 Mercedes-Benz Innovation
Competition 1st Prize Winners 2018 Lloyd’s Register Innovation Competition 1st Prize Winners 2019 NTT & Dimension Data Innovation Competition 1st Prize Winners

HANDS UP!

https://www.youtube.com/watch?v=Gz0QZP2RKWA

https://twitter.com/goodfellow_ian/status/1084973596236144640

10 https://twitter.com/quasimondo/status/1100016467213516801

https://twitter.com/jonathanfly/status/1185022363848658945?s=21

https:/ /arxiv.org/abs/1508.06576 CONTENT IMAGE STYLE IMAGE STYLE TRANSFER OUTPUT +
=

https://github.com/junyanz/CycleGAN 13

https://news.developer.nvidia.com/ai-can-transform-anyone-into-a-professional-dancer/

https://motherboard.vice.com/en_us/article/gydydm/gal-gadot-fake-ai-porn

https://www.youtube.com/watch?v=MVBe6_o4cMI

https://twitter.com/XHNews/status/1098173090448629760

https://www.youtube.com/watch?v=EkfnjAeHFAk

https://www.youtube.com/watch?v=aE1kA0Jy0Xg

https://www.youtube.com/watch?v=xhp47v5OBXQ

https://twitter.com/x0rz/status/1104744170529439744

f (video) = useful data

f (video) = clip label

f (video) = frame label

f (video) = object count

f (video) = object activity

f (video) = object poses

f (video) = facial expressions

f (video) = higher res video

f (video) = video with new faces

Neural Networks Crash Course

Basically watch this incredible video series by 3blue1brown: www.3blue1brown.com/neural-networks

input output Neural Network

http://playground.tensorflow.org http://playground.tensorflow.org

. ⋯ . ⋮ ⋱ ⋮ . ⋯ . .
. . … . 13.37 . ⋯ . ⋮ ⋱ ⋮ . ⋯ . . ⋯ . ⋮ ⋱ ⋮ . ⋯ .

width 28px Height 28px

image tensor 500 x 500 x 3 = 750’000 3-dimensions:
width, height, colour

video tensor (60 second clip @ 24fps) 500 x 500
x 3 x 24 x 60 = 1’080’000’000 4-dimensions: width, height, colour, time

“How much error increases when we increase this weight” HOW
DOES A NEURAL NETWORK LEARN? New weight = Old weight Learning rate - Gradient of Error with respect to Weight ( ) x

GRADIENT DESCENT http://scs.ryerson.ca/~aharley/neural-networks/

Convolutional Neural Networks (CNNs)

https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py (99.25% test accuracy in 192 seconds and 46 lines
of code)

Convolutional Neural Network 1. Local receptive fields 2. Shared weights
3. Subsampling

http://setosa.io/ev/image-kernels

80 http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

81 VGGNet

83 Zeiler, M.D. and Fergus, R., 2014, September. Visualizing and
understanding convolutional networks. In European conference on computer vision (pp. 818-833).

84 Zeiler, M.D. and Fergus, R., 2014, September. Visualizing and
understanding convolutional networks. In European conference on computer vision (pp. 818-833).

Convolutional Nets Learn Hierarchical Features 85

SUBSAMPLING aka “POOLING” 86

87 VGGNet

we need labelled training data

14,197,122 images, 21841 synsets indexed ILSVRC: 1‘200‘000 images, 1000 categories
ImageNet

92 ImageNet

93 ImageNet

IMAGENET TOP-5 ERROR RATE Traditional Image Processing Methods AlexNet 8
Layers ZFNet 8 Layers GoogLeNet 22 Layers ResNet 152 Layers SENet Ensamble TSNet Ensamble

96 https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

= f ) CNN ( class probabilities . “ ”
. “ ” . “ ” … . “” 1000 rows

= f ) ( . “ ” . “ ”
. “ ” … . “”

= f ) ( “ ” with probability 0.5037

Example: Use CNN to Classify Product Images 100 https://github.com/alexcnwy/ DeepLearning4ComputerVision

https://www.youtube.com/watch?v=X4Q6C915sUY

TRANSFER LEARNING

Using a CNN as a Feature Extractor Feature Extractor (“ENCODER”)
Classifier

Extracting Features from an Image

= f ) CNN ( feature vector . . .
… . “encoding”

Using a CNN as a Feature Extractor Feature Extractor (“ENCODER”)
Classifier

Adding a New Classifier

Fine-tuning A CNN To Solve A New Problem 96.3% accuracy
in under 2 minutes for classifying products into categories (WITH ONLY 3467 TRAINING IMAGES!!1!)

https://www.pyimagesearch.com/2019/06/03/fine- tuning-with-keras-and-deep-learning/

IMAGE & VIDEO MODERATION TODO 111

Object Detection

List of bounding boxes + classes = f ) ,
, , , , , , , … , , , , Object Detection Model (

https://www.youtube.com/watch?v=VOC3huqHrss

1.5 million object instances 80 object categories http://cocodataset.org

DEMO (HOLD THUMBS)

https://towardsdatascience.com/how-to-train-your-own-object- detector-with-tensorflows-object-detector-api-bec72ecfe1d9

https://github.com/tensorflow/models/blob/master/research /object_detection/g3doc/detection_model_zoo.md

https://github.com/tzutalin/labelImg CUSTOM OBJECT DETECTION

CNN … P(A) = 0.005 P(B) = 0.002 P(C) =
0.98 P(9) = 0.001 P(0) = 0.03

https://www.reddit.com/r/southafrica/comments/asl4n5/when_a_l ittle_is_just_not_enough/

Passenger Counting Person Tracking Vanish in Door? Person Detector Door
Detector Passenger Count Video Frame Stream

ID #1 OBJECT TRACKING ID #2 https://www.pyimagesearch.com/2018/07/23/simple-object-tracking-with-opencv/ Start by assigning
each object in first frame an ID Centroid shown as dot in center of each object

OBJECT TRACKING Objects in frame t shown in green, Objects
in frame t+1 shown in red

OBJECT TRACKING For each object in frame t+1, compute the
Euclidean distance between its centroid and the centroid of every object in frame t

OBJECT TRACKING Assign object in frame t+1 the ID of
nearest object from frame t provided distance less than threshold distance ID #1 ID #1 ID #2 ID #2

ID #1 ID #2 ID #1 ID #2 ID #3
OBJECT TRACKING If no object from frame t-1 within threshold distance then assign new ID

https://www.youtube.com/watch?v=FfU22I-_dI4

https://www.youtube.com/watch?v=NW-rXqCl7us

Recurrent Neural Networks (RNNs)

CNN CNN CNN Rugby Rugby CNN CNN CNN Soccer Soccer
Rugby Rugby

Video Clip Model = Frame CNN + Prediction Voting Rugby

CNN CNN CNN Rugby Rugby CNN CNN CNN Rugby Rugby
Rugby Rugby RNN RNN RNN RNN RNN RNN

CNN Encoder Rugby Rugby Rugby Rugby Rugby Rugby RNN RNN
RNN RNN RNN RNN

CNN Encoder Pass Run Catch Run Run Score RNN RNN
RNN RNN RNN RNN

163 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Recurrent Neural Networks (RNNs)

164 Recurrent Neural Networks (RNNs) http://colah.github.io/posts/2015-08-Understanding-LSTMs/

feature vector = f ) . . . … .
. . . … . . . . … . . . . … . , , , … RNN ( CNN frame encodings clip / frame prediction

Frame CNN model test accuracy = 71% Video CNN +
RNN model test accuracy = 93%

https://i.imgur.com/mGXdpdp.gifv

Frame-level Action Recognition (7 classes)

Frame CNN model test accuracy = 51% Video CNN +
RNN model test accuracy = 87%

175 https://github.com/alxcnwy/Deep-Neural- Networks-for-Video-Classification

MORE (CRAZY) APPLICATIONS

XXX 177 https://www.youtube.com/watch?v=UeheTiBJ0Io VIDEO Q&A

179 https://www.youtube.com/watch?v=UeheTiBJ0Io VIDEO Q&A

Pose Estimation https://www.youtube.com/watch?v=pW6nZXeWlGM

https://www.youtube.com/watch?v=pW6nZXeWlGM Pose Estimation

https://github.com/CMU-Perceptual-Computing-Lab/openpose

https://www.affectiva.com/product/affectiva- automotive-ai-for-driver-monitoring-solutions/ Distracted Driving Detection

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models https://www.youtube.com/watch?v=p1b5aiTrGzY

https://www.youtube.com/watch?v=p1b5aiTrGzY Few-Shot Adversarial Learning of Realistic Neural Talking Head Models

Network 1: CNN embedder compresses faces & landmarks to vector
Few-Shot Adversarial Learning of Realistic Neural Talking Head Models

Network 2: Generator takes landmarks and synthesizes photo Few-Shot Adversarial
Learning of Realistic Neural Talking Head Models

Network 3: Discriminator learns to tell apart real and synthesized
photos Few-Shot Adversarial Learning of Realistic Neural Talking Head Models

https://www.youtube.com/watch?v=8AZBuyEuDqc

REMEMBER

f (input) = output

f (video) = useful data

f (video) =

Producers vs. Consumers

Deep Learning Indaba http://www.deeplearningindaba.com Jeremy Howard & Rachel Thomas http://course.fast.ai
Andrej Karpathy’s Class on Computer Vision http://cs231n.github.io Richard Socher’s Class on NLP (great RNN resource) http://web.stanford.edu/class/cs224n/ Keras docs https://keras.io/ GREAT FREE RESOURCES

THANK YOU! @alxcnwy alex @ numberboost.com

GDG DevFest 2019 - "Deep Neural Networks for Vi...

GDG DevFest 2019 - "Deep Neural Networks for Video Applications"

More Decks by Alex Conway

Other Decks in Education

Featured

Transcript