PyConZA 2019 Keynote - Deep Neural Networks for Video Applications

DEEP NEURAL NETWORKS Alex Conway alex @ numberboost.com PYCONZA Keynote
2019 Neither confidential nor proprietary - please distribute ;) for Video Applications

2016 MultiChoice Innovation Competition 1st Prize Winners 2017 Mercedes-Benz Innovation
Competition 1st Prize Winners 2018 Lloyd’s Register Innovation Competition 1st Prize Winners 2019 NTT & Dimension Data Innovation Competition 1st Prize Winners

HANDS UP!

https://www.youtube.com/watch?v=Gz0QZP2RKWA

https://twitter.com/goodfellow_ian/status/1084973596236144640

9 https://twitter.com/quasimondo/status/1100016467213516801

10 https://www.youtube.com/watch?feature=youtu.be&v=r6zZPn-6dPY&app=desktop

ORIGINAL FILM Rear Window (1954) PIX2PIX MODEL OUTPUT Fully Automated
RE-MASTERED BY HAND Painstakingly https://hackernoon.com/remastering-classic-films-in- tensorflow-with-pix2pix-f4d551fa0503

INPUT OUTPUT ORIGINAL https://arstechnica.com/information-technology/2017/02/google-brain-super-resolution-zoom-enhance/

https://techcrunch.com/2016/06/20/twitter-is-buying-magic-pony-technology-which-uses-neural-networks- to-improve-images/

https:/ /arxiv.org/abs/1508.06576 CONTENT IMAGE STYLE IMAGE STYLE TRANSFER OUTPUT +
=

https://github.com/junyanz/CycleGAN 15

https://news.developer.nvidia.com/ai-can-transform-anyone-into-a-professional-dancer/

https://github.com/JoYoungjoo/SC-FEGAN

https://www.linkedin.com/feed/update/urn:li:activity:6498172448196820993

https://motherboard.vice.com/en_us/article/gydydm/gal-gadot-fake-ai-porn

https://www.youtube.com/watch?v=MVBe6_o4cMI

https://twitter.com/XHNews/status/1098173090448629760

https://www.youtube.com/watch?v=aE1kA0Jy0Xg

https://www.youtube.com/watch?v=xhp47v5OBXQ

https://www.reddit.com/r/Cyberpunk/comments/ddplms/hk_wearable_face_projector_to_avoid_face/

https://twitter.com/x0rz/status/1104744170529439744

f (video) = useful data

f (video) = clip label

f (video) = frame label

f (video) = object count

f (video) = object activity

f (video) = object poses

f (video) = facial expressions

f (video) = higher res video

f (video) = video with new faces

Neural Networks Crash Course

NEURAL NETWORKS Set of connected Neurons with randomly initialized weights
and non-linear activation functions connected in a Network that are optimized (learned) using training data to minimize prediction error

http://playground.tensorflow.org http://playground.tensorflow.org

WHAT IS A NEURON?

LINEAR

NON-LINEAR

NON-LINEAR ACTIVATION FUNCTIONS Tanh Sigmoid ReLU

Inputs outputs hidden layer 1 hidden layer 2 hidden layer
3 Note: Outputs of one layer are inputs into the next layer This (non-convolutional) architecture is called a “multi-layered perceptron” (DEEP) NEURAL NETWORKS

HOW DOES A NEURAL NETWORK LEARN? New weight = Old
weight Learning rate - ( ) x “How much error increases when we increase this weight”

GRADIENT DESCENT http://scs.ryerson.ca/~aharley/neural-networks/

1 1, 3, 3, 7, … [[1, 2, 3 ]
[3, 2, 1] [3, 4, 5] [7, 8, 9] …] [[1, 2, 3 ] [3, 2, 1] [3, 4, 5] [7, 8, 9] …] [[1, 2, 3 ] [3, 2, 1] [3, 4, 5] [7, 8, 9] …]

image tensor 500 x 500 x 3 = 750’000 60
second video at 10 FPS tensor 500 x 500 x 3 x 10 x 60 = 450’000’000

Convolutional Neural Networks (CNNs)

INPUT 28 x 28 pixel grayscale images = 784 numbers

2 LAYER NEURAL NETWORK 0 1 2 3 4 5
6 7 8 9

https://www.youtube.com/watch?v=aircAruvnKk 3 LAYER NEURAL NETWORK

https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py (99.25% test accuracy in 192 seconds and 46 lines
of code)

3 KEY CONVOLUTIONAL NETWORK ARCHITECTURE IDEAS: 1. Local receptive fields
2. Shared weights 3. Subsampling

76 VGGNet

http://setosa.io/ev/image-kernels

78 http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

80 Zeiler, M.D. and Fergus, R., 2014, September. Visualizing and
understanding convolutional networks. In European conference on computer vision (pp. 818-833).

81 Zeiler, M.D. and Fergus, R., 2014, September. Visualizing and
understanding convolutional networks. In European conference on computer vision (pp. 818-833).

Convolutional Nets Learn Hierarchical Features 82

SUBSAMPLING aka “POOLING” 83

84 VGGNet

we need labelled training data

14,197,122 images, 21841 synsets indexed ILSVRC: 1‘200‘000 images, 1000 categories
ImageNet

89 ImageNet

90 ImageNet

IMAGENET TOP-5 ERROR RATE Traditional Image Processing Methods AlexNet 8
Layers ZFNet 8 Layers GoogLeNet 22 Layers ResNet 152 Layers SENet Ensamble TSNet Ensamble

https://arxiv.org/abs/1611.01578

95 https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

Example: Use CNN to Classify Product Images 96 https://github.com/alexcnwy/ DeepLearning4ComputerVision

TRANSFER LEARNING

99 USING A CNN AS A FEATURE EXTRACTOR Feature Extractor
(“ENCODER”) Classifier

Extracting Features from an Image

feature vector = f ( )

Adding a New Classifier

Fine-tuning A CNN To Solve A New Problem 96.3% accuracy
in under 2 minutes for classifying products into categories (WITH ONLY 3467 TRAINING IMAGES!!1!)

https://www.youtube.com/watch?v=X4Q6C915sUY

https://www.pyimagesearch.com/2019/06/03/fine- tuning-with-keras-and-deep-learning/

IMAGE & VIDEO MODERATION TODO 106

Object Detection

https://www.youtube.com/watch?v=VOC3huqHrss

1.5 million object instances 80 object categories http://cocodataset.org

https://github.com/tensorflow/models/blob/master/research /object_detection/g3doc/detection_model_zoo.md

DEMO (HOLD THUMBS)

https://github.com/tzutalin/labelImg CUSTOM OBJECT DETECTION

https://towardsdatascience.com/how-to-train-your-own-object- detector-with-tensorflows-object-detector-api-bec72ecfe1d9

CNN … P(A) = 0.005 P(B) = 0.002 P(C) =
0.98 P(9) = 0.001 P(0) = 0.03

https://www.reddit.com/r/southafrica/comments/asl4n5/when_a_l ittle_is_just_not_enough/

HOW DOES IT WORK? Person Tracking Vanish in Door? Person
Detector Door Detector Passenger Count Video Frame Stream

ID #1 OBJECT TRACKING ID #2 https://www.pyimagesearch.com/2018/07/23/simple-object-tracking-with-opencv/ Start by assigning
each object in first frame an ID Centroid shown as dot in center of each object

OBJECT TRACKING Objects in frame t shown in green, Objects
in frame t+1 shown in red

OBJECT TRACKING For each object in frame t+1, compute the
Euclidean distance between its centroid and the centroid of every object in frame t

OBJECT TRACKING Assign object in frame t+1 the ID of
nearest object from frame t provided distance less than threshold distance ID #1 ID #1 ID #2 ID #2

ID #1 ID #2 ID #1 ID #2 ID #3
OBJECT TRACKING If no object from frame t-1 within threshold distance then assign new ID

https://www.youtube.com/watch?v=FfU22I-_dI4

https://www.youtube.com/watch?v=NW-rXqCl7us

Recurrent Neural Networks (RNNs)

SPATIO-TEMPORAL

SPORTS 1-M

SPATIAL … THEN TEMPORAL

151 http://colah.github.io/posts/2015-08-Understanding-LSTMs/

feature vector = f ( )

Frame model accuracy <<< Video model accuracy

https://i.imgur.com/mGXdpdp.gifv

Frame-level Action Recognition (7 classes)

Frame model accuracy <<< Video model accuracy

162 https://github.com/alxcnwy/Deep-Neural-Networks- for-Video-Classification

MORE (CRAZY) APPLICATIONS

XXX 164 https://www.youtube.com/watch?v=UeheTiBJ0Io VIDEO Q&A

XXX 165 https://www.youtube.com/watch?v=UeheTiBJ0Io VIDEO Q&A

166 https://www.youtube.com/watch?v=UeheTiBJ0Io VIDEO Q&A

https://github.com/wuhuikai/FaceSwap FACE SWAP

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models https://www.youtube.com/watch?v=p1b5aiTrGzY

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models Network
1: CNN embedder compresses faces & landmarks to vector

2: Generator takes landmarks and synthesizes photo

3: Discriminator learns to tell apart real and synthesized photos

POSE ESTIMATION https://www.youtube.com/watch?v=pW6nZXeWlGM

https://github.com/CMU-Perceptual-Computing-Lab/openpose

https://www.affectiva.com/product/affectiva- automotive-ai-for-driver-monitoring-solutions/ DISTRACTED DRIVING DETECTION

SELF-DRIVING CARS https://www.youtube.com/watch?v=nuMQ4LNMWu8

https://arstechnica.com/cars/2019/08/elon-musk-says- driverless-cars-dont-need-lidar-experts-arent-so-sure/

REMEMBER

f (video) = useful data

Don’t be scared to git clone functions and use deep
learning!

Deep Learning Indaba http://www.deeplearningindaba.com Jeremy Howard & Rachel Thomas http://course.fast.ai
Andrej Karpathy’s Class on Computer Vision http://cs231n.github.io Richard Socher’s Class on NLP (great RNN resource) http://web.stanford.edu/class/cs224n/ Keras docs https://keras.io/ GREAT FREE RESOURCES

THANK YOU! @alxcnwy alex @ numberboost.com

PyConZA 2019 Keynote - Deep Neural Networks for...

PyConZA 2019 Keynote - Deep Neural Networks for Video Applications

More Decks by Alex Conway

Other Decks in Programming

Featured

Transcript