Slide 1

Slide 1 text

Leveraging AI for Scene Detection in Video By learning a distance measure between shots By: Rik Heijdens

Slide 2

Slide 2 text

Video Structure

Slide 3

Slide 3 text

Scene Detection The task of finding Logical Story Units in Video Why? ● Automatic content indexing of (large) video libraries ● Automatic advertisement insertion

Slide 4

Slide 4 text

Architecture Overview

Slide 5

Slide 5 text

Feature extraction

Slide 6

Slide 6 text

Extracting Visual Features ● Encode frames using Google's Inception CNN

Slide 7

Slide 7 text

Extracting Audible Features ● Audio is often used to underline the development of a story ● Short-time Fourier Transforms (STFTs) ● Mel-scaled power spectrograms S. Dieleman et al. "End-to-end learning for music audio" (ICASSP 2014)

Slide 8

Slide 8 text

Extracting Textual Features Extracted from transcripts Word2Vec embeddings

Slide 9

Slide 9 text

Feed the extracted features into a Neural Network 1. Concatenate all the features into a single dense feature vector. 2. Feed this feature vector into a multimodal neural network that learns how to weight the components and maps high dimensional feature vectors into lower dimensional shot embeddings.

Slide 10

Slide 10 text

Clustering Similarity Matrix

Slide 11

Slide 11 text

Plot of similarity scores for temporally adjacent shots Clustering

Slide 12

Slide 12 text

Clustering Scene Breaks and their confidence score plotted on top of the Similarity Matrix

Slide 13

Slide 13 text

Questions? Approach me after the talk or send an email to [email protected]