Karim Jedda - Analyzing TV Series/Movies with Python – part 2

Analyzing TV Series/Movies with Python – part 2

About me • Big data architect in the Data Technology
team @ Prosieben • Data science Education @ France (NLP, image recognition,…) • Several hacks (kaggle, datamining a flat, robotax, whatsappcli,…) • Python enthusiast • Still trying to figure out Javascript and booting up Eclipse

Some background • Find a good series/movie to watch next
• Using the content • and Python and my cat

Initial approach: Subtitles • Amount of data

Initial approach: Subtitles • Information in the data

Goal: Movie/Series Recommendation system • What should i watch next?

What is a recommendation system and why should I care?

Item based recommendation

Similarity: Feature definition (text only) • Talk time •
Talk frequency • Episode/Movie duration • Idle time • Number of words • Number of sentences • Most used words • Words length • Sentence length • Vocabulary richness • Time to read • SMOG grade • Topic modelling • Summary • Polarity • Word usage • Sentence beginnings

Let‘s do it

Recommender using the subtitles

Unprocessed/Not Analyzed data • Audio • Video • Most important
information ! • Let‘s play a bit with it

Video analysis and processing • Succession of frames • Has
different formats • Easy to process with Python • Contains more information than we can analyze today So we need to know what to look for in videos

Ideas?

Sex & NSFW stuff in Series and Movies

Sex & NSFW stuff in Series and Movies • How
?!

Sex & NSFW stuff in Series and Movies • How
?! + Open NSFW =

open NSFW

open NSFW NSFW score: 0.0174936428666 root@7bb4e0bc0da0:/workspace/open_nsfw# python ./classify_nsfw.py \
-‐-‐model_def nsfw_model/deploy.prototxt \ -‐-‐pretrained_model nsfw_model/resnet_50_1by2_nsfw.caffemodel \ the_image IMG_5983.JPG christy_mack_test.JPG NSFW score: 0.852280557156

Let‘s apply this to series/movies Tools used: • Python
• Caffe • NSFW Model • MoviePy

Results

Desperate Housewives

Desperate Housewives NSFW SFW ??

Desperate Housewives

Dexter

Game of Thrones

Narcos

Sex and the City

Californication

Californication 10/10 would watch!

Recap • Automatic tagging of videos • Not limited to
NSFW • Generating more usable data • Models can be trained easily

Similarity: Feature definition for Videos • NSFW score • Color
palette • Brightness • Length • Scenes & Rhythm • ...

Ok, but what else? • Let‘s put Video and Text
together in a practical example

Video summarization • Let‘s generate trailers

IBM Watson did it • Using AI • Complex
• https://www.youtube.com/watch?v=gJEzuYynaiw

Ok • What do you do when you don‘t have
a supercomputer?

Libraries used • PyAudioAnalysis • PySceneDetect • youtube-‐dl • moviePy
• pysrt • nltk • re

Output

Demo • Original: https://www.youtube.com/watch?v=9VDvgL58h_Y • Summarized version using subtitles and
Python: https://youtu.be/6Yvy1wHItSA

Now let‘s try to • Boil down the series to
their NSFW content

Now let‘s try to • Boil down the series to
their NSFW content ... No, just kidding :)

Back to business...

An easy recommendation engine concept • Goal: Similarity

An easy recommendation engine concept • Goal: Similarity • Fast
hack: Use Elasticsearch‘s built in „More Like this“ feature • Better option: build own similarity

Elasticsearch • Powerful search engine • Very easy to use
and to install • Similar to Solr • Has lots of hidden features

An easy recommendation engine concept • MLT selects a set
of representative terms of these input documents, forms a query using these terms, executes the query and returns the results. The user controls the input documents, how the terms should be selected and how the query is formed. more_like_this can be shortened to mlt.

An easy recommendation engine concept 1. create an index based
on the calculated features

An easy recommendation engine 2. queryit

First Results: What should I watch after Dexter?

Conclusion • We test the capabilities of machine learning methods
to enhance our dataset • Size of the data does matter, but variety matters more • Trying out out of the box solutionis always rewarding and motivating • Stay tuned for part 3 on http://funnybretzel.comor @KarimJDDA Interested in analyzing uniquedata in an innovative environmentand working on super cool projects? • Big Data Engineer x1 • Data Scientist x1 [email protected]

Thank you!

Combining Audio, Video and Text Building and scaling a unique
content recommendation system with open source tools

Karim Jedda - Analyzing TV Series/Movies with P...

Karim Jedda - Analyzing TV Series/Movies with Python – part 2

More Decks by MunichDataGeeks

Other Decks in Programming

Featured

Transcript