iubu.it

iubu.it //

What? • An automatic book recommendation system • It suggests
books based on the video you’re watching on YouTube • It is completely web based • Built mainly in JavaScript (client side) and PHP (server side) • It makes use of the Google Books API

How does it work? • It uses a bookmarklet to
load an external JavaScript • It injects an IFRAME into the YouTube’s page DOM tree. javascript:(function(){_iubu_js=document.createElement('SCRIPT');_iubu_js.type='text/ javascript';_iubu_js.src='http://iubu.it/bm.js';document.getElementsByTagName('head') [0].appendChild(_iubu_js);})();

What does it look like?

Feature extraction • Basically: keywords extraction from the YouTube video’s
page. • Sources: • Title • Meta Keywords • Description • Comments • Related • Category

Feature extraction • First: stopwords removal • Second: word stemming
• Third: stems frequency distribution • Resulting dataset: {“stem1”:f1, “stem2”:f2, ... , “stem9”:f9}

In an ideal world • Collection of N → ∞
patterns • Clustering of the patterns based on their semantic distance • K clusters, matching against books categories • Do the same for books • Boom!

The (not so) sad reality • Collection of a sufﬁciently
big N of patterns would have been impractical. • Google Books API doesn’t allow tags extraction or category ﬁltering (!) • Data sources are quite volatile: we need to make it *in real time* • In the end: not enough time and resources to build such a system

So what? • Fulltext search of video’s top keywords on
Google Books • Categorization made to support the search process • Main principle: • Let’s be speciﬁc ﬁrst (keywords search) • More “relaxed” search in case of failure. (category + 1 kw)

Categorization • Two levels categorization: macro categories > subcategories •
YouTube categories mapped to macro categories • Example: • sports (macro category) ‣ football, baseball, swimming (subcategories) • Categorization based on semantic distance

WordNet

Semantic distance word word word sense sense sense synset sense
sense sense

Semantic distance • Between two words: minimum hop (synsets) count
• We can use a typical shortest path algorithm • Similarity ∝ 1 / distance • Semantic similarity between category and {“stem1”:f1, “stem2”:f2, ...} pattern: ∑ similarity ( category, stemi ) * fi

Validation • Human based: user feedback (yes, no) per suggested
book • Not very sophisticated indeed: false positives removal only • Remove on: no > 5 AND no / (yes+no) > 0.6

Demo • Let’s see it in action! • Disclaimer: it’s
still pretty rough, but it works (sometimes, somehow)

iubu.it

iubu.it

maffeis

More Decks by maffeis

Featured

Transcript

iubu.it //

What? • An automatic book recommendation system • It suggests

How does it work? • It uses a bookmarklet to

What does it look like?

Feature extraction • Basically: keywords extraction from the YouTube video’s

Feature extraction • First: stopwords removal • Second: word stemming

In an ideal world • Collection of N → ∞

The (not so) sad reality • Collection of a sufﬁciently

So what? • Fulltext search of video’s top keywords on

Categorization • Two levels categorization: macro categories > subcategories •

WordNet

Semantic distance word word word sense sense sense synset sense

Semantic distance word word word sense sense sense synset sense

Semantic distance • Between two words: minimum hop (synsets) count

Validation • Human based: user feedback (yes, no) per suggested

Demo • Let’s see it in action! • Disclaimer: it’s