Upgrade to Pro — share decks privately, control downloads, hide ads and more …

iubu.it

maffeis
September 12, 2012
620

 iubu.it

Read your videos

maffeis

September 12, 2012
Tweet

Transcript

  1. What? • An automatic book recommendation system • It suggests

    books based on the video you’re watching on YouTube • It is completely web based • Built mainly in JavaScript (client side) and PHP (server side) • It makes use of the Google Books API
  2. How does it work? • It uses a bookmarklet to

    load an external JavaScript • It injects an IFRAME into the YouTube’s page DOM tree. javascript:(function(){_iubu_js=document.createElement('SCRIPT');_iubu_js.type='text/ javascript';_iubu_js.src='http://iubu.it/bm.js';document.getElementsByTagName('head') [0].appendChild(_iubu_js);})();
  3. Feature extraction • Basically: keywords extraction from the YouTube video’s

    page. • Sources: • Title • Meta Keywords • Description • Comments • Related • Category
  4. Feature extraction • First: stopwords removal • Second: word stemming

    • Third: stems frequency distribution • Resulting dataset: {“stem1”:f1, “stem2”:f2, ... , “stem9”:f9}
  5. In an ideal world • Collection of N → ∞

    patterns • Clustering of the patterns based on their semantic distance • K clusters, matching against books categories • Do the same for books • Boom!
  6. The (not so) sad reality • Collection of a sufficiently

    big N of patterns would have been impractical. • Google Books API doesn’t allow tags extraction or category filtering (!) • Data sources are quite volatile: we need to make it *in real time* • In the end: not enough time and resources to build such a system
  7. So what? • Fulltext search of video’s top keywords on

    Google Books • Categorization made to support the search process • Main principle: • Let’s be specific first (keywords search) • More “relaxed” search in case of failure. (category + 1 kw)
  8. Categorization • Two levels categorization: macro categories > subcategories •

    YouTube categories mapped to macro categories • Example: • sports (macro category) ‣ football, baseball, swimming (subcategories) • Categorization based on semantic distance
  9. Semantic distance • Between two words: minimum hop (synsets) count

    • We can use a typical shortest path algorithm • Similarity ∝ 1 / distance • Semantic similarity between category and {“stem1”:f1, “stem2”:f2, ...} pattern: ∑ similarity ( category, stemi ) * fi
  10. Validation • Human based: user feedback (yes, no) per suggested

    book • Not very sophisticated indeed: false positives removal only • Remove on: no > 5 AND no / (yes+no) > 0.6
  11. Demo • Let’s see it in action! • Disclaimer: it’s

    still pretty rough, but it works (sometimes, somehow)