Upgrade to Pro — share decks privately, control downloads, hide ads and more …

iubu.it

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for maffeis maffeis
September 12, 2012
620

 iubu.it

Read your videos

Avatar for maffeis

maffeis

September 12, 2012
Tweet

Transcript

  1. What? • An automatic book recommendation system • It suggests

    books based on the video you’re watching on YouTube • It is completely web based • Built mainly in JavaScript (client side) and PHP (server side) • It makes use of the Google Books API
  2. How does it work? • It uses a bookmarklet to

    load an external JavaScript • It injects an IFRAME into the YouTube’s page DOM tree. javascript:(function(){_iubu_js=document.createElement('SCRIPT');_iubu_js.type='text/ javascript';_iubu_js.src='http://iubu.it/bm.js';document.getElementsByTagName('head') [0].appendChild(_iubu_js);})();
  3. Feature extraction • Basically: keywords extraction from the YouTube video’s

    page. • Sources: • Title • Meta Keywords • Description • Comments • Related • Category
  4. Feature extraction • First: stopwords removal • Second: word stemming

    • Third: stems frequency distribution • Resulting dataset: {“stem1”:f1, “stem2”:f2, ... , “stem9”:f9}
  5. In an ideal world • Collection of N → ∞

    patterns • Clustering of the patterns based on their semantic distance • K clusters, matching against books categories • Do the same for books • Boom!
  6. The (not so) sad reality • Collection of a sufficiently

    big N of patterns would have been impractical. • Google Books API doesn’t allow tags extraction or category filtering (!) • Data sources are quite volatile: we need to make it *in real time* • In the end: not enough time and resources to build such a system
  7. So what? • Fulltext search of video’s top keywords on

    Google Books • Categorization made to support the search process • Main principle: • Let’s be specific first (keywords search) • More “relaxed” search in case of failure. (category + 1 kw)
  8. Categorization • Two levels categorization: macro categories > subcategories •

    YouTube categories mapped to macro categories • Example: • sports (macro category) ‣ football, baseball, swimming (subcategories) • Categorization based on semantic distance
  9. Semantic distance • Between two words: minimum hop (synsets) count

    • We can use a typical shortest path algorithm • Similarity ∝ 1 / distance • Semantic similarity between category and {“stem1”:f1, “stem2”:f2, ...} pattern: ∑ similarity ( category, stemi ) * fi
  10. Validation • Human based: user feedback (yes, no) per suggested

    book • Not very sophisticated indeed: false positives removal only • Remove on: no > 5 AND no / (yes+no) > 0.6
  11. Demo • Let’s see it in action! • Disclaimer: it’s

    still pretty rough, but it works (sometimes, somehow)