Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Simple Japanese Content-Based Recommender System in Python

Charles VDulac
September 21, 2016

Building a Simple Japanese Content-Based Recommender System in Python

Talk gave at the PyCon JP 2016 on 21th of September, 2016.

Online stores such as Amazon but also news/blogs websites suffer from information overload. Customers can easily get lost in their large variety (millions) of products or articles. Recommendation engines help users narrow down the large variety by presenting possible suggestions. In this talk, I will show how to create a simple Japanese content-based recommendation system in Python for blog posts.

Talk page: https://pycon.jp/2016/en/schedule/presentation/43/
Code: https://github.com/charles-vdulac/japanese-content-engine

Charles VDulac

September 21, 2016

Other Decks in Programming


  1. ABOUT ME • Playing with Python for ~10 years. •

    Currently software engineer at Locarise, Tokyo. • Father for 1 week.
  2. LOCARISE Locarise collects data from several sources, processes and analyzes

    them to provide business intelligence support to retail industry #Python, #DataScience, #Clojure, #Micro-services, #Django, #ReactJS
  3. AGENDA 1. What are content-based recommendation engines?
 2. Concepts used

    in content-based recommendation engines
 3. Of course, a little bit of code ;)
  4. 1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES? Recommender systems typically produce

    a list of recommendations in one of two ways: We build a model from a user's past behavior Collaborative filtering approaches Content-based filtering approaches We utilize a series of discrete characteristics of the items

    the documents discrete characteristics (only one time) Compute the current document characteristics (each time an user clicks on a link) Compute documents similarities Return the closest documents CURRENT DOCUMENT SIMILARITIES SUGGESTIONS How does it work?
  6. 2. CONCEPTS • Removing markups • Forcing UTF-8 • Splitting

    words on `spaces` • Removing stop words 2.1. We tokenize the text. It’s mostly:
  7. 2. CONCEPTS A solution is to run a morphological analyzer

    based on a dictionary of words with parts of speech, and find the best sequence of the words in the dictionary that matches the input character sequence. But splitting on spaces doesn’t work with most east asian languages…
  8. 2. CONCEPTS We create a feature vector for the document

    where each feature is a word (a term) and the feature's value is a term weight 2.2. Vector Space Model (VSM)
  9. 2. CONCEPTS We project the vector to a different space:

    • To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way. • To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction). 2.3. VSM Transformations
  10. 2. CONCEPTS This transformation is a numerical statistic that is

    intended to reflect how important a word is to a document in a collection or corpus. TF-IDF (term frequency–inverse document frequency) Intuitively: • If a word appears frequently in a document, it's important. Give the word a high score. • But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
  11. 2. CONCEPTS This is commonly accomplished by comparing the deviation

    of angles between each document vector and the original query vector. This is the Cosine Distance. 2.4. Computing the Similarities Between Documents
  12. 2. CONCEPTS We sort the output by distance and returns

    the first N items if the item score is greater than a threshold. 2.5. Returning the closest documents

    Going to Use: Gensim. A Python library for topic modelling, document indexing and similarity retrieval with large corpora. Mecab-ipadic-neologd. This dictionary includes many neologisms (new word), which are extracted from many language resources on the Web. Mecab-python3. A Python wrapper for Mecabs. The last Japanese Wikipedia dump ˜2.3 GB bz2 file
  14. 3. BUILDING A JAPANESE ENGINE 1. Parse the Japanese wikipedia

    dump 2. Tokenize all the documents and save tokens in a huge dictionary 3. Compute TF-IDF for all tokens 4. Save results First script: make_corpus.py
  15. 3. BUILDING A JAPANESE ENGINE 1. Load trained TF-IDF corpus

    2. For a given wikipedia page title, retrieve the content and apply TF-IDF 3. Compute similarities between the document and the trained corpus 4. Returns the most similar Second script: content_engine.py
  16. 3. BUILDING A JAPANESE ENGINE Code is available on github

  17. NEXT STEPS - Applying other transformations like LDA which extracts

    topics (topic modeling). - Using a taxonomy (Are we talking about Apple or an apple ?).
 We can download taxonomies or play with deep leaning algorithms like Word2vec to find one automatically. Word2vec may find relationships like (Tokyo, Japan), so extends the suggestions. We can improve the results by: