Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Simple Japanese Content-Based Recommender System in Python

Building a Simple Japanese Content-Based Recommender System in Python

Talk gave at the PyCon JP 2016 on 21th of September, 2016.

Online stores such as Amazon but also news/blogs websites suffer from information overload. Customers can easily get lost in their large variety (millions) of products or articles. Recommendation engines help users narrow down the large variety by presenting possible suggestions. In this talk, I will show how to create a simple Japanese content-based recommendation system in Python for blog posts.

Talk page: https://pycon.jp/2016/en/schedule/presentation/43/
Code: https://github.com/charles-vdulac/japanese-content-engine

Charles Vallantin Dulac

September 21, 2016
Tweet

More Decks by Charles Vallantin Dulac

Other Decks in Programming

Transcript

  1. Building a Simple Japanese
    Content-Based
    Recommender System in Python
    Charles Vallantin Dulac

    @cvallantindulac

    View Slide

  2. PROBLEM
    How to create a simple Japanese content-based recommendation
    system in Python for articles?

    View Slide

  3. ABOUT ME
    • Playing with Python for ~10 years.
    • Currently software engineer at Locarise, Tokyo.
    • Father for 1 week.

    View Slide

  4. LOCARISE
    Locarise collects data from several sources,
    processes and analyzes them to provide
    business intelligence support to retail industry
    #Python, #DataScience,
    #Clojure, #Micro-services,
    #Django, #ReactJS

    View Slide

  5. AGENDA
    1. What are content-based recommendation engines?

    2. Concepts used in content-based recommendation engines

    3. Of course, a little bit of code ;)

    View Slide

  6. 1. What Are Content-based 

    Recommendation 

    Engines?

    View Slide

  7. 1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES?
    Recommender systems typically produce a list of
    recommendations in one of two ways:
    We build a model from a
    user's past behavior
    Collaborative filtering approaches Content-based filtering approaches
    We utilize a series of discrete
    characteristics of the items

    View Slide

  8. 1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES?
    TRAIN MODEL
    Compute
    all the documents
    discrete
    characteristics
    (only one time)
    Compute the
    current document
    characteristics
    (each time an user
    clicks on a link)
    Compute
    documents
    similarities
    Return the closest
    documents
    CURRENT DOCUMENT SIMILARITIES SUGGESTIONS
    How does it work?

    View Slide

  9. 2. What Are the Concepts Used in
    Content-based Recommendation
    Engines?

    View Slide

  10. 2. CONCEPTS
    • Removing markups
    • Forcing UTF-8
    • Splitting words on `spaces`
    • Removing stop words
    2.1. We tokenize the text. It’s mostly:

    View Slide

  11. 2. CONCEPTS
    A solution is to run a morphological analyzer based on a dictionary of words with
    parts of speech, and find the best sequence of the words in the dictionary that
    matches the input character sequence.
    But splitting on spaces doesn’t work with most east asian languages…

    View Slide

  12. 2. CONCEPTS
    We create a feature vector for the document
    where each feature is a word (a term) and the
    feature's value is a term weight
    2.2. Vector Space Model (VSM)

    View Slide

  13. 2. CONCEPTS
    We project the vector to a different space:
    • To bring out hidden structure in the corpus, discover
    relationships between words and use them to describe the
    documents in a new and (hopefully) more semantic way.
    • To make the document representation more compact.
    This both improves efficiency (new representation
    consumes less resources) and efficacy (marginal data
    trends are ignored, noise-reduction).
    2.3. VSM Transformations

    View Slide

  14. 2. CONCEPTS
    This transformation is a numerical statistic that is
    intended to reflect how important a word is to a
    document in a collection or corpus.
    TF-IDF (term frequency–inverse document frequency)
    Intuitively:
    • If a word appears frequently in a document, it's
    important. Give the word a high score.
    • But if a word appears in many documents, it's not a
    unique identifier. Give the word a low score.

    View Slide

  15. 2. CONCEPTS
    This is commonly accomplished by
    comparing the deviation of angles
    between each document vector
    and the original query vector. This
    is the Cosine Distance.
    2.4. Computing the Similarities Between Documents

    View Slide

  16. 2. CONCEPTS
    We sort the output by distance and returns the first N
    items if the item score is greater than a threshold.
    2.5. Returning the closest documents

    View Slide

  17. 3. Building a Japanese content-based
    recommendation engine

    View Slide

  18. 3. BUILDING A JAPANESE ENGINE
    To Help Us, We Are Going to Use:
    Gensim.
    A Python library for topic modelling, document indexing and
    similarity retrieval with large corpora.
    Mecab-ipadic-neologd.
    This dictionary includes many neologisms (new word), which are
    extracted from many language resources on the Web.
    Mecab-python3.
    A Python wrapper for Mecabs.
    The last Japanese Wikipedia dump
    ˜2.3 GB bz2 file

    View Slide

  19. 3. BUILDING A JAPANESE ENGINE
    1. Parse the Japanese wikipedia dump
    2. Tokenize all the documents and save tokens in a huge dictionary
    3. Compute TF-IDF for all tokens
    4. Save results
    First script: make_corpus.py

    View Slide

  20. 3. BUILDING A JAPANESE ENGINE

    View Slide

  21. View Slide

  22. 3. BUILDING A JAPANESE ENGINE

    View Slide

  23. 3. BUILDING A JAPANESE ENGINE

    View Slide

  24. 3. BUILDING A JAPANESE ENGINE
    1. Load trained TF-IDF corpus
    2. For a given wikipedia page title, retrieve the content and apply TF-IDF
    3. Compute similarities between the document and the trained corpus
    4. Returns the most similar
    Second script: content_engine.py

    View Slide

  25. View Slide

  26. 3. BUILDING A JAPANESE ENGINE

    View Slide

  27. 3. BUILDING A JAPANESE ENGINE
    1
    2
    3

    View Slide

  28. 3. BUILDING A JAPANESE ENGINE
    Code is available on github
    https://github.com/charles-vdulac/japanese-content-engine

    View Slide

  29. Demo (if I’m lucky and it works)

    View Slide

  30. Next steps

    View Slide

  31. NEXT STEPS
    - Applying other transformations like LDA which extracts topics (topic
    modeling).
    - Using a taxonomy (Are we talking about Apple or an apple ?).

    We can download taxonomies or play with deep leaning algorithms like
    Word2vec to find one automatically. Word2vec may find relationships like
    (Tokyo, Japan), so extends the suggestions.
    We can improve the results by:

    View Slide

  32. Thanks !
    Questions?

    View Slide