Building a Simple Japanese Content-Based Recommender System in Python

Building a Simple Japanese Content-Based Recommender System in Python Charles
Vallantin Dulac  @cvallantindulac

PROBLEM How to create a simple Japanese content-based recommendation system
in Python for articles?

ABOUT ME • Playing with Python for ~10 years. •
Currently software engineer at Locarise, Tokyo. • Father for 1 week.

LOCARISE Locarise collects data from several sources, processes and analyzes
them to provide business intelligence support to retail industry #Python, #DataScience, #Clojure, #Micro-services, #Django, #ReactJS

AGENDA 1. What are content-based recommendation engines?  2. Concepts used
in content-based recommendation engines  3. Of course, a little bit of code ;)

1. What Are Content-based   Recommendation   Engines?

1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES? Recommender systems typically produce
a list of recommendations in one of two ways: We build a model from a user's past behavior Collaborative ﬁltering approaches Content-based ﬁltering approaches We utilize a series of discrete characteristics of the items

1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES? TRAIN MODEL Compute all
the documents discrete characteristics (only one time) Compute the current document characteristics (each time an user clicks on a link) Compute documents similarities Return the closest documents CURRENT DOCUMENT SIMILARITIES SUGGESTIONS How does it work?

2. What Are the Concepts Used in Content-based Recommendation Engines?

2. CONCEPTS • Removing markups • Forcing UTF-8 • Splitting
words on `spaces` • Removing stop words 2.1. We tokenize the text. It’s mostly:

2. CONCEPTS A solution is to run a morphological analyzer
based on a dictionary of words with parts of speech, and ﬁnd the best sequence of the words in the dictionary that matches the input character sequence. But splitting on spaces doesn’t work with most east asian languages…

2. CONCEPTS We create a feature vector for the document
where each feature is a word (a term) and the feature's value is a term weight 2.2. Vector Space Model (VSM)

2. CONCEPTS We project the vector to a different space:
• To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way. • To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction). 2.3. VSM Transformations

2. CONCEPTS This transformation is a numerical statistic that is
intended to reﬂect how important a word is to a document in a collection or corpus. TF-IDF (term frequency–inverse document frequency) Intuitively: • If a word appears frequently in a document, it's important. Give the word a high score. • But if a word appears in many documents, it's not a unique identiﬁer. Give the word a low score.

2. CONCEPTS This is commonly accomplished by comparing the deviation
of angles between each document vector and the original query vector. This is the Cosine Distance. 2.4. Computing the Similarities Between Documents

2. CONCEPTS We sort the output by distance and returns
the ﬁrst N items if the item score is greater than a threshold. 2.5. Returning the closest documents

3. Building a Japanese content-based recommendation engine

3. BUILDING A JAPANESE ENGINE To Help Us, We Are
Going to Use: Gensim. A Python library for topic modelling, document indexing and similarity retrieval with large corpora. Mecab-ipadic-neologd. This dictionary includes many neologisms (new word), which are extracted from many language resources on the Web. Mecab-python3. A Python wrapper for Mecabs. The last Japanese Wikipedia dump ˜2.3 GB bz2 ﬁle

3. BUILDING A JAPANESE ENGINE 1. Parse the Japanese wikipedia
dump 2. Tokenize all the documents and save tokens in a huge dictionary 3. Compute TF-IDF for all tokens 4. Save results First script: make_corpus.py

3. BUILDING A JAPANESE ENGINE

3. BUILDING A JAPANESE ENGINE 1. Load trained TF-IDF corpus
2. For a given wikipedia page title, retrieve the content and apply TF-IDF 3. Compute similarities between the document and the trained corpus 4. Returns the most similar Second script: content_engine.py

3. BUILDING A JAPANESE ENGINE

3. BUILDING A JAPANESE ENGINE 1 2 3

3. BUILDING A JAPANESE ENGINE Code is available on github
https://github.com/charles-vdulac/japanese-content-engine

Demo (if I’m lucky and it works)

Next steps

NEXT STEPS - Applying other transformations like LDA which extracts
topics (topic modeling). - Using a taxonomy (Are we talking about Apple or an apple ?).  We can download taxonomies or play with deep leaning algorithms like Word2vec to ﬁnd one automatically. Word2vec may ﬁnd relationships like (Tokyo, Japan), so extends the suggestions. We can improve the results by:

Thanks ! Questions?

Building a Simple Japanese Content-Based Recomm...

Building a Simple Japanese Content-Based Recommender System in Python

Charles VDulac

Other Decks in Programming

Featured

Transcript