Building a Simple Japanese
Content-Based
Recommender System in Python
Charles Vallantin Dulac
@cvallantindulac
Slide 2
Slide 2 text
PROBLEM
How to create a simple Japanese content-based recommendation
system in Python for articles?
Slide 3
Slide 3 text
ABOUT ME
• Playing with Python for ~10 years.
• Currently software engineer at Locarise, Tokyo.
• Father for 1 week.
Slide 4
Slide 4 text
LOCARISE
Locarise collects data from several sources,
processes and analyzes them to provide
business intelligence support to retail industry
#Python, #DataScience,
#Clojure, #Micro-services,
#Django, #ReactJS
Slide 5
Slide 5 text
AGENDA
1. What are content-based recommendation engines?
2. Concepts used in content-based recommendation engines
3. Of course, a little bit of code ;)
Slide 6
Slide 6 text
1. What Are Content-based
Recommendation
Engines?
Slide 7
Slide 7 text
1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES?
Recommender systems typically produce a list of
recommendations in one of two ways:
We build a model from a
user's past behavior
Collaborative filtering approaches Content-based filtering approaches
We utilize a series of discrete
characteristics of the items
Slide 8
Slide 8 text
1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES?
TRAIN MODEL
Compute
all the documents
discrete
characteristics
(only one time)
Compute the
current document
characteristics
(each time an user
clicks on a link)
Compute
documents
similarities
Return the closest
documents
CURRENT DOCUMENT SIMILARITIES SUGGESTIONS
How does it work?
Slide 9
Slide 9 text
2. What Are the Concepts Used in
Content-based Recommendation
Engines?
Slide 10
Slide 10 text
2. CONCEPTS
• Removing markups
• Forcing UTF-8
• Splitting words on `spaces`
• Removing stop words
2.1. We tokenize the text. It’s mostly:
Slide 11
Slide 11 text
2. CONCEPTS
A solution is to run a morphological analyzer based on a dictionary of words with
parts of speech, and find the best sequence of the words in the dictionary that
matches the input character sequence.
But splitting on spaces doesn’t work with most east asian languages…
Slide 12
Slide 12 text
2. CONCEPTS
We create a feature vector for the document
where each feature is a word (a term) and the
feature's value is a term weight
2.2. Vector Space Model (VSM)
Slide 13
Slide 13 text
2. CONCEPTS
We project the vector to a different space:
• To bring out hidden structure in the corpus, discover
relationships between words and use them to describe the
documents in a new and (hopefully) more semantic way.
• To make the document representation more compact.
This both improves efficiency (new representation
consumes less resources) and efficacy (marginal data
trends are ignored, noise-reduction).
2.3. VSM Transformations
Slide 14
Slide 14 text
2. CONCEPTS
This transformation is a numerical statistic that is
intended to reflect how important a word is to a
document in a collection or corpus.
TF-IDF (term frequency–inverse document frequency)
Intuitively:
• If a word appears frequently in a document, it's
important. Give the word a high score.
• But if a word appears in many documents, it's not a
unique identifier. Give the word a low score.
Slide 15
Slide 15 text
2. CONCEPTS
This is commonly accomplished by
comparing the deviation of angles
between each document vector
and the original query vector. This
is the Cosine Distance.
2.4. Computing the Similarities Between Documents
Slide 16
Slide 16 text
2. CONCEPTS
We sort the output by distance and returns the first N
items if the item score is greater than a threshold.
2.5. Returning the closest documents
Slide 17
Slide 17 text
3. Building a Japanese content-based
recommendation engine
Slide 18
Slide 18 text
3. BUILDING A JAPANESE ENGINE
To Help Us, We Are Going to Use:
Gensim.
A Python library for topic modelling, document indexing and
similarity retrieval with large corpora.
Mecab-ipadic-neologd.
This dictionary includes many neologisms (new word), which are
extracted from many language resources on the Web.
Mecab-python3.
A Python wrapper for Mecabs.
The last Japanese Wikipedia dump
˜2.3 GB bz2 file
Slide 19
Slide 19 text
3. BUILDING A JAPANESE ENGINE
1. Parse the Japanese wikipedia dump
2. Tokenize all the documents and save tokens in a huge dictionary
3. Compute TF-IDF for all tokens
4. Save results
First script: make_corpus.py
Slide 20
Slide 20 text
3. BUILDING A JAPANESE ENGINE
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
3. BUILDING A JAPANESE ENGINE
Slide 23
Slide 23 text
3. BUILDING A JAPANESE ENGINE
Slide 24
Slide 24 text
3. BUILDING A JAPANESE ENGINE
1. Load trained TF-IDF corpus
2. For a given wikipedia page title, retrieve the content and apply TF-IDF
3. Compute similarities between the document and the trained corpus
4. Returns the most similar
Second script: content_engine.py
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
3. BUILDING A JAPANESE ENGINE
Slide 27
Slide 27 text
3. BUILDING A JAPANESE ENGINE
1
2
3
Slide 28
Slide 28 text
3. BUILDING A JAPANESE ENGINE
Code is available on github
https://github.com/charles-vdulac/japanese-content-engine
Slide 29
Slide 29 text
Demo (if I’m lucky and it works)
Slide 30
Slide 30 text
Next steps
Slide 31
Slide 31 text
NEXT STEPS
- Applying other transformations like LDA which extracts topics (topic
modeling).
- Using a taxonomy (Are we talking about Apple or an apple ?).
We can download taxonomies or play with deep leaning algorithms like
Word2vec to find one automatically. Word2vec may find relationships like
(Tokyo, Japan), so extends the suggestions.
We can improve the results by: