Building a Simple Japanese Content-Based Recommender System in Python

by Charles VDulac

Slide 1

Slide 1 text

Building a Simple Japanese Content-Based Recommender System in Python Charles Vallantin Dulac  @cvallantindulac

Slide 2

Slide 2 text

PROBLEM How to create a simple Japanese content-based recommendation system in Python for articles?

Slide 3

Slide 3 text

ABOUT ME • Playing with Python for ~10 years. • Currently software engineer at Locarise, Tokyo. • Father for 1 week.

Slide 4

Slide 4 text

LOCARISE Locarise collects data from several sources, processes and analyzes them to provide business intelligence support to retail industry #Python, #DataScience, #Clojure, #Micro-services, #Django, #ReactJS

Slide 5

Slide 5 text

AGENDA 1. What are content-based recommendation engines?  2. Concepts used in content-based recommendation engines  3. Of course, a little bit of code ;)

Slide 6

Slide 6 text

1. What Are Content-based   Recommendation   Engines?

Slide 7

Slide 7 text

1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES? Recommender systems typically produce a list of recommendations in one of two ways: We build a model from a user's past behavior Collaborative ﬁltering approaches Content-based ﬁltering approaches We utilize a series of discrete characteristics of the items

Slide 8

Slide 8 text

1. WHAT ARE CONTENT-BASED RECOMMENDATION ENGINES? TRAIN MODEL Compute all the documents discrete characteristics (only one time) Compute the current document characteristics (each time an user clicks on a link) Compute documents similarities Return the closest documents CURRENT DOCUMENT SIMILARITIES SUGGESTIONS How does it work?

Slide 9

Slide 9 text

2. What Are the Concepts Used in Content-based Recommendation Engines?

Slide 10

Slide 10 text

2. CONCEPTS • Removing markups • Forcing UTF-8 • Splitting words on `spaces` • Removing stop words 2.1. We tokenize the text. It’s mostly:

Slide 11

Slide 11 text

2. CONCEPTS A solution is to run a morphological analyzer based on a dictionary of words with parts of speech, and ﬁnd the best sequence of the words in the dictionary that matches the input character sequence. But splitting on spaces doesn’t work with most east asian languages…

Slide 12

Slide 12 text

2. CONCEPTS We create a feature vector for the document where each feature is a word (a term) and the feature's value is a term weight 2.2. Vector Space Model (VSM)

Slide 13

Slide 13 text

2. CONCEPTS We project the vector to a different space: • To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way. • To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction). 2.3. VSM Transformations

Slide 14

Slide 14 text

2. CONCEPTS This transformation is a numerical statistic that is intended to reﬂect how important a word is to a document in a collection or corpus. TF-IDF (term frequency–inverse document frequency) Intuitively: • If a word appears frequently in a document, it's important. Give the word a high score. • But if a word appears in many documents, it's not a unique identiﬁer. Give the word a low score.

Slide 15

Slide 15 text

2. CONCEPTS This is commonly accomplished by comparing the deviation of angles between each document vector and the original query vector. This is the Cosine Distance. 2.4. Computing the Similarities Between Documents

Slide 16

Slide 16 text

2. CONCEPTS We sort the output by distance and returns the ﬁrst N items if the item score is greater than a threshold. 2.5. Returning the closest documents

Slide 17

Slide 17 text

3. Building a Japanese content-based recommendation engine

Slide 18

Slide 18 text

3. BUILDING A JAPANESE ENGINE To Help Us, We Are Going to Use: Gensim. A Python library for topic modelling, document indexing and similarity retrieval with large corpora. Mecab-ipadic-neologd. This dictionary includes many neologisms (new word), which are extracted from many language resources on the Web. Mecab-python3. A Python wrapper for Mecabs. The last Japanese Wikipedia dump ˜2.3 GB bz2 ﬁle

Slide 19

Slide 19 text

3. BUILDING A JAPANESE ENGINE 1. Parse the Japanese wikipedia dump 2. Tokenize all the documents and save tokens in a huge dictionary 3. Compute TF-IDF for all tokens 4. Save results First script: make_corpus.py

Slide 20

Slide 20 text

3. BUILDING A JAPANESE ENGINE

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

3. BUILDING A JAPANESE ENGINE

Slide 23

Slide 23 text

3. BUILDING A JAPANESE ENGINE

Slide 24

Slide 24 text

3. BUILDING A JAPANESE ENGINE 1. Load trained TF-IDF corpus 2. For a given wikipedia page title, retrieve the content and apply TF-IDF 3. Compute similarities between the document and the trained corpus 4. Returns the most similar Second script: content_engine.py

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

3. BUILDING A JAPANESE ENGINE

Slide 27

Slide 27 text

3. BUILDING A JAPANESE ENGINE 1 2 3

Slide 28

Slide 28 text

3. BUILDING A JAPANESE ENGINE Code is available on github https://github.com/charles-vdulac/japanese-content-engine

Slide 29

Slide 29 text

Demo (if I’m lucky and it works)

Slide 30

Slide 30 text

Next steps

Slide 31

Slide 31 text

NEXT STEPS - Applying other transformations like LDA which extracts topics (topic modeling). - Using a taxonomy (Are we talking about Apple or an apple ?).  We can download taxonomies or play with deep leaning algorithms like Word2vec to ﬁnd one automatically. Word2vec may ﬁnd relationships like (Tokyo, Japan), so extends the suggestions. We can improve the results by:

Slide 32

Slide 32 text

Thanks ! Questions?