Treasure Data
Summer Internship 2017
Final Report
Kento NOZAWA (@nzw0301)
Sep. 29 2017
Slide 2
Slide 2 text
Who am I ?
• Kento NOZAWA (@nzw0301)
• Master’s student at university of Tsukuba
• I will be a Ph.D student next year
• Research: unsupervised machine learning
• Graph Data and Topic Models
• OSS contribution: Keras and document translation
• Keras is a very popular deep learning framework
Slide 3
Slide 3 text
What I did: Adding new UDFs to Hivemall
1. F-Measure: evaluation metric for classification model
• First easy task
• To learn Hive and Hivemall
• Merged
2. SLIM: fast recommendation algorithm
• The hardest work for me…
• Merged
3. word2vec: unsupervised word feature learning algorithm
• Challenging task
• Under review
Slide 4
Slide 4 text
1. F-Measure
Slide 5
Slide 5 text
• Predict a label of each data given two categories
• e.g: positive/negative book review, …
• Train a ML model on labeled dataset
Background: Binary Classification Problem
OR
Dataset
Train ML model
Slide 6
Slide 6 text
• Predict a label of each data given two categories
• e.g: User gender, positive/negative book review, …
• Prediction model is trained on labeled dataset
Background: Binary Classification Problem
OR
Dataset
Train ML model
8IJDIJT.-NPEFMCFUUFS
)PXUPDIPPTFCFUUFSQBSBNFUFSTPG.-NPEFM
Slide 7
Slide 7 text
F-Measure
• Widely used evaluation metric for classification model
• Higher F-measure indicates better model
F
= (1 +
2
)
precision
⇤
recall
2
precision + recall
Truth labels
Predicted
Precision =
Recall =
Slide 8
Slide 8 text
F-Measure
• Widely used evaluation metric for classification model
• Higher F-measure indicates better model
• When β=1, it is called F1–score/F–score
• Hivemall supported F1-score
F
= (1 +
2
)
precision
⇤
recall
2
precision + recall
Slide 9
Slide 9 text
My Tasks
• User can pass β to argument in the query
• Two average calculation ways for binary classification
• Micro average
• Binary average
• Support multi-labeling
Slide 10
Slide 10 text
Usage of fmeasure for Binary Classification
set β & average
Detail usage: https://hivemall.incubator.apache.org/userguide/eval/binary_classification_measures.html
Slide 11
Slide 11 text
Usage of fmeasure for Multi-label Classification
Detail usage: https://hivemall.incubator.apache.org/userguide/eval/multilabel_classification_measures.html
Slide 12
Slide 12 text
2. SLIM
Slide 13
Slide 13 text
• Suggest some items to the user
• If he gets one, he will be satisfied with it
Background: Recommendation
Slide 14
Slide 14 text
• Suggest some items to the user
• If he gets one, he will be satisfied with it
• Recommendation based on
purchase history of user
Background: Recommendation
Slide 15
Slide 15 text
• Predict top-N items per user based on its score
• Each predicted item has the score
• e.g. future rating, #stars
Top-N Recommendation
Dataset
Train ML model
Top-3 book recommendation
4.7 4.6 4.2
Slide 16
Slide 16 text
About SLIM
• Sparse LInear Method
• Fast top-N recommendation algorithm
Xia Ning and George Karypis. SLIM: Sparse Linear Methods for Top-N Recommender Systems.
In ICDM, 2011.
minwj
1
2
||aj Awj
||2
2 +
2
||wj
||2
2 +
||wj
||1
subject to
W
0
diag(
W
) = 0
× ≒
A W A’
items
users
Slide 17
Slide 17 text
Why is SLIM Fast?
1. Training
• Parallelized training per by coordinate descent
• Approximate matrix product AW
• Only use top-k similar items per item
2. Prediction
• Approximate Matrix product AW
• Weights matrix W is sparse
aj
Slide 18
Slide 18 text
Explanation of train_slim Function: i and j
• i and j are item index
• j is one of the top-k
similar items of i
i th book
Slide 19
Slide 19 text
Explanation of train_slim Function: r_i and r_j
• r_i is a map stored all ratings by users
• key: user id
• value: rating
• r_j is the same
i th book
x5 x5 x2
Slide 20
Slide 20 text
Explanation of train_slim Function: knn_i
• knn_i is a map of top-k similar items
of item i with ratings
• Larger k makes better recommendation,
but memory usage and training time increase
* th book
x5 x5 x4
x k
Slide 21
Slide 21 text
Prediction
• Only use HiveQL
• Known ratings and train_slim’s output
• Output value: future rating of itemid by userid
Matrix product
Slide 22
Slide 22 text
Top-N Item Recommendation for Each User
Use each_top_k based on SLIM’s predicted values
full queries:
https://hivemall.incubator.apache.org/userguide/recommend/
movielens_slim.html
Slide 23
Slide 23 text
3. word2vec
Slide 24
Slide 24 text
“java” - “compiler” + “interpreter” ?
Slide 25
Slide 25 text
“java” - “compiler” + “interpreter” ?
A. rexx
Slide 26
Slide 26 text
word2vec
• Unsupervised algorithms to obtain word vector
• Only use document dataset like a Wikipedia
• High impact algorithms
• Very fast
• Simple model
• Other domain applications
• E.g. item purchase history, graph data
Slide 27
Slide 27 text
Word Vector
• Each word represents a dense and low dimension vector
• About 100 — 1000 dimension
• Features
• Similar words are similar vectors
• Finding synonyms
• Good feature for other ML tasks
• Word analogy
• King - Man + Woman ~= Queen
• Reading - Watching + Watched ~= Read
• France - Paris + Tokyo ~= Japan
King Man
Queen Woman
Slide 28
Slide 28 text
Word Vector
• Each word represents a dense and low dimension vector
• About 100 — 1000 dimension
• Features
• Similar words are similar vectors
• Finding synonyms
• Good feature for other ML tasks
• Word analogy
• King - Man + Woman ~= Queen
• Reading - Watching + Watched ~= Read
• France - Paris + Tokyo ~= Japan
King Man
Queen Woman
5PQTJNJMBSXPSETPG
lKBWBz
Slide 29
Slide 29 text
Word Vector
• Each word represents a dense and low dimension vector
• About 100 — 1000 dimension
• Features
• Similar words are similar vectors
• Finding synonyms
• Good feature for other ML tasks
• Word analogy
• King - Man + Woman ~= Queen
• Reading - Watching + Watched ~= Read
• France - Paris + Tokyo ~= Japan
King Man
Queen Woman
Slide 30
Slide 30 text
High Impact Papers
• There are many *2vec papers…
• doc2vec, pin2vec, node2ve, query2vec, emoji2vec, dna2vec, …
• At least, 51 papers
List: https://gist.github.com/
nzw0301/333afc00bd508501268fa7bf40cafe4e
Slide 31
Slide 31 text
word2vec Models
• Word2vec is the tool name including two models
• Skip-gram
• Continuous Bag-of-Words
• Hivemall supports both algorithms
Original code: https://github.com/tmikolov/word2vec
Slide 32
Slide 32 text
Concept of Skip-Gram
• Train word vector by predicting nearby words given
the current word
Alice was beginning to get very
Cited from T. Mikolov, et al., Efficient Estimation of Word Representations in Vector Space. In ICLR, 2013.
Slide 33
Slide 33 text
Concept of Continuous Bag-of-Words (CBoW)
Alice was beginning to get very
• Train word vector by predicting the current word
based on nearby words
• It is better model in order to
obtain low frequency words
Cited from T. Mikolov, et al., Efficient Estimation of Word Representations in Vector Space. In ICLR, 2013.
Slide 34
Slide 34 text
Usage of train_word2vec
• negative_table: next slides
• words: array of string/int
• Last string argument: training parameters
Slide 35
Slide 35 text
Negative Sampling
• Output layer’s activation function : softmax
• O(V) : V is the size of vocabulary
• Negative sampling approximates softmax
• Use “negative words” sampled
from noise distribution as negative example
• The number of negative samples is 5 – 25
Slide 36
Slide 36 text
Word Sampling from Noise Distribution
A
0 1
V = 4
The Hive word2vec is
0.3 0.2 0.25 0.25
Traditional search algorithms are too slow…
• Linear search : O(V)
• Binary search: O(log(V))
Slide 37
Slide 37 text
Original Implementation
A
0 1
The Hive word2vec is
0.3 0.2 0.25 0.25
Elements of array that has 100 elements
• 0—29: The
• 30—49: Hive
• 50—74: word2vec
• 75—99: is
Sampling by nextint(100) : O(1)
Using too long array for V
Slide 38
Slide 38 text
Alias Method
• Save memory
• Same sampling cost: O(1)
A
0 4
The Hive word2vec is
1.2 0.8 1.0 1.0
A
0 1
The Hive word2vec is
0.3 0.2 0.25 0.25
x V (=4)
Slide 39
Slide 39 text
Alias Method: Split Array O(V)
A
0 4
The Hive word2vec is
1.2 0.8 1.0 1.0
A
0
The
Hive
word2vec
is
A
A
A
The
0.8
1.0
Each bar has two word at most
Slide 40
Slide 40 text
Alias Method: Sampling O(1)
A
0
The
Hive
word2vec
is
A
A
A
The
0.8
1.0
0
1. Sampling index by nextint(V)
3
2
1
Slide 41
Slide 41 text
Alias Method: Sampling O(1)
A
0
The
Hive
word2vec
is
A
A
A
The
0.8
1.0
0
1. Sampling index by nextint(V)
3
2
1
nextint(V) = 1
Slide 42
Slide 42 text
Alias Method: Sampling O(1)
A
0
The
Hive
word2vec
is
A
A
A
The
0.8
1.0
0
1. Sampling index by nextint(V)
2. Sampling double nextDouble()
3
2
1
Slide 43
Slide 43 text
Alias Method: Sampling O(1)
A
0
The
Hive
word2vec
is
A
A
A
The
0.8
1.0
0
1. Sampling index by nextint(V)
2. Sampling double nextDouble()
3
2
1
nextDouble() = 0.7
If random value < 0.8
1th element: Hive
Else
the other word
in 1st element: The
Slide 44
Slide 44 text
Characteristic Features of Hivemall’s word2vec
1. Use Alias method for negative sampling
• Fast sampling from noise distribution
• Saving memory
• Original: use too long array for the size of vocabulary
2. Data parallel training
• No parameter synchronous
• Guarantee the same initialization for vector weights
• By using word index as seed value
• Unfortunately, vector quality is not good…
Slide 45
Slide 45 text
Future work
• Comparison between implemented algorithm and
other algorithms
• e.g. Recommendation quality and speed for SLIM
• Improvement quality during data parallel training
• parameter server ?
• If we can do, I also want to write a paper based on
hivemall implementation…
Slide 46
Slide 46 text
Impression
• It’s my first long term internship
• Distributed machine learning is exciting task
• Reimplementation ML algorithms on Hivemall
• Programming skills: Java, hive and Hivemall