Treasure Data Summer Internship 2017

by Kento Nozawa

Slide 1

Slide 1 text

Treasure Data Summer Internship 2017 Final Report Kento NOZAWA (@nzw0301) Sep. 29 2017

Slide 2

Slide 2 text

Who am I ? • Kento NOZAWA (@nzw0301) • Master’s student at university of Tsukuba • I will be a Ph.D student next year • Research: unsupervised machine learning • Graph Data and Topic Models • OSS contribution: Keras and document translation • Keras is a very popular deep learning framework

Slide 3

Slide 3 text

What I did: Adding new UDFs to Hivemall 1. F-Measure: evaluation metric for classiﬁcation model • First easy task • To learn Hive and Hivemall • Merged 2. SLIM: fast recommendation algorithm • The hardest work for me… • Merged 3. word2vec: unsupervised word feature learning algorithm • Challenging task • Under review

Slide 4

Slide 4 text

1. F-Measure

Slide 5

Slide 5 text

• Predict a label of each data given two categories • e.g: positive/negative book review, … • Train a ML model on labeled dataset Background: Binary Classiﬁcation Problem OR Dataset Train ML model

Slide 6

Slide 6 text

• Predict a label of each data given two categories • e.g: User gender, positive/negative book review, … • Prediction model is trained on labeled dataset Background: Binary Classiﬁcation Problem OR Dataset Train ML model 8IJDIJT.-NPEFMCFUUFS )PXUPDIPPTFCFUUFSQBSBNFUFSTPG.-NPEFM

Slide 7

Slide 7 text

F-Measure • Widely used evaluation metric for classiﬁcation model • Higher F-measure indicates better model F = (1 + 2 ) precision ⇤ recall 2 precision + recall Truth labels Predicted Precision = Recall =

Slide 8

Slide 8 text

F-Measure • Widely used evaluation metric for classiﬁcation model • Higher F-measure indicates better model • When β=1, it is called F1–score/F–score • Hivemall supported F1-score F = (1 + 2 ) precision ⇤ recall 2 precision + recall

Slide 9

Slide 9 text

My Tasks • User can pass β to argument in the query • Two average calculation ways for binary classiﬁcation • Micro average • Binary average • Support multi-labeling

Slide 10

Slide 10 text

Usage of fmeasure for Binary Classiﬁcation set β & average Detail usage: https://hivemall.incubator.apache.org/userguide/eval/binary_classiﬁcation_measures.html

Slide 11

Slide 11 text

Usage of fmeasure for Multi-label Classiﬁcation Detail usage: https://hivemall.incubator.apache.org/userguide/eval/multilabel_classiﬁcation_measures.html

Slide 12

Slide 12 text

2. SLIM

Slide 13

Slide 13 text

• Suggest some items to the user • If he gets one, he will be satisﬁed with it Background: Recommendation

Slide 14

Slide 14 text

• Suggest some items to the user • If he gets one, he will be satisﬁed with it • Recommendation based on   purchase history of user Background: Recommendation

Slide 15

Slide 15 text

• Predict top-N items per user based on its score • Each predicted item has the score • e.g. future rating, #stars Top-N Recommendation Dataset Train ML model Top-3 book recommendation 4.7 4.6 4.2

Slide 16

Slide 16 text

About SLIM • Sparse LInear Method • Fast top-N recommendation algorithm Xia Ning and George Karypis. SLIM: Sparse Linear Methods for Top-N Recommender Systems. In ICDM, 2011. minwj 1 2 ||aj Awj ||2 2 + 2 ||wj ||2 2 + ||wj ||1 subject to W 0 diag( W ) = 0 × ≒ A W A’ items users

Slide 17

Slide 17 text

Why is SLIM Fast? 1. Training • Parallelized training per by coordinate descent • Approximate matrix product AW • Only use top-k similar items per item 2. Prediction • Approximate Matrix product AW • Weights matrix W is sparse aj

Slide 18

Slide 18 text

Explanation of train_slim Function: i and j • i and j are item index • j is one of the top-k   similar items of i i th book

Slide 19

Slide 19 text

Explanation of train_slim Function: r_i and r_j • r_i is a map stored all ratings by users • key: user id • value: rating • r_j is the same i th book x5 x5 x2

Slide 20

Slide 20 text

Explanation of train_slim Function: knn_i • knn_i is a map of top-k similar items   of item i with ratings • Larger k makes better recommendation,   but memory usage and training time increase * th book x5 x5 x4 x k

Slide 21

Slide 21 text

Prediction • Only use HiveQL • Known ratings and train_slim’s output • Output value: future rating of itemid by userid Matrix product

Slide 22

Slide 22 text

Top-N Item Recommendation for Each User Use each_top_k based on SLIM’s predicted values full queries:   https://hivemall.incubator.apache.org/userguide/recommend/ movielens_slim.html

Slide 23

Slide 23 text

3. word2vec

Slide 24

Slide 24 text

“java” - “compiler” + “interpreter” ?

Slide 25

Slide 25 text

“java” - “compiler” + “interpreter” ? A. rexx

Slide 26

Slide 26 text

word2vec • Unsupervised algorithms to obtain word vector • Only use document dataset like a Wikipedia • High impact algorithms • Very fast • Simple model • Other domain applications • E.g. item purchase history, graph data

Slide 27

Slide 27 text

Word Vector • Each word represents a dense and low dimension vector • About 100 — 1000 dimension • Features • Similar words are similar vectors • Finding synonyms • Good feature for other ML tasks • Word analogy • King - Man + Woman ~= Queen • Reading - Watching + Watched ~= Read • France - Paris + Tokyo ~= Japan King Man Queen Woman

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

High Impact Papers • There are many *2vec papers… • doc2vec, pin2vec, node2ve, query2vec, emoji2vec, dna2vec, … • At least, 51 papers List: https://gist.github.com/ nzw0301/333afc00bd508501268fa7bf40cafe4e

Slide 31

Slide 31 text

word2vec Models • Word2vec is the tool name including two models • Skip-gram • Continuous Bag-of-Words • Hivemall supports both algorithms Original code: https://github.com/tmikolov/word2vec

Slide 32

Slide 32 text

Concept of Skip-Gram • Train word vector by predicting nearby words given the current word Alice was beginning to get very Cited from T. Mikolov, et al., Efﬁcient Estimation of Word Representations in Vector Space. In ICLR, 2013.

Slide 33

Slide 33 text

Concept of Continuous Bag-of-Words (CBoW) Alice was beginning to get very • Train word vector by predicting the current word based on nearby words • It is better model in order to   obtain low frequency words Cited from T. Mikolov, et al., Efﬁcient Estimation of Word Representations in Vector Space. In ICLR, 2013.

Slide 34

Slide 34 text

Usage of train_word2vec • negative_table: next slides • words: array of string/int • Last string argument: training parameters

Slide 35

Slide 35 text

Negative Sampling • Output layer’s activation function : softmax • O(V) : V is the size of vocabulary • Negative sampling approximates softmax • Use “negative words” sampled   from noise distribution as negative example • The number of negative samples is 5 – 25

Slide 36

Slide 36 text

Word Sampling from Noise Distribution A 0 1 V = 4 The Hive word2vec is 0.3 0.2 0.25 0.25 Traditional search algorithms are too slow… • Linear search : O(V) • Binary search: O(log(V))

Slide 37

Slide 37 text

Original Implementation A 0 1 The Hive word2vec is 0.3 0.2 0.25 0.25 Elements of array that has 100 elements • 0—29: The • 30—49: Hive • 50—74: word2vec • 75—99: is Sampling by nextint(100) : O(1) Using too long array for V

Slide 38

Slide 38 text

Alias Method • Save memory • Same sampling cost: O(1) A 0 4 The Hive word2vec is 1.2 0.8 1.0 1.0 A 0 1 The Hive word2vec is 0.3 0.2 0.25 0.25 x V (=4)

Slide 39

Slide 39 text

Alias Method: Split Array O(V) A 0 4 The Hive word2vec is 1.2 0.8 1.0 1.0 A 0 The Hive word2vec is A A A The 0.8 1.0 Each bar has two word at most

Slide 40

Slide 40 text

Alias Method: Sampling O(1) A 0 The Hive word2vec is A A A The 0.8 1.0 0 1. Sampling index by nextint(V) 3 2 1

Slide 41

Slide 41 text

Alias Method: Sampling O(1) A 0 The Hive word2vec is A A A The 0.8 1.0 0 1. Sampling index by nextint(V) 3 2 1 nextint(V) = 1

Slide 42

Slide 42 text

Alias Method: Sampling O(1) A 0 The Hive word2vec is A A A The 0.8 1.0 0 1. Sampling index by nextint(V) 2. Sampling double nextDouble() 3 2 1

Slide 43

Slide 43 text

Alias Method: Sampling O(1) A 0 The Hive word2vec is A A A The 0.8 1.0 0 1. Sampling index by nextint(V) 2. Sampling double nextDouble() 3 2 1 nextDouble() = 0.7 If random value < 0.8 1th element: Hive Else the other word   in 1st element: The

Slide 44

Slide 44 text

Characteristic Features of Hivemall’s word2vec 1. Use Alias method for negative sampling • Fast sampling from noise distribution • Saving memory • Original: use too long array for the size of vocabulary 2. Data parallel training • No parameter synchronous • Guarantee the same initialization for vector weights • By using word index as seed value • Unfortunately, vector quality is not good…

Slide 45

Slide 45 text

Future work • Comparison between implemented algorithm and other algorithms • e.g. Recommendation quality and speed for SLIM • Improvement quality during data parallel training • parameter server ? • If we can do, I also want to write a paper based on   hivemall implementation…

Slide 46

Slide 46 text

Impression • It’s my ﬁrst long term internship • Distributed machine learning is exciting task • Reimplementation ML algorithms on Hivemall • Programming skills: Java, hive and Hivemall