Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Treasure Data Summer Internship 2017

2ab3dc02a9448f246bab64174b19dc1e?s=47 Kento Nozawa
September 29, 2017

Treasure Data Summer Internship 2017

2ab3dc02a9448f246bab64174b19dc1e?s=128

Kento Nozawa

September 29, 2017
Tweet

More Decks by Kento Nozawa

Other Decks in Technology

Transcript

  1. Treasure Data Summer Internship 2017 Final Report Kento NOZAWA (@nzw0301)

    Sep. 29 2017
  2. Who am I ? • Kento NOZAWA (@nzw0301) • Master’s

    student at university of Tsukuba • I will be a Ph.D student next year • Research: unsupervised machine learning • Graph Data and Topic Models • OSS contribution: Keras and document translation • Keras is a very popular deep learning framework
  3. What I did: Adding new UDFs to Hivemall 1. F-Measure:

    evaluation metric for classification model • First easy task • To learn Hive and Hivemall • Merged 2. SLIM: fast recommendation algorithm • The hardest work for me… • Merged 3. word2vec: unsupervised word feature learning algorithm • Challenging task • Under review
  4. 1. F-Measure

  5. • Predict a label of each data given two categories

    • e.g: positive/negative book review, … • Train a ML model on labeled dataset Background: Binary Classification Problem OR Dataset Train ML model
  6. • Predict a label of each data given two categories

    • e.g: User gender, positive/negative book review, … • Prediction model is trained on labeled dataset Background: Binary Classification Problem OR Dataset Train ML model 8IJDIJT.-NPEFMCFUUFS  )PXUPDIPPTFCFUUFSQBSBNFUFSTPG.-NPEFM
  7. F-Measure • Widely used evaluation metric for classification model •

    Higher F-measure indicates better model F = (1 + 2 ) precision ⇤ recall 2 precision + recall Truth labels Predicted Precision = Recall =
  8. F-Measure • Widely used evaluation metric for classification model •

    Higher F-measure indicates better model • When β=1, it is called F1–score/F–score • Hivemall supported F1-score F = (1 + 2 ) precision ⇤ recall 2 precision + recall
  9. My Tasks • User can pass β to argument in

    the query • Two average calculation ways for binary classification • Micro average • Binary average • Support multi-labeling
  10. Usage of fmeasure for Binary Classification set β & average

    Detail usage: https://hivemall.incubator.apache.org/userguide/eval/binary_classification_measures.html
  11. Usage of fmeasure for Multi-label Classification Detail usage: https://hivemall.incubator.apache.org/userguide/eval/multilabel_classification_measures.html

  12. 2. SLIM

  13. • Suggest some items to the user • If he

    gets one, he will be satisfied with it Background: Recommendation
  14. • Suggest some items to the user • If he

    gets one, he will be satisfied with it • Recommendation based on 
 purchase history of user Background: Recommendation
  15. • Predict top-N items per user based on its score

    • Each predicted item has the score • e.g. future rating, #stars Top-N Recommendation Dataset Train ML model Top-3 book recommendation 4.7 4.6 4.2
  16. About SLIM • Sparse LInear Method • Fast top-N recommendation

    algorithm Xia Ning and George Karypis. SLIM: Sparse Linear Methods for Top-N Recommender Systems. In ICDM, 2011. minwj 1 2 ||aj Awj ||2 2 + 2 ||wj ||2 2 + ||wj ||1 subject to W 0 diag( W ) = 0 × ≒ A W A’ items users
  17. Why is SLIM Fast? 1. Training • Parallelized training per

    by coordinate descent • Approximate matrix product AW • Only use top-k similar items per item 2. Prediction • Approximate Matrix product AW • Weights matrix W is sparse aj
  18. Explanation of train_slim Function: i and j • i and

    j are item index • j is one of the top-k 
 similar items of i i th book
  19. Explanation of train_slim Function: r_i and r_j • r_i is

    a map stored all ratings by users • key: user id • value: rating • r_j is the same i th book x5 x5 x2
  20. Explanation of train_slim Function: knn_i • knn_i is a map

    of top-k similar items 
 of item i with ratings • Larger k makes better recommendation, 
 but memory usage and training time increase * th book x5 x5 x4 x k
  21. Prediction • Only use HiveQL • Known ratings and train_slim’s

    output • Output value: future rating of itemid by userid Matrix product
  22. Top-N Item Recommendation for Each User Use each_top_k based on

    SLIM’s predicted values full queries: 
 https://hivemall.incubator.apache.org/userguide/recommend/ movielens_slim.html
  23. 3. word2vec

  24. “java” - “compiler” + “interpreter” ?

  25. “java” - “compiler” + “interpreter” ? A. rexx

  26. word2vec • Unsupervised algorithms to obtain word vector • Only

    use document dataset like a Wikipedia • High impact algorithms • Very fast • Simple model • Other domain applications • E.g. item purchase history, graph data
  27. Word Vector • Each word represents a dense and low

    dimension vector • About 100 — 1000 dimension • Features • Similar words are similar vectors • Finding synonyms • Good feature for other ML tasks • Word analogy • King - Man + Woman ~= Queen • Reading - Watching + Watched ~= Read • France - Paris + Tokyo ~= Japan King Man Queen Woman
  28. Word Vector • Each word represents a dense and low

    dimension vector • About 100 — 1000 dimension • Features • Similar words are similar vectors • Finding synonyms • Good feature for other ML tasks • Word analogy • King - Man + Woman ~= Queen • Reading - Watching + Watched ~= Read • France - Paris + Tokyo ~= Japan King Man Queen Woman 5PQTJNJMBSXPSETPG lKBWBz
  29. Word Vector • Each word represents a dense and low

    dimension vector • About 100 — 1000 dimension • Features • Similar words are similar vectors • Finding synonyms • Good feature for other ML tasks • Word analogy • King - Man + Woman ~= Queen • Reading - Watching + Watched ~= Read • France - Paris + Tokyo ~= Japan King Man Queen Woman
  30. High Impact Papers • There are many *2vec papers… •

    doc2vec, pin2vec, node2ve, query2vec, emoji2vec, dna2vec, … • At least, 51 papers List: https://gist.github.com/ nzw0301/333afc00bd508501268fa7bf40cafe4e
  31. word2vec Models • Word2vec is the tool name including two

    models • Skip-gram • Continuous Bag-of-Words • Hivemall supports both algorithms Original code: https://github.com/tmikolov/word2vec
  32. Concept of Skip-Gram • Train word vector by predicting nearby

    words given the current word Alice was beginning to get very Cited from T. Mikolov, et al., Efficient Estimation of Word Representations in Vector Space. In ICLR, 2013.
  33. Concept of Continuous Bag-of-Words (CBoW) Alice was beginning to get

    very • Train word vector by predicting the current word based on nearby words • It is better model in order to 
 obtain low frequency words Cited from T. Mikolov, et al., Efficient Estimation of Word Representations in Vector Space. In ICLR, 2013.
  34. Usage of train_word2vec • negative_table: next slides • words: array

    of string/int • Last string argument: training parameters
  35. Negative Sampling • Output layer’s activation function : softmax •

    O(V) : V is the size of vocabulary • Negative sampling approximates softmax • Use “negative words” sampled 
 from noise distribution as negative example • The number of negative samples is 5 – 25
  36. Word Sampling from Noise Distribution A 0 1 V =

    4 The Hive word2vec is 0.3 0.2 0.25 0.25 Traditional search algorithms are too slow… • Linear search : O(V) • Binary search: O(log(V))
  37. Original Implementation A 0 1 The Hive word2vec is 0.3

    0.2 0.25 0.25 Elements of array that has 100 elements • 0—29: The • 30—49: Hive • 50—74: word2vec • 75—99: is Sampling by nextint(100) : O(1) Using too long array for V
  38. Alias Method • Save memory • Same sampling cost: O(1)

    A 0 4 The Hive word2vec is 1.2 0.8 1.0 1.0 A 0 1 The Hive word2vec is 0.3 0.2 0.25 0.25 x V (=4)
  39. Alias Method: Split Array O(V) A 0 4 The Hive

    word2vec is 1.2 0.8 1.0 1.0 A 0 The Hive word2vec is A A A The 0.8 1.0 Each bar has two word at most
  40. Alias Method: Sampling O(1) A 0 The Hive word2vec is

    A A A The 0.8 1.0 0 1. Sampling index by nextint(V) 3 2 1
  41. Alias Method: Sampling O(1) A 0 The Hive word2vec is

    A A A The 0.8 1.0 0 1. Sampling index by nextint(V) 3 2 1 nextint(V) = 1
  42. Alias Method: Sampling O(1) A 0 The Hive word2vec is

    A A A The 0.8 1.0 0 1. Sampling index by nextint(V) 2. Sampling double nextDouble() 3 2 1
  43. Alias Method: Sampling O(1) A 0 The Hive word2vec is

    A A A The 0.8 1.0 0 1. Sampling index by nextint(V) 2. Sampling double nextDouble() 3 2 1 nextDouble() = 0.7 If random value < 0.8 1th element: Hive Else the other word 
 in 1st element: The
  44. Characteristic Features of Hivemall’s word2vec 1. Use Alias method for

    negative sampling • Fast sampling from noise distribution • Saving memory • Original: use too long array for the size of vocabulary 2. Data parallel training • No parameter synchronous • Guarantee the same initialization for vector weights • By using word index as seed value • Unfortunately, vector quality is not good…
  45. Future work • Comparison between implemented algorithm and other algorithms

    • e.g. Recommendation quality and speed for SLIM • Improvement quality during data parallel training • parameter server ? • If we can do, I also want to write a paper based on 
 hivemall implementation…
  46. Impression • It’s my first long term internship • Distributed

    machine learning is exciting task • Reimplementation ML algorithms on Hivemall • Programming skills: Java, hive and Hivemall