Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Recommendation Engine with Spark and...

Building a Recommendation Engine with Spark and Apache PredictionIO

Scala製機械学習基盤PredictionIOとSparkによるレコメンドシステム | JJUG CCC 2017 SPRING
#ccc_a3
SparkやMLlib、HDFS、Elasticsearchなど、注目を集めるオープンソースをベースとした機械学習サーバApache PredictionIOの概論と、同システムを使ったレコメンドシステム開発で得られた知見を共有するセッションです。Apache PredictionIOは様々な機械学習の手法をテンプレートに記述するだけで、Sparkをベースに学習タスクの分散処理が可能になります。それだけでなく、学習モデルから予測値を返したり、新たなイベントデータをリアルタイムに受けつけるAPIサーバまでを統合的に提供するプラットフォーム技術です。
本セッションでは、機械学習のインフラデザインとしても参考になるPredictionIOのアーキテクチャや、日本最大級の求人検索エンジンのログデータから、ユーザに最適な求人を推薦するレコメンドシステムの開発を通じて、学習ロジックのつくり方、学習モデルの評価と改善、Spark MLlibのチューニングやハマりどころなど実践導入のなかでのノウハウや苦労をお話します。Webシステムに機械学習を導入する際にPredictionIOを使うメリットをお伝えできればと思いますので、ぜひご参加ください。

takahiro-hagino

May 20, 2017
Tweet

More Decks by takahiro-hagino

Other Decks in Technology

Transcript

  1. x ML ٻ৬ऀͱٻਓͷϚονϯά Search Quality and Recommendation ೥ऩਪఆ Salary Prediction

    ۀछɾ৬छਪఆ Job Category Prediction ٻਓಛ௃ਪఆ Prediction of Job Characteristics
  2. ٻ৬ऀͱٻਓͷϚονϯά Search Quality and Recommendation ೥ऩਪఆ Salary Prediction ۀछɾ৬छਪఆ Job

    Category Prediction ٻਓಛ௃ਪఆ Prediction of Job Characteristics x ML
  3. ٻ৬ऀͱٻਓͷϚονϯά Search Quality and Recommendation ೥ऩਪఆ Salary Prediction ۀछɾ৬छਪఆ Job

    Category Prediction ٻਓಛ௃ਪఆ Prediction of Job Characteristics x ML
  4. ٻ৬ऀͱٻਓͷϚονϯά Search Quality and Recommendation ೥ऩਪఆ Salary Prediction ۀछɾ৬छਪఆ Job

    Category Prediction ٻਓಛ௃ਪఆ Prediction of Job Characteristics x ML
  5. ٻ৬ऀͱٻਓͷϚονϯά Search Quality and Recommendation ೥ऩਪఆ Salary Prediction ۀछɾ৬छਪఆ Job

    Category Prediction ٻਓಛ௃ਪఆ Prediction of Job Characteristics x ML
  6. ٻ৬ऀͱٻਓͷϚονϯά Search Quality and Recommendation ೥ऩਪఆ Salary Prediction ۀछɾ৬छਪఆ Job

    Category Prediction ٻਓಛ௃ਪఆ Prediction of Job Characteristics x ML
  7. ٻ৬ऀͱٻਓͷϚονϯά Search Quality and Recommendation ೥ऩਪఆ Salary Prediction ۀछɾ৬छਪఆ Job

    Category Prediction ٻਓಛ௃ਪఆ Prediction of Job Characteristics x ML
  8. Machine Learning Stacks Apps Algorithm Processing Datastore API Server (Tornado…)

    Scikitlearn, SparkML … DL: Caffe2, DL4j, Tensorflow, Chainer … Hadoop, Spark, Storm … Elasticsearch, HBASE, Redshift …
  9. Machine Learning Stacks Apps Algorithm Processing Datastore API Server Scikitlearn,

    SparkML … DL: Caffe2, DL4j, Tensorflow, Chainer … Hadoop, Spark, Storm … Elasticsearch, HBASE, Redshift … PredictionIO
  10. The most stars repositories on Github? spark apache/spark ˒ 12.8k

    incubator-predictionio apache/incubator-predictionio ˒ 10.2k playframework playframework/playframework ˒ 9.3k scala scala/scala ˒ 8.2k
  11. Apache PredictionIO? Apache PredictionIO (incubating) is an open source Machine

    Learning Server built on top of state-of-the-art open source stack for developers and data scientists create predictive engines for any machine learning task.
  12. Apache PredictionIO (incubating) is an open source Machine Learning Server

    built on top of state-of-the-art open source stack for developers and data scientists create predictive engines for any machine learning task. Apache PredictionIO? ࠷ઌ୺ͷΦʔϓϯιʔεΛ ૊߹Θͤͨػցֶशαʔό
  13. Apache PredictionIO (incubating) is an open source Machine Learning Server

    built on top of state-of-the-art open source stack for developers and data scientists create predictive engines for any machine learning task. Apache PredictionIO? ࠷ઌ୺ͷΦʔϓϯιʔεΛ ૊߹Θͤͨػցֶशαʔό ͲΜͳػցֶशλεΫͰ΋ ༧ଌΤϯδϯ͕ͭ͘ΕΔ
  14. Apache PredictionIO let you ର৅໰୊͝ͱʹςϯϓϨʔτΛ࡞Γɺ
 ͙͢ʹσϓϩΠͰ͖Δ quickly build and deploy

    an engine as a web service on production with customizable templates; ΫΤϦ౤͛ͯ݁ՌΛฦ͢API͕͋Δ respond to dynamic queries in real-time once deployed as a web service;
  15. Apache PredictionIO let you ޡࠩͷௐ੔΍ɺධՁͷ࢓૊Έ΋͋Δ evaluate and tune multiple engine

    variants systematically; όον or ϦΞϧλΠϜͰ
 ֶशσʔλΛొ࿥͢ΔI/F͕͋Δ unify data from multiple platforms in batch or in real-time for comprehensive predictive analytics;
  16. PIO CLI eventserver Launch an Event Server app Manage apps

    that are used by the Event Server build Build an engine at the current train Kick off a training using an engine deploy Deploy an engine as an engine server
  17. System Architecture Apache Hadoop up to 2.7.2 required only if

    YARN and HDFS are needed
 Apache HBase up to 1.2.4 Apache Spark up to 1.6.3
 for Hadoop 2.6 not Spark 2.x version Elasticsearch up to 1.7.5 not the Elasticsearch 2.x version
  18. Recommendation? JOB A JOB B Cafe Waiter Shibuya JOB C

    View Restaurant Waiter Shibuya Startup Programmer Roppongi
  19. Recommendation? JOB A JOB B Cafe Waiter Shibuya JOB C

    View Restaurant Waiter Shibuya Startup Programmer Roppongi
  20. Recommendation? JOB A JOB B Cafe Waiter Shibuya JOB C

    View Restaurant Waiter Shibuya Startup Programmer Roppongi Item-Based Recommendation
  21. Collaborative Filtering Job A Job B Job C Similarity User

    X View Through - 1 User A View Through View 1 User B Through View Through -1 User C View View View 0.5 Recommended 1.5
  22. Collaborative Filtering Job A Job B Job C Similarity User

    X View Through - User A View Through View 1 User B Through View Through -1 User C View View View 0.5 Recommended 1.5
  23. Click Log Favorite Log Elasticsearch v5.3 cluster Event Server ALS

    Template pio import Data Spark 2 node cluster RDD
  24. Click Log Favorite Log Elasticsearch v5.3 cluster Event Server ALS

    Template pio import Data LOCALFS Spark 2 node cluster RDD Model
  25. Click Log Favorite Log Elasticsearch v5.3 cluster Event Server ALS

    Template pio import Data LOCALFS Spark 2 node cluster RDD Model Query Predicted Result
  26. D A S E D-A-S-E Data Source and Data Preparator

    Algorithm Serving Evaluation Metrics
  27. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌॲཧ Prediction Server ༧ଌ݁Ռ Predicted Result
  28. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌॲཧ Prediction Server ༧ଌ݁Ռ Predicted Result
  29. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌॲཧ Prediction Server ༧ଌ݁Ռ Predicted Result
  30. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌॲཧ Prediction Server ༧ଌ݁Ռ Predicted Result
  31. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌॲཧ Prediction Server ༧ଌ݁Ռ Predicted Result
  32. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌॲཧ Prediction Server ༧ଌ݁Ռ Predicted Result
  33. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌϞσϧ Predictive Model ༧ଌ݁Ռ Predicted Result Data Source & Preparator D
  34. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌϞσϧ Predictive Model ༧ଌ݁Ռ Predicted Result Data Source & Preparator D Algorithm A
  35. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌϞσϧ Predictive Model ༧ଌ݁Ռ Predicted Result Data Source & Preparator D Algorithm A Serving S
  36. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌϞσϧ Predictive Model ༧ଌ݁Ռ Predicted Result Data Source & Preparator D Algorithm A Serving S E Evaluation Metrics
  37. D

  38. D A

  39. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌϞσϧ Predictive Model ༧ଌ݁Ռ Predicted Result Data Source & Preparator D Algorithm A Serving S E Evaluation Metrics
  40. D

  41. D

  42. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌϞσϧ Predictive Model ༧ଌ݁Ռ Predicted Result Data Source & Preparator D Algorithm A Serving S E Evaluation Metrics
  43. A

  44. Algorithm • train() Λ࣮૷ • ༧ଌϞσϧͷֶशΛ୲౰͢Δ • pio train ίϚϯυͰݺͼग़͞ΕΔ

    • HDFSʢLocalFSʣʹετΞ͞ΕΔ • predict() Λ࣮૷ • σϓϩΠޙͷΫΤϦʹରͯ͠ϦΞϧλΠϜʹݺ͹ΕΔ
  45. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌϞσϧ Predictive Model ༧ଌ݁Ռ Predicted Result Data Source & Preparator D Algorithm A Serving S E Evaluation Metrics
  46. Machine Learning Flow τϨʔχϯάσʔλ Training Data ػցֶशΞϧΰϦζϜ Machine Learning Algorithm

    ༧ଌϞσϧ Predictive Model લॲཧ Preprocessing Πϯϓοτσʔλ Input Data ༧ଌϞσϧ Predictive Model ༧ଌ݁Ռ Predicted Result Data Source & Preparator D Algorithm A Serving S E Evaluation Metrics
  47. Precision@k Precision@5 / Threshold = 2.0 Predicted A Validation B

    C D E A ˒ˑˑ B ˒˒˒ X ˒˒ˑ D ˑˑˑ E ˒˒ˑ
  48. Precision@k Precision@5 / Threshold = 2.0 Predicted A Validation B

    C D E A ˒ˑˑ B ˒˒˒ X ˒˒ˑ D ˑˑˑ E ˒˒ˑ
  49. Precision@k Precision@5 / Threshold = 2.0 Predicted A Validation B

    C D E A ˒ˑˑ B ˒˒˒ X ˒˒ˑ D ˑˑˑ E ˒˒ˑ PositiveCount: 2.0
  50. ॻྨબߟ௨ա཰ - ಺ఆ཰ - ಺ఆঝ୚཰ ༧ଌ Prediction for Reject Ratio

    ٻਓͷ೥ऩਪఆ Salary Prediction ٻਓ಺༰ͷࣗಈੜ੒ Job description writing-bot
  51. 30 6݄ FRI Open Source Machine Learning Server 02 JPIOUG

    Meetup 19:30 @Shibuya ʲٸืʳLT͍ͨ͠ํ
  52. 30 8݄ WED Open Source Machine Learning Server 03 JPIOUG

    Meetup 19:30 @Shibuya ʲΏΔืʳLT͍ͨ͠ํ