Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Arm Treasure Data Internship 2018 Final Report

Arm Treasure Data Internship 2018 Final Report

Masaki KOBAYASHI

September 20, 2018
Tweet

More Decks by Masaki KOBAYASHI

Other Decks in Technology

Transcript

  1. Arm Treasure Data Summer Intern 2018 Final Report 2018/09/20 Masaki

    Kobayashi #td_intern
  2. Who am I? 2 - Masaki Kobayashi (@_Makky_, @makky3939) -

    Master's student, University of Tsukuba - Research Topics: Crowdsourcing - Hobbies: JavaScript, Photography
  3. About My Internship 3 - 2018/08/01 ~ 2018/09/28 - Mentors:

    @makimoto, @myui
  4. What I did - I developed LDA-board - A topic

    model visualization tool using LDA - https://github.com/treasure-data/lda-board 4
  5. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 5
  6. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 6
  7. Understanding Users from Collected Data • Can grasp what kind

    of people using our service • Can provide users oriented features e.g. Suggest/recommend something based on users data (or activity log) Background 7
  8. LDA: Latent Dirichlet Allocation One of the most popular documents

    clustering method 8 Documents Topics A B C
  9. LDA: Latent Dirichlet Allocation - LDA also can be applicable

    to any kind of clustering including user clustering. (not only documents) - TD users can be use LDA on Hive QL Hivemall has implementation of LDA - Already prepared a digdag workflow template to use LDA. 9
  10. Treasure Data Customers Use Case 10 Users Topics A B

    C Sushi ! JavaScript Rails
  11. Understanding each topic meaning - How were the documents divided?

    - Which topic contains this document (or user)? - What is the relationship between topics? (close? nested?) Topic Model Visualization 11
  12. Related Projects 12 pyLDAvis [1] BigML [2] [1] http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb [2]

    https://bigml.com/shared/topicmodel/shSAsibY4TdQGIZvjjwtjs85cXx/topics can't check documents of each topics can't filter topics by term
  13. There is no simple method to visualize hivemall's lda Of

    course, We can use pyLDAvis if we transform the output of hivemall Not all humans can use python / Jupyter notebook In Treasure Data users use cases, The tool is mainly used for user's analysis. They want to observe various information on users belonging to each topic. Issues 13
  14. Development Requirements 1. Search and explore by term 2. Visualize

    clusters in two-dimensional space 3. Retrieve docids/userid in the specified topic 14
  15. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 15
  16. LDA-board 16 - A topic model visualization tool using LDA

    - Focused on users data (e.g. activity log) - Can be easily visualize collected data in Treasure Data.
  17. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 17
  18. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 18
  19. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 19
  20. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 20
  21. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 21
  22. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 22
  23. Features of LDA-board 1. Manage workflow executions 2. Run a

    workflow with setting params 3. Fetch the result of workflow from TD 4. Visualize the result 23
  24. Manage workflow executions 24

  25. Run a workflow with setting params 25

  26. based on https://github.com/treasure-data/workflow-examples/tree/master/machine-learning/lda Task (lda.dig) + tokenize + prepare_input_table +

    train_lda + post_train Run a workflow with setting params 26
  27. https://github.com/treasure-data/workflow-examples/tree/master/machine-learning/lda Task (lda.dig) _export: !include : config/params.yml num_topics: "${typeof(session_num_topics) ===

    'undefined' ? default_num_topics : session_num_topics}" + tokenize + prepare_input_table + train_lda + post_train Run a workflow with setting params 27
  28. https://github.com/treasure-data/workflow-examples/tree/master/machine-learning/lda Task (lda.dig) + tokenize + prepare_input_table + train_lda +

    post_train + post_predict Run a workflow with setting params 28 + dimension_reduction + topic_proportion + prepare_pca_input: + run_pca: py>: tasks.DimensionReduction.jspca docker: image: "python:3.6.5" _env: python_apikey: ${secret:python_apikey}
  29. Fetch the result of workflow from TD 29

  30. Visualize the result 30

  31. Visualize the result - Filter Topics by term 31

  32. Show topic terms and elements 32

  33. Additional contents in Topic elements 33

  34. Additional contents in Topic elements 34

  35. Architecture 35 Amazon ECS API Server (Rails) Browser Client App

    (React / Redux) Amazon RDS PostgreSQL https://api.treasuredata.com https://api-workflow.treasuredata.com
  36. Try with local docker-compose 36

  37. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 37
  38. Evaluation - Development Requirements - Feedback from TD Customers (LDA

    users) 38
  39. Development Requirements 1. Search and explore by term 2. Visualize

    clusters in two-dimensional space 3. Retrieve docids/userid in the specified topic 39
  40. Feedback from TD Customers (LDA users) - Want to use

    <Link> in Additional contents area e.g. To users detail page - Want to show Additional contents in Topic terms e.g terms -> stores - Just run post_predict part to avoid lda train time 40
  41. Future Tasks (want to try Final Week) 41 - Prepare

    for publication - Documentation - Add more examples for expected use cases
  42. Conclusion 42 - I developed LDA-board - A topic model

    visualization tool using LDA - github.com/treasure-data/lda-board
  43. Acknowledgements 43 Special thanks to … @makimoto -san @myui -san

    and All Treasure Data members