Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Arm Treasure Data Internship 2018 Final Report

Arm Treasure Data Internship 2018 Final Report

Masaki KOBAYASHI

September 20, 2018
Tweet

More Decks by Masaki KOBAYASHI

Other Decks in Technology

Transcript

  1. Who am I? 2 - Masaki Kobayashi (@_Makky_, @makky3939) -

    Master's student, University of Tsukuba - Research Topics: Crowdsourcing - Hobbies: JavaScript, Photography
  2. What I did - I developed LDA-board - A topic

    model visualization tool using LDA - https://github.com/treasure-data/lda-board 4
  3. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 5
  4. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 6
  5. Understanding Users from Collected Data • Can grasp what kind

    of people using our service • Can provide users oriented features e.g. Suggest/recommend something based on users data (or activity log) Background 7
  6. LDA: Latent Dirichlet Allocation One of the most popular documents

    clustering method 8 Documents Topics A B C
  7. LDA: Latent Dirichlet Allocation - LDA also can be applicable

    to any kind of clustering including user clustering. (not only documents) - TD users can be use LDA on Hive QL Hivemall has implementation of LDA - Already prepared a digdag workflow template to use LDA. 9
  8. Understanding each topic meaning - How were the documents divided?

    - Which topic contains this document (or user)? - What is the relationship between topics? (close? nested?) Topic Model Visualization 11
  9. Related Projects 12 pyLDAvis [1] BigML [2] [1] http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb [2]

    https://bigml.com/shared/topicmodel/shSAsibY4TdQGIZvjjwtjs85cXx/topics can't check documents of each topics can't filter topics by term
  10. There is no simple method to visualize hivemall's lda Of

    course, We can use pyLDAvis if we transform the output of hivemall Not all humans can use python / Jupyter notebook In Treasure Data users use cases, The tool is mainly used for user's analysis. They want to observe various information on users belonging to each topic. Issues 13
  11. Development Requirements 1. Search and explore by term 2. Visualize

    clusters in two-dimensional space 3. Retrieve docids/userid in the specified topic 14
  12. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 15
  13. LDA-board 16 - A topic model visualization tool using LDA

    - Focused on users data (e.g. activity log) - Can be easily visualize collected data in Treasure Data.
  14. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 17
  15. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 18
  16. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 19
  17. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 20
  18. Quick start 0. Register your LDA workflow 1. Sign in

    with TD API Key 2. Run the workflow 3. Visualize 21
  19. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 22
  20. Features of LDA-board 1. Manage workflow executions 2. Run a

    workflow with setting params 3. Fetch the result of workflow from TD 4. Visualize the result 23
  21. https://github.com/treasure-data/workflow-examples/tree/master/machine-learning/lda Task (lda.dig) _export: !include : config/params.yml num_topics: "${typeof(session_num_topics) ===

    'undefined' ? default_num_topics : session_num_topics}" + tokenize + prepare_input_table + train_lda + post_train Run a workflow with setting params 27
  22. https://github.com/treasure-data/workflow-examples/tree/master/machine-learning/lda Task (lda.dig) + tokenize + prepare_input_table + train_lda +

    post_train + post_predict Run a workflow with setting params 28 + dimension_reduction + topic_proportion + prepare_pca_input: + run_pca: py>: tasks.DimensionReduction.jspca docker: image: "python:3.6.5" _env: python_apikey: ${secret:python_apikey}
  23. Architecture 35 Amazon ECS API Server (Rails) Browser Client App

    (React / Redux) Amazon RDS PostgreSQL https://api.treasuredata.com https://api-workflow.treasuredata.com
  24. Outline of This Talk Background & Related Projects What is

    LDA-board? Features & Architecture Evaluation & Conclusion 37
  25. Development Requirements 1. Search and explore by term 2. Visualize

    clusters in two-dimensional space 3. Retrieve docids/userid in the specified topic 39
  26. Feedback from TD Customers (LDA users) - Want to use

    <Link> in Additional contents area e.g. To users detail page - Want to show Additional contents in Topic terms e.g terms -> stores - Just run post_predict part to avoid lda train time 40
  27. Future Tasks (want to try Final Week) 41 - Prepare

    for publication - Documentation - Add more examples for expected use cases
  28. Conclusion 42 - I developed LDA-board - A topic model

    visualization tool using LDA - github.com/treasure-data/lda-board