カンファレンスセッションの選択傾向を知りたい / Let’s study trends of entry to conference sessions

Slide 1

Slide 1 text

カンファレンスセッションの選択傾向を知りたい / Let’s study trends of entry to conference sessions Kanazawa.rb meetup #84 Takayuki Takagi

Slide 2

Slide 2 text

Who am I? ● Takayuki Takagi (高木貴之 / ニボシーニョ) ● @TAKAyuki_atkwsk / takayukiatkwsk ● Freelance programmer ● Remote work ● Scala, Ruby, Python, AWS, Docker, etc. ● Like beer and gyoza

Slide 3

Slide 3 text

Today’s topic ● I want to know trends of entry to conference sessions ○ Extract characteristic words from their title and description with NLP(Natural Language Processing) ○ But I’m NOT familiar with NLP, so I want to use as easy tools as possible ○ Easy tools - Cloud APIs

Slide 4

Slide 4 text

Cloud APIs for NLP ● Amazon Comprehend API (AWS) ● Cloud Natural Language API (GCP) ○ Syntactic analysis ● Text Analytics API (Azure) ○ Key-phrase extraction API -> These APIs are directly available in Japanese!!

Slide 5

Slide 5 text

Make input data ● Copy session titles and descriptions to spreadsheet ○ Japan Container Days 2018 (no descriptions) ○ Scala Kansai Summit 2018 ○ JAWS DAYS 2019 ○ Scala Matsuri 2019 ○ Google Cloud Next Tokyo 2019 ● Export as CSV (script input) ○ id, title ○ id, title + description

Slide 6

Slide 6 text

Analysis methods 1. Extract key-phrases with Text Analytics API 2. Analyze syntax with Cloud Natural Language API 3. Analyze syntax with MeCab + NEologd (for comparison) ● Source code ○ https://github.com/TAKAyukiatkwsk/session_analytics_sample

Slide 7

Slide 7 text

Extract key-phrases with Text Analytics API

Slide 8

Slide 8 text

Key-phrase frequency ● [title] ﬁnd technical words but they are the low frequency(Max=4, mostly 1) ● [title + description] the high frequency(Max=9) but they are not technical words ● “Kubernetes” is 4 in title, but is 3 in title + description

Slide 9

Slide 9 text

Analyze syntax with Cloud Natural Language API

Slide 10

Slide 10 text

N-gram Frequency ● [title unigram] More general topics (ex. Scala, Kubernetes, サービス, コンテナ, Cloud) ● Trends: Scala, Kubernetes, Akka, 機械学習, サーバーレス, Cloud Spanner, マイクロサービス

Slide 11

Slide 11 text

N-gram Frequency ● [title + desc bigram] more understandable words than title bigram ● “分散トレーシング” is a characteristic phrase

Slide 12

Slide 12 text

Analyze syntax with MeCab

Slide 13

Slide 13 text

N-gram Frequency ● “型” is tokenized as a noun (as an aﬃx with Natural Language API) ● “機械学習” and “サーバーレス” are tokenzied as one word ● “関数型” is a characteristic phrase

Slide 14

Slide 14 text

N-gram Frequency ● There are not abstract words like “よう” “こと” “ため” ● “GraphQL” and “マイクロサービス” are tokenzied as one word (not in chart) ● “分散トレーシング” is a characteristic phrase

Slide 15

Slide 15 text

Results ● Trends: Kubernetes, Serverless, Scala, 分散トレーシング, 関数型, Akka, Cloud Spanner ○ The frequency depends on title and description quality ● Cloud APIs are useful ● Key-phrase is not enough, using N-gram too is better ● MeCab + NEologd can analyze better than Native Language API (in Japanese/speciﬁc category?)

Slide 16

Slide 16 text

Results ● If you are familiar with NLP, please teach me NLP and analyzing methods!!!

Slide 17

Slide 17 text

References ● Cloud Natural Language | Cloud Natural Language API | Google Cloud ○ https://cloud.google.com/natural-language/?hl=ja ● Text Analytics API とは - 機能 - - Azure Cognitive Services | Microsoft Docs ○ https://docs.microsoft.com/ja-jp/azure/cognitive-services/text-analytics/overview ● Amazon Comprehend（テキストのインサイトや関係性を検出） | AWS ○ https://aws.amazon.com/jp/comprehend/ ● TF-IDFで見る評価の高いラーメン屋の口コミ傾向（自然言語処理 , TF-IDF, Mecab, wordcloud, 形態素解析、分かち書き） - ギークなエンジニアを目指す男 ○ https://www.takapy.work/entry/2019/01/14/142128 ● N-gramモデルを利用したテキスト分析　 ―インデックスページ― ○ http://www.shuiren.org/chuden/teach/n-gram/index-j.html