Today’s topic ● I want to know trends of entry to conference sessions
○ Extract characteristic words from their title and description with NLP(Natural Language Processing) ○ But I’m NOT familiar with NLP, so I want to use as easy tools as possible ○ Easy tools - Cloud APIs
Cloud APIs for NLP ● Amazon Comprehend API (AWS) ● Cloud Natural Language API (GCP) ○ Syntactic analysis ● Text Analytics API (Azure) ○ Key-phrase extraction API -> These APIs are directly available in Japanese!!
Make input data ● Copy session titles and descriptions to spreadsheet ○ Japan Container Days 2018 (no descriptions) ○ Scala Kansai Summit 2018 ○ JAWS DAYS 2019 ○ Scala Matsuri 2019 ○ Google Cloud Next Tokyo 2019 ● Export as CSV (script input) ○ id, title ○ id, title + description
Analysis methods 1. Extract key-phrases with Text Analytics API 2. Analyze syntax with Cloud Natural Language API 3. Analyze syntax with MeCab + NEologd (for comparison) ● Source code ○ https://github.com/TAKAyukiatkwsk/session_analytics_sample
Key-phrase frequency ● [title] find technical words but they are the low frequency(Max=4, mostly 1) ● [title + description] the high frequency(Max=9) but they are not technical words ● “Kubernetes” is 4 in title, but is 3 in title + description
N-gram Frequency ● “型” is tokenized as a noun (as an affix with Natural Language API) ● “機械学習” and “サーバーレス” are tokenzied as one word ● “関数型” is a characteristic phrase
N-gram Frequency ● There are not abstract words like “よう” “こと” “ため” ● “GraphQL” and “マイクロサービス” are tokenzied as one word (not in chart) ● “分散トレーシング” is a characteristic phrase
Results ● Trends: Kubernetes, Serverless, Scala, 分散トレーシング, 関数型, Akka, Cloud Spanner ○ The frequency depends on title and description quality ● Cloud APIs are useful ● Key-phrase is not enough, using N-gram too is better ● MeCab + NEologd can analyze better than Native Language API (in Japanese/specific category?)