カンファレンスセッションの選択傾向を知りたい / Let’s study trends of entry to conference sessions

カンファレンスセッションの選択傾向を知りたい / Let’s study trends of entry to conference
sessions Kanazawa.rb meetup #84 Takayuki Takagi

Who am I? • Takayuki Takagi (高木貴之 / ニボシーニョ) •
@TAKAyuki_atkwsk / takayukiatkwsk • Freelance programmer • Remote work • Scala, Ruby, Python, AWS, Docker, etc. • Like beer and gyoza

Today’s topic • I want to know trends of entry
to conference sessions ◦ Extract characteristic words from their title and description with NLP(Natural Language Processing) ◦ But I’m NOT familiar with NLP, so I want to use as easy tools as possible ◦ Easy tools - Cloud APIs

Cloud APIs for NLP • Amazon Comprehend API (AWS) •
Cloud Natural Language API (GCP) ◦ Syntactic analysis • Text Analytics API (Azure) ◦ Key-phrase extraction API -> These APIs are directly available in Japanese!!

Make input data • Copy session titles and descriptions to
spreadsheet ◦ Japan Container Days 2018 (no descriptions) ◦ Scala Kansai Summit 2018 ◦ JAWS DAYS 2019 ◦ Scala Matsuri 2019 ◦ Google Cloud Next Tokyo 2019 • Export as CSV (script input) ◦ id, title ◦ id, title + description

Analysis methods 1. Extract key-phrases with Text Analytics API 2.
Analyze syntax with Cloud Natural Language API 3. Analyze syntax with MeCab + NEologd (for comparison) • Source code ◦ https://github.com/TAKAyukiatkwsk/session_analytics_sample

Extract key-phrases with Text Analytics API

Key-phrase frequency • [title] ﬁnd technical words but they are
the low frequency(Max=4, mostly 1) • [title + description] the high frequency(Max=9) but they are not technical words • “Kubernetes” is 4 in title, but is 3 in title + description

Analyze syntax with Cloud Natural Language API

N-gram Frequency • [title unigram] More general topics (ex. Scala,
Kubernetes, サービス, コンテナ, Cloud) • Trends: Scala, Kubernetes, Akka, 機械学習, サーバーレス, Cloud Spanner, マイクロサービス

N-gram Frequency • [title + desc bigram] more understandable words
than title bigram • “分散トレーシング” is a characteristic phrase

Analyze syntax with MeCab

N-gram Frequency • “型” is tokenized as a noun (as
an aﬃx with Natural Language API) • “機械学習” and “サーバーレス” are tokenzied as one word • “関数型” is a characteristic phrase

N-gram Frequency • There are not abstract words like “よう”
“こと” “ため” • “GraphQL” and “マイクロサービス” are tokenzied as one word (not in chart) • “分散トレーシング” is a characteristic phrase

Results • Trends: Kubernetes, Serverless, Scala, 分散トレーシング, 関数型, Akka, Cloud
Spanner ◦ The frequency depends on title and description quality • Cloud APIs are useful • Key-phrase is not enough, using N-gram too is better • MeCab + NEologd can analyze better than Native Language API (in Japanese/speciﬁc category?)

Results • If you are familiar with NLP, please teach
me NLP and analyzing methods!!!

References • Cloud Natural Language | Cloud Natural Language API
| Google Cloud ◦ https://cloud.google.com/natural-language/?hl=ja • Text Analytics API とは - 機能 - - Azure Cognitive Services | Microsoft Docs ◦ https://docs.microsoft.com/ja-jp/azure/cognitive-services/text-analytics/overview • Amazon Comprehend（テキストのインサイトや関係性を検出） | AWS ◦ https://aws.amazon.com/jp/comprehend/ • TF-IDFで見る評価の高いラーメン屋の口コミ傾向（自然言語処理 , TF-IDF, Mecab, wordcloud, 形態素解析、分かち書き） - ギークなエンジニアを目指す男 ◦ https://www.takapy.work/entry/2019/01/14/142128 • N-gramモデルを利用したテキスト分析　 ―インデックスページ― ◦ http://www.shuiren.org/chuden/teach/n-gram/index-j.html

カンファレンスセッションの選択傾向を知りたい / Let’s study trends of ...

カンファレンスセッションの選択傾向を知りたい / Let’s study trends of entry to conference sessions

TAKAyukiatkwsk

More Decks by TAKAyukiatkwsk

Other Decks in Technology

Featured

Transcript