Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
240
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
320
The right tool for the job
juliasilge
0
51
Good practices for applied machine learning
juliasilge
0
210
Applied machine learning with tidymodels
juliasilge
0
130
Maintaining an R Package
juliasilge
0
380
Publishing the Stack Overflow Developer Survey
juliasilge
2
77
Text Mining Using Tidy Data Principles
juliasilge
0
140
North American Developer Hiring Landscape
juliasilge
0
56
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
ユーザーの声とAI検証で進める、プロダクトディスカバリー
sansantech
PRO
1
140
[Keynote] What do you need to know about DevEx in 2025
salaboy
0
180
AI時代こそ求められる設計力- AWSクラウドデザインパターン3選で信頼性と拡張性を高める-
kenichirokimura
3
330
カンファレンスに託児サポートがあるということ / Having Childcare Support at Conferences
nobu09
1
580
[Codex Meetup Japan #1] Codex-Powered Mobile Apps Development
korodroid
2
870
生成AI時代のセキュアコーディングとDevSecOps
yuriemori
0
110
OCI Network Firewall 概要
oracle4engineer
PRO
2
7.9k
OAuthからOIDCへ ― 認可の仕組みが認証に拡張されるまで
yamatai1212
0
130
Bill One 開発エンジニア 紹介資料
sansan33
PRO
4
14k
「れきちず」のこれまでとこれから - 誰にでもわかりやすい歴史地図を目指して / FOSS4G 2025 Japan
hjmkth
1
310
「使い方教えて」「事例教えて」じゃもう遅い! Microsoft 365 Copilot を触り倒そう!
taichinakamura
0
390
Performance Insights 廃止から Database Insights 利用へ/transition-from-performance-insights-to-database-insights
emiki
0
280
Featured
See All Featured
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
127
53k
Facilitating Awesome Meetings
lara
56
6.6k
YesSQL, Process and Tooling at Scale
rocio
173
14k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.2k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
The Cult of Friendly URLs
andyhume
79
6.6k
Making the Leap to Tech Lead
cromwellryan
135
9.6k
Mobile First: as difficult as doing things right
swwweet
224
10k
Speed Design
sergeychernyshev
32
1.2k
How to Think Like a Performance Engineer
csswizardry
27
2k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.6k
A Tale of Four Properties
chriscoyier
161
23k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE