Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Julia Silge
March 04, 2019
Technology
260
1
Share
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
360
The right tool for the job
juliasilge
0
76
Good practices for applied machine learning
juliasilge
0
240
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
430
Publishing the Stack Overflow Developer Survey
juliasilge
2
94
Text Mining Using Tidy Data Principles
juliasilge
0
180
North American Developer Hiring Landscape
juliasilge
0
82
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
主催・運営として"場をつくる”というアウトプットのススメ
_mossann_t
0
110
【関西電力KOI×VOLTMIND 生成AIハッカソン】空間AIブレイン ~⼤阪おばちゃんフィジカルAIに続く道~
tanakaseiya
0
150
OPENLOGI Company Profile
hr01
0
83k
すごいぞManaged Kubernetes
harukasakihara
1
320
AIを活用したアクセシビリティ改善フロー
degudegu2510
1
140
AWS DevOps Agent or Kiro の使いどころを考える_20260402
masakiokuda
0
180
AWSで2番目にリリースされたサービスについてお話しします(諸説あります)
yama3133
0
120
ブラックボックス化したMLシステムのVertex AI移行 / mlops_community_62
visional_engineering_and_design
1
280
Oracle AI Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
3
2.1k
「できない」のアウトプット 同人誌『精神を壊してからの』シリーズ出版を 通して得られたこと
comi190327
3
570
OPENLOGI Company Profile for engineer
hr01
1
62k
OCI技術資料 : ロード・バランサ 概要 - FLB・NLB共通
ocise
4
27k
Featured
See All Featured
It's Worth the Effort
3n
188
29k
エンジニアに許された特別な時間の終わり
watany
106
240k
Skip the Path - Find Your Career Trail
mkilby
1
94
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.7k
Digital Ethics as a Driver of Design Innovation
axbom
PRO
1
250
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
WENDY [Excerpt]
tessaabrams
9
37k
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
160
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
170
4 Signs Your Business is Dying
shpigford
187
22k
New Earth Scene 8
popppiees
2
2k
What’s in a name? Adding method to the madness
productmarketing
PRO
24
4k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE