Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Julia Silge
March 04, 2019
Technology
1
250
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
340
The right tool for the job
juliasilge
0
67
Good practices for applied machine learning
juliasilge
0
220
Applied machine learning with tidymodels
juliasilge
0
150
Maintaining an R Package
juliasilge
0
410
Publishing the Stack Overflow Developer Survey
juliasilge
2
84
Text Mining Using Tidy Data Principles
juliasilge
0
160
North American Developer Hiring Landscape
juliasilge
0
68
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
【Oracle Cloud ウェビナー】[Oracle AI Database + Azure] AI-Ready データ戦略の最短ルート:Azure AIでビジネス データの価値を最大化
oracle4engineer
PRO
2
120
Werner Vogelsが14年間 問い続けてきたこと
yusukeshimizu
2
240
新規事業における「一部だけどコア」な AI精度改善の優先順位づけ
zerebom
0
350
BiDiってなんだ?
tomorrowkey
2
490
ビジュアルプログラミングIoTLT vol.22
1ftseabass
PRO
0
140
Claude in Chromeで始める自律的フロントエンド開発
diggymo
1
280
クラウドセキュリティの進化 — AWSの20年を振り返る
kei4eva4
0
160
20260120 Amazon VPC のパブリックサブネットを無くしたい!
masaruogura
2
170
[Iceberg Meetup #4] ゼロからはじめる: Apache Icebergとはなにか? / Apache Iceberg for Beginners
databricksjapan
0
510
なぜCREを8年間続けているのか / cre-camp-4-2026-01-21
missasan
0
1.3k
KubeCon + CloudNativeCon NA ‘25 Recap, Extensibility: Gateway API / NRI
ladicle
0
150
習慣とAIと環境 — 技術探求を続ける3つの鍵
azukiazusa1
3
790
Featured
See All Featured
Darren the Foodie - Storyboard
khoart
PRO
2
2.3k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
How to audit for AI Accessibility on your Front & Back End
davetheseo
0
150
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Mozcon NYC 2025: Stop Losing SEO Traffic
samtorres
0
130
A better future with KSS
kneath
240
18k
Scaling GitHub
holman
464
140k
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
48
Abbi's Birthday
coloredviolet
1
4.5k
Neural Spatial Audio Processing for Sound Field Analysis and Control
skoyamalab
0
150
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
HDC tutorial
michielstock
1
330
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE