Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
250
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
330
The right tool for the job
juliasilge
0
61
Good practices for applied machine learning
juliasilge
0
210
Applied machine learning with tidymodels
juliasilge
0
140
Maintaining an R Package
juliasilge
0
390
Publishing the Stack Overflow Developer Survey
juliasilge
2
80
Text Mining Using Tidy Data Principles
juliasilge
0
150
North American Developer Hiring Landscape
juliasilge
0
63
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
re:Invent2025 3つの Frontier Agents を紹介 / introducing-3-frontier-agents
tomoki10
0
150
1人1サービス開発しているチームでのClaudeCodeの使い方
noayaoshiro
1
170
ChatGPTで論⽂は読めるのか
spatial_ai_network
9
29k
エンジニアリングをやめたくないので問い続ける
estie
2
1.2k
[デモです] NotebookLM で作ったスライドの例
kongmingstrap
0
150
文字列の並び順 / Unicode Collation
tmtms
3
590
チーリンについて
hirotomotaguchi
6
2k
学習データって増やせばいいんですか?
ftakahashi
2
350
意外とあった SQL Server 関連アップデート + Database Savings Plans
stknohg
PRO
0
330
re:Invent 2025 ~何をする者であり、どこへいくのか~
tetutetu214
0
220
IAMユーザーゼロの運用は果たして可能なのか
yama3133
1
400
[CMU-DB-2025FALL] Apache Fluss - A Streaming Storage for Real-Time Lakehouse
jark
0
120
Featured
See All Featured
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.8k
Documentation Writing (for coders)
carmenintech
76
5.2k
Java REST API Framework Comparison - PWX 2021
mraible
34
9k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
How GitHub (no longer) Works
holman
316
140k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.8k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
17k
Balancing Empowerment & Direction
lara
5
800
Six Lessons from altMBA
skipperchong
29
4.1k
Code Reviewing Like a Champion
maltzj
527
40k
Music & Morning Musume
bryan
46
7k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.3k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE