Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
240
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
330
The right tool for the job
juliasilge
0
56
Good practices for applied machine learning
juliasilge
0
210
Applied machine learning with tidymodels
juliasilge
0
140
Maintaining an R Package
juliasilge
0
380
Publishing the Stack Overflow Developer Survey
juliasilge
2
77
Text Mining Using Tidy Data Principles
juliasilge
0
150
North American Developer Hiring Landscape
juliasilge
0
58
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
Boxを“使われる場”にする統制と自動化の仕組み
demaecan
0
230
今日から使える AWS Step Functions 小技集 / AWS Step Functions Tips
kinunori
5
620
開発者から見たLLMの進化 202511
ny7760
1
160
Black Hat USA 2025 Recap ~ クラウドセキュリティ編 ~
kyohmizu
0
180
AI時代に必要なデータプラットフォームの要件とは by @Kazaneya_PR / 20251107
kazaneya
PRO
4
900
これからアウトプットする人たちへ - アウトプットを支える技術 / that support output
soudai
PRO
14
4.8k
Spec Driven Development入門/spec_driven_development_for_learners
hanhan1978
1
1k
[AWS 秋のオブザーバビリティ祭り 2025 〜最新アップデートと生成 AI × オブザーバビリティ〜] Amazon Bedrock AgentCore で実現!お手軽 AI エージェントオブザーバビリティ
0nihajim
2
1.6k
AIエージェントを導入する [ 社内ナレッジ活用編 ] / Implement AI agents
glidenote
1
330
re:Inventに行きたい いつか行きたい 行けるようにできることは?
yama3133
0
120
プロダクトエンジニアとしてのマインドセットの育み方 / How to improve product engineer mindset
saka2jp
2
200
3年ぶりの re:Invent 今年の意気込みと前回の振り返り
kazzpapa3
0
190
Featured
See All Featured
Bash Introduction
62gerente
615
210k
Building Applications with DynamoDB
mza
96
6.7k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
For a Future-Friendly Web
brad_frost
180
10k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
15k
Thoughts on Productivity
jonyablonski
73
4.9k
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
2
300
Practical Orchestrator
shlominoach
190
11k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.7k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.7k
The World Runs on Bad Software
bkeepers
PRO
72
12k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE