Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
250
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
340
The right tool for the job
juliasilge
0
65
Good practices for applied machine learning
juliasilge
0
220
Applied machine learning with tidymodels
juliasilge
0
150
Maintaining an R Package
juliasilge
0
400
Publishing the Stack Overflow Developer Survey
juliasilge
2
83
Text Mining Using Tidy Data Principles
juliasilge
0
160
North American Developer Hiring Landscape
juliasilge
0
66
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
あの夜、私たちは「人間」に戻った。 ── 災害ユートピア、贈与、そしてアジャイルの再構築 / 20260108 Hiromitsu Akiba
shift_evolve
PRO
0
180
ESXi のAIOps だ!2025冬
unnowataru
0
450
AI with TiDD
shiraji
1
330
投資戦略を量産せよ 2 - マケデコセミナー(2025/12/26)
gamella
0
560
Entity Framework Core におけるIN句クエリ最適化について
htkym
0
140
Next.js 16の新機能 Cache Components について
sutetotanuki
0
210
SES向け、生成AI時代におけるエンジニアリングとセキュリティ
longbowxxx
0
270
テストセンター受験、オンライン受験、どっちなんだい?
yama3133
0
200
日本Rubyの会: これまでとこれから
snoozer05
PRO
6
250
Snowflake Industry Days 2025 Nowcast
takumimukaiyama
0
150
AIエージェントを5分で一気におさらい!AIエージェント「構築」元年に備えよう
yakumo
1
130
Cloud WAN MCP Serverから考える新しいネットワーク運用 / 20251228 Masaki Okuda
shift_evolve
PRO
0
130
Featured
See All Featured
Measuring & Analyzing Core Web Vitals
bluesmoon
9
720
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
680
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
220
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.3k
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
1
270
Practical Orchestrator
shlominoach
190
11k
Writing Fast Ruby
sferik
630
62k
Visualization
eitanlees
150
16k
Designing for Timeless Needs
cassininazir
0
110
ラッコキーワード サービス紹介資料
rakko
0
1.9M
Beyond borders and beyond the search box: How to win the global "messy middle" with AI-driven SEO
davidcarrasco
0
26
GitHub's CSS Performance
jonrohan
1032
470k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE