Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Julia Silge
March 04, 2019
Technology
260
1
Share
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
360
The right tool for the job
juliasilge
0
76
Good practices for applied machine learning
juliasilge
0
240
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
430
Publishing the Stack Overflow Developer Survey
juliasilge
2
97
Text Mining Using Tidy Data Principles
juliasilge
0
180
North American Developer Hiring Landscape
juliasilge
0
84
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
Shipping AI Agents — Lessons from Production
vvatanabe
0
290
Revisiting [CLS] and Patch Token Interaction in Vision Transformers
yu4u
0
400
AI時代における技術的負債への取り組み
codenote
1
1.8k
CloudTrail を見つめ直してみる
kazzpapa3
1
120
Claude Code を安全に使おう勉強会 / Claude Code Security Basics
masahirokawahara
12
37k
AI バイブコーティングでキーボード不要?!
samakada
0
640
AI: Making Admin and Users, Lives Better
kbmsg
0
120
LLM時代の検索アーキテクチャと技術的意思決定
shibuiwilliam
3
1.5k
バイブコーディングで3倍早く⚪⚪を作ってみた
samakada
0
160
スクラムの中で AI-DLC workflow を 使い始めて3ヶ月の振り返り
kaminashi
0
140
MLOps導入のための組織作りの第一歩
akasan
0
390
20260423_執筆の工夫と裏側 技術書の企画から刊行まで / From the planning to the publication of technical book
nash_efp
3
600
Featured
See All Featured
AI Search: Where Are We & What Can We Do About It?
aleyda
0
7.4k
The World Runs on Bad Software
bkeepers
PRO
72
12k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
210
Joys of Absence: A Defence of Solitary Play
codingconduct
1
350
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.7k
Building a Scalable Design System with Sketch
lauravandoore
463
34k
New Earth Scene 8
popppiees
3
2.1k
Automating Front-end Workflow
addyosmani
1370
200k
So, you think you're a good person
axbom
PRO
2
2k
How to audit for AI Accessibility on your Front & Back End
davetheseo
0
300
Believing is Seeing
oripsolob
1
120
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
130
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE