Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
220
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
210
The right tool for the job
juliasilge
0
24
Good practices for applied machine learning
juliasilge
0
180
Applied machine learning with tidymodels
juliasilge
0
94
Maintaining an R Package
juliasilge
0
320
Publishing the Stack Overflow Developer Survey
juliasilge
2
55
Text Mining Using Tidy Data Principles
juliasilge
0
120
North American Developer Hiring Landscape
juliasilge
0
34
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.4k
Other Decks in Technology
See All in Technology
2025年の挑戦 コーポレートエンジニアの技術広報/techpr5
nishiuma
0
140
今年一年で頑張ること / What I will do my best this year
pauli
1
220
embedパッケージを深掘りする / Deep Dive into embed Package in Go
task4233
1
200
今から、 今だからこそ始める Terraform で Azure 管理 / Managing Azure with Terraform: The Perfect Time to Start
nnstt1
0
190
完全自律型AIエージェントとAgentic Workflow〜ワークフロー構築という現実解
pharma_x_tech
0
320
FODにおけるホーム画面編成のレコメンド
watarukudo
PRO
2
250
Unsafe.BitCast のすゝめ。
nenonaninu
0
190
The future we create with our own MVV
matsukurou
0
2k
WantedlyでのKotlin Multiplatformの導入と課題 / Kotlin Multiplatform Implementation and Challenges at Wantedly
kubode
0
240
2024年活動報告会(人材育成推進WG・ビジネスサブWG) / 20250114-OIDF-J-EduWG-BizSWG
oidfj
0
100
データ基盤におけるIaCの重要性とその運用
mtpooh
2
250
re:Invent 2024のふりかえり
beli68
0
110
Featured
See All Featured
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
226
22k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
7
570
Fireside Chat
paigeccino
34
3.1k
How STYLIGHT went responsive
nonsquared
96
5.3k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
127
18k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Writing Fast Ruby
sferik
628
61k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
120k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
38
1.9k
Measuring & Analyzing Core Web Vitals
bluesmoon
5
210
Building a Modern Day E-commerce SEO Strategy
aleyda
38
7k
Fashionably flexible responsive web design (full day workshop)
malarkey
406
66k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE