Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
220
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
210
The right tool for the job
juliasilge
0
27
Good practices for applied machine learning
juliasilge
0
180
Applied machine learning with tidymodels
juliasilge
0
96
Maintaining an R Package
juliasilge
0
330
Publishing the Stack Overflow Developer Survey
juliasilge
2
56
Text Mining Using Tidy Data Principles
juliasilge
0
120
North American Developer Hiring Landscape
juliasilge
0
35
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.4k
Other Decks in Technology
See All in Technology
MC906491 を見据えた Microsoft Entra Connect アップグレード対応
tamaiyutaro
1
530
地方拠点で エンジニアリングマネージャーってできるの? 〜地方という制約を楽しむオーナーシップとコミュニティ作り〜
1coin
1
220
データの品質が低いと何が困るのか
kzykmyzw
6
1.1k
モノレポ開発のエラー、誰が見る?Datadog で実現する適切なトリアージとエスカレーション
biwashi
6
800
関東Kaggler会LT: 人狼コンペとLLM量子化について
nejumi
3
570
エンジニアが加速させるプロダクトディスカバリー 〜最速で価値ある機能を見つける方法〜 / product discovery accelerated by engineers
rince
1
140
エンジニアの育成を支える爆速フィードバック文化
sansantech
PRO
3
1k
ホワイトボードチャレンジ 説明&実行資料
ichimichi
0
130
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
6
57k
Developers Summit 2025 浅野卓也(13-B-7 LegalOn Technologies)
legalontechnologies
PRO
0
650
速くて安いWebサイトを作る
nishiharatsubasa
10
12k
データマネジメントのトレードオフに立ち向かう
ikkimiyazaki
6
840
Featured
See All Featured
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
10
1.3k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.7k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Music & Morning Musume
bryan
46
6.3k
Embracing the Ebb and Flow
colly
84
4.6k
BBQ
matthewcrist
87
9.5k
Art, The Web, and Tiny UX
lynnandtonic
298
20k
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
For a Future-Friendly Web
brad_frost
176
9.5k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
2.1k
GitHub's CSS Performance
jonrohan
1030
460k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE