Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
230
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
270
The right tool for the job
juliasilge
0
42
Good practices for applied machine learning
juliasilge
0
200
Applied machine learning with tidymodels
juliasilge
0
110
Maintaining an R Package
juliasilge
0
360
Publishing the Stack Overflow Developer Survey
juliasilge
2
67
Text Mining Using Tidy Data Principles
juliasilge
0
130
North American Developer Hiring Landscape
juliasilge
0
50
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
Javaで作る RAGを活用した Q&Aアプリケーション
recruitengineers
PRO
1
120
Lambda Web Adapterについて自分なりに理解してみた
smt7174
5
130
Абьюзим random_bytes(). Фёдор Кулаков, разработчик Lamoda Tech
lamodatech
0
350
フィンテック養成勉強会#54
finengine
0
180
なぜ私はいま、ここにいるのか? #もがく中堅デザイナー #プロダクトデザイナー
bengo4com
0
480
登壇ネタの見つけ方 / How to find talk topics
pinkumohikan
5
530
エンジニア向け技術スタック情報
kauche
1
270
「Chatwork」の認証基盤の移行とログ活用によるプロダクト改善
kubell_hr
1
200
【TiDB GAME DAY 2025】Shadowverse: Worlds Beyond にみる TiDB 活用術
cygames
0
1.1k
AIのAIによるAIのための出力評価と改善
chocoyama
2
580
製造業からパッケージ製品まで、あらゆる領域をカバー!生成AIを利用したテストシナリオ生成 / 20250627 Suguru Ishii
shift_evolve
PRO
1
140
Tech-Verse 2025 Global CTO Session
lycorptech_jp
PRO
0
100
Featured
See All Featured
What's in a price? How to price your products and services
michaelherold
246
12k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
7
710
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
46
9.6k
VelocityConf: Rendering Performance Case Studies
addyosmani
331
24k
Adopting Sorbet at Scale
ufuk
77
9.4k
The Invisible Side of Design
smashingmag
300
51k
Building Applications with DynamoDB
mza
95
6.5k
The Straight Up "How To Draw Better" Workshop
denniskardys
234
140k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.8k
Automating Front-end Workflow
addyosmani
1370
200k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
130
19k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
34
5.9k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE