Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Julia Silge
March 04, 2019
Technology
260
1
Share
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
370
The right tool for the job
juliasilge
0
86
Good practices for applied machine learning
juliasilge
0
250
Applied machine learning with tidymodels
juliasilge
0
170
Maintaining an R Package
juliasilge
0
440
Publishing the Stack Overflow Developer Survey
juliasilge
2
100
Text Mining Using Tidy Data Principles
juliasilge
0
190
North American Developer Hiring Landscape
juliasilge
0
87
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.6k
Other Decks in Technology
See All in Technology
Swift Sequence の便利 API 再発見
treastrain
1
290
SpeechTranscriber + AIによる文字起こし機能
kazuki1220
0
120
論文紹介:Pixal3D (SIGGRAPH 2026)
tenten0727
0
630
"うちにはまだ早い"は本当? ─ 小さく始めるPlatform Engineering入門
harukasakihara
7
650
Oracle AI Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
6
1.4k
Claude Code で使える DuckDB Skills を試してみた / DuckDB Skills and Claude Code
masahirokawahara
1
1.8k
最新技術を"今は選ばない"という技術選定
leveragestech
PRO
0
330
ECSのTerraformモジュールにコントリビュートした話
harukasakihara
0
260
ルール・ロール・ツールを創る / Creating Rules, Roles and Tools
ks91
PRO
0
130
実例から学ぶ GuardDuty(SSH BruteForce)調査の全体フローと勘所【SecurityJAWS】
cscengineer
PRO
0
170
その英語学習、AWSで代替できませんか?
suzutatsu
1
160
The Making of AI Chips
pfn
PRO
0
530
Featured
See All Featured
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
790
The Curious Case for Waylosing
cassininazir
1
350
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
190
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.2k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
70
39k
The browser strikes back
jonoalderson
0
1.1k
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
200
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
210
Six Lessons from altMBA
skipperchong
29
4.2k
Reality Check: Gamification 10 Years Later
codingconduct
0
2.1k
Statistics for Hackers
jakevdp
799
230k
Joys of Absence: A Defence of Solitary Play
codingconduct
1
360
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE