Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machine Learning
Search
Julia Silge
March 04, 2019
Technology
1
200
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Good practices for applied machine learning
juliasilge
0
140
Applied machine learning with tidymodels
juliasilge
0
72
Maintaining an R Package
juliasilge
0
270
Publishing the Stack Overflow Developer Survey
juliasilge
2
48
Text Mining Using Tidy Data Principles
juliasilge
0
100
North American Developer Hiring Landscape
juliasilge
0
28
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.3k
Text Mining with Tidy Data Principles and Count-Based Methods
juliasilge
3
890
Navigating the R Package Universe
juliasilge
2
5.6k
Other Decks in Technology
See All in Technology
AWS を使う上で知っておきたいオンプレミス知識/aws-on-premise-essentials
emiki
1
4.1k
[2024年3月版] Databricksのシステムアーキテクチャ
databricksjapan
7
1.9k
最近たまに見かけるTiDBってなんだ? - Findy
pingcap0315
2
360
"好き"との生活/Regularly update profile with GitHub Actions
judeeeee
0
140
Hands-on / Kaname Frusawa / Cloud Compare Users Meetup 2024 at University of Tokyo on April 17
paraworld
2
470
マルチアカウント環境への発見的統制の導入
ch1aki
1
1.3k
0→1開発における技術選定において一番大切なこと
bicstone
1
320
Tableau事例紹介 / Tableau Case Study of Eureka
kazuya_araki_tokyo
1
170
Aurora MySQL v3(MySQL8.0互換)の オンラインDDLの罠挙動を全バージョンで検証した
yutakikai
0
150
ユーザーストーリーのレビューを自動化したみたの
bun913
1
280
o11y入門_外形監視を利用したWebアプリケーションへの最適なモニタリング_TechBrew
k5k
2
100
長期間TiDBを使ってきた話 @ 私たちはなぜNewSQLを使うのかTiDB選定5社が語る選定理由と活用LT / Experiences with TiDB Over Time
chibiegg
2
450
Featured
See All Featured
The Invisible Customer
myddelton
114
12k
How To Stay Up To Date on Web Technology
chriscoyier
781
250k
Writing Fast Ruby
sferik
619
60k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
75
41k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
13
1.5k
VelocityConf: Rendering Performance Case Studies
addyosmani
319
23k
[RailsConf 2023] Rails as a piece of cake
palkan
22
3.9k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
219
21k
Unsuck your backbone
ammeep
662
57k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
352
28k
A better future with KSS
kneath
230
16k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
153
14k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE