Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
220
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
240
The right tool for the job
juliasilge
0
31
Good practices for applied machine learning
juliasilge
0
190
Applied machine learning with tidymodels
juliasilge
0
100
Maintaining an R Package
juliasilge
0
350
Publishing the Stack Overflow Developer Survey
juliasilge
2
61
Text Mining Using Tidy Data Principles
juliasilge
0
120
North American Developer Hiring Landscape
juliasilge
0
41
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
サーバレス、コンテナ、データベース特化型機能をご紹介。CloudWatch をもっと使いこなそう!
o11yfes2023
0
180
コスト最適重視でAurora PostgreSQLのログ分析基盤を作ってみた #jawsug_tokyo
non97
0
330
React ABC Questions
hirotomoyamada
0
440
watsonx.data上のベクトル・データベース Milvusを見てみよう/20250418-milvus-dojo
mayumihirano
0
120
Рекомендации с нуля: как мы в Lamoda превратили главную страницу в ключевую точку входа для персонализированного шоппинга. Данил Комаров, Data Scientist, Lamoda Tech
lamodatech
0
750
Amazon CloudWatch を使って NW 監視を行うには
o11yfes2023
0
170
AIと開発者の共創: エージェント時代におけるAIフレンドリーなDevOpsの実践
bicstone
1
320
QA/SDETの現在と、これからの挑戦
imtnd
0
130
ブラウザのレガシー・独自機能を愛でる-Firefoxの脆弱性4選- / Browser Crash Club #1
masatokinugawa
1
480
Linuxのパッケージ管理とアップデート基礎知識
go_nishimoto
0
360
AWSの新機能検証をやる時こそ、Amazon Qでプロンプトエンジニアリングを駆使しよう
duelist2020jp
1
250
4/16/25 - SFJug - Java meets AI: Build LLM-Powered Apps with LangChain4j
edeandrea
PRO
2
110
Featured
See All Featured
GraphQLとの向き合い方2022年版
quramy
46
14k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
227
22k
KATA
mclloyd
29
14k
GitHub's CSS Performance
jonrohan
1030
460k
Mobile First: as difficult as doing things right
swwweet
223
9.6k
RailsConf 2023
tenderlove
30
1.1k
BBQ
matthewcrist
88
9.6k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
Making Projects Easy
brettharned
116
6.1k
The Cost Of JavaScript in 2023
addyosmani
49
7.7k
Navigating Team Friction
lara
184
15k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
34
2.2k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE