Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Text Mining: Exploratory Data Analysis to Machi...
Search
Julia Silge
March 04, 2019
Technology
1
240
Text Mining: Exploratory Data Analysis to Machine Learning
March 2019 talk at WiDS Salt Lake City regional event
Julia Silge
March 04, 2019
Tweet
Share
More Decks by Julia Silge
See All by Julia Silge
Introducing Positron
juliasilge
1
320
The right tool for the job
juliasilge
0
50
Good practices for applied machine learning
juliasilge
0
210
Applied machine learning with tidymodels
juliasilge
0
130
Maintaining an R Package
juliasilge
0
380
Publishing the Stack Overflow Developer Survey
juliasilge
2
76
Text Mining Using Tidy Data Principles
juliasilge
0
140
North American Developer Hiring Landscape
juliasilge
0
55
Understanding Principal Component Analysis Using Stack Overflow Data
juliasilge
13
4.5k
Other Decks in Technology
See All in Technology
FastAPIの魔法をgRPC/Connect RPCへ
monotaro
PRO
1
690
ACA でMAGI システムを社内で展開しようとした話
mappie_kochi
0
170
Function calling機能をPLaMo2に実装するには / PFN LLMセミナー
pfn
PRO
0
840
C# 14 / .NET 10 の新機能 (RC 1 時点)
nenonaninu
1
1.5k
コンテキストエンジニアリングとは? 考え方と応用方法
findy_eventslides
4
870
AI Agentと MCP Serverで実現する iOSアプリの 自動テスト作成の効率化
spiderplus_cb
0
460
BtoBプロダクト開発の深層
16bitidol
0
160
自作LLM Native GORM Pluginで実現する AI Agentバックテスト基盤構築
po3rin
2
240
Flaky Testへの現実解をGoのプロポーザルから考える | Go Conference 2025
upamune
1
400
関係性が駆動するアジャイル──GPTに人格を与えたら、対話を通してふりかえりを習慣化できた話
mhlyc
0
130
生成AIを活用したZennの取り組み事例
ryosukeigarashi
0
200
PLaMoの事後学習を支える技術 / PFN LLMセミナー
pfn
PRO
9
3.7k
Featured
See All Featured
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.2k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3.1k
[RailsConf 2023] Rails as a piece of cake
palkan
57
5.9k
Building an army of robots
kneath
306
46k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
7
890
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.7k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
VelocityConf: Rendering Performance Case Studies
addyosmani
332
24k
Docker and Python
trallard
46
3.6k
Transcript
T E X T M I N I N G
EXPLORATORY DATA ANALYSIS TO MACHINE LEARNING
HELLO T I D Y T E X T Data
Scientist at Stack Overflow @juliasilge https://juliasilge.com/ I’m Julia Silge
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT
T I D Y T E X T TEXT DATA
IS INCREASINGLY IMPORTANT NLP TRAINING IS SCARCE ON THE GROUND
TIDY DATA PRINCIPLES + COUNT-BASED METHODS = T I D
Y T E X T
https://github.com/juliasilge/tidytext
https://github.com/juliasilge/tidytext
http://tidytextmining.com/
T I D Y T E X T EXPLORATORY DATA
ANALYSIS N-GRAMS AND MORE WORDS MACHINE LEARNING
EXPLORATORY DATA ANALYSIS T I D Y T E X
T
from the Washington Post’s Wonkblog
from the Washington Post’s Wonkblog
D3 visualization on Glitch
WHAT IS A DOCUMENT ABOUT? T I D Y T
E X T TERM FREQUENCY INVERSE DOCUMENT FREQUENCY
None
None
• As part of the NASA Datanauts program, I worked
on a project to understand NASA datasets • Metadata includes title, description, keywords, etc
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L N-GRAMS, NETWORKS, & NEGATION
None
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TOPIC MODELING
TOPIC MODELING T I D Y T E X T
•Each DOCUMENT = mixture of topics •Each TOPIC = mixture of words
None
None
None
None
T A K I N G T I D Y
T E X T T O T H E N E X T L E V E L TEXT CLASSIFICATION
TRAIN A GLMNET MODEL T I D Y T E
X T
TEXT CLASSIFICATION T I D Y T E X T
> library(glmnet) > library(doMC) > registerDoMC(cores = 8) > > is_jane <- books_joined$title == "Pride and Prejudice" > > model <- cv.glmnet(sparse_words, is_jane, family = "binomial", + parallel = TRUE, keep = TRUE)
None
None
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com JULIA SILGE
THANK YOU T I D Y T E X T
@juliasilge https://juliasilge.com Author portraits from Wikimedia Photos by Glen Noble and Kimberly Farmer on Unsplash JULIA SILGE