Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
text_mining_slides_20180512
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Leo Lu
May 12, 2018
Technology
92
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
text_mining_slides_20180512
Leo Lu
May 12, 2018
More Decks by Leo Lu
See All by Leo Lu
R from Data Analysis to Production
leoluyi
1
150
2018-07-28_viz_talk
leoluyi
0
90
Other Decks in Technology
See All in Technology
AI時代のコスト管理を考えよう〜明日から使える実践AWSノウハウ~
yoshimi0227
0
960
テスト設計の本質を改めて考えてみる~生成AIを活用する時代だからこそ、作ったテストの説明性を高めよう~
yamasaki696
1
140
5分でわかるDuckDB Quack
chanyou0311
4
260
AIAU_UMEMOGU_ninomiya_slide
ninomiya_ii
0
280
クレデンシャル流出 ― 攻撃 3 時間 vs 復旧 10 時間。この非対称性にどう備えるか
kazzpapa3
3
620
感情と身体を置き去りにしない、エンジニアの生きのこり方 ──いまから、ここから「自分の状態」を扱うという選択
saorimurooka
0
360
千葉での単身赴任からAWSをやり続け、千葉に戻ってきた話
yama3133
1
120
徹底討論!ECS vs EKS!
daitak
3
1.8k
#エンジニアBooks 30分でわかる 「技術記事を書く技術」 / engineer-books 2026-06-30
jnchito
1
130
[AWS Summit Japan 2026]迷っているあなたへ_小さな一歩が、やがて自分を助けてくれる
sh_fk2
2
430
When Platform Engineering Meets GenAI
sucitw
0
200
AIチャットの改善から見えた、良いAI体験とは / What Constitutes a Good AI Experience: Insights from Improving AI Chat
kubode
0
130
Featured
See All Featured
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
590
So, you think you're a good person
axbom
PRO
2
2.1k
Designing for Performance
lara
611
70k
Navigating Team Friction
lara
192
16k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
250
1.3M
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
62
55k
Site-Speed That Sticks
csswizardry
13
1.2k
Digital Projects Gone Horribly Wrong (And the UX Pros Who Still Save the Day) - Dean Schuster
uxyall
1
1.8k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
Amusing Abliteration
ianozsvald
1
210
SEO in 2025: How to Prepare for the Future of Search
ipullrank
3
3.6k
Why Our Code Smells
bkeepers
PRO
340
58k
Transcript
Text Mining and Data Viz 2018-05-12 leoluyi@iii Slides http://pcse.pw/6WHWJ ©
leoluyi, 2018 1
橕ෝ౯ 4 㸎瓽 Leo Lu 4 ݣय़ૡᓕ 4 ፓ獮ෝᰂᣟ禂๐率 4
Build data products 4 ETL 4 Models 4 Text mining 4 Viz 4 ... © leoluyi, 2018 2
Text Minning 窕纷 膏 ૡٍ㮉 © leoluyi, 2018 3
膑碻դጱૡٍ vs. 碝Ӯդጱૡٍ © leoluyi, 2018 4
犥獮౯㮉᮷አक़㾴Ո䌃ጱ䩚ᥜ tm + tmcn Rwordseg © leoluyi, 2018 5
֕ฎ蝡犚ॺկஃஃࣁӾ 䨝磪๚Ꭳጱ襊 © leoluyi, 2018 6
犡ॠ౯㮉ᥝአӞ犚碝ጱૡٍ © leoluyi, 2018 7
窕纷 Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜
Model © leoluyi, 2018 8
Get data Get data ➜ Tokenize ➜ Embedding ➜ Viz
➜ Model 9
PTT ฎ疌疌ጱঅ๏ Get data ➜ Tokenize ➜ Embedding ➜ Viz
➜ Model 10
ྯॠ᮷磪盄ग़盄ग़ጱ䔂承碘 © leoluyi, 2018 11
ᛔ૩ጱ粖恝ᛔ૩䌃 devtools::install_packages( "leoluyi/PTTr") © leoluyi, 2018 12
Cleaning and preprocessing text ኸӥ虻懱牧݄ധ褾懱 © leoluyi, 2018 13
Tokenize Transform whole text into parts Get data ➜ Tokenize
➜ Embedding ➜ Viz ➜ Model 14
For English 4 normalization 4 stemming (扃䓄玲) 4 lemmatization (扃ࣳ螭ܻ)
4 POS tagging 4 ... Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 15
Ӿ犲Ԓ穉斃墋㻌 4 䥁扃 4 犋䥁扃 4 POS tagging 4 ...
Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 16
Semantic Parsing vs. Bag-of-Words © leoluyi, 2018 17
R tools 4 stringr 4 jiebaR Get data ➜ Tokenize
➜ Embedding ➜ Viz ➜ Model 18
Embedding (Encode, Feature Extraction) Get data ➜ Tokenize ➜ Embedding
➜ Viz ➜ Model 19
Embedding In a nutshell, Word Embedding turns text into numbers.
4 Embedding Layer1 4 Word2Vec 4 GloVe 4 doc2vec 4 sense2vec 1 https://machinelearningmastery.com/what-are-word-embeddings/ Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 20
© leoluyi, 2018 21
Demo Information Retrieval Get data ➜ Tokenize ➜ Embedding ➜
Viz ➜ Model 22
Visualize 4 Dimension Reduction 4 t-sne 4 PCA 4 Clustering
4 Interactive or static plots Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 23
Visualize 4 tsne::tsne() 4 prcomp() Get data ➜ Tokenize ➜
Embedding ➜ Viz ➜ Model 24
Model Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜
Model 25
Tasks 4 Classification 4 獤觊 4 Clustering 4 ತ疨ፘ犲 4
Generative models 4 ᛔ㵕ኞ౮ Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 26
አک磧盅᮷䨝మᥝ䌃ᛔ૩ጱ toolkit 4 Sparse Matrix manipulation 4 Informaiton retrieval tools
4 ... © leoluyi, 2018 27
Summary 1. Problem definition & specific goal: Get Curious About
Text 2. Finding Your Data 3. Preprocessing Your Data 4 Removing stopwords, Stemming, Segmentation, ... 4. Feature Extraction 4 Document-Term Matrix: tm, text2vec 4 Named Entity Recognition, POS tagging 4 Word embeddings: word2vec, GloVe 5. More Text Mining Skills 4 sentiment analysis 4 topicmodels, LDAViz: LDA 6. More Than Words - Visualizing Your Results © leoluyi, 2018 28
碍硁ᑀ䋊 㸎瓽 leoluyi@github https://leoluyi.github.io © leoluyi, 2018 29