Slide 1

Slide 1 text

Text Mining and Data Viz 2018-05-12 leoluyi@iii Slides http://pcse.pw/6WHWJ © leoluyi, 2018 1

Slide 2

Slide 2 text

橕ෝ౯ 4 㸎瓽 Leo Lu 4 ݣय़ૡᓕ 4 ፓ獮ෝᰂᣟ禂๐率 4 Build data products 4 ETL 4 Models 4 Text mining 4 Viz 4 ... © leoluyi, 2018 2

Slide 3

Slide 3 text

Text Minning 窕纷 膏 ૡٍ㮉 © leoluyi, 2018 3

Slide 4

Slide 4 text

膑碻դጱૡٍ vs. 碝Ӯդጱૡٍ © leoluyi, 2018 4

Slide 5

Slide 5 text

犥獮౯㮉᮷አक़㾴Ո䌃ጱ䩚ᥜ tm + tmcn Rwordseg © leoluyi, 2018 5

Slide 6

Slide 6 text

֕ฎ蝡犚ॺկஃஃࣁӾ෈ 䨝磪๚Ꭳጱ襊 © leoluyi, 2018 6

Slide 7

Slide 7 text

犡ॠ౯㮉ᥝአӞ犚碝ጱૡٍ © leoluyi, 2018 7

Slide 8

Slide 8 text

窕纷 Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model © leoluyi, 2018 8

Slide 9

Slide 9 text

Get data Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 9

Slide 10

Slide 10 text

PTT ฎ疌疌ጱঅ๏݋ Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 10

Slide 11

Slide 11 text

ྯॠ᮷磪盄ग़盄ग़ጱ䔂෈承碘 © leoluyi, 2018 11

Slide 12

Slide 12 text

ᛔ૩ጱ粖恝ᛔ૩䌃 devtools::install_packages( "leoluyi/PTTr") © leoluyi, 2018 12

Slide 13

Slide 13 text

Cleaning and preprocessing text ኸӥ虻懱牧݄ധ褾懱 © leoluyi, 2018 13

Slide 14

Slide 14 text

Tokenize Transform whole text into parts Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 14

Slide 15

Slide 15 text

For English 4 normalization 4 stemming (扃䓄൉玲) 4 lemmatization (扃ࣳ螭ܻ) 4 POS tagging 4 ... Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 15

Slide 16

Slide 16 text

Ӿ෈犲Ԓ穉斃墋㻌 4 䥁扃 4 犋䥁扃 4 POS tagging 4 ... Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 16

Slide 17

Slide 17 text

Semantic Parsing vs. Bag-of-Words © leoluyi, 2018 17

Slide 18

Slide 18 text

R tools 4 stringr 4 jiebaR Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 18

Slide 19

Slide 19 text

Embedding (Encode, Feature Extraction) Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 19

Slide 20

Slide 20 text

Embedding In a nutshell, Word Embedding turns text into numbers. 4 Embedding Layer1 4 Word2Vec 4 GloVe 4 doc2vec 4 sense2vec 1 https://machinelearningmastery.com/what-are-word-embeddings/ Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 20

Slide 21

Slide 21 text

© leoluyi, 2018 21

Slide 22

Slide 22 text

Demo Information Retrieval Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 22

Slide 23

Slide 23 text

Visualize 4 Dimension Reduction 4 t-sne 4 PCA 4 Clustering 4 Interactive or static plots Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 23

Slide 24

Slide 24 text

Visualize 4 tsne::tsne() 4 prcomp() Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 24

Slide 25

Slide 25 text

Model Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 25

Slide 26

Slide 26 text

Tasks 4 Classification 4 ෈๜獤觊 4 Clustering 4 ತ疨ፘ犲෈๜ 4 Generative models 4 ෈๜ᛔ㵕ኞ౮ Get data ➜ Tokenize ➜ Embedding ➜ Viz ➜ Model 26

Slide 27

Slide 27 text

አک磧盅᮷䨝మᥝ䌃ᛔ૩ጱ toolkit 4 Sparse Matrix manipulation 4 Informaiton retrieval tools 4 ... © leoluyi, 2018 27

Slide 28

Slide 28 text

Summary 1. Problem definition & specific goal: Get Curious About Text 2. Finding Your Data 3. Preprocessing Your Data 4 Removing stopwords, Stemming, Segmentation, ... 4. Feature Extraction 4 Document-Term Matrix: tm, text2vec 4 Named Entity Recognition, POS tagging 4 Word embeddings: word2vec, GloVe 5. More Text Mining Skills 4 sentiment analysis 4 topicmodels, LDAViz: LDA 6. More Than Words - Visualizing Your Results © leoluyi, 2018 28

Slide 29

Slide 29 text

碍硁ᑀ䋊 㸎瓽 leoluyi@github https://leoluyi.github.io © leoluyi, 2018 29