Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
NLP introduction in R 1
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
kur0cky
November 29, 2019
100
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
NLP introduction in R 1
講義用資料
kur0cky
November 29, 2019
More Decks by kur0cky
See All by kur0cky
The bootstrapping method for everyone
kur0cky
3
990
音楽理論と方向統計学の初歩/introduction of circular statistics and musicology
kur0cky
5
2.2k
tidyverse tutorial 2
kur0cky
1
74
tidyverse tutorial 1
kur0cky
1
91
Featured
See All Featured
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
370
Fireside Chat
paigeccino
42
4k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
How Software Deployment tools have changed in the past 20 years
geshan
0
34k
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
390
Skip the Path - Find Your Career Trail
mkilby
1
150
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
65
56k
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
1
3.6k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
250
1.3M
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
340
We Are The Robots
honzajavorek
0
250
Transcript
ࣗવݴޠॲཧೖᶗ .ࠇ༟ୋ
"HFOEB ࣗવݴޠॲཧͱ ࣗવݴޠσʔλͷલॲཧ ղੳྫ 2
ࣗવݴޠॲཧͱ
ࣗવݴޠॲཧͱ w ࣗવݴޠ OBUVSBMMBOHVBHF ͱɼຊޠɼӳޠͳͲͷ ਓ͕ؒѻ͏ݴޠ w ࣗવݴޠ͔Β༗ӹͳݟΛಘΔ͜ͱ͕ ࣗવݴޠॲཧͷత ˞ࠇࣗવݴޠॲཧΛઐʹ͍ͯ͠ͳ͍ͷͰ
ຊࢿྉʹޡΓ͕͋Δ͔͠Ε·ͤΜ 4
ྫ w ػց༁ w ΧελϚʔϨϏϡʔͷੳ w Α͋͘Δ࣭ͷࣗಈճ w χϡʔεͷ/FHBUJWF1PTJUJWFੳ 5
جຊతͳղੳͷྲྀΕ ࠷ۙͷൃల w ༷ʑͳλεΫʹ͓͍ͯɼਂֶ श͕TUBUFPGUIFBSU w #&35ʹΑΔϒϨΠΫεϧʔ w
લޙํͷจ຺Λֶश w ͭͷϞσϧΛ༷ʑͳλεΫʹར ༻Ͱ͖Δ 6 จষ ܗଶૉྻ ϕΫτϧ ݁Ռ લॲཧɾܗଶૉղੳ લॲཧ ػցֶशɾ౷ܭղੳ
ࣗવݴޠͷલॲཧ
લॲཧ w ϧϏɼه߸ɼֆจࣈͳͲͷআڈ w ܗଶૉղੳ w 4UPQ8PSETͷআڈ 8
ܗଶૉղੳ ܗଶૉɿ w ҙຯΛͭจࣈɾه߸ྻͷ࠷খ୯Ґ w ղੳϓϩάϥϜͱࣙॻʹΑΓߏ͞ΕΔ ಛྔͷ࡞͕ࣙॻͷ࣭ʹґଘ ݻ༗໊ࢺઐ༻ޠͷఆ 9 ʮ
ݚڀࣨʹೖΓ͍͕ͨɼ͕Γͳ͍ɽʯ ໊ࢺ໊ࢺ໊ࢺॿࢺಈࢺॿಈࢺॿࢺه߸໊ࢺॿࢺಈࢺॿಈࢺه߸ l zݚڀࣨʹೖΓ͍͕ͨɼ͕Γͳ͍ɽ .F$BCʹΑΔܗଶૉղੳͷྫ
.F$BC w ژେֶใֶݚڀՊɼ/55ίϛϡχέʔγϣϯՊֶجૅݚڀॴʹΑΔղੳΤϯδϯ w σϑΥϧτͰɼ*1"%JDͱ͍͏ࣙॻ͕༻͍ΒΕ͍ͯΔ w 3͔Βɼ3.F$BCͱ͍͏ύοέʔδΛ༻͍ͯར༻Մೳ 10 Πϯετʔϧ
NBD λʔϛφϧ brew install mecab brew install mecab-ipacid Πϯετʔϧ NBD 3 install.packages("RMeCab", repos = "http://rmecab.jp/R", type = "source")
ͬͯΈΑ͏ library(RMeCab) words <- RMeCabC(str = “͢ͷ͏ͪ”, mypref =
0) do.call(c, words) # RMeCab ͷग़ྗϦετܕɼϕΫτϧ # mypref = 1 ͱ͢Δͱɼݪܕʹมͯ͘͠ΕΔ 11
ϕΫτϧԽ w ղੳ͢ΔͨΊʹɼจॻΛԽ͢Δඞཁ͕͋Δ w ࠷୯७ͳͷɼස 5FSN'SFRVFODZ5' 12 ࢲ ͍Δ ͢Δ
Ԏ ʜ খઆ খઆ খઆ খઆ ʜ
༷ʑͳॏΈ͚ w ϒʔϦΞϯॏΈ͚ w 5'*%'ॏΈ͚ w 5'Ͱɼʮ͍Δʯ ʮ͢ΔʯͳͲසൟʹ༻͍ΒΕΔܗଶૉͷൺॏ͕େ͖͗͢Δ w ʮԎʯͳͲͷɼҰൠʹ༻͍ΒΕͳ͍͕จॻΛಛ͚ΔΑ͏ͳܗଶૉΛߟྀ
͍ͨ͠ɽ 13 ࢲ ͍Δ ͢Δ Ԏ ʜ খઆ খઆ খઆ খઆ ʜ
5'*%'ॏΈ͚ w 5' 5FSN'SFRVFODZ *%' *OWFSUFEEPDVNFOUGSFRVFODZ × 14 ɿจॻ
ʹ͓͚Δޠ۟ ͕ݱΕͨස ɿޠ۟ ͕ݱΕΔςΩετͷ ɿςΩετͷ૯ tfij i j dfj j N 5'*%' ij = tfij × log N dfj ͦͷ୯ޠͷϨΞ͞ͷٯ
จॻͷ͞Λߟྀͨ͠5'*%' w ಉ͡5'Ͱจॻͷ͞ʹΑͬͯɼͦͷॏΈҟͳΔͣ Ͱ͋Δ 15 5'*%' ij = tfij ×
log N dfj ∑ j tfij จॻ͝ͱʹ૯୯ޠͰ؇
ετοϓϫʔυ w ղੳ্ɼʹͨͨͳ͍୯ޠ w ॿࢺॿಈࢺͳͲͷػೳޠ w lUIFz lBzͳͲ w ͲͷΑ͏ʹআڈ͢Δ͔
w ࣙॻΛ༻͍Δ ࢀߟɿIUUQTWOTPVSDFGPSHFKQTWOSPPUTMPUIMJC$4IBSQ7FSTJPO4MPUI-JC/-1'JMUFS 4UPQ8PSEXPSE+BQBOFTFUYU w ग़ݱසΛ༻͍Δ ্Ґͷ୯ޠ͕΄ͱΜͲͷׂ߹ΛΊΔ 16
3ʹΑΔ࣮ફ લॲཧ
४උ library(tidyverse) # ͍ͭͷ library(RMeCab) # ܗଶૉղੳ༻ # σʔληοτͷಡΈࠐΈ df
<- read_csv(“aozora.csv”) # ྫͱͯ͠ɼਓࣦؒ֨Λநग़ ningen <- df %>% filter(title == “ਓࣦؒ֨”) %>% .$main_text print(df) print(ningen) 18 w ੨ۭจݿ͔ΒɼখઆσʔληοτΛ༻ҙ͠·ͨ͠ɽ ଠ࠻ɼᕸੴɼև
ʰਓࣦؒ֨ʱ಄ ͕͖͠aSaOaSaOɹࢲɺͦͷஉͷࣸਅΛࡾ༿ɺݟͨ͜ͱ͕͋ΔɻaSaOɹҰ༿ɺͦͷஉͷɺ༮࣌ɺͱͰ ݴ͏͖Ͱ͋Ζ͏͔ɺेࡀલޙ͔ͱਪఆ͞ΕΔࠒͷࣸਅͰ͋ͬͯɺͦͷࢠڙ͕େͷঁͷͻͱʹऔΓ͔͜·Εɺ ʢͦΕɺͦͷࢠڙͷ࢞ͨͪɺຓͨͪɺͦΕ͔Βɺै࢞ຓʢ͍ͱ͜ʣ͔ͨͪͱ૾͞ΕΔʣఉԂͷͷ΄ͱΓʹɺ ߥ͍ࣶͷއʢ͔·ʣΛཱ͍ͯͪɺटΛࡾे΄Ͳࠨʹ͚ɺृ͘স͍ͬͯΔࣸਅͰ͋Δɻृ͘ʁɹ͚ΕͲɺ ಷ͍ਓͨͪʢͭ·ΓɺඒृͳͲʹؔ৺Λ࣋ͨ͵ਓͨͪʣɺ໘ന͘Կͱແ͍Α͏ͳإΛͯ͠ɺaSaOʮՄѪ͍ ͪΌΜͰ͢ͶʯaSaOɹͱ͍͍Ճݮͳ͓ੈࣙΛݴͬͯɺ·Μ͟Βۭʢ͔Βʣ͓ੈࣙʹฉ͑ͳ͍͘Β͍ͷɺҦʢ͍ʣ Θ௨ଏͷʮՄѪΒ͠͞ʯΈ͍ͨͳӨͦͷࢠڙͷসإʹແ͍Θ͚Ͱͳ͍ͷ͕ͩɺ͔͠͠ɺ͍͔͞͞Ͱɺඒृ ʹब͍ͯͷ܇࿅ΛܦͯདྷͨͻͱͳΒɺͻͱΊݟ͙ͯ͢ɺaSaOʮͳΜͯɺ͍ͳࢠڙͩʯaSaOɹͱੂʢ͢͜ͿʣΔෆ շͦ͏ʹᄁʢͭͿʣ͖ɺໟͰ͍ͷ͚Δ࣌ͷΑ͏ͳख͖ͭͰɺͦͷࣸਅΛ΄͏Γ͛Δ͔Εͳ͍ɻ
aSaOɹ·ͬͨ͘ɺͦͷࢠڙͷসإɺΑ͘ݟΕݟΔ΄ͲɺԿͱΕͣɺΠϠͳബؾຯѱ͍ͷ͕ײͥΒΕͯདྷ ΔɻͲ͍ͩɺͦΕɺসإͰͳ͍ɻ͜ͷࢠɺগ͠স͍ͬͯͳ͍ͷͩɻͦͷূڌʹɺ͜ͷࢠɺ྆ํͷ͜Ϳ ͠Λݻ͘Ѳཱ͍ͬͯͬͯΔɻਓؒɺ͜Ϳ͠Λݻ͘ѲΓͳ͕Βস͑ΔͷͰແ͍ͷͰ͋ΔɻԐͩɻ 19 w ϧϏͷଘࡏ w վߦΛࣔ͢zaOzεϖʔεͳͲ w આ໌ॻ͖ͷΧοίআڈͨ͘͠ ͳ͍
ਖ਼نදݱʹΑΔআڈ ਖ਼نදݱ w จࣈྻͷू߹ΛҰͭͷจࣈྻͰදݱ͢Δํ๏ 8JLJQFEJB w ྫɿ\l͘Ζ͖z l͘Γ͖z l͖͘͞z^l͘\^z
20 ningen <- str_remove_all(ningen, c("\n|\r|ʢ.{1,10}ʣ| |ɹ|Ұ")) ҙͷจࣈ \^܁Γฦ͠ճͷࢦఆ
ܗଶૉղੳͱετοϓϫʔυͷআڈ ઌͷϦϯΫΛετοϓϫʔυͱͯ͠ར༻͢Δ TUPQXPSETTDBO pMFlIUUQTWOTPVSDFGPSHFKQTWOSPPUTMPUIMJC$4IBSQ7FSTJPO4MPUI-JC/-1'JMUFS4UPQ8PSEXPSE+BQBOFTFUYU z XIBUlz ର͕จࣈྻͰ͋Δ͜ͱͷࢦఆ OJOHFO@XPSET3.F$BC$ OJOHFO
NZQSFG ܗଶૉղੳɼݪܕม VOMJTU ϦετܕΛϕΫτϧ UJCCMF WFSCOBNFT σʔλϑϨʔϜͷมࢺྻΛՃ XPSE pMUFS XPSEJOTUPQXPSET ετοϓϫʔυͷআڈɽlJOzแؚͰ536& 21
5'ͷࢉग़ # ه߸ॿࢺɼॿಈࢺ͕ଟ͍ count(ningen_words, verb, sort = TRUE) count(ningen_words, word,
sort = TRUE) # ໊ࢺɼܗ༰ࢺɼಈࢺʹߜͬͯΧϯτ ningen_words %>% filter(verb %in% c(“໊ࢺ”, “ܗ༰ࢺ”, “ಈࢺ”)) %>% count(word, sort = TRUE) 22
तۀ՝ w ͖ͳখઆΛҰͭऔΓ্͛ɼ໊ࢺͷ5'্ҐݸΛநग़ͤΑ 23
ෳখઆͷܗଶૉղੳ res_mecab <- df %>% filter(author == “ևཾ೭հ”) %>%
mutate(mecab = map(main_text, RMeCabC, mypref = 1)) %>% select(author, title, mecab) df2 <- tibble() for(i in 1:nrow(res_mecab)){ try({ df2 <- tibble(verb = sapply(res_mecab$mecab[[i]], names), word = sapply(res_mecab$mecab[[i]], c)) %>% filter(verb %in% c("໊ࢺ", "ܗ༰ࢺ", "ಈࢺ")) %>% filter(!word %in% stopwords) %>% mutate(author = res_mecab$author[[i]], title = res_mecab$title[[i]]) %>% bind_rows(df2,.) }) } 24 ίʔυಛʹؾʹ࣮ͤͣߦ͢Δ͚ͩͰྑ͍ EGΛ֬ೝͤΑ
तۀ՝ w ࣍ͷίʔυͷ ΛຒΊɼ5'*%'Λࢉग़ͤΑ 25
तۀ՝ TF <- df2 %>% count( ***** , *****
, name = “TF”) IDF <- df2 %>% distinct(title, word) %>% group_by(word) %>% summarise(IDF = ***** ) TF_IDF <- TF %>% left_join(IDF, by = ***** ) %>% mutate(TF_IDF = TF * IDF) %>% group_by(title) %>% mutate(TF_IDF = TF_IDF / sum(TF)) %>% ungroup() %>% arrange(desc(TF_IDF)) 26
σʔλੳͷྫ
ྫ w ͲͷখઆͱͲͷখઆ͕ࣅ͍ͯΔͷ͔ɼओੳΛͯ͠ ՄࢹԽͯ͠ΈΔ શখઆΛܗଶૉղੳ 5'*%'ͷࢉग़ 5'্Ґ୯ޠΛ༻͍ͯओੳ
શ୯ޠΛ͏ͱଟ͗͢ΔͨΊ 28
29
ओͷد༩͕େ্͖͍Ґ୯ޠ 30
՝
՝ ࣗવݴޠॲཧͷԠ༻ࣄྫΛҰͭௐɼ·ͱΊΑ ࡉ͔͍ٕज़తͳ෦ׂѪͯ͠ྑ͍ ࣮ɼղੳྫ͋·ΓΑΖ͍͠ͱ͍͑ͳ͍ɽ݁Ռखॱ ͔ΒΛߟ͑ɼྻڍͤΑɽ 32