NLP introduction in R 1

9c42c4bc1d91c409d754da88c91cb2ef?s=47 kur0cky
November 29, 2019
40

NLP introduction in R 1

講義用資料

9c42c4bc1d91c409d754da88c91cb2ef?s=128

kur0cky

November 29, 2019
Tweet

Transcript

  1. ࣗવݴޠॲཧೖ໳ᶗ .ࠇ໦༟ୋ

  2. "HFOEB  ࣗવݴޠॲཧͱ͸  ࣗવݴޠσʔλͷલॲཧ  ղੳྫ 2

  3. ࣗવݴޠॲཧͱ͸

  4. ࣗવݴޠॲཧͱ͸ w ࣗવݴޠ OBUVSBMMBOHVBHF ͱ͸ɼ೔ຊޠɼӳޠͳͲͷ ਓ͕ؒѻ͏ݴޠ w ࣗવݴޠ͔Β༗ӹͳ஌ݟΛಘΔ͜ͱ͕ ࣗવݴޠॲཧͷ໨త ˞ࠇ໦͸ࣗવݴޠॲཧΛઐ໳ʹ͍ͯ͠ͳ͍ͷͰ

    
 ຊࢿྉʹ͸ޡΓ͕͋Δ͔΋͠Ε·ͤΜ 4
  5. ྫ w ػց຋༁ w ΧελϚʔϨϏϡʔͷ෼ੳ w Α͋͘Δ࣭໰΁ͷࣗಈճ౴ w χϡʔεͷ/FHBUJWF1PTJUJWF෼ੳ 5

  6. جຊతͳղੳͷྲྀΕ ࠷ۙͷൃల w ༷ʑͳλεΫʹ͓͍ͯɼਂ૚ֶ श͕TUBUFPGUIFBSU w #&35ʹΑΔϒϨΠΫεϧʔ
   w

    લޙ૒ํ޲ͷจ຺Λֶश w ͭͷϞσϧΛ༷ʑͳλεΫʹར ༻Ͱ͖Δ 6 จষ ܗଶૉྻ ਺஋ϕΫτϧ ݁Ռ લॲཧɾܗଶૉղੳ લॲཧ ػցֶशɾ౷ܭղੳ
  7. ࣗવݴޠͷલॲཧ

  8. લॲཧ w ϧϏɼه߸ɼֆจࣈͳͲͷআڈ w ܗଶૉղੳ w 4UPQ8PSETͷআڈ 8

  9. ܗଶૉղੳ ܗଶૉɿ w ҙຯΛ΋ͭจࣈɾه߸ྻͷ࠷খ୯Ґ w ղੳϓϩάϥϜͱࣙॻʹΑΓߏ੒͞ΕΔ
 ‎ಛ௃ྔͷ࡞੒͕ࣙॻͷ඼࣭ʹґଘ ݻ༗໊ࢺ΍ઐ໳༻ޠͷ൑ఆ 9 ʮ

    ݚڀࣨʹೖΓ͍͕ͨɼ੒੷͕଍Γͳ͍ɽʯ ໊ࢺ໊ࢺ໊ࢺॿࢺಈࢺॿಈࢺॿࢺه߸໊ࢺॿࢺಈࢺॿಈࢺه߸ l zݚڀࣨʹೖΓ͍͕ͨɼ੒੷͕଍Γͳ͍ɽ .F$BCʹΑΔܗଶૉղੳͷྫ
  10. .F$BC ࿨෍෥ w ژ౎େֶ৘ใֶݚڀՊɼ/55ίϛϡχέʔγϣϯՊֶجૅݚڀॴʹΑΔղੳΤϯδϯ w σϑΥϧτͰ͸ɼ*1"%JDͱ͍͏ࣙॻ͕༻͍ΒΕ͍ͯΔ w 3͔Β͸ɼ3.F$BCͱ͍͏ύοέʔδΛ༻͍ͯར༻Մೳ 10 Πϯετʔϧ

    NBD λʔϛφϧ  brew install mecab brew install mecab-ipacid Πϯετʔϧ NBD 3  install.packages("RMeCab", repos = "http://rmecab.jp/R", type = "source")
  11. ΍ͬͯΈΑ͏ library(RMeCab) words <- RMeCabC(str = “͢΋΋΋΋΋΋΋΋ͷ͏ͪ”, 
 mypref =

    0) do.call(c, words) # RMeCab ͷग़ྗ͸ϦετܕɼϕΫτϧ΁ # mypref = 1 ͱ͢Δͱɼݪܕʹม׵ͯ͘͠ΕΔ 11
  12. ਺஋ϕΫτϧԽ w ղੳ͢ΔͨΊʹɼจॻΛ਺஋Խ͢Δඞཁ͕͋Δ w ࠷΋୯७ͳͷ͸ɼස౓ 5FSN'SFRVFODZ5' 12 ࢲ ͍Δ ͢Δ

    ໷ Ԏ૲ ʜ খઆ      খઆ      খઆ      খઆ      ʜ
  13. ༷ʑͳॏΈ෇͚ w ϒʔϦΞϯॏΈ෇͚ w 5'*%'ॏΈ෇͚ w 5'Ͱ͸ɼʮ͍Δʯ ʮ͢ΔʯͳͲසൟʹ༻͍ΒΕΔܗଶૉͷൺॏ͕େ͖͗͢Δ w ʮԎ૲ʯͳͲͷɼҰൠʹ͸༻͍ΒΕͳ͍͕จॻΛಛ௃෇͚ΔΑ͏ͳܗଶૉΛߟྀ


    ͍ͨ͠ɽ 13 ࢲ ͍Δ ͢Δ ໷ Ԏ૲ ʜ খઆ      খઆ      খઆ      খઆ      ʜ
  14. 5'*%'ॏΈ෇͚ w 5' 5FSN'SFRVFODZ  *%' *OWFSUFEEPDVNFOUGSFRVFODZ × 14 ɿจॻ

    ʹ͓͚Δޠ۟ ͕ݱΕͨස౓ ɿޠ۟ ͕ݱΕΔςΩετͷ਺ ɿςΩετͷ૯਺ tfij i j dfj j N 5'*%' ij = tfij × log N dfj ͦͷ୯ޠͷϨΞ͞ͷٯ਺
  15. จॻͷ௕͞Λߟྀͨ͠5'*%' w ಉ͡5'Ͱ΋จॻͷ௕͞ʹΑͬͯɼͦͷॏΈ͸ҟͳΔ͸ͣ Ͱ͋Δ 15 5'*%' ij = tfij ×

    log N dfj ∑ j tfij จॻ͝ͱʹ૯୯ޠ਺Ͱ؇࿨
  16. ετοϓϫʔυ w ղੳ্ɼ໾ʹͨͨͳ͍୯ޠ w ॿࢺ΍ॿಈࢺͳͲͷػೳޠ w lUIFz lBzͳͲ w ͲͷΑ͏ʹআڈ͢Δ͔

    w ࣙॻΛ༻͍Δ
 ࢀߟɿIUUQTWOTPVSDFGPSHFKQTWOSPPUTMPUIMJC$4IBSQ7FSTJPO4MPUI-JC/-1'JMUFS 4UPQ8PSEXPSE+BQBOFTFUYU w ग़ݱස౓Λ༻͍Δ ্Ґͷ୯ޠ͕΄ͱΜͲͷׂ߹Λ઎ΊΔ 16
  17. 3ʹΑΔ࣮ફ લॲཧ

  18. ४උ library(tidyverse) # ͍ͭ΋ͷ library(RMeCab) # ܗଶૉղੳ༻ # σʔληοτͷಡΈࠐΈ df

    <- read_csv(“aozora.csv”) # ྫͱͯ͠ɼਓࣦؒ֨Λநग़ ningen <- df %>% 
 filter(title == “ਓࣦؒ֨”) %>%
 .$main_text print(df) print(ningen) 18 w ੨ۭจݿ͔ΒɼখઆσʔληοτΛ༻ҙ͠·ͨ͠ɽ
 ଠ࠻ɼᕸੴɼև઒
  19. ʰਓࣦؒ֨ʱ๯಄ ͸͕͖͠aSaOaSaOɹࢲ͸ɺͦͷஉͷࣸਅΛࡾ༿ɺݟͨ͜ͱ͕͋ΔɻaSaOɹҰ༿͸ɺͦͷஉͷɺ༮೥࣌୅ɺͱͰ΋ ݴ͏΂͖Ͱ͋Ζ͏͔ɺेࡀલޙ͔ͱਪఆ͞ΕΔࠒͷࣸਅͰ͋ͬͯɺͦͷࢠڙ͕େ੎ͷঁͷͻͱʹऔΓ͔͜·Εɺ ʢͦΕ͸ɺͦͷࢠڙͷ࢞ͨͪɺຓͨͪɺͦΕ͔Βɺै࢞ຓʢ͍ͱ͜ʣ͔ͨͪͱ૝૾͞ΕΔʣఉԂͷ஑ͷ΄ͱΓʹɺ ߥ͍ࣶͷއʢ͸͔·ʣΛ͸ཱ͍ͯͪɺटΛࡾे౓΄Ͳࠨʹ܏͚ɺृ͘স͍ͬͯΔࣸਅͰ͋Δɻृ͘ʁɹ͚ΕͲ΋ɺ ಷ͍ਓͨͪʢͭ·ΓɺඒृͳͲʹؔ৺Λ࣋ͨ͵ਓͨͪʣ͸ɺ໘ന͘΋Կͱ΋ແ͍Α͏ͳإΛͯ͠ɺaSaOʮՄѪ͍๥ ͪΌΜͰ͢ͶʯaSaOɹͱ͍͍Ճݮͳ͓ੈࣙΛݴͬͯ΋ɺ·Μ͟Βۭʢ͔Βʣ͓ੈࣙʹฉ͑ͳ͍͘Β͍ͷɺҦʢ͍ʣ Θ͹௨ଏͷʮՄѪΒ͠͞ʯΈ͍ͨͳӨ΋ͦͷࢠڙͷসإʹແ͍Θ͚Ͱ͸ͳ͍ͷ͕ͩɺ͔͠͠ɺ͍͔͞͞Ͱ΋ɺඒृ ʹब͍ͯͷ܇࿅ΛܦͯདྷͨͻͱͳΒɺͻͱΊݟ͙ͯ͢ɺaSaOʮͳΜͯɺ͍΍ͳࢠڙͩʯaSaOɹͱੂʢ͢͜ͿʣΔෆ շͦ͏ʹᄁʢͭͿ΍ʣ͖ɺໟ஬Ͱ΋෷͍ͷ͚Δ࣌ͷΑ͏ͳख͖ͭͰɺͦͷࣸਅΛ΄͏Γ౤͛Δ͔΋஌Εͳ͍ɻ

    aSaOɹ·ͬͨ͘ɺͦͷࢠڙͷসإ͸ɺΑ͘ݟΕ͹ݟΔ΄ͲɺԿͱ΋஌ΕͣɺΠϠͳബؾຯѱ͍΋ͷ͕ײͥΒΕͯདྷ ΔɻͲ͍ͩɺͦΕ͸ɺসإͰͳ͍ɻ͜ͷࢠ͸ɺগ͠΋সͬͯ͸͍ͳ͍ͷͩɻͦͷূڌʹ͸ɺ͜ͷࢠ͸ɺ྆ํͷ͜Ϳ ͠Λݻ͘Ѳཱ͍ͬͯͬͯΔɻਓؒ͸ɺ͜Ϳ͠Λݻ͘ѲΓͳ͕Βস͑Δ΋ͷͰ͸ແ͍ͷͰ͋ΔɻԐͩɻ 19 w ϧϏͷଘࡏ w վߦΛࣔ͢zaOz΍εϖʔεͳͲ w આ໌ॻ͖ͷΧοί͸আڈͨ͘͠ ͳ͍
  20. ਖ਼نදݱʹΑΔআڈ ਖ਼نදݱ w จࣈྻͷू߹ΛҰͭͷจࣈྻͰදݱ͢Δํ๏ 8JLJQFEJB  w ྫɿ\l͘Ζ͖z l͘Γ͖z l͖͘͞z^l͘\^z

    20 ningen <- str_remove_all(ningen, c("\n|\r|ʢ.{1,10}ʣ| |ɹ|Ұ")) ೚ҙͷจࣈ \^܁Γฦ͠ճ਺ͷࢦఆ
  21. ܗଶૉղੳͱετοϓϫʔυͷআڈ ઌͷϦϯΫΛετοϓϫʔυͱͯ͠ར༻͢Δ
 TUPQXPSETTDBO pMFlIUUQTWOTPVSDFGPSHFKQTWOSPPUTMPUIMJC$4IBSQ7FSTJPO4MPUI-JC/-1'JMUFS4UPQ8PSEXPSE+BQBOFTFUYU z 
 XIBUlz ର৅͕จࣈྻͰ͋Δ͜ͱͷࢦఆ OJOHFO@XPSET3.F$BC$ OJOHFO

    NZQSFG ܗଶૉղੳɼݪܕ΁ม׵
 VOMJTU ϦετܕΛϕΫτϧ΁
 UJCCMF WFSCOBNFT  σʔλϑϨʔϜ΁ͷม׵඼ࢺྻΛ௥Ճ
 XPSE 
 pMUFS XPSEJOTUPQXPSET ετοϓϫʔυͷআڈɽlJOz͸แؚͰ536& 21
  22. 5'ͷࢉग़ # ه߸΍ॿࢺɼॿಈࢺ͕ଟ͍
 count(ningen_words, verb, sort = TRUE) count(ningen_words, word,

    sort = TRUE) # ໊ࢺɼܗ༰ࢺɼಈࢺʹߜͬͯΧ΢ϯτ
 ningen_words %>%
 filter(verb %in% c(“໊ࢺ”, “ܗ༰ࢺ”, “ಈࢺ”)) %>%
 count(word, sort = TRUE) 22
  23. तۀ಺՝୊ w ޷͖ͳখઆΛҰͭऔΓ্͛ɼ໊ࢺͷ5'্ҐݸΛநग़ͤΑ 23

  24. ෳ਺খઆͷܗଶૉղੳ res_mecab <- df %>% 
 filter(author == “և઒ཾ೭հ”) %>%

    
 mutate(mecab = map(main_text, RMeCabC, mypref = 1)) %>%
 select(author, title, mecab) df2 <- tibble() for(i in 1:nrow(res_mecab)){
 try({ 
 df2 <- tibble(verb = sapply(res_mecab$mecab[[i]], names),
 word = sapply(res_mecab$mecab[[i]], c)) %>% 
 filter(verb %in% c("໊ࢺ", "ܗ༰ࢺ", "ಈࢺ")) %>% 
 filter(!word %in% stopwords) %>% 
 mutate(author = res_mecab$author[[i]],
 title = res_mecab$title[[i]]) %>% 
 bind_rows(df2,.)
 })
 } 24 ίʔυ͸ಛʹؾʹ࣮ͤͣߦ͢Δ͚ͩͰྑ͍ EGΛ֬ೝͤΑ
  25. तۀ಺՝୊ w ࣍ͷίʔυͷ ΛຒΊɼ5'*%'Λࢉग़ͤΑ 25

  26. तۀ಺՝୊ TF <- df2 %>% 
 count( ***** , *****

    , name = “TF”) IDF <- df2 %>% 
 distinct(title, word) %>% 
 group_by(word) %>% 
 summarise(IDF = ***** ) TF_IDF <- TF %>% 
 left_join(IDF, by = ***** ) %>% 
 mutate(TF_IDF = TF * IDF) %>% 
 group_by(title) %>% 
 mutate(TF_IDF = TF_IDF / sum(TF)) %>% 
 ungroup() %>% 
 arrange(desc(TF_IDF)) 26
  27. σʔλ෼ੳͷྫ

  28. ྫ w ͲͷখઆͱͲͷখઆ͕ࣅ͍ͯΔͷ͔ɼओ੒෼෼ੳΛͯ͠
 ՄࢹԽͯ͠ΈΔ  શখઆΛܗଶૉղੳ  5'*%'ͷࢉग़  5'্Ґ୯ޠΛ༻͍ͯओ੒෼෼ੳ


    શ୯ޠΛ࢖͏ͱଟ͗͢ΔͨΊ 28
  29. 29

  30. ओ੒෼΁ͷد༩͕େ্͖͍Ґ୯ޠ 30

  31. ՝୊

  32. ՝୊  ࣗવݴޠॲཧͷԠ༻ࣄྫΛҰͭௐ΂ɼ·ͱΊΑ
 ࡉ͔͍ٕज़తͳ෦෼͸ׂѪͯ͠ྑ͍   ࣮͸ɼղੳྫ͸͋·ΓΑΖ͍͠ͱ͍͑ͳ͍ɽ݁Ռ΍खॱ ͔Β໰୊఺Λߟ͑ɼྻڍͤΑɽ 32