Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pythonで動かして学ぶ機械学習入門第二回 評判分析

yoppe
September 25, 2016

 Pythonで動かして学ぶ機械学習入門第二回 評判分析

yoppe

September 25, 2016
Tweet

More Decks by yoppe

Other Decks in Technology

Transcript

  1. ߨࢣ঺հ • ٠ా ངฏʢ͖ͨ͘ Α͏΁͍ʣ • ത࢜ʢཧֶʣ • ݱࡏ͸๭ίϯαϧςΟϯάϑΝʔϜʹͯσʔλ෼ੳۀ຿ʹैࣄ •

    ಘҙ෼໺
 ɾػցֶशͷཧ࿦తଆ໘
 ɾਪનΞϧΰϦζϜ
 ɾը૾෼ੳʢDeep Learningʣ • ࿈བྷઌ
 Կ͔͋Γ·ͨ͠Β͓ؾܰʹ͝࿈བྷ͍ͩ͘͞
 Email : [email protected]
 Facebook : https://www.facebook.com/yohei.kikuta.3
 Linkedin : https://jp.linkedin.com/in/yohei-kikuta-983b29117 
  2. CGMͷོ੝  Consumer Generated Media (CGM) ͕޿͘࢖ΘΕ͍ͯΔ
 • ۩ମతͳྫ
 Amazon

    ͷϨϏϡʔ, Twitter ͷπΠʔτ, ཱྀߦαΠτͷޱίϛ, ͳͲ • ಛ௃
 Ϣʔβࣗ਎͕σʔλΛੜ੒
 αʔϏεʹΑͬͯ͸ඇৗʹେྔͷσʔλ • ༗༻ͳ఺
 ର৅ʹର͢ΔҰൠফඅऀͷੜͷ੠͕நग़Ͱ͖Δ
 େྔͷσʔλ͔Βੈ࿦શମΛ൑அՄೳ
  3. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ۃੑͷ෼ྨ
 ͦͷςΩετ͕ positive ͳҙݟ͔ negative ͳҙݟ͔Λ෼ྨ
 •

    ࣙॻͷࣗಈߏங
 ςΩετσʔλͷू߹ʹݱΕΔ୯ޠͷొ࿥
 • ϢʔβͷϓϩϑΝΠϦϯά
 ςΩετͷ࡞੒ऀͷੑผ൑ผͳͲΛ࣮ࢪ
 • ಺༰ͷཁ໿
 ςΩετσʔλͷू߹͔Β࿩୊΍ੈ࿦ͷ܏޲ͷநग़ͳͲΛ࣮ࢪ
  4. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ۃੑͷ෼ྨ
 ͦͷςΩετ͕ positive ͳҙݟ͔ negative ͳҙݟ͔Λ෼ྨ
 ෼ྨʹ͓͍ͯػցֶशΛ༻͍Δ͜ͱ͕Մೳ

    ϨϏϡʔA ͜ͷ঎඼͸ͱͯ΋࢖͍ ΍͍͢ͷͰ͓͢͢Ίʂ ϨϏϡʔB ͜ͷ঎඼͸஋ஈͷׂʹ ੑೳ͕௿͍ɻ ϨϏϡʔC ͜ͷ঎඼͸ങͬͯଛ͢ ΔϨϕϧͰ͸ͳ͍ɻ positive negative positive (or neutral) Ԡ༻ྫʣ
 ͋Δ঎඼ʹରͯͦ͠Ε͕ੈ͔ؒΒྑ͍ධՁͳͷ͔൱͔Λ൑அ ঎඼ʹର͢Δ negative ͳϨϏϡʔΛूΊͯվળ఺Λચ͍ग़͠
  5. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ۃੑͷ෼ྨ
 ෼ྨΛࡉ͔ͯ͘͠ޒஈ֊ͷධՁͳͲʹ͢Δ͜ͱ΋ଟ͍
 ϥϯΫֶश΍ճؼͷ໰୊ͱͯ͠΋ѻ͑Δ ϨϏϡʔA ͜ͷ঎඼͸ͱͯ΋࢖͍ ΍͍͢ͷͰ͓͢͢Ίʂ ϨϏϡʔB

    ͜ͷ঎඼͸஋ஈͷׂʹ ੑೳ͕௿͍ɻ ϨϏϡʔC ͜ͷ঎඼͸ങͬͯଛ͢ ΔϨϕϧͰ͸ͳ͍ɻ ˑˑˑˑˑ ˑˑ ˑˑˑ Ԡ༻ྫʣ
 Ϣʔβͷଞ৘ใͱ૊Έ߹Θͤͯɺ͋ΔϢʔβʹରͯ͠ߴධՁʹͳΓ ͦ͏ͳ঎඼Λਪન
  6. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ۃੑͷ෼ྨ
 ΑΓਐΜͩ΋ͷͱͯ͠؍఺ຖʹ෼ྨ͢Δ΋ͷ΋͋Δ
 ֤؍఺Λநग़͢Δͱ͜Ζ΋ػցֶश͕࢖͑Δ ϨϏϡʔ ͜ͷϗςϧ͸෦԰͕ͱͯ΋͖Ε͍Ͱྑ ͍ɻ͔͠͠ͳ͕Β৯ࣄ͸࣭͕௿͘վ ળͯ͠΋Β͍ͱ͜Ζɻैۀһͷ઀٬͸

    ஸೡͰ޷ײ͕࣋ͯΔɻ ෦԰ͷ࣭ : positive Ԡ༻ྫʣ
 positive / negative ͕͍ࠞͬͯ͟ΔϨϏϡʔ͔Β؍఺ຖͷධՁΛऔΓग़ ͠ϢʔβͷҙݟΛ೺Ѳ ৯ࣄͷ࣭ : negative ઀٬ͷ࣭ : positive
  7. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ࣙॻͷࣗಈߏங
 ςΩετσʔλͷू߹ʹݱΕΔ୯ޠͷొ࿥ Ԡ༻ྫʣ
 େྔͷςΩετσʔλ͔Βग़ݱ୯ޠͷ඼ࢺ΍ positive / negative

    Λࣗ ಈతʹ൑ผͯ͠ొ࿥͠ɺ͞Βʹදه༳Ε΋ੋਖ਼ͨ͠ DB Λ࡞੒ ϨϏϡʔA ͜ͷ঎඼͸ͱͯ΋࢖͍ ΍͍͢ͷͰ͓͢͢Ίʂ ϨϏϡʔB ͜ͷ঎඼͸஋ஈͷׂʹ ੑೳ͕௿͍ɻ ϨϏϡʔC ͜ͷ঎඼͸ങͬͯଛ͢ ΔϨϕϧͰ͸ͳ͍ɻ ࣙॻ DB ࢖͍΍͍͢ : positive յΕ͍ͯΔ : negative Ϩϕϧ : neutral ɾ ɾ ɾ
  8. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ಺༰ͷཁ໿
 ςΩετσʔλͷू߹͔Β࿩୊΍ੈ࿦ͷ܏޲ͷநग़ͳͲΛ࣮ࢪ Ԡ༻ྫʣ
 େྔͷςΩετσʔλ͔Β͍·ྲྀߦΓͷ࿩୊Λநग़ͨ͠Γબڍͷ݁ ՌΛ༧ଌͨ͠Γ͢Δ ςΩετA બڍߥ໺͸΍͸Γࣗຽ

    ౘ͕͔ͬ͠Γ͍ͯ͠Δ ςΩετB ͜ͷঢ়گ͡Όࣗຽౘʹ ೖΕΔ͔͠ͳ͍ͩΖ ςΩετC ʓʓౘͷ֗಄ԋઆʹ~ ਓ͕ԡ͠دͤͨɻ ग़ॴ : http://japan.cnet.com/news/society/35034916/
  9. ධ൑෼ੳͷྲྀΕ  ୯७ͳ෼ྨ͔ΒෳࡶͳλεΫ΁ͱൃల͍ͯ͠·͢
 • 1990೥୅
 ܗ༰ࢺͷۃੑ෼ྨ (positive or negative) •

    2000೥୅
 લ൒͸ϨϏϡʔςΩετͷۃੑ෼ྨ (positive or negative)
 ޙ൒͔Β͸τϐοΫϞσϧΛ༻͍ͨ಺༰ཁ໿ͳͲͷෳࡶͳλεΫ΁ • 2010೥୅
 Twitter ͳͲͷ SNS ͷ෼ੳ
 Word2vecΛ࢝Ίͱ͢Δ୯ޠͷ෼ࢄදݱͳͲ
 ςΩετͱը૾ͳͲͷଞͷσʔλΛෳ߹ͨ͠෼ੳ
  10. ςΩετ෼ੳ͸೉͍͠  ίϯϐϡʔλͰѻ͑Δͷ͸਺஋σʔλ
 
 → ςΩετ͸୯ޠͷཏྻ
 
 → ҙຯߏ଄ΛؚΜͩ਺஋σʔλʹ͢Δඞཁ͕͋Δ 


    → ͔͠͠ςΩετσʔλ͸େ͖͞΍ॱংͱ͍͏ई౓Ͱ͸ଌΓͮΒ͍ ɹ Ex.) ʮཧ૝ʯͱʮݱ࣮ʯʹେখ΍ͲͪΒ͕ઌͱ͔͸Ұൠʹ͸ͳ͍
 
 → Ͳ͏͢Ε͹Α͍ͷ͔ʂʁ
 
 → ຊߨ࠲Ͱ͸୅දతͳ෼ੳεςοϓΛ؆୯ʹ঺հ
  11. ୅දతͳ෼ੳεςοϓ  • ςΩετσʔλΛ४උ
 • ܗଶૉղੳ
 • ܎Γड͚ղੳ
 • ಛ௃ྔ࡞੒


    • ໨తͷ෼ੳΛ࣮ࢪ
 ڭࢣ༗ΓͳΒ൑ผ΍ճؼɺڭࢣແ͠ͳΒΫϥελϦϯάɺͳͲ ※͋͘·Ͱ୅දతͳεςοϓͳͷͰ༷ʑͳύλʔϯ͕ଘࡏ
  12. ୅දతͳ෼ੳεςοϓ  • ςΩετσʔλΛ४උ 
 ෼ੳͷର৅ͱͳΔςΩετσʔλΛ४උ͢Δ
 
 จࣈίʔυʹ஫ҙʂ
 ɾ೔ຊޠͳͲͷϚϧνόΠτจࣈ͸ಛʹ஫ҙ͕ඞཁ ɾpython͸2ܥͱ3ܥͰ࣮૷͕ҟͳΔ(2ܥ͸strܕͱunicodeܕ͕͋Γ3ܥ͸unicodeͰ౷Ұ)


    ɹಛผͳཧ༝͕ͳ͍ݶΓ͸3ܥΛ༻͍Δͷ͕ྑ͍
 
 σʔλιʔεͱͯ͠͸ԼهͷΑ͏ͳ΋ͷ͕͋Δ
 ɾTwitter API (https://dev.twitter.com/overview/documentation) Λ༻͍ͨ tweet ऩू ɾ੨ۭจݿ (http://www.aozora.gr.jp/) ɾӳޠͷөըϨϏϡʔσʔλ (http://www.cs.cornell.edu/people/pabo/movie-review-data/)
  13. ୅දతͳ෼ੳεςοϓ  • ܗଶૉղੳ
 ςΩετΛҙຯ୯ҐͰ࠷খͷཁૉʹ෼ղ͢Δ
 
 ӳޠͳͲεϖʔεͰ୯ޠ͕۠੾ΒΕΔݴޠ͸؆୯
 ɹEx.) This is

    a pen. → This / is / a / pen / .
 ೔ຊޠ͸೉͍͠
 ղੳ༻ϥΠϒϥϦ͕ඞཁͰ, MeCab (http://taku910.github.io/mecab/) ͕༗໊
 ɹEx.) ͢΋΋΋΋΋΋ͷ͏ͪ → ͢΋΋ / ΋ / ΋΋ / ΋ / ΋΋ / ͷ / ͏ͪ
 ՄೳͳΒࣙॻΛ༻͍ͯදه༳ΕΛਖ਼ͨ͠Γ඼ࢺΛ༩͑Δͱߋʹྑ͍ ɹEx.) ͓͜ͳ͏, ߦ͏, ߦͳ͏ → ߦ͏
 ɹEx.) ඒ͍͠ → (ඒ͍͠, ܗ༰ࢺ)
  14. ୅දతͳ෼ੳεςοϓ  • ܎Γड͚ղੳ
 ܗଶૉʹରͯ͠म০͢Δ͞ΕΔͷؔ܎Λࢦఆ͢Δ
 ͜Ε͸ݴޠಛੑ΍ଟٛੑͷͨΊͱͯ΋೉͍͠
 
 ɹEx.) I think

    that that that that boy used is wrong. 
 ɹEx.) ࠇ͍൅ͷඒ͍͠ঁੑ͕͍Δɻ 
 ϥΠϒϥϦͳͲΛ࢖༻͢Δ͔܎Γड͚ղੳ͸εΩοϓ͢Δͷ΋ΞϦ ೔ຊޠͷϥΠϒϥϦ͸ Cabocha (https://taku910.github.io/cabocha/) ͕༗໊
  15. ୅දతͳ෼ੳεςοϓ  • ಛ௃ྔ࡞੒
 ୯ޠͷ༗ແΛ {0,1} Ͱදݱ͢Δ one-hot encoding ͕جຊ


    ɹEx.) ࢲ → [1,0,0,0, …], ͸ → [0,1,0,0, …], ਓؒ → [0,0,1,0, …]
 ͜ΕΛ༻͍ͯจॻߦྻΛԼهͷΑ͏ʹ࡞੒Ͱ͖Δ จষ ࢲ ͋ͳͨ ͸ ਓؒ ͩ Ͱ ͳ͍ ɻ ʜ ࢲ͸ਓؒͩɻ         ͋ͳͨ͸ਓؒͰͳ͍ɻ         ʜ ͜ͷํ๏͸γϯϓϧͰѻ͍΍͍͢ ͔͠͠೚ҙͷೋ୯ޠؒͷྨࣅ౓͕ಉ͡ʹͳͬͯ͠·͍ҙຯ͸ফࣦ
  16. ୅දతͳ෼ੳεςοϓ  • ಛ௃ྔ࡞੒ɿࠓճͷ෼ੳͰ࢖͏΋ͷ
 ςΩετؚ͕ΉҙຯΛ൓өͤ͞ΔͨΊʹ༷ʑͳಛ௃͕ߟҊ͞Ε͍ͯΔ ɾN-gram ɹྡ઀ͯ͠ੜ͡Δ N ݸͷ୯ޠΛҰͭͷ୯Ґͱͯ͠ѻ͏ɻN =

    1,2͕ଟ͍ ɹɹEx.) bi-gram ࢲ͸ਓؒͩɻ→ (ࢲ, ͸), (͸, ਓؒ), (ਓؒ, ͩ), (ͩ, ɻ) ɾBag of Words (BoW) ɹग़ݱ͢Δ୯ޠͷස౓ΛΧ΢ϯτͯͦ͠ͷ਺Λಛ௃ྔͱ͢Δ ɹɹEx.) ࢲ͸ࢲΛ৴͡Δɻ→ (ࢲ, 2), (͸, 1), (Λ, 1), (৴͡Δ, 1), (ɻ, 1) ɾtf-idf (term frequency and inverse document frequency) ɹ୯ޠͷස౓ʹରͯͦ͠Ε͕ग़ݱ͢Δจॻͷׂ߹ͰॏΈ෇͚ ɹɹEx.) ʮࢲʯ͸ग़ݱස౓͸ଟ͍͕ଟ͘ͷจষͰݱΕΔͨΊ௿͍είΞ
  17. ୅දతͳ෼ੳεςοϓ  • ಛ௃ྔ࡞੒ɿͦͷଞ
 ɾ࣍ݩѹॖ ɹจॻߦྻΛ௿࣍ݩʹѹॖͯ͠τϐοΫநग़ͳͲΛߦ͏
 ɹಛҟ஋෼ղ΍֬཰తજࡏҙຯղੳ΍Latent Dirichlet AllocationͳͲ ɾ෼෍

    ɹ୯ޠ෼෍΍඼ࢺͷൺ཰෼෍ͳͲ ɾ୯ޠͷ෼ࢄදݱ
 ɹWord2Vec ʹ୅ද͞ΕΔϕΫτϧԋࢉ͕ҙຯΛ੒͢Α͏ͳදݱͷ֫ಘ
 ɹɹEx.) king - man ≒ queen - woman
 ɾetc…
  18. ࠓճ࣮ࢪ͢Δ෼ੳ  • ςΩετσʔλΛ४උ
 ӳޠͰॻ͔ΕͨөըͷϨϏϡʔσʔλΛ࢖༻
 • ܗଶૉղੳ
 ୯७ͳεϖʔε۠੾Γ΍؆୯ͳࣙॻΛߏஙͯ͠ͷॲཧ࣮ߦ
 • ܎Γड͚ղੳ


    • ಛ௃ྔ࡞੒
 uni-gram Ͱ Bag of Words ࡞੒, Ұ෦ͷ bi-gram ͷߏங΍ tf-idf ΋ར༻
 • ໨తͷ෼ੳΛ࣮ࢪ
 ର৅ͷϨϏϡʔςΩετ͕ positive ͔ negative ͔Λ൑ผ
  19. ࣮ࡍʹ෼ੳΛ࣮ࢪͯ͠Έ·͠ΐ͏ʂ  • ໰୊ઃఆ
 ༩͑ΒΕͨจষ͕ positive ͳҙݟ͔ negative ͳҙݟ͔Λ൑ผ
 ࢀߟ࿦จɿhttp://www.aclweb.org/anthology/W02-1011


    • σʔλ
 https://www.cs.cornell.edu/people/pabo/movie-review-data/ ͔Βऩू
 positive, negative ͷλά෇͚͕ͳ͞Ε͍ͯΔ 700+700 ͷจষ
 1ϑΝΠϧʹ͖ͭ1ϨϏϡʔςΩετ͕֨ೲ
 • ໨ඪ
 ςΩετ෼ੳͷجຊతͳྲྀΕΛମݧ
 ࢀߟ࿦จͷਫ਼౓Λ্ճΔʂ
  20. Notebookͷ४උ  1. Githubͷ https://github.com/yosukekatada/python_ml_study Λ clone
 2. 20160930_second_meeting ʹҠಈ


    3. ධ൑෼ੳͰ༻͍Δͷ͸Լه
 - data/
 - ML_2_2_normal.ipynb
 - ML_2_2_advanced.ipynb
 4. jupyter notebook (or ipython notebook) Λ։͘
 5. ෼ੳΛ࢝Ί·͠ΐ͏ʂ
  21. Ϟσϧ : Support Vector Machine  ࢀߟ : https://en.wikipedia.org/wiki/Support_vector_machine
 


    ෳࡶͳ෼཭ڥքͷσʔλΛઢܗ෼཭Մೳͳಛ௃ྔۭؒʹࣹӨ
 Ͱ͖Δ͚ͩ෼཭ڥք͕σʔλ఺͔Β཭ΕΔΑ͏ʹ͢Δ (Ϛʔδϯ࠷େԽ)
 ߴ࣍ݩ (ແݶ࣍ݩ΋ʂ) ΁ͷࣹӨͰ΋ܭࢉ͕Մೳ (ΧʔωϧτϦοΫ) original space feature space
  22. Ϟσϧ : Naive Bayes classifier  ࢀߟ : https://en.wikipedia.org/wiki/Naive_Bayes_classifier
 


    ม਺ͷಠཱੑͷԾఆͱϕΠζͷఆཧ͔Β൑ผثΛߏங 
 
 
 ͜͜Ͱ C ͸ 1 (positive) ΋͘͠͸ 0 (negative) ͱ͍͏ΫϥεͰ͋Γ, ֤ x ͸ unigram Ͱߏஙͨ͠ Bag of Words ͱߟ͑Ε͹Α͍
 ෼฼͸Ϋϥε C ʹґଘ͠ͳ͍ͷͰআ͍ͯ, ৚݅෇͖ಠཱΛ࢖͏ͱҎԼ 
 Ψ΢ε෼෍΍ϕϧψʔΠ෼෍ΛԾఆͯ͠ σʔλ͔ΒύϥϝλΛֶश P ( C | X ) = P ( C | x1, x2, . . . , xn) = P ( x1, x2, . . . , xn | C ) P ( C ) P ( X ) P ( C | X ) / P ( C ) n Y 1 P ( xi | C )
  23. Ϟσϧ : Random Forest  ࢀߟ : https://en.wikipedia.org/wiki/Random_forest
 
 ܾఆ໦Λෳ਺૊Έ߹Θͤͯଟ਺ܾͰ༧ଌ

    ୈҰճษڧձࢿྉͰ΋આ໌ : http://www.slideshare.net/ssuserb5817c/python-66169435/1 ɾ ɾ ɾ ҰͭҰͭͷ໦͸σʔλΛ͏·͘෼ׂ͢ΔΑ͏ʹࢬ෼͚͞Ε͍ͯ͘ ͦΕͧΕͷ໦ͷ༧ଌͷฏۉΛͱΔ͜ͱͰશମͷ༧ଌͱ͢Δ ޷͖ͱ͍͏୯ޠ͕ ؚ·ΕΔ͔൱͔ ବ࡞ͱ͍͏୯ޠ͕ ؚ·ΕΔ͔൱͔ ?͕5ճҎ্ ݱΕΔ͔൱͔
  24. ·ͱΊ • ධ൑෼ੳͱ͸Կ͔
 ςΩετσʔλ͔Β positive / negative ͷۃੑ൑ผ΍࿩୊நग़Λߦ͏
 • ςΩετσʔλΛ༻͍ͨ෼ੳͷجຊ


    ඇߏ଄ԽσʔλͰ͋Δ͜ͱ΍ݴޠʹΑΔҧ͍͕͋ΔͨΊߴ೉౓
 جຊ͸ ܗଶૉղੳ→܎Γड͚ղੳ→ಛ௃ྔ࡞੒→෼ੳͷ࣮ࢪ
 • өըͷϨϏϡʔσʔλΛ༻͍ͨ෼ੳ
 ςΩετ෼ੳͷجຊతͳॲཧΛܦݧ
 ಛ௃ྔ࡞੒Λ޻෉͢Δ͜ͱͰաڈ࿦จΛ্ճΔਫ਼౓Λୡ੒