Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pythonで動かして学ぶ機械学習入門第二回 評判分析

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.
Avatar for yoppe yoppe
September 25, 2016

 Pythonで動かして学ぶ機械学習入門第二回 評判分析

Avatar for yoppe

yoppe

September 25, 2016
Tweet

More Decks by yoppe

Other Decks in Technology

Transcript

  1. ߨࢣ঺հ • ٠ా ངฏʢ͖ͨ͘ Α͏΁͍ʣ • ത࢜ʢཧֶʣ • ݱࡏ͸๭ίϯαϧςΟϯάϑΝʔϜʹͯσʔλ෼ੳۀ຿ʹैࣄ •

    ಘҙ෼໺
 ɾػցֶशͷཧ࿦తଆ໘
 ɾਪનΞϧΰϦζϜ
 ɾը૾෼ੳʢDeep Learningʣ • ࿈བྷઌ
 Կ͔͋Γ·ͨ͠Β͓ؾܰʹ͝࿈བྷ͍ͩ͘͞
 Email : [email protected]
 Facebook : https://www.facebook.com/yohei.kikuta.3
 Linkedin : https://jp.linkedin.com/in/yohei-kikuta-983b29117 
  2. CGMͷོ੝  Consumer Generated Media (CGM) ͕޿͘࢖ΘΕ͍ͯΔ
 • ۩ମతͳྫ
 Amazon

    ͷϨϏϡʔ, Twitter ͷπΠʔτ, ཱྀߦαΠτͷޱίϛ, ͳͲ • ಛ௃
 Ϣʔβࣗ਎͕σʔλΛੜ੒
 αʔϏεʹΑͬͯ͸ඇৗʹେྔͷσʔλ • ༗༻ͳ఺
 ର৅ʹର͢ΔҰൠফඅऀͷੜͷ੠͕நग़Ͱ͖Δ
 େྔͷσʔλ͔Βੈ࿦શମΛ൑அՄೳ
  3. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ۃੑͷ෼ྨ
 ͦͷςΩετ͕ positive ͳҙݟ͔ negative ͳҙݟ͔Λ෼ྨ
 •

    ࣙॻͷࣗಈߏங
 ςΩετσʔλͷू߹ʹݱΕΔ୯ޠͷొ࿥
 • ϢʔβͷϓϩϑΝΠϦϯά
 ςΩετͷ࡞੒ऀͷੑผ൑ผͳͲΛ࣮ࢪ
 • ಺༰ͷཁ໿
 ςΩετσʔλͷू߹͔Β࿩୊΍ੈ࿦ͷ܏޲ͷநग़ͳͲΛ࣮ࢪ
  4. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ۃੑͷ෼ྨ
 ͦͷςΩετ͕ positive ͳҙݟ͔ negative ͳҙݟ͔Λ෼ྨ
 ෼ྨʹ͓͍ͯػցֶशΛ༻͍Δ͜ͱ͕Մೳ

    ϨϏϡʔA ͜ͷ঎඼͸ͱͯ΋࢖͍ ΍͍͢ͷͰ͓͢͢Ίʂ ϨϏϡʔB ͜ͷ঎඼͸஋ஈͷׂʹ ੑೳ͕௿͍ɻ ϨϏϡʔC ͜ͷ঎඼͸ങͬͯଛ͢ ΔϨϕϧͰ͸ͳ͍ɻ positive negative positive (or neutral) Ԡ༻ྫʣ
 ͋Δ঎඼ʹରͯͦ͠Ε͕ੈ͔ؒΒྑ͍ධՁͳͷ͔൱͔Λ൑அ ঎඼ʹର͢Δ negative ͳϨϏϡʔΛूΊͯվળ఺Λચ͍ग़͠
  5. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ۃੑͷ෼ྨ
 ෼ྨΛࡉ͔ͯ͘͠ޒஈ֊ͷධՁͳͲʹ͢Δ͜ͱ΋ଟ͍
 ϥϯΫֶश΍ճؼͷ໰୊ͱͯ͠΋ѻ͑Δ ϨϏϡʔA ͜ͷ঎඼͸ͱͯ΋࢖͍ ΍͍͢ͷͰ͓͢͢Ίʂ ϨϏϡʔB

    ͜ͷ঎඼͸஋ஈͷׂʹ ੑೳ͕௿͍ɻ ϨϏϡʔC ͜ͷ঎඼͸ങͬͯଛ͢ ΔϨϕϧͰ͸ͳ͍ɻ ˑˑˑˑˑ ˑˑ ˑˑˑ Ԡ༻ྫʣ
 Ϣʔβͷଞ৘ใͱ૊Έ߹Θͤͯɺ͋ΔϢʔβʹରͯ͠ߴධՁʹͳΓ ͦ͏ͳ঎඼Λਪન
  6. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ۃੑͷ෼ྨ
 ΑΓਐΜͩ΋ͷͱͯ͠؍఺ຖʹ෼ྨ͢Δ΋ͷ΋͋Δ
 ֤؍఺Λநग़͢Δͱ͜Ζ΋ػցֶश͕࢖͑Δ ϨϏϡʔ ͜ͷϗςϧ͸෦԰͕ͱͯ΋͖Ε͍Ͱྑ ͍ɻ͔͠͠ͳ͕Β৯ࣄ͸࣭͕௿͘վ ળͯ͠΋Β͍ͱ͜Ζɻैۀһͷ઀٬͸

    ஸೡͰ޷ײ͕࣋ͯΔɻ ෦԰ͷ࣭ : positive Ԡ༻ྫʣ
 positive / negative ͕͍ࠞͬͯ͟ΔϨϏϡʔ͔Β؍఺ຖͷධՁΛऔΓग़ ͠ϢʔβͷҙݟΛ೺Ѳ ৯ࣄͷ࣭ : negative ઀٬ͷ࣭ : positive
  7. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ࣙॻͷࣗಈߏங
 ςΩετσʔλͷू߹ʹݱΕΔ୯ޠͷొ࿥ Ԡ༻ྫʣ
 େྔͷςΩετσʔλ͔Βग़ݱ୯ޠͷ඼ࢺ΍ positive / negative

    Λࣗ ಈతʹ൑ผͯ͠ొ࿥͠ɺ͞Βʹදه༳Ε΋ੋਖ਼ͨ͠ DB Λ࡞੒ ϨϏϡʔA ͜ͷ঎඼͸ͱͯ΋࢖͍ ΍͍͢ͷͰ͓͢͢Ίʂ ϨϏϡʔB ͜ͷ঎඼͸஋ஈͷׂʹ ੑೳ͕௿͍ɻ ϨϏϡʔC ͜ͷ঎඼͸ങͬͯଛ͢ ΔϨϕϧͰ͸ͳ͍ɻ ࣙॻ DB ࢖͍΍͍͢ : positive յΕ͍ͯΔ : negative Ϩϕϧ : neutral ɾ ɾ ɾ
  8. ධ൑෼ੳͰ΍Γ͍ͨ͜ͱ  • ಺༰ͷཁ໿
 ςΩετσʔλͷू߹͔Β࿩୊΍ੈ࿦ͷ܏޲ͷநग़ͳͲΛ࣮ࢪ Ԡ༻ྫʣ
 େྔͷςΩετσʔλ͔Β͍·ྲྀߦΓͷ࿩୊Λநग़ͨ͠Γબڍͷ݁ ՌΛ༧ଌͨ͠Γ͢Δ ςΩετA બڍߥ໺͸΍͸Γࣗຽ

    ౘ͕͔ͬ͠Γ͍ͯ͠Δ ςΩετB ͜ͷঢ়گ͡Όࣗຽౘʹ ೖΕΔ͔͠ͳ͍ͩΖ ςΩετC ʓʓౘͷ֗಄ԋઆʹ~ ਓ͕ԡ͠دͤͨɻ ग़ॴ : http://japan.cnet.com/news/society/35034916/
  9. ධ൑෼ੳͷྲྀΕ  ୯७ͳ෼ྨ͔ΒෳࡶͳλεΫ΁ͱൃల͍ͯ͠·͢
 • 1990೥୅
 ܗ༰ࢺͷۃੑ෼ྨ (positive or negative) •

    2000೥୅
 લ൒͸ϨϏϡʔςΩετͷۃੑ෼ྨ (positive or negative)
 ޙ൒͔Β͸τϐοΫϞσϧΛ༻͍ͨ಺༰ཁ໿ͳͲͷෳࡶͳλεΫ΁ • 2010೥୅
 Twitter ͳͲͷ SNS ͷ෼ੳ
 Word2vecΛ࢝Ίͱ͢Δ୯ޠͷ෼ࢄදݱͳͲ
 ςΩετͱը૾ͳͲͷଞͷσʔλΛෳ߹ͨ͠෼ੳ
  10. ςΩετ෼ੳ͸೉͍͠  ίϯϐϡʔλͰѻ͑Δͷ͸਺஋σʔλ
 
 → ςΩετ͸୯ޠͷཏྻ
 
 → ҙຯߏ଄ΛؚΜͩ਺஋σʔλʹ͢Δඞཁ͕͋Δ 


    → ͔͠͠ςΩετσʔλ͸େ͖͞΍ॱংͱ͍͏ई౓Ͱ͸ଌΓͮΒ͍ ɹ Ex.) ʮཧ૝ʯͱʮݱ࣮ʯʹେখ΍ͲͪΒ͕ઌͱ͔͸Ұൠʹ͸ͳ͍
 
 → Ͳ͏͢Ε͹Α͍ͷ͔ʂʁ
 
 → ຊߨ࠲Ͱ͸୅දతͳ෼ੳεςοϓΛ؆୯ʹ঺հ
  11. ୅දతͳ෼ੳεςοϓ  • ςΩετσʔλΛ४උ
 • ܗଶૉղੳ
 • ܎Γड͚ղੳ
 • ಛ௃ྔ࡞੒


    • ໨తͷ෼ੳΛ࣮ࢪ
 ڭࢣ༗ΓͳΒ൑ผ΍ճؼɺڭࢣແ͠ͳΒΫϥελϦϯάɺͳͲ ※͋͘·Ͱ୅දతͳεςοϓͳͷͰ༷ʑͳύλʔϯ͕ଘࡏ
  12. ୅දతͳ෼ੳεςοϓ  • ςΩετσʔλΛ४උ 
 ෼ੳͷର৅ͱͳΔςΩετσʔλΛ४උ͢Δ
 
 จࣈίʔυʹ஫ҙʂ
 ɾ೔ຊޠͳͲͷϚϧνόΠτจࣈ͸ಛʹ஫ҙ͕ඞཁ ɾpython͸2ܥͱ3ܥͰ࣮૷͕ҟͳΔ(2ܥ͸strܕͱunicodeܕ͕͋Γ3ܥ͸unicodeͰ౷Ұ)


    ɹಛผͳཧ༝͕ͳ͍ݶΓ͸3ܥΛ༻͍Δͷ͕ྑ͍
 
 σʔλιʔεͱͯ͠͸ԼهͷΑ͏ͳ΋ͷ͕͋Δ
 ɾTwitter API (https://dev.twitter.com/overview/documentation) Λ༻͍ͨ tweet ऩू ɾ੨ۭจݿ (http://www.aozora.gr.jp/) ɾӳޠͷөըϨϏϡʔσʔλ (http://www.cs.cornell.edu/people/pabo/movie-review-data/)
  13. ୅දతͳ෼ੳεςοϓ  • ܗଶૉղੳ
 ςΩετΛҙຯ୯ҐͰ࠷খͷཁૉʹ෼ղ͢Δ
 
 ӳޠͳͲεϖʔεͰ୯ޠ͕۠੾ΒΕΔݴޠ͸؆୯
 ɹEx.) This is

    a pen. → This / is / a / pen / .
 ೔ຊޠ͸೉͍͠
 ղੳ༻ϥΠϒϥϦ͕ඞཁͰ, MeCab (http://taku910.github.io/mecab/) ͕༗໊
 ɹEx.) ͢΋΋΋΋΋΋ͷ͏ͪ → ͢΋΋ / ΋ / ΋΋ / ΋ / ΋΋ / ͷ / ͏ͪ
 ՄೳͳΒࣙॻΛ༻͍ͯදه༳ΕΛਖ਼ͨ͠Γ඼ࢺΛ༩͑Δͱߋʹྑ͍ ɹEx.) ͓͜ͳ͏, ߦ͏, ߦͳ͏ → ߦ͏
 ɹEx.) ඒ͍͠ → (ඒ͍͠, ܗ༰ࢺ)
  14. ୅දతͳ෼ੳεςοϓ  • ܎Γड͚ղੳ
 ܗଶૉʹରͯ͠म০͢Δ͞ΕΔͷؔ܎Λࢦఆ͢Δ
 ͜Ε͸ݴޠಛੑ΍ଟٛੑͷͨΊͱͯ΋೉͍͠
 
 ɹEx.) I think

    that that that that boy used is wrong. 
 ɹEx.) ࠇ͍൅ͷඒ͍͠ঁੑ͕͍Δɻ 
 ϥΠϒϥϦͳͲΛ࢖༻͢Δ͔܎Γड͚ղੳ͸εΩοϓ͢Δͷ΋ΞϦ ೔ຊޠͷϥΠϒϥϦ͸ Cabocha (https://taku910.github.io/cabocha/) ͕༗໊
  15. ୅දతͳ෼ੳεςοϓ  • ಛ௃ྔ࡞੒
 ୯ޠͷ༗ແΛ {0,1} Ͱදݱ͢Δ one-hot encoding ͕جຊ


    ɹEx.) ࢲ → [1,0,0,0, …], ͸ → [0,1,0,0, …], ਓؒ → [0,0,1,0, …]
 ͜ΕΛ༻͍ͯจॻߦྻΛԼهͷΑ͏ʹ࡞੒Ͱ͖Δ จষ ࢲ ͋ͳͨ ͸ ਓؒ ͩ Ͱ ͳ͍ ɻ ʜ ࢲ͸ਓؒͩɻ         ͋ͳͨ͸ਓؒͰͳ͍ɻ         ʜ ͜ͷํ๏͸γϯϓϧͰѻ͍΍͍͢ ͔͠͠೚ҙͷೋ୯ޠؒͷྨࣅ౓͕ಉ͡ʹͳͬͯ͠·͍ҙຯ͸ফࣦ
  16. ୅දతͳ෼ੳεςοϓ  • ಛ௃ྔ࡞੒ɿࠓճͷ෼ੳͰ࢖͏΋ͷ
 ςΩετؚ͕ΉҙຯΛ൓өͤ͞ΔͨΊʹ༷ʑͳಛ௃͕ߟҊ͞Ε͍ͯΔ ɾN-gram ɹྡ઀ͯ͠ੜ͡Δ N ݸͷ୯ޠΛҰͭͷ୯Ґͱͯ͠ѻ͏ɻN =

    1,2͕ଟ͍ ɹɹEx.) bi-gram ࢲ͸ਓؒͩɻ→ (ࢲ, ͸), (͸, ਓؒ), (ਓؒ, ͩ), (ͩ, ɻ) ɾBag of Words (BoW) ɹग़ݱ͢Δ୯ޠͷස౓ΛΧ΢ϯτͯͦ͠ͷ਺Λಛ௃ྔͱ͢Δ ɹɹEx.) ࢲ͸ࢲΛ৴͡Δɻ→ (ࢲ, 2), (͸, 1), (Λ, 1), (৴͡Δ, 1), (ɻ, 1) ɾtf-idf (term frequency and inverse document frequency) ɹ୯ޠͷස౓ʹରͯͦ͠Ε͕ग़ݱ͢Δจॻͷׂ߹ͰॏΈ෇͚ ɹɹEx.) ʮࢲʯ͸ग़ݱස౓͸ଟ͍͕ଟ͘ͷจষͰݱΕΔͨΊ௿͍είΞ
  17. ୅දతͳ෼ੳεςοϓ  • ಛ௃ྔ࡞੒ɿͦͷଞ
 ɾ࣍ݩѹॖ ɹจॻߦྻΛ௿࣍ݩʹѹॖͯ͠τϐοΫநग़ͳͲΛߦ͏
 ɹಛҟ஋෼ղ΍֬཰తજࡏҙຯղੳ΍Latent Dirichlet AllocationͳͲ ɾ෼෍

    ɹ୯ޠ෼෍΍඼ࢺͷൺ཰෼෍ͳͲ ɾ୯ޠͷ෼ࢄදݱ
 ɹWord2Vec ʹ୅ද͞ΕΔϕΫτϧԋࢉ͕ҙຯΛ੒͢Α͏ͳදݱͷ֫ಘ
 ɹɹEx.) king - man ≒ queen - woman
 ɾetc…
  18. ࠓճ࣮ࢪ͢Δ෼ੳ  • ςΩετσʔλΛ४උ
 ӳޠͰॻ͔ΕͨөըͷϨϏϡʔσʔλΛ࢖༻
 • ܗଶૉղੳ
 ୯७ͳεϖʔε۠੾Γ΍؆୯ͳࣙॻΛߏஙͯ͠ͷॲཧ࣮ߦ
 • ܎Γड͚ղੳ


    • ಛ௃ྔ࡞੒
 uni-gram Ͱ Bag of Words ࡞੒, Ұ෦ͷ bi-gram ͷߏங΍ tf-idf ΋ར༻
 • ໨తͷ෼ੳΛ࣮ࢪ
 ର৅ͷϨϏϡʔςΩετ͕ positive ͔ negative ͔Λ൑ผ
  19. ࣮ࡍʹ෼ੳΛ࣮ࢪͯ͠Έ·͠ΐ͏ʂ  • ໰୊ઃఆ
 ༩͑ΒΕͨจষ͕ positive ͳҙݟ͔ negative ͳҙݟ͔Λ൑ผ
 ࢀߟ࿦จɿhttp://www.aclweb.org/anthology/W02-1011


    • σʔλ
 https://www.cs.cornell.edu/people/pabo/movie-review-data/ ͔Βऩू
 positive, negative ͷλά෇͚͕ͳ͞Ε͍ͯΔ 700+700 ͷจষ
 1ϑΝΠϧʹ͖ͭ1ϨϏϡʔςΩετ͕֨ೲ
 • ໨ඪ
 ςΩετ෼ੳͷجຊతͳྲྀΕΛମݧ
 ࢀߟ࿦จͷਫ਼౓Λ্ճΔʂ
  20. Notebookͷ४උ  1. Githubͷ https://github.com/yosukekatada/python_ml_study Λ clone
 2. 20160930_second_meeting ʹҠಈ


    3. ධ൑෼ੳͰ༻͍Δͷ͸Լه
 - data/
 - ML_2_2_normal.ipynb
 - ML_2_2_advanced.ipynb
 4. jupyter notebook (or ipython notebook) Λ։͘
 5. ෼ੳΛ࢝Ί·͠ΐ͏ʂ
  21. Ϟσϧ : Support Vector Machine  ࢀߟ : https://en.wikipedia.org/wiki/Support_vector_machine
 


    ෳࡶͳ෼཭ڥքͷσʔλΛઢܗ෼཭Մೳͳಛ௃ྔۭؒʹࣹӨ
 Ͱ͖Δ͚ͩ෼཭ڥք͕σʔλ఺͔Β཭ΕΔΑ͏ʹ͢Δ (Ϛʔδϯ࠷େԽ)
 ߴ࣍ݩ (ແݶ࣍ݩ΋ʂ) ΁ͷࣹӨͰ΋ܭࢉ͕Մೳ (ΧʔωϧτϦοΫ) original space feature space
  22. Ϟσϧ : Naive Bayes classifier  ࢀߟ : https://en.wikipedia.org/wiki/Naive_Bayes_classifier
 


    ม਺ͷಠཱੑͷԾఆͱϕΠζͷఆཧ͔Β൑ผثΛߏங 
 
 
 ͜͜Ͱ C ͸ 1 (positive) ΋͘͠͸ 0 (negative) ͱ͍͏ΫϥεͰ͋Γ, ֤ x ͸ unigram Ͱߏஙͨ͠ Bag of Words ͱߟ͑Ε͹Α͍
 ෼฼͸Ϋϥε C ʹґଘ͠ͳ͍ͷͰআ͍ͯ, ৚݅෇͖ಠཱΛ࢖͏ͱҎԼ 
 Ψ΢ε෼෍΍ϕϧψʔΠ෼෍ΛԾఆͯ͠ σʔλ͔ΒύϥϝλΛֶश P ( C | X ) = P ( C | x1, x2, . . . , xn) = P ( x1, x2, . . . , xn | C ) P ( C ) P ( X ) P ( C | X ) / P ( C ) n Y 1 P ( xi | C )
  23. Ϟσϧ : Random Forest  ࢀߟ : https://en.wikipedia.org/wiki/Random_forest
 
 ܾఆ໦Λෳ਺૊Έ߹Θͤͯଟ਺ܾͰ༧ଌ

    ୈҰճษڧձࢿྉͰ΋આ໌ : http://www.slideshare.net/ssuserb5817c/python-66169435/1 ɾ ɾ ɾ ҰͭҰͭͷ໦͸σʔλΛ͏·͘෼ׂ͢ΔΑ͏ʹࢬ෼͚͞Ε͍ͯ͘ ͦΕͧΕͷ໦ͷ༧ଌͷฏۉΛͱΔ͜ͱͰશମͷ༧ଌͱ͢Δ ޷͖ͱ͍͏୯ޠ͕ ؚ·ΕΔ͔൱͔ ବ࡞ͱ͍͏୯ޠ͕ ؚ·ΕΔ͔൱͔ ?͕5ճҎ্ ݱΕΔ͔൱͔
  24. ·ͱΊ • ධ൑෼ੳͱ͸Կ͔
 ςΩετσʔλ͔Β positive / negative ͷۃੑ൑ผ΍࿩୊நग़Λߦ͏
 • ςΩετσʔλΛ༻͍ͨ෼ੳͷجຊ


    ඇߏ଄ԽσʔλͰ͋Δ͜ͱ΍ݴޠʹΑΔҧ͍͕͋ΔͨΊߴ೉౓
 جຊ͸ ܗଶૉղੳ→܎Γड͚ղੳ→ಛ௃ྔ࡞੒→෼ੳͷ࣮ࢪ
 • өըͷϨϏϡʔσʔλΛ༻͍ͨ෼ੳ
 ςΩετ෼ੳͷجຊతͳॲཧΛܦݧ
 ಛ௃ྔ࡞੒Λ޻෉͢Δ͜ͱͰաڈ࿦จΛ্ճΔਫ਼౓Λୡ੒