Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing Chinese Lyrics with Python

Analyzing Chinese Lyrics with Python

82da5222ebb0eec4f5a9ea0b1f99e1d7?s=128

Andy Dai

June 05, 2016
Tweet

Transcript

  1. Big Data, Better Decision www.gliacloud.com Andy Dai andydai@gliacloud.com Analyzing Chinese

    Lyrics with Python
  2. WHO AM I? •  Andy Dai •  Organizer of Taipei.py

    •  PyCon பૡ (2012~) •  GliaCloud CTO ࠨ᮷஑狶ጱૡ纷䒍
  3. 犡ॠᥝ蚤य़疑獤Ձጱฎ Ӿ෈虻碘獤ຉٌ䋿ฎ盄 墋㻌ጱ

  4. Ԫ眐ฎ蝡䰬樄তጱ…

  5. 礓ॠ౯ࣁమ PyCon ᥝಭᑤ ࠨጱ碻狡牧肊螲஺簁段蚏ԧ 蝡Ḓ稧

  6. ౯犋ݢ胼䨝訵膒

  7. 虻碘獤ຉጱᒫӞྍ 玲஑虻碘

  8. None
  9. None
  10. অމ牧ᛔ૩㬵瞟…

  11. None
  12. Scrapy

  13. 竃ቘ虻碘

  14. •  瞟ڊ襑ᥝጱ虻碘 •  ݄ധ犋ᥝጱ虻碘 •  ᯿蕦ጱ稧ใ

  15. 虏౯㮉ض㬵፡፡Ӟ犚碍硁

  16. pandas ฎ虻碘ૡ纷䒍ጱঅ๏݋

  17. pandas + pymongo

  18. 墋㻌ጱ翄懯虻碘 • 者و 141054 Ḓ稧 • 21150 㮆֢扃Ո • 6120 㮆稧ಋ

  19. ֢扃ኴጱ܈䔶

  20. ຋ॗ 3459 讙狰෈ 1452 檔੝ባ 1139 蟞㾴࿯ 1061 ব舙谍 1057

    ຋瞺䔶 1007 珏聱斝 903 战ଉ盓 786 珏因舯 758 ব拹 754
  21. matplotlib ฎ֦向瑽ጱঅ䒻ಋ import matplotlib.pyplot as plt plt.bar(…)

  22. None
  23. 蝡܈㮆Ո㬟硁ԧ 8.7% ጱ苉承֢ߝ ຋ॗӞ㮆Ո疰㬟ԧ 2.4%

  24. ӥӞྍ 䥁扃

  25. 䥁扃ฎᛔ簁承᥺蒂ቘጱच器 犡ଙ PyCon ݣ傀ࣁӾᎸᴺ膐旰 犡ଙ/PyCon/ݣ傀/ࣁ/ӾᎸᴺ/膐旰

  26. 2016 螭ࣁ媣媲አ奾૬ (jieba) •  pip install jieba •  Python • 

    耆誢獤扃 •  ᛔ懪ਁَ •  Quality 犋癩
  27. 䥁扃 /籃螂/ԧ/ग़ԋ/ఋ櫞/ /倀/ԧ/ग़裾/縄፮/ /಍胼/Ꭳ螇/㰁眤/ฎ/眢ጱ/螣叨/ /窕窚/皃皰/櫕Ո/ଥ/ /矦螂/皃稞/מի/ /಍/虏/౰瞲/嬝篷玱觎/ጱ/Ի矦/ /಩/Ӟ㮆/Ո/ጱ/伩ำ/ /旉ᑏ/ک/ݚ/Ӟ㮆/ጱ/胷腔/ /虏/Ӥ稞/ᇨ/ጱ/梊/玱፜/ڊ/瓵మ/

    /ྯ㮆/Ո/᮷/ฎ/蝡䰬/ /Ձݑ/螂/൉ஞݵ腭/ /಍/瞩妃/狶/眢眐/դ耻/ጱ/ᗧᗤ
  28. 䥁扃ਠ㬵֢犚獤ຉ

  29. ߺ犚扃ᤩአጱ磧ग़牫 吚簁ݢ犥䌃㮆 for 蝅瑹 + dictionary ൥ਧ >>> from collections

    import Counter >>> counter = Counter([‘a’, ‘a’, ‘b’, ‘c’]) >>> counter.most_common(1) [('a', 2)] 獨盛懿 Python 磪 Counter ݢ犥አ
  30. ڜڊ獮皃ݷ㬵፡፡ ౯㮉 Ӟ㮆̴䷱磪̴Ջ讕̴ᛔ૩̴眢眐̴犋ᥝ Ӯኴ̴࿞螐̴Ꭳ螇̴Ӟ蚏̴犋䨝̴ெ讕̴盠禼 ݢ犥 Ӟ獥̴ইຎ̴ଛᐰ̴眤憽̴聅讀 ਿ疖 ࢩ傶̴櫝樄̴፥ጱ̴ݝ磪̴ஞӾ̴Ӟኞ̴碻樌 ፥ጱ̴蛪螲̴ፘמ̴疰ᓒ̴匍ࣁ ࢧ䛂

    伩礖
  31. 稧扃ጱ扃䕍蚤෭ଉ፥ጱ癩盄ग़

  32. ෈ਁ襇 pip install wordcloud

  33. ෈ਁ襇

  34. ຋ॗጱ෈ਁ襇

  35. ො෈ઊጱ෈ਁ襇

  36. 檔姤揕ጱ෈ਁ襇

  37. 扃䕍掘੄纷ଶ (word density) unique 扃碍/者扃碍 len(set(word_list))/len(word_list)

  38. 扃䕍掘੄纷ଶ ଘ璂 word density - 0.175

  39. አ扃穉斃ጱ掘੄ጱ֢扃Ո (word density > 0.20) •  ྎᵜ (ৼ磣牏抑ฎ聲य़牏ူේ虭…҂ •  ຋纩櫝

    (犋傶抑ᘒ֢ጱ稧牏肯ၹ…) •  檔禼ᣟ (眤௮ጱஞ牏ॠॠమ֦…҂ •  暼ᤶ皐ҁ胙玳牏ம疃䩚᪠蚎Ԝ螁…҂ •  皰襁ኞҁݗฎஞ覍牏ণট牏Bad Boy…҂ •  玭磥৙ (臺ஞ牏櫝Ո…҂
  40. አ扃穉斃ጱ … ጱ֢扃Ո (word density < 0.15) •  檔椆૝ 0.138ҁሴঈ牏ᰀ悚蝿瞁҂

    •  磷疍ፐ 0.134ҁ౯ฎӞ櫇ੜੜ澆牏覿ఉ牏妔ᛔ૩ጱ 稧҂ •  ᴨמ 0.116ҁԲ์ॠ羬ڜ…҂
  41. ஞ஑物አ扃穉斃ग़犋Ӟਧ玭疏

  42. Ԇ氂獤ຉ物ଛᐰጱ眢眐ฎਿ疖ጱ

  43. 稧ใ吚Ӿڊ匍螂 “眢眐” ጱ穉ֺ ຋ॗ 11.6% 讙狰෈ 6.9% 檔੝ባ 5.1% 蟞㾴࿯

    4.7% ব舙谍 17.1% ຋瞺䔶 1.4% 珏聱斝 9.4% 战ଉ盓 25.4% 珏因舯 3.9% ব拹 33%
  44. 稧ใ吚Ӿڊ匍螂 “ਿ疖” ጱ穉ֺ ຋ॗ 9.2% 讙狰෈ 7.5% 檔੝ባ 9.3% 蟞㾴࿯

    4.1% ব舙谍 21.8% ຋瞺䔶 5.2% 珏聱斝 7.9% 战ଉ盓 21.8% 珏因舯 5.1% ব拹 26%
  45. 稧ใ吚Ӿڊ匍螂 “ଛᐰ” ጱ穉ֺ ຋ॗ 7.4% 讙狰෈ 9.0% 檔੝ባ 6.4% 蟞㾴࿯

    2.7% ব舙谍 29.3% ຋瞺䔶 3.9% 珏聱斝 5.1% 战ଉ盓 18.4% 珏因舯 2.7% ব拹 10.6%
  46. 螭磪盄ग़ݢ犥狶… •  ߺ犚扃䕍䨝ݶ碻ڊ匍 •  ߺ犚稧ጱ扃᯿蕦ሲ盄ṛ •  犋ݶ碻๗ጱአ扃… •  ইຎ֦మ㷢ᘍՈ疑ጱ֢扃…

  47. 犡ॠ㬵犋现拻ጱ - jupyter

  48. 犡ॠ㬵犋现拻ጱ - elasticsearch

  49. Elasticsearch •  ൉׀獊෈䲒ᔱۑ胼 •  ಅ磪砺֢᮷磪൉׀ REST API •  蟴ݳ ElasticSearchDSL

    䌃蚏㬵ๅঅ •  http://www.slideshare.net/daikeren/search-search- search
  50. 犡ॠ㬵犋现拻ጱ - gensim

  51. gensim ጱ䛑አ •  word2vec •  doc2vec •  獤ᗭ •  ፘ犲ଶ

    •  Machine Learning 奲ݳದ
  52. 矑ӥ㬵䨝狶ጱ ইຎ磪绚ጱ扖አ Deep Learning 狶ᛔ㵕稧扃ኞ౮牧 藶๗盃(?) ๚㬵ጱ Taipei.py

  53. Recap •  瞟翕ᒊጱॺկ – scrapy •  䥁扃 – jieba • 

    虻碘獤ຉ – pure Python, pandas •  憙憽玕 – wordcloud, matplotlib •  juypter •  gensim •  elasticsearch
  54. ૡ珶๐率 •  虵搚秚೴䜗ภ讨䨝磪碝氂ፓ •  ᐟᑃੜ因ᇔ •  懿஑݄覿糫牧磪ࠧ蟸牦牦牦 •  ӥ܌槼襎纨ݢ犥肯肯虵搚秚ጱ硲Ԫ

  55. THANK YOU