Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing Chinese Lyrics with Python

Andy Dai
June 05, 2016

Analyzing Chinese Lyrics with Python

Andy Dai

June 05, 2016
Tweet

More Decks by Andy Dai

Other Decks in Technology

Transcript

  1. WHO AM I? •  Andy Dai •  Organizer of Taipei.py

    •  PyCon பૡ (2012~) •  GliaCloud CTO ࠨ᮷஑狶ጱૡ纷䒍
  2. ຋ॗ 3459 讙狰෈ 1452 檔੝ባ 1139 蟞㾴࿯ 1061 ব舙谍 1057

    ຋瞺䔶 1007 珏聱斝 903 战ଉ盓 786 珏因舯 758 ব拹 754
  3. 2016 螭ࣁ媣媲አ奾૬ (jieba) •  pip install jieba •  Python • 

    耆誢獤扃 •  ᛔ懪ਁَ •  Quality 犋癩
  4. ߺ犚扃ᤩአጱ磧ग़牫 吚簁ݢ犥䌃㮆 for 蝅瑹 + dictionary ൥ਧ >>> from collections

    import Counter >>> counter = Counter([‘a’, ‘a’, ‘b’, ‘c’]) >>> counter.most_common(1) [('a', 2)] 獨盛懿 Python 磪 Counter ݢ犥አ
  5. አ扃穉斃ጱ掘੄ጱ֢扃Ո (word density > 0.20) •  ྎᵜ (ৼ磣牏抑ฎ聲य़牏ူේ虭…҂ •  ຋纩櫝

    (犋傶抑ᘒ֢ጱ稧牏肯ၹ…) •  檔禼ᣟ (眤௮ጱஞ牏ॠॠమ֦…҂ •  暼ᤶ皐ҁ胙玳牏ம疃䩚᪠蚎Ԝ螁…҂ •  皰襁ኞҁݗฎஞ覍牏ণট牏Bad Boy…҂ •  玭磥৙ (臺ஞ牏櫝Ո…҂
  6. አ扃穉斃ጱ … ጱ֢扃Ո (word density < 0.15) •  檔椆૝ 0.138ҁሴঈ牏ᰀ悚蝿瞁҂

    •  磷疍ፐ 0.134ҁ౯ฎӞ櫇ੜੜ澆牏覿ఉ牏妔ᛔ૩ጱ 稧҂ •  ᴨמ 0.116ҁԲ์ॠ羬ڜ…҂
  7. 稧ใ吚Ӿڊ匍螂 “眢眐” ጱ穉ֺ ຋ॗ 11.6% 讙狰෈ 6.9% 檔੝ባ 5.1% 蟞㾴࿯

    4.7% ব舙谍 17.1% ຋瞺䔶 1.4% 珏聱斝 9.4% 战ଉ盓 25.4% 珏因舯 3.9% ব拹 33%
  8. 稧ใ吚Ӿڊ匍螂 “ਿ疖” ጱ穉ֺ ຋ॗ 9.2% 讙狰෈ 7.5% 檔੝ባ 9.3% 蟞㾴࿯

    4.1% ব舙谍 21.8% ຋瞺䔶 5.2% 珏聱斝 7.9% 战ଉ盓 21.8% 珏因舯 5.1% ব拹 26%
  9. 稧ใ吚Ӿڊ匍螂 “ଛᐰ” ጱ穉ֺ ຋ॗ 7.4% 讙狰෈ 9.0% 檔੝ባ 6.4% 蟞㾴࿯

    2.7% ব舙谍 29.3% ຋瞺䔶 3.9% 珏聱斝 5.1% 战ଉ盓 18.4% 珏因舯 2.7% ব拹 10.6%
  10. Elasticsearch •  ൉׀獊෈䲒ᔱۑ胼 •  ಅ磪砺֢᮷磪൉׀ REST API •  蟴ݳ ElasticSearchDSL

    䌃蚏㬵ๅঅ •  http://www.slideshare.net/daikeren/search-search- search
  11. Recap •  瞟翕ᒊጱॺկ – scrapy •  䥁扃 – jieba • 

    虻碘獤ຉ – pure Python, pandas •  憙憽玕 – wordcloud, matplotlib •  juypter •  gensim •  elasticsearch