Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Analyzing Chinese Lyrics with Python
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Andy Dai
June 05, 2016
Technology
4
1.3k
Analyzing Chinese Lyrics with Python
Andy Dai
June 05, 2016
Tweet
Share
More Decks by Andy Dai
See All by Andy Dai
用 Python + Azure 建立你的聊天機器人
daikeren
2
490
Other Decks in Technology
See All in Technology
Databricks Free Edition講座 データサイエンス編
taka_aki
0
290
Introduction to Sansan for Engineers / エンジニア向け会社紹介
sansan33
PRO
6
68k
SREが向き合う大規模リアーキテクチャ 〜信頼性とアジリティの両立〜
zepprix
0
400
Ruby版 JSXのRuxが気になる
sansantech
PRO
0
110
会社紹介資料 / Sansan Company Profile
sansan33
PRO
15
400k
Claude_CodeでSEOを最適化する_AI_Ops_Community_Vol.2__マーケティングx_AIはここまで進化した.pdf
riku_423
2
450
Context Engineeringが企業で不可欠になる理由
hirosatogamo
PRO
3
410
Oracle Cloud Observability and Management Platform - OCI 運用監視サービス概要 -
oracle4engineer
PRO
2
14k
なぜ今、コスト最適化(倹約)が必要なのか? ~AWSでのコスト最適化の進め方「目的編」~
htan
1
110
2人で作ったAIダッシュボードが、開発組織の次の一手を照らした話― Cursor × SpecKit × 可視化の実践 ― Qiita AI Summit
noalisaai
1
370
Meshy Proプラン課金した
henjin0
0
250
顧客の言葉を、そのまま信じない勇気
yamatai1212
1
340
Featured
See All Featured
RailsConf 2023
tenderlove
30
1.3k
The Limits of Empathy - UXLibs8
cassininazir
1
210
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
110
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
730
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.8k
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
0
130
Paper Plane
katiecoart
PRO
0
46k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.7k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.2k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.3k
Unsuck your backbone
ammeep
671
58k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
150
Transcript
Big Data, Better Decision www.gliacloud.com Andy Dai
[email protected]
Analyzing Chinese
Lyrics with Python
WHO AM I? • Andy Dai • Organizer of Taipei.py
• PyCon பૡ (2012~) • GliaCloud CTO ࠨ᮷狶ጱૡ纷䒍
犡ॠᥝ蚤य़疑獤Ձጱฎ Ӿ虻碘獤ຉٌ䋿ฎ盄 墋㻌ጱ
Ԫ眐ฎ蝡䰬樄তጱ…
礓ॠ౯ࣁమ PyCon ᥝಭᑤ ࠨጱ碻狡牧肊螲簁段蚏ԧ 蝡Ḓ稧
౯犋ݢ胼䨝訵膒
虻碘獤ຉጱᒫӞྍ 玲虻碘
None
None
অމ牧ᛔ૩㬵瞟…
None
Scrapy
竃ቘ虻碘
• 瞟ڊ襑ᥝጱ虻碘 • ݄ധ犋ᥝጱ虻碘 • ᯿蕦ጱ稧ใ
虏౯㮉ض㬵፡፡Ӟ犚碍硁
pandas ฎ虻碘ૡ纷䒍ጱঅ๏
pandas + pymongo
墋㻌ጱ翄懯虻碘 • 者و 141054 Ḓ稧 • 21150 㮆֢扃Ո • 6120 㮆稧ಋ
֢扃ኴጱ܈䔶
ॗ 3459 讙狰 1452 檔ባ 1139 蟞㾴 1061 ব舙谍 1057
瞺䔶 1007 珏聱斝 903 战ଉ盓 786 珏因舯 758 ব拹 754
matplotlib ฎ֦向瑽ጱঅ䒻ಋ import matplotlib.pyplot as plt plt.bar(…)
None
蝡܈㮆Ո㬟硁ԧ 8.7% ጱ苉承֢ߝ ॗӞ㮆Ո疰㬟ԧ 2.4%
ӥӞྍ 䥁扃
䥁扃ฎᛔ簁承蒂ቘጱच器 犡ଙ PyCon ݣ傀ࣁӾᎸᴺ膐旰 犡ଙ/PyCon/ݣ傀/ࣁ/ӾᎸᴺ/膐旰
2016 螭ࣁ媣媲አ奾૬ (jieba) • pip install jieba • Python •
耆誢獤扃 • ᛔ懪ਁَ • Quality 犋癩
䥁扃 /籃螂/ԧ/ग़ԋ/ఋ櫞/ /倀/ԧ/ग़裾/縄፮/ /胼/Ꭳ螇/㰁眤/ฎ/眢ጱ/螣叨/ /窕窚/皃皰/櫕Ո/ଥ/ /矦螂/皃稞/מի/ //虏/瞲/嬝篷玱觎/ጱ/Ի矦/ //Ӟ㮆/Ո/ጱ/伩ำ/ /旉ᑏ/ک/ݚ/Ӟ㮆/ጱ/胷腔/ /虏/Ӥ稞/ᇨ/ጱ/梊/玱/ڊ/瓵మ/
/ྯ㮆/Ո/᮷/ฎ/蝡䰬/ /Ձݑ/螂/ஞݵ腭/ //瞩妃/狶/眢眐/դ耻/ጱ/ᗧᗤ
䥁扃ਠ㬵֢犚獤ຉ
ߺ犚扃ᤩአጱ磧ग़牫 吚簁ݢ犥䌃㮆 for 蝅瑹 + dictionary ਧ >>> from collections
import Counter >>> counter = Counter([‘a’, ‘a’, ‘b’, ‘c’]) >>> counter.most_common(1) [('a', 2)] 獨盛懿 Python 磪 Counter ݢ犥አ
ڜڊ獮皃ݷ㬵፡፡ ౯㮉 Ӟ㮆̴䷱磪̴Ջ讕̴ᛔ૩̴眢眐̴犋ᥝ Ӯኴ̴螐̴Ꭳ螇̴Ӟ蚏̴犋䨝̴ெ讕̴盠禼 ݢ犥 Ӟ獥̴ইຎ̴ଛᐰ̴眤憽̴聅讀 ਿ疖 ࢩ傶̴櫝樄̴፥ጱ̴ݝ磪̴ஞӾ̴Ӟኞ̴碻樌 ፥ጱ̴蛪螲̴ፘמ̴疰ᓒ̴匍ࣁ ࢧ䛂
伩礖
稧扃ጱ扃䕍蚤෭ଉ፥ጱ癩盄ग़
ਁ襇 pip install wordcloud
ਁ襇
ॗጱਁ襇
ොઊጱਁ襇
檔姤揕ጱਁ襇
扃䕍掘纷ଶ (word density) unique 扃碍/者扃碍 len(set(word_list))/len(word_list)
扃䕍掘纷ଶ ଘ璂 word density - 0.175
አ扃穉斃ጱ掘ጱ֢扃Ո (word density > 0.20) • ྎᵜ (ৼ磣牏抑ฎ聲य़牏ူේ虭…҂ • 纩櫝
(犋傶抑ᘒ֢ጱ稧牏肯ၹ…) • 檔禼ᣟ (眤௮ጱஞ牏ॠॠమ֦…҂ • 暼ᤶ皐ҁ胙玳牏ம疃䩚᪠蚎Ԝ螁…҂ • 皰襁ኞҁݗฎஞ覍牏ণট牏Bad Boy…҂ • 玭磥 (臺ஞ牏櫝Ո…҂
አ扃穉斃ጱ … ጱ֢扃Ո (word density < 0.15) • 檔椆 0.138ҁሴঈ牏ᰀ悚蝿瞁҂
• 磷疍ፐ 0.134ҁ౯ฎӞ櫇ੜੜ澆牏覿ఉ牏妔ᛔ૩ጱ 稧҂ • ᴨמ 0.116ҁԲ์ॠ羬ڜ…҂
ஞ物አ扃穉斃ग़犋Ӟਧ玭疏
Ԇ氂獤ຉ物ଛᐰጱ眢眐ฎਿ疖ጱ
稧ใ吚Ӿڊ匍螂 “眢眐” ጱ穉ֺ ॗ 11.6% 讙狰 6.9% 檔ባ 5.1% 蟞㾴
4.7% ব舙谍 17.1% 瞺䔶 1.4% 珏聱斝 9.4% 战ଉ盓 25.4% 珏因舯 3.9% ব拹 33%
稧ใ吚Ӿڊ匍螂 “ਿ疖” ጱ穉ֺ ॗ 9.2% 讙狰 7.5% 檔ባ 9.3% 蟞㾴
4.1% ব舙谍 21.8% 瞺䔶 5.2% 珏聱斝 7.9% 战ଉ盓 21.8% 珏因舯 5.1% ব拹 26%
稧ใ吚Ӿڊ匍螂 “ଛᐰ” ጱ穉ֺ ॗ 7.4% 讙狰 9.0% 檔ባ 6.4% 蟞㾴
2.7% ব舙谍 29.3% 瞺䔶 3.9% 珏聱斝 5.1% 战ଉ盓 18.4% 珏因舯 2.7% ব拹 10.6%
螭磪盄ग़ݢ犥狶… • ߺ犚扃䕍䨝ݶ碻ڊ匍 • ߺ犚稧ጱ扃᯿蕦ሲ盄ṛ • 犋ݶ碻๗ጱአ扃… • ইຎ֦మ㷢ᘍՈ疑ጱ֢扃…
犡ॠ㬵犋现拻ጱ - jupyter
犡ॠ㬵犋现拻ጱ - elasticsearch
Elasticsearch • ׀獊䲒ᔱۑ胼 • ಅ磪砺֢᮷磪׀ REST API • 蟴ݳ ElasticSearchDSL
䌃蚏㬵ๅঅ • http://www.slideshare.net/daikeren/search-search- search
犡ॠ㬵犋现拻ጱ - gensim
gensim ጱ䛑አ • word2vec • doc2vec • 獤ᗭ • ፘ犲ଶ
• Machine Learning 奲ݳದ
矑ӥ㬵䨝狶ጱ ইຎ磪绚ጱ扖አ Deep Learning 狶ᛔ㵕稧扃ኞ౮牧 藶๗盃(?) ๚㬵ጱ Taipei.py
Recap • 瞟翕ᒊጱॺկ – scrapy • 䥁扃 – jieba •
虻碘獤ຉ – pure Python, pandas • 憙憽玕 – wordcloud, matplotlib • juypter • gensim • elasticsearch
ૡ珶๐率 • 虵搚秚䜗ภ讨䨝磪碝氂ፓ • ᐟᑃੜ因ᇔ • 懿݄覿糫牧磪ࠧ蟸牦牦牦 • ӥ܌槼襎纨ݢ犥肯肯虵搚秚ጱ硲Ԫ
THANK YOU