Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Analyzing Chinese Lyrics with Python
Search
Andy Dai
June 05, 2016
Technology
4
1.3k
Analyzing Chinese Lyrics with Python
Andy Dai
June 05, 2016
Tweet
Share
More Decks by Andy Dai
See All by Andy Dai
用 Python + Azure 建立你的聊天機器人
daikeren
2
480
Other Decks in Technology
See All in Technology
Welcome to the LLM Club
koic
0
170
TechLION vol.41~MySQLユーザ会のほうから来ました / techlion41_mysql
sakaik
0
180
2025-06-26_Lightning_Talk_for_Lightning_Talks
_hashimo2
2
100
あなたの声を届けよう! 女性エンジニア登壇の意義とアウトプット実践ガイド #wttjp / Call for Your Voice
kondoyuko
4
450
OpenHands🤲にContributeしてみた
kotauchisunsun
1
440
製造業からパッケージ製品まで、あらゆる領域をカバー!生成AIを利用したテストシナリオ生成 / 20250627 Suguru Ishii
shift_evolve
PRO
1
140
セキュリティの民主化は何故必要なのか_AWS WAF 運用の 10 の苦悩から学ぶ
yoh
1
170
Postman AI エージェントビルダー最新情報
nagix
0
110
Observability infrastructure behind the trillion-messages scale Kafka platform
lycorptech_jp
PRO
0
140
Understanding_Thread_Tuning_for_Inference_Servers_of_Deep_Models.pdf
lycorptech_jp
PRO
0
120
AWS CDK 実践的アプローチ N選 / aws-cdk-practical-approaches
gotok365
6
750
Oracle Cloud Infrastructure:2025年6月度サービス・アップデート
oracle4engineer
PRO
2
250
Featured
See All Featured
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
What’s in a name? Adding method to the madness
productmarketing
PRO
23
3.5k
RailsConf 2023
tenderlove
30
1.1k
Producing Creativity
orderedlist
PRO
346
40k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
657
60k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
20
1.3k
Reflections from 52 weeks, 52 projects
jeffersonlam
351
20k
A better future with KSS
kneath
239
17k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.3k
Being A Developer After 40
akosma
90
590k
Intergalactic Javascript Robots from Outer Space
tanoku
271
27k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Transcript
Big Data, Better Decision www.gliacloud.com Andy Dai
[email protected]
Analyzing Chinese
Lyrics with Python
WHO AM I? • Andy Dai • Organizer of Taipei.py
• PyCon பૡ (2012~) • GliaCloud CTO ࠨ᮷狶ጱૡ纷䒍
犡ॠᥝ蚤य़疑獤Ձጱฎ Ӿ虻碘獤ຉٌ䋿ฎ盄 墋㻌ጱ
Ԫ眐ฎ蝡䰬樄তጱ…
礓ॠ౯ࣁమ PyCon ᥝಭᑤ ࠨጱ碻狡牧肊螲簁段蚏ԧ 蝡Ḓ稧
౯犋ݢ胼䨝訵膒
虻碘獤ຉጱᒫӞྍ 玲虻碘
None
None
অމ牧ᛔ૩㬵瞟…
None
Scrapy
竃ቘ虻碘
• 瞟ڊ襑ᥝጱ虻碘 • ݄ധ犋ᥝጱ虻碘 • ᯿蕦ጱ稧ใ
虏౯㮉ض㬵፡፡Ӟ犚碍硁
pandas ฎ虻碘ૡ纷䒍ጱঅ๏
pandas + pymongo
墋㻌ጱ翄懯虻碘 • 者و 141054 Ḓ稧 • 21150 㮆֢扃Ո • 6120 㮆稧ಋ
֢扃ኴጱ܈䔶
ॗ 3459 讙狰 1452 檔ባ 1139 蟞㾴 1061 ব舙谍 1057
瞺䔶 1007 珏聱斝 903 战ଉ盓 786 珏因舯 758 ব拹 754
matplotlib ฎ֦向瑽ጱঅ䒻ಋ import matplotlib.pyplot as plt plt.bar(…)
None
蝡܈㮆Ո㬟硁ԧ 8.7% ጱ苉承֢ߝ ॗӞ㮆Ո疰㬟ԧ 2.4%
ӥӞྍ 䥁扃
䥁扃ฎᛔ簁承蒂ቘጱच器 犡ଙ PyCon ݣ傀ࣁӾᎸᴺ膐旰 犡ଙ/PyCon/ݣ傀/ࣁ/ӾᎸᴺ/膐旰
2016 螭ࣁ媣媲አ奾૬ (jieba) • pip install jieba • Python •
耆誢獤扃 • ᛔ懪ਁَ • Quality 犋癩
䥁扃 /籃螂/ԧ/ग़ԋ/ఋ櫞/ /倀/ԧ/ग़裾/縄፮/ /胼/Ꭳ螇/㰁眤/ฎ/眢ጱ/螣叨/ /窕窚/皃皰/櫕Ո/ଥ/ /矦螂/皃稞/מի/ //虏/瞲/嬝篷玱觎/ጱ/Ի矦/ //Ӟ㮆/Ո/ጱ/伩ำ/ /旉ᑏ/ک/ݚ/Ӟ㮆/ጱ/胷腔/ /虏/Ӥ稞/ᇨ/ጱ/梊/玱/ڊ/瓵మ/
/ྯ㮆/Ո/᮷/ฎ/蝡䰬/ /Ձݑ/螂/ஞݵ腭/ //瞩妃/狶/眢眐/դ耻/ጱ/ᗧᗤ
䥁扃ਠ㬵֢犚獤ຉ
ߺ犚扃ᤩአጱ磧ग़牫 吚簁ݢ犥䌃㮆 for 蝅瑹 + dictionary ਧ >>> from collections
import Counter >>> counter = Counter([‘a’, ‘a’, ‘b’, ‘c’]) >>> counter.most_common(1) [('a', 2)] 獨盛懿 Python 磪 Counter ݢ犥አ
ڜڊ獮皃ݷ㬵፡፡ ౯㮉 Ӟ㮆̴䷱磪̴Ջ讕̴ᛔ૩̴眢眐̴犋ᥝ Ӯኴ̴螐̴Ꭳ螇̴Ӟ蚏̴犋䨝̴ெ讕̴盠禼 ݢ犥 Ӟ獥̴ইຎ̴ଛᐰ̴眤憽̴聅讀 ਿ疖 ࢩ傶̴櫝樄̴፥ጱ̴ݝ磪̴ஞӾ̴Ӟኞ̴碻樌 ፥ጱ̴蛪螲̴ፘמ̴疰ᓒ̴匍ࣁ ࢧ䛂
伩礖
稧扃ጱ扃䕍蚤෭ଉ፥ጱ癩盄ग़
ਁ襇 pip install wordcloud
ਁ襇
ॗጱਁ襇
ොઊጱਁ襇
檔姤揕ጱਁ襇
扃䕍掘纷ଶ (word density) unique 扃碍/者扃碍 len(set(word_list))/len(word_list)
扃䕍掘纷ଶ ଘ璂 word density - 0.175
አ扃穉斃ጱ掘ጱ֢扃Ո (word density > 0.20) • ྎᵜ (ৼ磣牏抑ฎ聲य़牏ူේ虭…҂ • 纩櫝
(犋傶抑ᘒ֢ጱ稧牏肯ၹ…) • 檔禼ᣟ (眤௮ጱஞ牏ॠॠమ֦…҂ • 暼ᤶ皐ҁ胙玳牏ம疃䩚᪠蚎Ԝ螁…҂ • 皰襁ኞҁݗฎஞ覍牏ণট牏Bad Boy…҂ • 玭磥 (臺ஞ牏櫝Ո…҂
አ扃穉斃ጱ … ጱ֢扃Ո (word density < 0.15) • 檔椆 0.138ҁሴঈ牏ᰀ悚蝿瞁҂
• 磷疍ፐ 0.134ҁ౯ฎӞ櫇ੜੜ澆牏覿ఉ牏妔ᛔ૩ጱ 稧҂ • ᴨמ 0.116ҁԲ์ॠ羬ڜ…҂
ஞ物አ扃穉斃ग़犋Ӟਧ玭疏
Ԇ氂獤ຉ物ଛᐰጱ眢眐ฎਿ疖ጱ
稧ใ吚Ӿڊ匍螂 “眢眐” ጱ穉ֺ ॗ 11.6% 讙狰 6.9% 檔ባ 5.1% 蟞㾴
4.7% ব舙谍 17.1% 瞺䔶 1.4% 珏聱斝 9.4% 战ଉ盓 25.4% 珏因舯 3.9% ব拹 33%
稧ใ吚Ӿڊ匍螂 “ਿ疖” ጱ穉ֺ ॗ 9.2% 讙狰 7.5% 檔ባ 9.3% 蟞㾴
4.1% ব舙谍 21.8% 瞺䔶 5.2% 珏聱斝 7.9% 战ଉ盓 21.8% 珏因舯 5.1% ব拹 26%
稧ใ吚Ӿڊ匍螂 “ଛᐰ” ጱ穉ֺ ॗ 7.4% 讙狰 9.0% 檔ባ 6.4% 蟞㾴
2.7% ব舙谍 29.3% 瞺䔶 3.9% 珏聱斝 5.1% 战ଉ盓 18.4% 珏因舯 2.7% ব拹 10.6%
螭磪盄ग़ݢ犥狶… • ߺ犚扃䕍䨝ݶ碻ڊ匍 • ߺ犚稧ጱ扃᯿蕦ሲ盄ṛ • 犋ݶ碻๗ጱአ扃… • ইຎ֦మ㷢ᘍՈ疑ጱ֢扃…
犡ॠ㬵犋现拻ጱ - jupyter
犡ॠ㬵犋现拻ጱ - elasticsearch
Elasticsearch • ׀獊䲒ᔱۑ胼 • ಅ磪砺֢᮷磪׀ REST API • 蟴ݳ ElasticSearchDSL
䌃蚏㬵ๅঅ • http://www.slideshare.net/daikeren/search-search- search
犡ॠ㬵犋现拻ጱ - gensim
gensim ጱ䛑አ • word2vec • doc2vec • 獤ᗭ • ፘ犲ଶ
• Machine Learning 奲ݳದ
矑ӥ㬵䨝狶ጱ ইຎ磪绚ጱ扖አ Deep Learning 狶ᛔ㵕稧扃ኞ౮牧 藶๗盃(?) ๚㬵ጱ Taipei.py
Recap • 瞟翕ᒊጱॺկ – scrapy • 䥁扃 – jieba •
虻碘獤ຉ – pure Python, pandas • 憙憽玕 – wordcloud, matplotlib • juypter • gensim • elasticsearch
ૡ珶๐率 • 虵搚秚䜗ภ讨䨝磪碝氂ፓ • ᐟᑃੜ因ᇔ • 懿݄覿糫牧磪ࠧ蟸牦牦牦 • ӥ܌槼襎纨ݢ犥肯肯虵搚秚ጱ硲Ԫ
THANK YOU