Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Analyzing Chinese Lyrics with Python
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Andy Dai
June 05, 2016
Technology
1.3k
4
Share
Analyzing Chinese Lyrics with Python
Andy Dai
June 05, 2016
More Decks by Andy Dai
See All by Andy Dai
用 Python + Azure 建立你的聊天機器人
daikeren
2
500
Other Decks in Technology
See All in Technology
「QA=テスト」「シフトレフト=スクラムイベントの参加者の一員」の呪縛を解く。アジャイルな開発を止めないために、10Xで挑んだ「右側のしわ寄せ」解消記 #scrumniigata
nihonbuson
PRO
5
1.4k
Tachikawa.any 運営挨拶
daitasu
0
170
ESP32 IoTを動かしながらメモリ使用量を観測してみた話
zozotech
PRO
0
130
Purview 勉強会報告 Microsoft Purview 入門しようとしてみた
masakichixo
1
410
Oracle AI Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
6
1.6k
ECSのTerraformモジュールにコントリビュートした話
harukasakihara
0
170
そのSLO 99.9%、本当に必要ですか? 〜優先度付きSLOによる責任共有の設計思想〜 / Is that 99.9% SLO really necessary? Design philosophy of shared responsibility through prioritized SLOs
vtryo
0
740
続 運用改善、不都合な真実 〜 物理制約のない運用改善はほとんど無価値 / 20260518-ssmjp-kaizen-no-value-without-physical-constraints
opelab
2
220
AIのための特別なアーキテクチャはいらない 0→1開発で実践した設計原則とガードレール
kaminashi
0
130
みんなの考えた最強のデータ基盤アーキテクチャ'26前期〜前夜祭〜ルーキーズ_資料_遠藤な
endonanana
0
350
AIエージェントの支払い基盤 AgentCore Payments概要
kmiya84377
2
190
Purview Endpoint DLP 動かしてみた
kozakigh
0
410
Featured
See All Featured
What's in a price? How to price your products and services
michaelherold
247
13k
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
500
First, design no harm
axbom
PRO
2
1.2k
What does AI have to do with Human Rights?
axbom
PRO
1
2.1k
SEO Brein meetup: CTRL+C is not how to scale international SEO
lindahogenes
1
2.6k
The Limits of Empathy - UXLibs8
cassininazir
1
330
RailsConf 2023
tenderlove
30
1.4k
JAMstack: Web Apps at Ludicrous Speed - All Things Open 2022
reverentgeek
1
440
[SF Ruby Conf 2025] Rails X
palkan
2
1k
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
1
2k
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.5k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
2
1.5k
Transcript
Big Data, Better Decision www.gliacloud.com Andy Dai
[email protected]
Analyzing Chinese
Lyrics with Python
WHO AM I? • Andy Dai • Organizer of Taipei.py
• PyCon பૡ (2012~) • GliaCloud CTO ࠨ᮷狶ጱૡ纷䒍
犡ॠᥝ蚤य़疑獤Ձጱฎ Ӿ虻碘獤ຉٌ䋿ฎ盄 墋㻌ጱ
Ԫ眐ฎ蝡䰬樄তጱ…
礓ॠ౯ࣁమ PyCon ᥝಭᑤ ࠨጱ碻狡牧肊螲簁段蚏ԧ 蝡Ḓ稧
౯犋ݢ胼䨝訵膒
虻碘獤ຉጱᒫӞྍ 玲虻碘
None
None
অމ牧ᛔ૩㬵瞟…
None
Scrapy
竃ቘ虻碘
• 瞟ڊ襑ᥝጱ虻碘 • ݄ധ犋ᥝጱ虻碘 • ᯿蕦ጱ稧ใ
虏౯㮉ض㬵፡፡Ӟ犚碍硁
pandas ฎ虻碘ૡ纷䒍ጱঅ๏
pandas + pymongo
墋㻌ጱ翄懯虻碘 • 者و 141054 Ḓ稧 • 21150 㮆֢扃Ո • 6120 㮆稧ಋ
֢扃ኴጱ܈䔶
ॗ 3459 讙狰 1452 檔ባ 1139 蟞㾴 1061 ব舙谍 1057
瞺䔶 1007 珏聱斝 903 战ଉ盓 786 珏因舯 758 ব拹 754
matplotlib ฎ֦向瑽ጱঅ䒻ಋ import matplotlib.pyplot as plt plt.bar(…)
None
蝡܈㮆Ո㬟硁ԧ 8.7% ጱ苉承֢ߝ ॗӞ㮆Ո疰㬟ԧ 2.4%
ӥӞྍ 䥁扃
䥁扃ฎᛔ簁承蒂ቘጱच器 犡ଙ PyCon ݣ傀ࣁӾᎸᴺ膐旰 犡ଙ/PyCon/ݣ傀/ࣁ/ӾᎸᴺ/膐旰
2016 螭ࣁ媣媲አ奾૬ (jieba) • pip install jieba • Python •
耆誢獤扃 • ᛔ懪ਁَ • Quality 犋癩
䥁扃 /籃螂/ԧ/ग़ԋ/ఋ櫞/ /倀/ԧ/ग़裾/縄፮/ /胼/Ꭳ螇/㰁眤/ฎ/眢ጱ/螣叨/ /窕窚/皃皰/櫕Ո/ଥ/ /矦螂/皃稞/מի/ //虏/瞲/嬝篷玱觎/ጱ/Ի矦/ //Ӟ㮆/Ո/ጱ/伩ำ/ /旉ᑏ/ک/ݚ/Ӟ㮆/ጱ/胷腔/ /虏/Ӥ稞/ᇨ/ጱ/梊/玱/ڊ/瓵మ/
/ྯ㮆/Ո/᮷/ฎ/蝡䰬/ /Ձݑ/螂/ஞݵ腭/ //瞩妃/狶/眢眐/դ耻/ጱ/ᗧᗤ
䥁扃ਠ㬵֢犚獤ຉ
ߺ犚扃ᤩአጱ磧ग़牫 吚簁ݢ犥䌃㮆 for 蝅瑹 + dictionary ਧ >>> from collections
import Counter >>> counter = Counter([‘a’, ‘a’, ‘b’, ‘c’]) >>> counter.most_common(1) [('a', 2)] 獨盛懿 Python 磪 Counter ݢ犥አ
ڜڊ獮皃ݷ㬵፡፡ ౯㮉 Ӟ㮆̴䷱磪̴Ջ讕̴ᛔ૩̴眢眐̴犋ᥝ Ӯኴ̴螐̴Ꭳ螇̴Ӟ蚏̴犋䨝̴ெ讕̴盠禼 ݢ犥 Ӟ獥̴ইຎ̴ଛᐰ̴眤憽̴聅讀 ਿ疖 ࢩ傶̴櫝樄̴፥ጱ̴ݝ磪̴ஞӾ̴Ӟኞ̴碻樌 ፥ጱ̴蛪螲̴ፘמ̴疰ᓒ̴匍ࣁ ࢧ䛂
伩礖
稧扃ጱ扃䕍蚤෭ଉ፥ጱ癩盄ग़
ਁ襇 pip install wordcloud
ਁ襇
ॗጱਁ襇
ොઊጱਁ襇
檔姤揕ጱਁ襇
扃䕍掘纷ଶ (word density) unique 扃碍/者扃碍 len(set(word_list))/len(word_list)
扃䕍掘纷ଶ ଘ璂 word density - 0.175
አ扃穉斃ጱ掘ጱ֢扃Ո (word density > 0.20) • ྎᵜ (ৼ磣牏抑ฎ聲य़牏ူේ虭…҂ • 纩櫝
(犋傶抑ᘒ֢ጱ稧牏肯ၹ…) • 檔禼ᣟ (眤௮ጱஞ牏ॠॠమ֦…҂ • 暼ᤶ皐ҁ胙玳牏ம疃䩚᪠蚎Ԝ螁…҂ • 皰襁ኞҁݗฎஞ覍牏ণট牏Bad Boy…҂ • 玭磥 (臺ஞ牏櫝Ո…҂
አ扃穉斃ጱ … ጱ֢扃Ո (word density < 0.15) • 檔椆 0.138ҁሴঈ牏ᰀ悚蝿瞁҂
• 磷疍ፐ 0.134ҁ౯ฎӞ櫇ੜੜ澆牏覿ఉ牏妔ᛔ૩ጱ 稧҂ • ᴨמ 0.116ҁԲ์ॠ羬ڜ…҂
ஞ物አ扃穉斃ग़犋Ӟਧ玭疏
Ԇ氂獤ຉ物ଛᐰጱ眢眐ฎਿ疖ጱ
稧ใ吚Ӿڊ匍螂 “眢眐” ጱ穉ֺ ॗ 11.6% 讙狰 6.9% 檔ባ 5.1% 蟞㾴
4.7% ব舙谍 17.1% 瞺䔶 1.4% 珏聱斝 9.4% 战ଉ盓 25.4% 珏因舯 3.9% ব拹 33%
稧ใ吚Ӿڊ匍螂 “ਿ疖” ጱ穉ֺ ॗ 9.2% 讙狰 7.5% 檔ባ 9.3% 蟞㾴
4.1% ব舙谍 21.8% 瞺䔶 5.2% 珏聱斝 7.9% 战ଉ盓 21.8% 珏因舯 5.1% ব拹 26%
稧ใ吚Ӿڊ匍螂 “ଛᐰ” ጱ穉ֺ ॗ 7.4% 讙狰 9.0% 檔ባ 6.4% 蟞㾴
2.7% ব舙谍 29.3% 瞺䔶 3.9% 珏聱斝 5.1% 战ଉ盓 18.4% 珏因舯 2.7% ব拹 10.6%
螭磪盄ग़ݢ犥狶… • ߺ犚扃䕍䨝ݶ碻ڊ匍 • ߺ犚稧ጱ扃᯿蕦ሲ盄ṛ • 犋ݶ碻๗ጱአ扃… • ইຎ֦మ㷢ᘍՈ疑ጱ֢扃…
犡ॠ㬵犋现拻ጱ - jupyter
犡ॠ㬵犋现拻ጱ - elasticsearch
Elasticsearch • ׀獊䲒ᔱۑ胼 • ಅ磪砺֢᮷磪׀ REST API • 蟴ݳ ElasticSearchDSL
䌃蚏㬵ๅঅ • http://www.slideshare.net/daikeren/search-search- search
犡ॠ㬵犋现拻ጱ - gensim
gensim ጱ䛑አ • word2vec • doc2vec • 獤ᗭ • ፘ犲ଶ
• Machine Learning 奲ݳದ
矑ӥ㬵䨝狶ጱ ইຎ磪绚ጱ扖አ Deep Learning 狶ᛔ㵕稧扃ኞ౮牧 藶๗盃(?) ๚㬵ጱ Taipei.py
Recap • 瞟翕ᒊጱॺկ – scrapy • 䥁扃 – jieba •
虻碘獤ຉ – pure Python, pandas • 憙憽玕 – wordcloud, matplotlib • juypter • gensim • elasticsearch
ૡ珶๐率 • 虵搚秚䜗ภ讨䨝磪碝氂ፓ • ᐟᑃੜ因ᇔ • 懿݄覿糫牧磪ࠧ蟸牦牦牦 • ӥ܌槼襎纨ݢ犥肯肯虵搚秚ጱ硲Ԫ
THANK YOU