Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Mining Interest Topics from Plurk by Using Python
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Ken Lee
May 24, 2013
Programming
7
280
Mining Interest Topics from Plurk by Using Python
Ken Lee
May 24, 2013
Tweet
Share
More Decks by Ken Lee
See All by Ken Lee
Mining Interest Topics from Plurk by Using Python - Taipei.py
echain
0
100
CitC: Chewing in the Cloud
echain
1
110
Mining Interest Topics from Plurk
echain
2
63
Other Decks in Programming
See All in Programming
KIKI_MBSD Cybersecurity Challenges 2025
ikema
0
1.3k
AIと一緒にレガシーに向き合ってみた
nyafunta9858
0
160
Oxlintはいいぞ
yug1224
5
1.3k
Basic Architectures
denyspoltorak
0
660
Honoを使ったリモートMCPサーバでAIツールとの連携を加速させる!
tosuri13
1
170
CSC307 Lecture 04
javiergs
PRO
0
650
IFSによる形状設計/デモシーンの魅力 @ 慶應大学SFC
gam0022
1
290
AIエージェントの設計で注意するべきポイント6選
har1101
7
3.4k
Rust 製のコードエディタ “Zed” を使ってみた
nearme_tech
PRO
0
140
Package Management Learnings from Homebrew
mikemcquaid
0
200
Amazon Bedrockを活用したRAGの品質管理パイプライン構築
tosuri13
4
240
AI & Enginnering
codelynx
0
110
Featured
See All Featured
The Curious Case for Waylosing
cassininazir
0
230
The untapped power of vector embeddings
frankvandijk
1
1.6k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.6k
Fireside Chat
paigeccino
41
3.8k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
270
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
0
1.8k
Between Models and Reality
mayunak
1
180
Navigating Weather and Climate Data
rabernat
0
100
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
3.9k
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
110
Why Your Marketing Sucks and What You Can Do About It - Sophie Logan
marketingsoph
0
71
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
180
Transcript
Mining Interest Topics from Plurk by Using Python
[email protected]
About Me * Ken Lee * @echain * Developer at
Synology Inc. * Worked for NCTU CSCC * Just graduated from NCTU
很久很久以前...
D-1268 #GoPlurk
None
>>> import Java Traceback (most recent call last): File "<stdin>",
line 1, in <module> ImportError: No module named Java
>>> import Java Traceback (most recent call last): File "<stdin>",
line 1, in <module> ImportError: No module named Java >>> >>> import PHP Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named PHP
D-848 NSC College Student Creative Research Award
None
D-701 #ECPlurk
>>> from multiprocessing import Process >>> import multiprocessing.connection >>> import
MySQLdb >>> import re >>> import urllib2
>>> from multiprocessing import Process >>> import multiprocessing.connection >>> import
MySQLdb >>> import re >>> import urllib2 >>> >>> import ecspy >>> from matplotlib import pyplot
None
None
D-270 Happy b-day to Taeyeon
None
None
None
None
None
None
None
None
None
None
None
D-260 Happy b-day to Emily
「開工!!! 先把資料撈回來。」 \(▔□▔")/
# pacman -S python2-pip # cd /usr/bin # ln -s
pip2 pip # pip install plurk-oauth $ python2 >>> from PlurkAPI import PlurkAPI
「嗚… 系工作站有資源 限制,multiprocessing 會用掉一堆記憶體。」 \(▔□▔")/
# pip install gevent $ python2 >>> from gevent import
monkey >>> from gevent.pool import Pool >>> from gevent.queue import Queue
「砍站的速度不太理想, 似乎是因為 oauth2 用 的 httplib 效能很差。」 \(▔□▔")/
# pip install urllib3 $ python2 >>> import urllib3 >>>
urllib3.connection_from_url()
D-230 Happy b-day to Jessica
「噗哈哈哈… 朋友和粉絲的關聯 終於重新爬完了。」 (・ω・。)
None
MariaDB [plurk]> SHOW TABLE STATUS; +---------+-----------+------------+ | Name | Data
| Index | +---------+-----------+------------+ | fans | 233720343 | 901548032 | | friends | 421496919 | 1625860096 | | users | 189420940 | 79509504 | +---------+-----------+------------+
「索引的大小比資料 還要肥是怎樣?」 ヾ(゚▽゚*)ノ
# pacman -S redis # pip install redis $ python2
>>> import redis >>> from retry import retry
「A Multi-Level Aggregation Method for optimizing modularity; Louvain method」 ヾ(゚▽゚*)ノ
「糟糕,用 CPython 跑 @jserv 的 test case 要七分半才跑得完。」 ヾ(゚▽゚*)ノ
# pacman -S pypy # pypy get-pip.py # pip install
networkx $ pypy >>> import networkx >>> import community
「讚,現在最慢三十秒就 可以人肉 @jserv 了。」 ヾ(゚▽゚*)ノ
D-188 Happy b-day to Yoona
「報告老師,實驗室的 儲存空間不夠了。」 ( ̄﹁ ̄)
老師說:「要有機器。」 就有了機器。 ( ̄﹁ ̄)
None
「現在有二十台機器可以 用,來玩玩看新的砍站 架構吧!」 ( ̄﹁ ̄)
# pacman -S zeromq # pip install pyzmq # pip
install msgpack-python $ python2 >>> import zmq >>> import msgpack
None
「呃… CPU 使用率怎麼會 這麼高? I/O bound 變 成 CPU bound
了 ?!」 ಠ_ಠ
$ free -h total used Mem: 15G 15G -/+ buffers/cache:
8.9G Swap: 251M 3.8M $ cat /proc/cpuinfo ... model name: Dual Core AMD Opteron(tm) Processor 270 ...
「骨董 CPU 不意外… 算 HMAC-SHA1 和解 JSON 花太多時間。」 (´・ω・`)
# pacman -S openssl # pip install ujson $ python2
>>> import hmac_sha1 >>> import ujson
D-159 Happy b-day to Seohyun
「用 MySQL 存噗文有 點不太直覺,來試試很 潮的 MongoDB。」 ( ̄﹁ ̄)
「20 = (1+9) * 2 = 兩隊少女時代 + 兩位經紀人」 \⊙▽⊙/
「那就把這些機器組成 Replicated Sharded MongoDB cluster 吧。」 ‧★,:*:‧\( ̄▽ ̄)/‧:*‧°★*
None
None
「噗文全部抓回來了, 可是內容夾雜了英文、 中文和火星文…」 ヾ(゚▽゚*)ノ
「CKIP 在學校的 mirror 被入侵關機中。」 \(▔□▔")/ 「而且 CKIP 看不懂英 文,只能斷中文詞。」
「自幹斷詞系統。」 ヾ(゚▽゚*)ノ
# pip install regex # pip install marisa-trie # pip
install sqlalchemy $ python2 >>> import regex >>> import marisa_trie >>> import sqlalchemy
D-120 Happy b-day to Helena
老師說:「系統的網頁 前端不要用 PHP 寫。」 Σヽ(゚Д ゚; )ノ
None
「現在放棄的話, 就要準備去當兵囉。」 _(:3」∠)_
# pacman -S mod_wsgi2 # pacman -S memcached # pip
install flask $ python2 >>> from gevent.wsgi \ import WSGIServer >>> import flask
老師說:「系統要能同 時讓多人連線使用,算 community 要夠快。」 Σヽ(゚Д ゚; )ノ
None
「我們實驗室的工作站 只有 16 核心可以用。」 ヾ(゚▽゚*)ノ
「團結就是力量。」 \⊙▽⊙/
$ yaourt -S rabbitmq # pip install celery $ python2
>>> import celery
None
None
D-55 National Day of Taiwan
「論文快寫完了,來幫這次 做的系統取個名字。」 ( ̄﹁ ̄)
「Social Networking Service Discovery (SNSD) system」 ヾ(゚▽゚*)ノ
「好巧!少女時代的簡寫 也是 SNSD 耶!」 \⊙▽⊙/
$ whois snsd.tw Domain Name: snsd.tw Contact: Ken Lee
[email protected]
Record expires on 2013-10-11 (YYYY-MM-DD) Record created on 2012-10-10 (YYYY-MM-DD) Registration Service Provider: APT
D-31 Happy b-day to Anne
None
None
None
None
None
None
D-7 Thesis oral presentation
None
None
None
D-Day Graduated!!!
None
None
結論
* 可嘗試使用 NoSQL 來解決問題 * 不夠快?寫 C-extension * 跟 C
不熟?那就交給 pypy 吧! * pypi 上的套件很多,多看多比較 * 要感謝的人太多了,就感謝 * 少女時代吧!