Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Interest Topics from Plurk by Using Python - Taipei.py

Mining Interest Topics from Plurk by Using Python - Taipei.py

Ken Lee

July 25, 2013
Tweet

More Decks by Ken Lee

Other Decks in Programming

Transcript

  1. About Me * Ken Lee * @echain * Developer at

    Synology Inc. * Worked for NCTU CSCC * Just graduated from NCTU
  2. >>> import Java Traceback (most recent call last): File "<stdin>",

    line 1, in <module> ImportError: No module named Java
  3. >>> import Java Traceback (most recent call last): File "<stdin>",

    line 1, in <module> ImportError: No module named Java >>> >>> import PHP Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named PHP
  4. >>> from multiprocessing import Process >>> import multiprocessing.connection >>> import

    MySQLdb >>> import re >>> import urllib2 >>> >>> import ecspy >>> from matplotlib import pyplot
  5. # pacman -S python2-pip # cd /usr/bin # ln -s

    pip2 pip # pip install plurk-oauth $ python2 >>> from PlurkAPI import PlurkAPI
  6. # pip install gevent $ python2 >>> from gevent import

    monkey >>> from gevent.pool import Pool >>> from gevent.queue import Queue
  7. MariaDB [plurk]> SHOW TABLE STATUS; +---------+-----------+------------+ | Name | Data

    | Index | +---------+-----------+------------+ | fans | 233720343 | 901548032 | | friends | 421496919 | 1625860096 | | users | 189420940 | 79509504 | +---------+-----------+------------+
  8. # pacman -S redis # pip install redis $ python2

    >>> import redis >>> from retry import retry
  9. # pacman -S pypy # pypy get-pip.py # pip install

    networkx $ pypy >>> import networkx >>> import community
  10. # pacman -S zeromq # pip install pyzmq # pip

    install msgpack-python $ python2 >>> import zmq >>> import msgpack
  11. import redis from time import time r = redis.Redis() def

    execute(uid): try: _t = int(time() + 300) r.zadd('WIP', uid, _t) CRAWL_FOR_THE_TARGET_AND_STORE(uid) r.sadd('DONE', uid) r.zrem('WIP', uid) except: r.zrem('WIP', uid)
  12. def fetch_targets(): targets = [] _t = int(time()) for _

    in r.zrangebyscore('WIP', 0, _t): targets.append(_) for _ in xrange(9): target = r.lpop('TODO') targets.append(target) return targets def add_todo(uid): if r.zscore('WIP', uid) is None \ and not r.sismember('DONE', uid): r.rpush('TODO', uid)
  13. $ free -h total used Mem: 15G 15G -/+ buffers/cache:

    8.9G Swap: 251M 3.8M $ cat /proc/cpuinfo ... model name: Dual Core AMD Opteron(tm) Processor 270 ...
  14. # pacman -S openssl # pip install ujson $ python2

    >>> import hmac_sha1 >>> import ujson
  15. # pip install regex # pip install marisa-trie # pip

    install sqlalchemy $ python2 >>> import regex >>> import marisa_trie >>> import sqlalchemy
  16. # pacman -S memcached # pip install flask # pip

    install gunicorn $ python2 >>> import flask
  17. from kombu import Queue CELERY_QUEUES = ( Queue('default', routing_key='pypy'), Queue('bsd',

    routing_key='python27'), ) CELERY_DEFAULT_QUEUE = 'default' CELERY_DEFAULT_ROUTING_KEY = 'pypy' CELERY_ROUTES = { 'tasks.communities': { 'queue': 'default', 'routing_key': 'pypy', }, } linux1 $ celery worker -Q default --autoscale=16,8 bsd1 $ celery worker -Q bsd --autoscale=16,8 random $ celery worker -Q default,bsd --autoscale=20,2
  18. $ whois snsd.tw Domain Name: snsd.tw Contact: Ken Lee [email protected]

    Record expires on 2013-10-11 (YYYY-MM-DD) Record created on 2012-10-10 (YYYY-MM-DD) Registration Service Provider: APT
  19. * 可嘗試使用 NoSQL 來解決問題 * 不夠快?寫 C-extension * 跟 C

    不熟?那就交給 pypy 吧! * pypi 上的套件很多,多看多比較
  20. * 可嘗試使用 NoSQL 來解決問題 * 不夠快?寫 C-extension * 跟 C

    不熟?那就交給 pypy 吧! * pypi 上的套件很多,多看多比較 * 要感謝的人太多了,就感謝 * 少女時代吧!