Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
我的PYTHON爬蟲編年史
Search
bingroom
January 09, 2019
Programming
0
200
我的PYTHON爬蟲編年史
Invited talk in Python Hsinchu User Group
bingroom
January 09, 2019
Tweet
Share
More Decks by bingroom
See All by bingroom
我如何在PyCon找到乾爹 ─ 工人智慧的展現
bingroom
0
250
厲害了,我的蟲 - 我在iii的爬蟲人生
bingroom
0
510
Other Decks in Programming
See All in Programming
AIで開発はどれくらい加速したのか?AIエージェントによるコード生成を、現場の評価と研究開発の評価の両面からdeep diveしてみる
daisuketakeda
1
970
AI によるインシデント初動調査の自動化を行う AI インシデントコマンダーを作った話
azukiazusa1
1
680
組織で育むオブザーバビリティ
ryota_hnk
0
170
dchart: charts from deck markup
ajstarks
3
990
AIによるイベントストーミング図からのコード生成 / AI-powered code generation from Event Storming diagrams
nrslib
2
1.8k
AI 駆動開発ライフサイクル(AI-DLC):ソフトウェアエンジニアリングの再構築 / AI-DLC Introduction
kanamasa
12
6.5k
MUSUBIXとは
nahisaho
0
130
カスタマーサクセス業務を変革したヘルススコアの実現と学び
_hummer0724
0
610
IFSによる形状設計/デモシーンの魅力 @ 慶應大学SFC
gam0022
1
290
Spinner 軸ズレ現象を調べたらレンダリング深淵に飲まれた #レバテックMeetup
bengo4com
1
230
Oxlint JS plugins
kazupon
1
530
360° Signals in Angular: Signal Forms with SignalStore & Resources @ngLondon 01/2026
manfredsteyer
PRO
0
110
Featured
See All Featured
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
The Limits of Empathy - UXLibs8
cassininazir
1
210
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.6k
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
910
Information Architects: The Missing Link in Design Systems
soysaucechin
0
770
How to Build an AI Search Optimization Roadmap - Criteria and Steps to Take #SEOIRL
aleyda
1
1.9k
Navigating Team Friction
lara
192
16k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
270
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.1k
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
280
Designing for Performance
lara
610
70k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Transcript
౯ጱPYTHON粖恝翥ଙݥ Bingroom
About Me • Data engineer in • Comprehending DevOps •
Addiction to and
None
C# PHP JAVA Python窕 أ㭏ၹ
Python窕 أ㭏ၹ
None
None
History Events
History Events Mechanize Selenium + Tkinter Web API PyQuery
Outline • Mechanize: web automation • Selenium (with Tkinter): build
a tool to load dynamic content • Web API: find it, use it • PyQuery: fast crawler prototyping
Challenge 1: PTT
The Soup Everyone Loves
The Soup Everyone Loves
The Soup Everyone Loves
import mechanize
import mechanize
None
None
None
None
None
None
None
import mechanize Only in Python2
Python3: mechanicalsoup
Missing part in mechanicalsoup https://findbiz.nat.gov.tw/fts/query/QueryBar/queryInit.do
Missing part in mechanicalsoup
Missing part in mechanicalsoup
Challenge 2: Facebook Personal IDs
Personal ID?
Personal ID: OK
How to query personal pages?
How to query personal pages? By Facebook ’s Vulnerability
How to deal with AJAX?
How to deal with AJAX? Selenium: for loading dynamic content
PhantomJS: headless browser
PhantomJS: headless browser Headless Chrome
PIL: crop reCAPTCHA
PhantomJS: headless browser Headless Chrome PIL: crop reCAPTCHA Tkinter: GUI
design
PhantomJS: headless browser Headless Chrome cx_Freeze: everything to executable PIL:
crop reCAPTCHA Tkinter: GUI design
Good Game
Challenge 3: 22 Government Official Sites
News & Attachment 岄玖૱(tpe)物https://health.gov.taipei/Default.aspx?tabid=36&mid=442 岄Ӿ૱(txg)物 http://www.health.taichung.gov.tw/26216/26204/26216/26204/26201/26204/27056/Lpsimplelist चᵇ૱(klu)物 http://www.klchb.gov.tw/ch/news/newspaper/list.aspx?c0=640 岄ܖ૱(tnn)物 https://health.tainan.gov.tw/list.asp?nsub=A0A600&topage=1
ṛᵜ૱(khh)物 http://khd.kcg.gov.tw/Main.aspx?sn=398 碝玖૱(ntpe)物 http://www.health.ntpc.gov.tw 葦翻(ilc)物 http://www.ilshb.gov.tw/index.php?catid=23&cid=4 ໘瑼૱(tyc)物 http://dph.tycg.gov.tw/home.jsp?id=13 ࡱ嬝૱(cyi)物 https://www.cichb.gov.tw/news/indexpda.asp 碝ᒓ翻(hsh)物 http://www.hcshb.gov.tw/home.jsp?mserno=200802220002&serno=200802220015&menudata=HcshbMenu& 碝ᒓ૱(hsc)物 http://www.hccg.gov.tw/MunicipalNews?websitedn=ou=hcchb,ou=ap_root,o=hccg,c=tw&language=chinese 舏礣翻(mal)物 https://www.mlshb.gov.tw/tc/PressRelease.aspx?pn=1&department=7 ܖಭ翻(nto)物 http://www.ntshb.gov.tw/Default.aspx 玕翻(cwh)物 http://www.chshb.gov.tw/news/?type_id=127&top=0 襇翻(yun)物 http://www.ylshb.gov.tw/news/index.php?m=9&m1=14&m2=35 ࡱ嬝翻(chy)物 https://cyshb.cyhg.gov.tw/News.aspx?n=E236E03D6AE796F8&sms=A55ECAF6D99EACF8 䩚翻(pch)物 http://www.ptshb.gov.tw/News.aspx?CategorySN=1894&n=CC774672B906BC5E 臺荳翻(hwa)物 http://www.hlshb.gov.tw/files/40-1006-27-1.php 岄䩚翻(ttt)物 http://www.ttshb.gov.tw/files/40-1000-12-1.php?Lang=zh-tw ᰂ槹翻(kmn)物 http://phb.kinmen.gov.tw/News.aspx?n=87FB8DD4C759A8B5&sms=A2C62D68901B977C ᄯ竝翻(peh)物 https://www.phchb.gov.tw/home.jsp?id=23 蝫翻(lnn)物 http://www.matsuhb.gov.tw/2009web/news/news_contents.php?room=news1
News & Attachment 岄玖૱(tpe)物https://health.gov.taipei/Default.aspx?tabid=36&mid=442 岄Ӿ૱(txg)物 http://www.health.taichung.gov.tw/26216/26204/26216/26204/26201/26204/27056/Lpsimplelist चᵇ૱(klu)物 http://www.klchb.gov.tw/ch/news/newspaper/list.aspx?c0=640 岄ܖ૱(tnn)物 https://health.tainan.gov.tw/list.asp?nsub=A0A600&topage=1
ṛᵜ૱(khh)物 http://khd.kcg.gov.tw/Main.aspx?sn=398 碝玖૱(ntpe)物 http://www.health.ntpc.gov.tw 葦翻(ilc)物 http://www.ilshb.gov.tw/index.php?catid=23&cid=4 ໘瑼૱(tyc)物 http://dph.tycg.gov.tw/home.jsp?id=13 ࡱ嬝૱(cyi)物 https://www.cichb.gov.tw/news/indexpda.asp 碝ᒓ翻(hsh)物 http://www.hcshb.gov.tw/home.jsp?mserno=200802220002&serno=200802220015&menudata=HcshbMenu& 碝ᒓ૱(hsc)物 http://www.hccg.gov.tw/MunicipalNews?websitedn=ou=hcchb,ou=ap_root,o=hccg,c=tw&language=chinese 舏礣翻(mal)物 https://www.mlshb.gov.tw/tc/PressRelease.aspx?pn=1&department=7 ܖಭ翻(nto)物 http://www.ntshb.gov.tw/Default.aspx 玕翻(cwh)物 http://www.chshb.gov.tw/news/?type_id=127&top=0 襇翻(yun)物 http://www.ylshb.gov.tw/news/index.php?m=9&m1=14&m2=35 ࡱ嬝翻(chy)物 https://cyshb.cyhg.gov.tw/News.aspx?n=E236E03D6AE796F8&sms=A55ECAF6D99EACF8 䩚翻(pch)物 http://www.ptshb.gov.tw/News.aspx?CategorySN=1894&n=CC774672B906BC5E 臺荳翻(hwa)物 http://www.hlshb.gov.tw/files/40-1006-27-1.php 岄䩚翻(ttt)物 http://www.ttshb.gov.tw/files/40-1000-12-1.php?Lang=zh-tw ᰂ槹翻(kmn)物 http://phb.kinmen.gov.tw/News.aspx?n=87FB8DD4C759A8B5&sms=A2C62D68901B977C ᄯ竝翻(peh)物 https://www.phchb.gov.tw/home.jsp?id=23 蝫翻(lnn)物 http://www.matsuhb.gov.tw/2009web/news/news_contents.php?room=news1
https://aji.tw/slides/pycon2017
https://aji.tw/slides/pycon2017
PyQuery == jQuery • parent element > child element •
# for id • . for class • . when a class name has space (i.e. has 2 classes)
None
#table_0 > tbody > tr:nth-child(1) > td.CCMS_jGridView_td_Class_1 Chrome Inspector
Firefox Inspector (Firebug) #table_0 > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(2)
None
Usage (1) Generate DOM object by URL opener
Usage (2) Traverse elements and list target items
Usage (2) Traverse elements and list target items
Usage (3) Laugh.
Usage (3) Laugh. #table_0 > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(2)
( ) td ጱ ᒫԫ㮆 ጱ text
Usage (3) Laugh. #table_0 > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(2)
( ) td ጱ ᒫԫ㮆 a ጱ 痀 href
Usage (3) Laugh.
Beautifulsoup V.S. PyQuery
Preparation for timeout issue
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn)
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) Ӟ෭玖ṛ
䌃ਠᵍॠፗ矑硬粚 岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi)
碝ᒓ翻(hsh) 碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn)
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) encoding=Big5
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) Date አ࿆㾴ଙ
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) 猂硬አ࿆㾴ଙ
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) ֦磪ݣ傀㰷独㻟牫 猂硬አ࿆㾴ଙ
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) / 22 https://github.com/bingroom/city-health-news-TW
https://github.com/bingroom/city-health-news-TW 8 / 22 FAILED: 岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh)
碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh) 碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn)
https://github.com/bingroom/city-health-news-TW 岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi)
碝ᒓ翻(hsh) 碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) 22 / 22 Successful:
Challenge 4: Hack Nielsen DAR System
How to get API from a website?
• Guess How to get API from a website?
How to get API from a website? • Guess
• Guess • Inspect How to get API from a
website?
• Guess • Inspect How to get API from a
website?
• Guess • Inspect How to get API from a
website?
Nielsen DAR Ad
Nielsen DAR Ad • Login by mechanize
Nielsen DAR Ad • Hold the header
Nielsen DAR Ad • Enjoy(?) the API
• Facebook • Gmail Inbox • Dcard • PChome Graph
API Multiprocess OAuth2.0 Xml base64 JSON regex random UA Keyword for Web APIs
Summary • Mechanize • BeautifulSoup • Selenium + Phantomjs •
APIs Less documentation Less limitation • Mechanicalsoup • PyQuery • Selenium + Chromedriver • APIs More documentation More limitation Past Now
None
PTT GOV websites Pchome Dcard Facebook Almost everything!
PTT GOV websites Pchome Dcard Facebook Almost everything!
Sikuli: PTT Crawler
Sikuli: Game Farming
Thanks for your listening! httpstatusdogs.com