Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
我的PYTHON爬蟲編年史
Search
bingroom
January 09, 2019
Programming
0
150
我的PYTHON爬蟲編年史
Invited talk in Python Hsinchu User Group
bingroom
January 09, 2019
Tweet
Share
More Decks by bingroom
See All by bingroom
我如何在PyCon找到乾爹 ─ 工人智慧的展現
bingroom
0
210
厲害了,我的蟲 - 我在iii的爬蟲人生
bingroom
0
380
Other Decks in Programming
See All in Programming
if constexpr文はテンプレート世界のラムダ式である
faithandbrave
3
650
Zero Waste, Radical Magic, and Italian Graft – Quarkus Efficiency Secrets
hollycummins
0
230
Milestoner
bkuhlmann
1
410
Git Lint
bkuhlmann
4
750
MetricKitで予期せぬ終了を検知する話 / Detect unexpected termination with MetricKit
nekowen
1
190
AWS CDKコントリビュートTIPS / aws-cdk-contribution-tips
gotok365
2
200
Random\Randomizer クラスで日常のあれこれを解決しよう! / Random\Randomizer class solves familiar trouble
cocoeyes02
0
250
Code Reviews
bkuhlmann
4
890
エンターテイメント業界で利用されるAWS
demuyan
0
210
Amazon SQSコンシューマー疎結合への旅 - 出張! #DevelopersIO IT技術ブログの中の人が語る勉強会 #3
quiver
0
270
Goのmultiple errorsについて (2024年4月版)
syumai
4
910
Elm Form Validation
bkuhlmann
0
510
Featured
See All Featured
Rebuilding a faster, lazier Slack
samanthasiow
73
8.2k
Building Better People: How to give real-time feedback that sticks.
wjessup
355
18k
The Cost Of JavaScript in 2023
addyosmani
16
3.9k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
121
39k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
221
21k
Six Lessons from altMBA
skipperchong
21
3k
What’s in a name? Adding method to the madness
productmarketing
PRO
16
2.6k
The Language of Interfaces
destraynor
151
23k
Put a Button on it: Removing Barriers to Going Fast.
kastner
58
3.1k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
19
1.7k
What's new in Ruby 2.0
geeforr
337
31k
How to name files
jennybc
65
93k
Transcript
౯ጱPYTHON粖恝翥ଙݥ Bingroom
About Me • Data engineer in • Comprehending DevOps •
Addiction to and
None
C# PHP JAVA Python窕 أ㭏ၹ
Python窕 أ㭏ၹ
None
None
History Events
History Events Mechanize Selenium + Tkinter Web API PyQuery
Outline • Mechanize: web automation • Selenium (with Tkinter): build
a tool to load dynamic content • Web API: find it, use it • PyQuery: fast crawler prototyping
Challenge 1: PTT
The Soup Everyone Loves
The Soup Everyone Loves
The Soup Everyone Loves
import mechanize
import mechanize
None
None
None
None
None
None
None
import mechanize Only in Python2
Python3: mechanicalsoup
Missing part in mechanicalsoup https://findbiz.nat.gov.tw/fts/query/QueryBar/queryInit.do
Missing part in mechanicalsoup
Missing part in mechanicalsoup
Challenge 2: Facebook Personal IDs
Personal ID?
Personal ID: OK
How to query personal pages?
How to query personal pages? By Facebook ’s Vulnerability
How to deal with AJAX?
How to deal with AJAX? Selenium: for loading dynamic content
PhantomJS: headless browser
PhantomJS: headless browser Headless Chrome
PIL: crop reCAPTCHA
PhantomJS: headless browser Headless Chrome PIL: crop reCAPTCHA Tkinter: GUI
design
PhantomJS: headless browser Headless Chrome cx_Freeze: everything to executable PIL:
crop reCAPTCHA Tkinter: GUI design
Good Game
Challenge 3: 22 Government Official Sites
News & Attachment 岄玖૱(tpe)物https://health.gov.taipei/Default.aspx?tabid=36&mid=442 岄Ӿ૱(txg)物 http://www.health.taichung.gov.tw/26216/26204/26216/26204/26201/26204/27056/Lpsimplelist चᵇ૱(klu)物 http://www.klchb.gov.tw/ch/news/newspaper/list.aspx?c0=640 岄ܖ૱(tnn)物 https://health.tainan.gov.tw/list.asp?nsub=A0A600&topage=1
ṛᵜ૱(khh)物 http://khd.kcg.gov.tw/Main.aspx?sn=398 碝玖૱(ntpe)物 http://www.health.ntpc.gov.tw 葦翻(ilc)物 http://www.ilshb.gov.tw/index.php?catid=23&cid=4 ໘瑼૱(tyc)物 http://dph.tycg.gov.tw/home.jsp?id=13 ࡱ嬝૱(cyi)物 https://www.cichb.gov.tw/news/indexpda.asp 碝ᒓ翻(hsh)物 http://www.hcshb.gov.tw/home.jsp?mserno=200802220002&serno=200802220015&menudata=HcshbMenu& 碝ᒓ૱(hsc)物 http://www.hccg.gov.tw/MunicipalNews?websitedn=ou=hcchb,ou=ap_root,o=hccg,c=tw&language=chinese 舏礣翻(mal)物 https://www.mlshb.gov.tw/tc/PressRelease.aspx?pn=1&department=7 ܖಭ翻(nto)物 http://www.ntshb.gov.tw/Default.aspx 玕翻(cwh)物 http://www.chshb.gov.tw/news/?type_id=127&top=0 襇翻(yun)物 http://www.ylshb.gov.tw/news/index.php?m=9&m1=14&m2=35 ࡱ嬝翻(chy)物 https://cyshb.cyhg.gov.tw/News.aspx?n=E236E03D6AE796F8&sms=A55ECAF6D99EACF8 䩚翻(pch)物 http://www.ptshb.gov.tw/News.aspx?CategorySN=1894&n=CC774672B906BC5E 臺荳翻(hwa)物 http://www.hlshb.gov.tw/files/40-1006-27-1.php 岄䩚翻(ttt)物 http://www.ttshb.gov.tw/files/40-1000-12-1.php?Lang=zh-tw ᰂ槹翻(kmn)物 http://phb.kinmen.gov.tw/News.aspx?n=87FB8DD4C759A8B5&sms=A2C62D68901B977C ᄯ竝翻(peh)物 https://www.phchb.gov.tw/home.jsp?id=23 蝫翻(lnn)物 http://www.matsuhb.gov.tw/2009web/news/news_contents.php?room=news1
News & Attachment 岄玖૱(tpe)物https://health.gov.taipei/Default.aspx?tabid=36&mid=442 岄Ӿ૱(txg)物 http://www.health.taichung.gov.tw/26216/26204/26216/26204/26201/26204/27056/Lpsimplelist चᵇ૱(klu)物 http://www.klchb.gov.tw/ch/news/newspaper/list.aspx?c0=640 岄ܖ૱(tnn)物 https://health.tainan.gov.tw/list.asp?nsub=A0A600&topage=1
ṛᵜ૱(khh)物 http://khd.kcg.gov.tw/Main.aspx?sn=398 碝玖૱(ntpe)物 http://www.health.ntpc.gov.tw 葦翻(ilc)物 http://www.ilshb.gov.tw/index.php?catid=23&cid=4 ໘瑼૱(tyc)物 http://dph.tycg.gov.tw/home.jsp?id=13 ࡱ嬝૱(cyi)物 https://www.cichb.gov.tw/news/indexpda.asp 碝ᒓ翻(hsh)物 http://www.hcshb.gov.tw/home.jsp?mserno=200802220002&serno=200802220015&menudata=HcshbMenu& 碝ᒓ૱(hsc)物 http://www.hccg.gov.tw/MunicipalNews?websitedn=ou=hcchb,ou=ap_root,o=hccg,c=tw&language=chinese 舏礣翻(mal)物 https://www.mlshb.gov.tw/tc/PressRelease.aspx?pn=1&department=7 ܖಭ翻(nto)物 http://www.ntshb.gov.tw/Default.aspx 玕翻(cwh)物 http://www.chshb.gov.tw/news/?type_id=127&top=0 襇翻(yun)物 http://www.ylshb.gov.tw/news/index.php?m=9&m1=14&m2=35 ࡱ嬝翻(chy)物 https://cyshb.cyhg.gov.tw/News.aspx?n=E236E03D6AE796F8&sms=A55ECAF6D99EACF8 䩚翻(pch)物 http://www.ptshb.gov.tw/News.aspx?CategorySN=1894&n=CC774672B906BC5E 臺荳翻(hwa)物 http://www.hlshb.gov.tw/files/40-1006-27-1.php 岄䩚翻(ttt)物 http://www.ttshb.gov.tw/files/40-1000-12-1.php?Lang=zh-tw ᰂ槹翻(kmn)物 http://phb.kinmen.gov.tw/News.aspx?n=87FB8DD4C759A8B5&sms=A2C62D68901B977C ᄯ竝翻(peh)物 https://www.phchb.gov.tw/home.jsp?id=23 蝫翻(lnn)物 http://www.matsuhb.gov.tw/2009web/news/news_contents.php?room=news1
https://aji.tw/slides/pycon2017
https://aji.tw/slides/pycon2017
PyQuery == jQuery • parent element > child element •
# for id • . for class • . when a class name has space (i.e. has 2 classes)
None
#table_0 > tbody > tr:nth-child(1) > td.CCMS_jGridView_td_Class_1 Chrome Inspector
Firefox Inspector (Firebug) #table_0 > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(2)
None
Usage (1) Generate DOM object by URL opener
Usage (2) Traverse elements and list target items
Usage (2) Traverse elements and list target items
Usage (3) Laugh.
Usage (3) Laugh. #table_0 > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(2)
( ) td ጱ ᒫԫ㮆 ጱ text
Usage (3) Laugh. #table_0 > tbody:nth-child(2) > tr:nth-child(1) > td:nth-child(2)
( ) td ጱ ᒫԫ㮆 a ጱ 痀 href
Usage (3) Laugh.
Beautifulsoup V.S. PyQuery
Preparation for timeout issue
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn)
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) Ӟ෭玖ṛ
䌃ਠᵍॠፗ矑硬粚 岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi)
碝ᒓ翻(hsh) 碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn)
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) encoding=Big5
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) Date አ࿆㾴ଙ
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) 猂硬አ࿆㾴ଙ
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) ֦磪ݣ傀㰷独㻟牫 猂硬አ࿆㾴ଙ
岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh)
碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) / 22 https://github.com/bingroom/city-health-news-TW
https://github.com/bingroom/city-health-news-TW 8 / 22 FAILED: 岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh)
碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi) 碝ᒓ翻(hsh) 碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn)
https://github.com/bingroom/city-health-news-TW 岄玖૱(tpe) 岄Ӿ૱(txg) चᵇ૱(klu) 岄ܖ૱(tnn) ṛᵜ૱(khh) 碝玖૱(ntpe) 葦翻(ilc) ໘瑼૱(tyc) ࡱ嬝૱(cyi)
碝ᒓ翻(hsh) 碝ᒓ૱(hsc) 舏礣翻(mal) ܖಭ翻(nto) 玕翻(cwh) 襇翻(yun) ࡱ嬝翻(chy) 䩚翻(pch) 臺荳翻(hwa) 岄䩚翻(ttt) ᰂ槹翻(kmn) ᄯ竝翻(peh) 蝫翻(lnn) 22 / 22 Successful:
Challenge 4: Hack Nielsen DAR System
How to get API from a website?
• Guess How to get API from a website?
How to get API from a website? • Guess
• Guess • Inspect How to get API from a
website?
• Guess • Inspect How to get API from a
website?
• Guess • Inspect How to get API from a
website?
Nielsen DAR Ad
Nielsen DAR Ad • Login by mechanize
Nielsen DAR Ad • Hold the header
Nielsen DAR Ad • Enjoy(?) the API
• Facebook • Gmail Inbox • Dcard • PChome Graph
API Multiprocess OAuth2.0 Xml base64 JSON regex random UA Keyword for Web APIs
Summary • Mechanize • BeautifulSoup • Selenium + Phantomjs •
APIs Less documentation Less limitation • Mechanicalsoup • PyQuery • Selenium + Chromedriver • APIs More documentation More limitation Past Now
None
PTT GOV websites Pchome Dcard Facebook Almost everything!
PTT GOV websites Pchome Dcard Facebook Almost everything!
Sikuli: PTT Crawler
Sikuli: Game Farming
Thanks for your listening! httpstatusdogs.com