Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
Search
Beomi
August 13, 2017
Technology
3
16k
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료
Beomi
August 13, 2017
Tweet
Share
More Decks by Beomi
See All by Beomi
[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning
beomi
0
95
[PyCon Korea 2024 Keynote] 커뮤니티와 파이썬, 그리고 우리
beomi
0
140
[PyCon Korea 2024 Session] 우리 모두는 스팸에서 자유로울 권리가 있다 - Smart Spam Filter 개발기
beomi
0
120
[2024 창구 성장 지원 세미나] LLM과 온디바이스LM으로 스팸필터 앱 서비스 만들기
beomi
0
31
[2024.08.30] Gemma-Ko, 오픈 언어모델에 한국어 입히기 @ 머신러닝부트캠프2024
beomi
0
800
[PyConKR 2019] 온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2
beomi
3
2.9k
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기
beomi
4
1.2k
GDG Campus SummerParty: 쓸데많은 웹 크롤러 만들기 with Python
beomi
3
1.3k
PYCON KR 2017 튜토리얼: 나만의 웹 크롤러 만들기
beomi
1
830
Other Decks in Technology
See All in Technology
AWS re:Invent 2024で発表された コードを書く開発者向け機能について
maruto
0
180
大幅アップデートされたRagas v0.2をキャッチアップ
os1ma
2
520
Postman と API セキュリティ / Postman and API Security
yokawasa
0
200
日本版とグローバル版のモバイルアプリ統合の開発の裏側と今後の展望
miichan
1
120
Turing × atmaCup #18 - 1st Place Solution
hakubishin3
0
470
kargoの魅力について伝える
magisystem0408
0
200
Amazon SageMaker Unified Studio(Preview)、Lakehouse と Amazon S3 Tables
ishikawa_satoru
0
150
サーバレスアプリ開発者向けアップデートをキャッチアップしてきた #AWSreInvent #regrowth_fuk
drumnistnakano
0
190
GitHub Copilot のテクニック集/GitHub Copilot Techniques
rayuron
23
11k
TSKaigi 2024 の登壇から広がったコミュニティ活動について
tsukuha
0
160
サイボウズフロントエンドエキスパートチームについて / FrontendExpert Team
cybozuinsideout
PRO
5
38k
1等無人航空機操縦士一発試験 合格までの道のり ドローンミートアップ@大阪 2024/12/18
excdinc
0
150
Featured
See All Featured
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
28
4.3k
Building an army of robots
kneath
302
44k
We Have a Design System, Now What?
morganepeng
51
7.3k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
10
810
Why You Should Never Use an ORM
jnunemaker
PRO
54
9.1k
The Invisible Side of Design
smashingmag
298
50k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
A Philosophy of Restraint
colly
203
16k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
6
510
Build your cross-platform service in a week with App Engine
jlugia
229
18k
Code Reviewing Like a Champion
maltzj
520
39k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Transcript
ࠗఠঌইࠁחਢ܀۞ ળߧ KVO!CFPNJOFU Back to the Basic
ߊ ࣗѐ ળߧ (
[email protected]
) - DjangoGirls Seoul -
<ա݅ ਢ ܀۞ ٜ݅ӝ> दܻૉ ো - ౚషܻ: <ա݅ ਢ ܀۞ ٜ݅ӝ> - Python + Django = <3 - ইೠ ప(ইೠ ഋઁٜ) ੋఢ 2
܀݂? 3
য়ט ೡ Ѫ 4 ܀݂ೡ ѐߊജ҃ ܁ ੜ ॄࠁӝ, CSS
Selectorۆ? ߊࣁ࣌ ܀݂ೞӝ ৡয়झ ۽ӒੋೞҊ ܀݂ೞӝ ਵ۽ ܀݂ೞӝ ( )ചݶহ ܀݂ೞӝ ӝਵ۽ ܀݂ೞӝ
যڃ ജ҃ীࢲ সೞաਃ? • Python 3.6.x (ইޖܻ ծইب 3.4.x /
Python2ॳݶ Ҋా߉ইਃ) • Requests 2.18.x (2.1x.x ࢚ߡݶ ޖդפ.) • beautifulsoup4 4.6.x (4.5 ࢚ ߡݶ ؾפ.) • selenium 3.4.x (selenium ઁա ୭न ߡਸ ਊ೧ࣁਃ.) • Chrome v60 ࢚ (Headlessݽ٘ח 60ߡ ࢚ ਗؾפ.) pip install requests bs4 selenium 5
Requestsۆ? 6 1ZUIPO)5513FRVFTUTGPS)VNBOT >>> import requests >>> r = requests.get('https://api.github.com/user',
auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...'
BeautifulSoupۆ? EFTJHOFEGPSRVJDLUVSOBSPVOEQSPKFDUT MJLFTDSFFOTDSBQJOH • Requests۽ ߉ইৡ ؘఠܳ ॆ ೧ೞח ё۽
ٜ݅ӝ • HTML DOM ҳઑ Ӓ۽! 7 from bs4 import BeautifulSoup
HTML DOM? 8 <!DOCTYPE html> <html> <head> <title>ఋౣ</title> </head> <body>
<h1>ઁੌ ઁݾ</h1> <div> 2017 KR </div> </body> </html> TITLE HEAD DIV H1 BODY HTML
܁ Inspector ࢎਊೞӝ 9
CSS Selectorۆ? 10 body > div.frontpage > div.onsky > nav
HTML TAG CSS Class CSS Class CSS Class HTML TAG CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >
рױೠ ਢ ಕ ܀݂ೞӝ 11
ࣁ࣌ ݾ۾ਸ оઉ৬ࠇद 12 https://www.pycon.kr/2017/program/list/
13
14 body > div.container > div:nth-child(1) > div.col-md-9.content > ul:nth-child(3)
> li:nth-child(16) > a nth-childח BeautifulSoup select()ীࢲ ਗೞ ঋח body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
15 div > div.col-md-9.content > ul > li > a
body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
nth-child vs nth-of-type • NotImplementedError: Only the following pseudo-classes are
implemented: nth-of-type. • div:nth-child(5): э ࠗݽܳ о Element 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ • div:nth-of-type(5): э ࠗݽܳ о div 5ߣ૩. 16
17 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('body > div.container > div:nth-of-type(1) '\ '> div.col-md-9.content > ul:nth-of-type(1) '\ '> li:nth-of-type(16) > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
18 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('div > div.col-md-9.content > ul > li > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
19
۽Ӓੋ ਃೠ ҃ 20
ONOFFMIX नݾ۾ оઉয়ӝ 21
۽Ӓੋ যڌѱ ೞաਃ? 22
ONOFFMIX ۽Ӓੋ ڳযࠁӝ 23 http://onoffmix.com/account/login
requests Session ਊ 24
25 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text)
۽Ӓੋ റ ܐ оઉয়ӝ 26 #eventListHolder > div:nth-child(1) > ul
> li.title > a #eventListHolder > div > ul > li.title > a
27 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text) 27
28
۽Ӓੋ ցޖ য۰ਕਃ Ӓր ࠳ۄ ॳݶ উغաਃ 29
܁ਵ۽ ܀݂೧ࠇद 30 Selenium + Chrome(v60) pip install selenium https://sites.google.com/a/chromium.org/chromedriver/downloads
31 /Users/ࢎਊܴ/Downloads/chromedriver
32 from selenium import webdriver driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver') # Selenium
ݽٚ ਗਸ оઉয়ӝө 3ୡܳ ӝ۰સפ driver.implicitly_wait(3) # ֎ߡ ചݶਸ оઉ৬ࠇद driver.get('https://naver.com')
33
݅ড ۠ ী۞о լݶ 34 selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs
to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home ‘chromedriver'о PATH ী ١۾غ ঋইࢲ ࢤӝח ޙઁ. https://sites.google.com/a/chromium.org/chromedriver/home ਤ ࣗীࢲ latest driverܳ ߉ই ୷ਸ ಽযળ ҃۽ܳ ഛೞѱ ೞӝ
֎ߡ ۽Ӓੋ ೧ࠁӝ 35 from selenium import webdriver driver =
webdriver.Chrome(‘/Users/username/Downloads/chromedriver') driver.implicitly_wait(3) driver.get('https://naver.com') input_id = driver.find_element_by_css_selector('#id') input_pw = driver.find_element_by_css_selector('#pw') login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]') input_id.send_keys('someid') input_pw.send_keys(‘mypassword1234!') login_button.click()
֎ߡ ನੋ ҳݒղ ܀݂ 36 #_listContentArea > ul > li:nth-child(1)
> div > div.item_content > div.info_space > p #_listContentArea > ul > li > div > div.item_content > div.info_space > p https://order.pay.naver.com/home?tabMenu=POINT_TOTAL
܁ ࣛীࢲ ܻ ഛੋ೧ࠁӝ 37 CSS Selectorܳ ࠗ ॳ ঋইب
܀݂ оמೞ! ೞ݅ ܁ ࣛ JS
оઉ৬ ࠇद 38 # ਤ ٘ যࢲ login_button.click() driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL') point_list
= driver.find_elements_by_css_selector('div.info_space > p') for point in point_list: print(point.text) driver.quit() ]
ചݶਸ ڸחѱ फযਃ 39 driver.set_window_position(-10000,0) ೞ݅ Ѥ ܻо ߄ۄחѱ ইפભ
Headless Browser like Headless Chrome
ಕझ࠘ झܽࢫ 40 chrome://settings/help ীࢲ ܁ ߡਸ ഛੋ೧ࣁਃ Headless Chrome
v60࢚ীࢲ ࢎਊоמפ. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1920x1080') driver = webdriver.Chrome('chromedriver', chrome_options=options) driver.get('https://facebook.com') driver.implicitly_wait(3) email = driver.find_element_by_css_selector('input[type=email]') password = driver.find_element_by_css_selector('input[type=password]') login = driver.find_element_by_css_selector('input[type="submit"]') email.send_keys('
[email protected]
') password.send_keys('ilovepython') login.click() driver.get_screenshot_as_file('facebook.png') driver.quit()
24दр ܀݂ جܻҊ रযਃ 41 Cron স + ࢲߡী ৢܻӝ
Ѥয়טೞঋইਃ
10࠙݃ ೠߣঀ ܀݂ ೞҊरযਃ 42 */10 * * * *
/usr/bin/python3 /home/beomi/parser.py ࠙ द ੌ ਘ ਃੌ ॆ ਤ ॆ ੌ ਤ crontab -e ഛೠ ‘दп’ ं, ‘ݻ ࠙/दр݃’ח */ं ೠঀ झಕझ
43 • एযࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ ঋইਃ. • ےؒ.
time.sleep(2 + random.random() * 4)ۢ ےؒਵ۽ एযࣁਃ. • User-Agent ܳ ࠈ ইצ ੌ߈ ࠳ۄۢ ֍যࣁਃ. • ߄झ݀о ݆ ҃ীח requestsח Әߑ Ѧܾࣻ णפ. • robots.txt ܳ ઓ೧ࣁਃ. ࢲߡਗ ޖೠ ইפѢٚਃ. • ࠁా ݽ߄ੌ ಕо PCಕࠁ ܀݂ೞӝ औणפ. (ੌױ Flash৬ ActiveXо হणפ) ࠗ۾: ખ ؊ ࢎۈۢ ܀݂ೞӝ
QnA 44
хࢎפ :D 45