Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
Search
Beomi
August 13, 2017
Technology
3
16k
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료
Beomi
August 13, 2017
Tweet
Share
More Decks by Beomi
See All by Beomi
[PyConKR 2019] 온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2
beomi
3
2.8k
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기
beomi
4
1.1k
GDG Campus SummerParty: 쓸데많은 웹 크롤러 만들기 with Python
beomi
3
1.3k
PYCON KR 2017 튜토리얼: 나만의 웹 크롤러 만들기
beomi
1
780
굥대생의 "HelloWorld!"
beomi
0
850
로지스틱 회귀 분석 | 밑바닥부터 시작하는 데이터 과학 16장
beomi
0
440
DjangoGirls Seoul | Django Study #2 Django MTV
beomi
0
140
Other Decks in Technology
See All in Technology
ABEMAにおけるLLMを用いたコンテンツベース推薦システム導入と効果検証
cyberagentdevelopers
PRO
1
720
簡単に始めるSnowflakeの機械学習
nayuts
1
190
AIアシスタントの活用で品質の向上と開発ワークフローのスピードアップ
nagix
1
200
DevIO2024_レガシー運用からの脱却 -クラウド活用の実践事例とベストプラクティス-
jun2882
0
210
ゆめみのアクセシビリティの現在地と今後
ryokatsuse
3
290
E2Eテスト自動化プラットフォームにおけるAIの活用
shift_evolve
0
180
公共領域から学ぶ クラウド移行についてエンジニアが意識していること
kawakawa2222
0
140
スレットハンティングについて知っておきたいこと
hacket
0
130
【基調講演】変える、今ここから ― IoTとAIで紡ぐ未来
soracom
PRO
0
320
Matterport を使ってクラスメソッド各拠点のバーチャルオフィスツアーを作成してみた
wakatsuki
0
160
What is DRE? - Road to SRE NEXT@広島
chanyou0311
3
630
MySQLのロックの種類とその競合
yoku0825
6
1.6k
Featured
See All Featured
Documentation Writing (for coders)
carmenintech
63
4.2k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
248
20k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
20
7.2k
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.4k
Raft: Consensus for Rubyists
vanstee
134
6.5k
5 minutes of I Can Smell Your CMS
philhawksworth
200
19k
StorybookのUI Testing Handbookを読んだ
zakiyama
15
4.9k
How to Ace a Technical Interview
jacobian
274
23k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
90
47k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
36
9.1k
Optimizing for Happiness
mojombo
373
69k
YesSQL, Process and Tooling at Scale
rocio
166
14k
Transcript
ࠗఠঌইࠁחਢ܀۞ ળߧ KVO!CFPNJOFU Back to the Basic
ߊ ࣗѐ ળߧ (
[email protected]
) - DjangoGirls Seoul -
<ա݅ ਢ ܀۞ ٜ݅ӝ> दܻૉ ো - ౚషܻ: <ա݅ ਢ ܀۞ ٜ݅ӝ> - Python + Django = <3 - ইೠ ప(ইೠ ഋઁٜ) ੋఢ 2
܀݂? 3
য়ט ೡ Ѫ 4 ܀݂ೡ ѐߊജ҃ ܁ ੜ ॄࠁӝ, CSS
Selectorۆ? ߊࣁ࣌ ܀݂ೞӝ ৡয়झ ۽ӒੋೞҊ ܀݂ೞӝ ਵ۽ ܀݂ೞӝ ( )ചݶহ ܀݂ೞӝ ӝਵ۽ ܀݂ೞӝ
যڃ ജ҃ীࢲ সೞաਃ? • Python 3.6.x (ইޖܻ ծইب 3.4.x /
Python2ॳݶ Ҋా߉ইਃ) • Requests 2.18.x (2.1x.x ࢚ߡݶ ޖդפ.) • beautifulsoup4 4.6.x (4.5 ࢚ ߡݶ ؾפ.) • selenium 3.4.x (selenium ઁա ୭न ߡਸ ਊ೧ࣁਃ.) • Chrome v60 ࢚ (Headlessݽ٘ח 60ߡ ࢚ ਗؾפ.) pip install requests bs4 selenium 5
Requestsۆ? 6 1ZUIPO)5513FRVFTUTGPS)VNBOT >>> import requests >>> r = requests.get('https://api.github.com/user',
auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...'
BeautifulSoupۆ? EFTJHOFEGPSRVJDLUVSOBSPVOEQSPKFDUT MJLFTDSFFOTDSBQJOH • Requests۽ ߉ইৡ ؘఠܳ ॆ ೧ೞח ё۽
ٜ݅ӝ • HTML DOM ҳઑ Ӓ۽! 7 from bs4 import BeautifulSoup
HTML DOM? 8 <!DOCTYPE html> <html> <head> <title>ఋౣ</title> </head> <body>
<h1>ઁੌ ઁݾ</h1> <div> 2017 KR </div> </body> </html> TITLE HEAD DIV H1 BODY HTML
܁ Inspector ࢎਊೞӝ 9
CSS Selectorۆ? 10 body > div.frontpage > div.onsky > nav
HTML TAG CSS Class CSS Class CSS Class HTML TAG CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >
рױೠ ਢ ಕ ܀݂ೞӝ 11
ࣁ࣌ ݾ۾ਸ оઉ৬ࠇद 12 https://www.pycon.kr/2017/program/list/
13
14 body > div.container > div:nth-child(1) > div.col-md-9.content > ul:nth-child(3)
> li:nth-child(16) > a nth-childח BeautifulSoup select()ীࢲ ਗೞ ঋח body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
15 div > div.col-md-9.content > ul > li > a
body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
nth-child vs nth-of-type • NotImplementedError: Only the following pseudo-classes are
implemented: nth-of-type. • div:nth-child(5): э ࠗݽܳ о Element 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ • div:nth-of-type(5): э ࠗݽܳ о div 5ߣ૩. 16
17 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('body > div.container > div:nth-of-type(1) '\ '> div.col-md-9.content > ul:nth-of-type(1) '\ '> li:nth-of-type(16) > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
18 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('div > div.col-md-9.content > ul > li > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
19
۽Ӓੋ ਃೠ ҃ 20
ONOFFMIX नݾ۾ оઉয়ӝ 21
۽Ӓੋ যڌѱ ೞաਃ? 22
ONOFFMIX ۽Ӓੋ ڳযࠁӝ 23 http://onoffmix.com/account/login
requests Session ਊ 24
25 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text)
۽Ӓੋ റ ܐ оઉয়ӝ 26 #eventListHolder > div:nth-child(1) > ul
> li.title > a #eventListHolder > div > ul > li.title > a
27 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text) 27
28
۽Ӓੋ ցޖ য۰ਕਃ Ӓր ࠳ۄ ॳݶ উغաਃ 29
܁ਵ۽ ܀݂೧ࠇद 30 Selenium + Chrome(v60) pip install selenium https://sites.google.com/a/chromium.org/chromedriver/downloads
31 /Users/ࢎਊܴ/Downloads/chromedriver
32 from selenium import webdriver driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver') # Selenium
ݽٚ ਗਸ оઉয়ӝө 3ୡܳ ӝ۰સפ driver.implicitly_wait(3) # ֎ߡ ചݶਸ оઉ৬ࠇद driver.get('https://naver.com')
33
݅ড ۠ ী۞о լݶ 34 selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs
to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home ‘chromedriver'о PATH ী ١۾غ ঋইࢲ ࢤӝח ޙઁ. https://sites.google.com/a/chromium.org/chromedriver/home ਤ ࣗীࢲ latest driverܳ ߉ই ୷ਸ ಽযળ ҃۽ܳ ഛೞѱ ೞӝ
֎ߡ ۽Ӓੋ ೧ࠁӝ 35 from selenium import webdriver driver =
webdriver.Chrome(‘/Users/username/Downloads/chromedriver') driver.implicitly_wait(3) driver.get('https://naver.com') input_id = driver.find_element_by_css_selector('#id') input_pw = driver.find_element_by_css_selector('#pw') login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]') input_id.send_keys('someid') input_pw.send_keys(‘mypassword1234!') login_button.click()
֎ߡ ನੋ ҳݒղ ܀݂ 36 #_listContentArea > ul > li:nth-child(1)
> div > div.item_content > div.info_space > p #_listContentArea > ul > li > div > div.item_content > div.info_space > p https://order.pay.naver.com/home?tabMenu=POINT_TOTAL
܁ ࣛীࢲ ܻ ഛੋ೧ࠁӝ 37 CSS Selectorܳ ࠗ ॳ ঋইب
܀݂ оמೞ! ೞ݅ ܁ ࣛ JS
оઉ৬ ࠇद 38 # ਤ ٘ যࢲ login_button.click() driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL') point_list
= driver.find_elements_by_css_selector('div.info_space > p') for point in point_list: print(point.text) driver.quit() ]
ചݶਸ ڸחѱ फযਃ 39 driver.set_window_position(-10000,0) ೞ݅ Ѥ ܻо ߄ۄחѱ ইפભ
Headless Browser like Headless Chrome
ಕझ࠘ झܽࢫ 40 chrome://settings/help ীࢲ ܁ ߡਸ ഛੋ೧ࣁਃ Headless Chrome
v60࢚ীࢲ ࢎਊоמפ. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1920x1080') driver = webdriver.Chrome('chromedriver', chrome_options=options) driver.get('https://facebook.com') driver.implicitly_wait(3) email = driver.find_element_by_css_selector('input[type=email]') password = driver.find_element_by_css_selector('input[type=password]') login = driver.find_element_by_css_selector('input[type="submit"]') email.send_keys('
[email protected]
') password.send_keys('ilovepython') login.click() driver.get_screenshot_as_file('facebook.png') driver.quit()
24दр ܀݂ جܻҊ रযਃ 41 Cron স + ࢲߡী ৢܻӝ
Ѥয়טೞঋইਃ
10࠙݃ ೠߣঀ ܀݂ ೞҊरযਃ 42 */10 * * * *
/usr/bin/python3 /home/beomi/parser.py ࠙ द ੌ ਘ ਃੌ ॆ ਤ ॆ ੌ ਤ crontab -e ഛೠ ‘दп’ ं, ‘ݻ ࠙/दр݃’ח */ं ೠঀ झಕझ
43 • एযࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ ঋইਃ. • ےؒ.
time.sleep(2 + random.random() * 4)ۢ ےؒਵ۽ एযࣁਃ. • User-Agent ܳ ࠈ ইצ ੌ߈ ࠳ۄۢ ֍যࣁਃ. • ߄झ݀о ݆ ҃ীח requestsח Әߑ Ѧܾࣻ णפ. • robots.txt ܳ ઓ೧ࣁਃ. ࢲߡਗ ޖೠ ইפѢٚਃ. • ࠁా ݽ߄ੌ ಕо PCಕࠁ ܀݂ೞӝ औणפ. (ੌױ Flash৬ ActiveXо হणפ) ࠗ۾: ખ ؊ ࢎۈۢ ܀݂ೞӝ
QnA 44
хࢎפ :D 45