Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
Search
Beomi
August 13, 2017
Technology
3
16k
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료
Beomi
August 13, 2017
Tweet
Share
More Decks by Beomi
See All by Beomi
1인개발로 AI서비스앱 만들기: 1, 10, 100, 1000, 10000, 그리고 100000명까지 (feat. Smart Spam Filter)
beomi
0
19
[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning
beomi
0
170
[PyCon Korea 2024 Keynote] 커뮤니티와 파이썬, 그리고 우리
beomi
0
180
[PyCon Korea 2024 Session] 우리 모두는 스팸에서 자유로울 권리가 있다 - Smart Spam Filter 개발기
beomi
0
170
[2024 창구 성장 지원 세미나] LLM과 온디바이스LM으로 스팸필터 앱 서비스 만들기
beomi
0
58
[2024.08.30] Gemma-Ko, 오픈 언어모델에 한국어 입히기 @ 머신러닝부트캠프2024
beomi
0
1.1k
[PyConKR 2019] 온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2
beomi
3
3k
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기
beomi
4
1.3k
GDG Campus SummerParty: 쓸데많은 웹 크롤러 만들기 with Python
beomi
3
1.4k
Other Decks in Technology
See All in Technology
AWS Summit Japan 2025 Community Stage - App workflow automation by AWS Step Functions
matsuihidetoshi
1
140
讓測試不再 BB! 從 BDD 到 CI/CD, 不靠人力也能 MVP
line_developers_tw
PRO
0
1.1k
データプラットフォーム技術におけるメダリオンアーキテクチャという考え方/DataPlatformWithMedallionArchitecture
smdmts
5
550
Create a Rails8 responsive app with Gemini and RubyLLM
palladius
0
140
Perk アプリの技術選定とリリースから1年弱経ってのふりかえり
stomk
0
120
Azure AI Foundryでマルチエージェントワークフロー
seosoft
0
140
Agentic Workflowという選択肢を考える
tkikuchi1002
1
350
Navigation3でViewModelにデータを渡す方法
mikanichinose
0
200
AIの最新技術&テーマをつまんで紹介&フリートークするシリーズ #1 量子機械学習の入門
tkhresk
0
120
BrainPadプログラミングコンテスト記念LT会2025_社内イベント&問題解説
brainpadpr
0
150
Prox Industries株式会社 会社紹介資料
proxindustries
0
180
Amazon ECS & AWS Fargate 運用アーキテクチャ2025 / Amazon ECS and AWS Fargate Ops Architecture 2025
iselegant
13
4.2k
Featured
See All Featured
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
31
1.2k
Practical Orchestrator
shlominoach
188
11k
BBQ
matthewcrist
89
9.7k
What's in a price? How to price your products and services
michaelherold
245
12k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
281
13k
Speed Design
sergeychernyshev
31
1k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
16
940
How to Ace a Technical Interview
jacobian
277
23k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
4
200
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
233
17k
The Cult of Friendly URLs
andyhume
79
6.4k
Transcript
ࠗఠঌইࠁחਢ܀۞ ળߧ KVO!CFPNJOFU Back to the Basic
ߊ ࣗѐ ળߧ (
[email protected]
) - DjangoGirls Seoul -
<ա݅ ਢ ܀۞ ٜ݅ӝ> दܻૉ ো - ౚషܻ: <ա݅ ਢ ܀۞ ٜ݅ӝ> - Python + Django = <3 - ইೠ ప(ইೠ ഋઁٜ) ੋఢ 2
܀݂? 3
য়ט ೡ Ѫ 4 ܀݂ೡ ѐߊജ҃ ܁ ੜ ॄࠁӝ, CSS
Selectorۆ? ߊࣁ࣌ ܀݂ೞӝ ৡয়झ ۽ӒੋೞҊ ܀݂ೞӝ ਵ۽ ܀݂ೞӝ ( )ചݶহ ܀݂ೞӝ ӝਵ۽ ܀݂ೞӝ
যڃ ജ҃ীࢲ সೞաਃ? • Python 3.6.x (ইޖܻ ծইب 3.4.x /
Python2ॳݶ Ҋా߉ইਃ) • Requests 2.18.x (2.1x.x ࢚ߡݶ ޖդפ.) • beautifulsoup4 4.6.x (4.5 ࢚ ߡݶ ؾפ.) • selenium 3.4.x (selenium ઁա ୭न ߡਸ ਊ೧ࣁਃ.) • Chrome v60 ࢚ (Headlessݽ٘ח 60ߡ ࢚ ਗؾפ.) pip install requests bs4 selenium 5
Requestsۆ? 6 1ZUIPO)5513FRVFTUTGPS)VNBOT >>> import requests >>> r = requests.get('https://api.github.com/user',
auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...'
BeautifulSoupۆ? EFTJHOFEGPSRVJDLUVSOBSPVOEQSPKFDUT MJLFTDSFFOTDSBQJOH • Requests۽ ߉ইৡ ؘఠܳ ॆ ೧ೞח ё۽
ٜ݅ӝ • HTML DOM ҳઑ Ӓ۽! 7 from bs4 import BeautifulSoup
HTML DOM? 8 <!DOCTYPE html> <html> <head> <title>ఋౣ</title> </head> <body>
<h1>ઁੌ ઁݾ</h1> <div> 2017 KR </div> </body> </html> TITLE HEAD DIV H1 BODY HTML
܁ Inspector ࢎਊೞӝ 9
CSS Selectorۆ? 10 body > div.frontpage > div.onsky > nav
HTML TAG CSS Class CSS Class CSS Class HTML TAG CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >
рױೠ ਢ ಕ ܀݂ೞӝ 11
ࣁ࣌ ݾ۾ਸ оઉ৬ࠇद 12 https://www.pycon.kr/2017/program/list/
13
14 body > div.container > div:nth-child(1) > div.col-md-9.content > ul:nth-child(3)
> li:nth-child(16) > a nth-childח BeautifulSoup select()ীࢲ ਗೞ ঋח body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
15 div > div.col-md-9.content > ul > li > a
body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
nth-child vs nth-of-type • NotImplementedError: Only the following pseudo-classes are
implemented: nth-of-type. • div:nth-child(5): э ࠗݽܳ о Element 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ • div:nth-of-type(5): э ࠗݽܳ о div 5ߣ૩. 16
17 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('body > div.container > div:nth-of-type(1) '\ '> div.col-md-9.content > ul:nth-of-type(1) '\ '> li:nth-of-type(16) > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
18 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('div > div.col-md-9.content > ul > li > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
19
۽Ӓੋ ਃೠ ҃ 20
ONOFFMIX नݾ۾ оઉয়ӝ 21
۽Ӓੋ যڌѱ ೞաਃ? 22
ONOFFMIX ۽Ӓੋ ڳযࠁӝ 23 http://onoffmix.com/account/login
requests Session ਊ 24
25 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text)
۽Ӓੋ റ ܐ оઉয়ӝ 26 #eventListHolder > div:nth-child(1) > ul
> li.title > a #eventListHolder > div > ul > li.title > a
27 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text) 27
28
۽Ӓੋ ցޖ য۰ਕਃ Ӓր ࠳ۄ ॳݶ উغաਃ 29
܁ਵ۽ ܀݂೧ࠇद 30 Selenium + Chrome(v60) pip install selenium https://sites.google.com/a/chromium.org/chromedriver/downloads
31 /Users/ࢎਊܴ/Downloads/chromedriver
32 from selenium import webdriver driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver') # Selenium
ݽٚ ਗਸ оઉয়ӝө 3ୡܳ ӝ۰સפ driver.implicitly_wait(3) # ֎ߡ ചݶਸ оઉ৬ࠇद driver.get('https://naver.com')
33
݅ড ۠ ী۞о լݶ 34 selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs
to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home ‘chromedriver'о PATH ী ١۾غ ঋইࢲ ࢤӝח ޙઁ. https://sites.google.com/a/chromium.org/chromedriver/home ਤ ࣗীࢲ latest driverܳ ߉ই ୷ਸ ಽযળ ҃۽ܳ ഛೞѱ ೞӝ
֎ߡ ۽Ӓੋ ೧ࠁӝ 35 from selenium import webdriver driver =
webdriver.Chrome(‘/Users/username/Downloads/chromedriver') driver.implicitly_wait(3) driver.get('https://naver.com') input_id = driver.find_element_by_css_selector('#id') input_pw = driver.find_element_by_css_selector('#pw') login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]') input_id.send_keys('someid') input_pw.send_keys(‘mypassword1234!') login_button.click()
֎ߡ ನੋ ҳݒղ ܀݂ 36 #_listContentArea > ul > li:nth-child(1)
> div > div.item_content > div.info_space > p #_listContentArea > ul > li > div > div.item_content > div.info_space > p https://order.pay.naver.com/home?tabMenu=POINT_TOTAL
܁ ࣛীࢲ ܻ ഛੋ೧ࠁӝ 37 CSS Selectorܳ ࠗ ॳ ঋইب
܀݂ оמೞ! ೞ݅ ܁ ࣛ JS
оઉ৬ ࠇद 38 # ਤ ٘ যࢲ login_button.click() driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL') point_list
= driver.find_elements_by_css_selector('div.info_space > p') for point in point_list: print(point.text) driver.quit() ]
ചݶਸ ڸחѱ फযਃ 39 driver.set_window_position(-10000,0) ೞ݅ Ѥ ܻо ߄ۄחѱ ইפભ
Headless Browser like Headless Chrome
ಕझ࠘ झܽࢫ 40 chrome://settings/help ীࢲ ܁ ߡਸ ഛੋ೧ࣁਃ Headless Chrome
v60࢚ীࢲ ࢎਊоמפ. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1920x1080') driver = webdriver.Chrome('chromedriver', chrome_options=options) driver.get('https://facebook.com') driver.implicitly_wait(3) email = driver.find_element_by_css_selector('input[type=email]') password = driver.find_element_by_css_selector('input[type=password]') login = driver.find_element_by_css_selector('input[type="submit"]') email.send_keys('
[email protected]
') password.send_keys('ilovepython') login.click() driver.get_screenshot_as_file('facebook.png') driver.quit()
24दр ܀݂ جܻҊ रযਃ 41 Cron স + ࢲߡী ৢܻӝ
Ѥয়טೞঋইਃ
10࠙݃ ೠߣঀ ܀݂ ೞҊरযਃ 42 */10 * * * *
/usr/bin/python3 /home/beomi/parser.py ࠙ द ੌ ਘ ਃੌ ॆ ਤ ॆ ੌ ਤ crontab -e ഛೠ ‘दп’ ं, ‘ݻ ࠙/दр݃’ח */ं ೠঀ झಕझ
43 • एযࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ ঋইਃ. • ےؒ.
time.sleep(2 + random.random() * 4)ۢ ےؒਵ۽ एযࣁਃ. • User-Agent ܳ ࠈ ইצ ੌ߈ ࠳ۄۢ ֍যࣁਃ. • ߄झ݀о ݆ ҃ীח requestsח Әߑ Ѧܾࣻ णפ. • robots.txt ܳ ઓ೧ࣁਃ. ࢲߡਗ ޖೠ ইפѢٚਃ. • ࠁా ݽ߄ੌ ಕо PCಕࠁ ܀݂ೞӝ औणפ. (ੌױ Flash৬ ActiveXо হणפ) ࠗ۾: ખ ؊ ࢎۈۢ ܀݂ೞӝ
QnA 44
хࢎפ :D 45