Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
Search
Beomi
August 13, 2017
Technology
16k
3
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료
Beomi
August 13, 2017
More Decks by Beomi
See All by Beomi
[Devfest Incheon 2025] 모두를 위한 친절한 언어모델(LLM) 학습 가이드
beomi
1
1.6k
[I/O Extended 2025 인천] 1인 개발 서비스를 위한 Gemini CLI 사용기
beomi
0
200
1인개발로 AI서비스앱 만들기: 1, 10, 100, 1000, 10000, 그리고 100000명까지 (feat. Smart Spam Filter)
beomi
0
100
[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning
beomi
0
240
[PyCon Korea 2024 Keynote] 커뮤니티와 파이썬, 그리고 우리
beomi
0
240
[PyCon Korea 2024 Session] 우리 모두는 스팸에서 자유로울 권리가 있다 - Smart Spam Filter 개발기
beomi
0
240
[2024 창구 성장 지원 세미나] LLM과 온디바이스LM으로 스팸필터 앱 서비스 만들기
beomi
0
120
[2024.08.30] Gemma-Ko, 오픈 언어모델에 한국어 입히기 @ 머신러닝부트캠프2024
beomi
0
1.3k
[PyConKR 2019] 온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2
beomi
3
3.2k
Other Decks in Technology
See All in Technology
AIのReact習熟度を測る
uhyo
2
590
Bucharest Tech Week 2026 - Reinventing testing practices in the AI era
edeandrea
PRO
1
160
入門!AWS Blocks
ysuzuki
1
130
小さくはじめるSLI/SLO ~育てながら組織に定着させる実践知~ / Starting Small with SLI/SLOs: Building Adoption Through Continuous Growth
nari_ex
7
2k
AmazonRoute 53ではじめてのドメイン取得!HTTPS化までの道のりを整理してみた
usanchuu
3
140
就職⽀援サービスにおけるキャリアアドバイザーのシフトスケジューリング
recruitengineers
PRO
1
150
新しいUbuntu/GNOMEが使いたいからXからWaylandへ移行頑張ってるの巻 2026-06-20
nobutomurata
0
120
AIエージェントが名古屋の猛暑からあなたを守る
happysamurai294
0
120
Kubernetesにおける学習基盤とLLMOpsの概要
ry
1
310
2026 TECHFRESH 畢業分享會 - 開發日常大解密!從領域驅動到企業級上線
line_developers_tw
PRO
0
1.1k
SONiC Scale-Up Working Group から探る Scale-UpやUltraEthernet機能の実装方法
ebiken
PRO
2
350
2026TECHFRESH畢業分享會 - AI 時代的人生存檔點
line_developers_tw
PRO
0
1.1k
Featured
See All Featured
sira's awesome portfolio website redesign presentation
elsirapls
0
280
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
62
54k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Building Adaptive Systems
keathley
44
3.1k
Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)
techseoconnect
PRO
0
180
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
310
We Are The Robots
honzajavorek
0
250
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
150
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
160
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
28
3.5k
Raft: Consensus for Rubyists
vanstee
141
7.5k
AI Search: Implications for SEO and How to Move Forward - #ShenzhenSEOConference
aleyda
1
1.3k
Transcript
ࠗఠঌইࠁחਢ܀۞ ળߧ KVO!CFPNJOFU Back to the Basic
ߊ ࣗѐ ળߧ (
[email protected]
) - DjangoGirls Seoul -
<ա݅ ਢ ܀۞ ٜ݅ӝ> दܻૉ ো - ౚషܻ: <ա݅ ਢ ܀۞ ٜ݅ӝ> - Python + Django = <3 - ইೠ ప(ইೠ ഋઁٜ) ੋఢ 2
܀݂? 3
য়ט ೡ Ѫ 4 ܀݂ೡ ѐߊജ҃ ܁ ੜ ॄࠁӝ, CSS
Selectorۆ? ߊࣁ࣌ ܀݂ೞӝ ৡয়झ ۽ӒੋೞҊ ܀݂ೞӝ ਵ۽ ܀݂ೞӝ ( )ചݶহ ܀݂ೞӝ ӝਵ۽ ܀݂ೞӝ
যڃ ജ҃ীࢲ সೞաਃ? • Python 3.6.x (ইޖܻ ծইب 3.4.x /
Python2ॳݶ Ҋా߉ইਃ) • Requests 2.18.x (2.1x.x ࢚ߡݶ ޖդפ.) • beautifulsoup4 4.6.x (4.5 ࢚ ߡݶ ؾפ.) • selenium 3.4.x (selenium ઁա ୭न ߡਸ ਊ೧ࣁਃ.) • Chrome v60 ࢚ (Headlessݽ٘ח 60ߡ ࢚ ਗؾפ.) pip install requests bs4 selenium 5
Requestsۆ? 6 1ZUIPO)5513FRVFTUTGPS)VNBOT >>> import requests >>> r = requests.get('https://api.github.com/user',
auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...'
BeautifulSoupۆ? EFTJHOFEGPSRVJDLUVSOBSPVOEQSPKFDUT MJLFTDSFFOTDSBQJOH • Requests۽ ߉ইৡ ؘఠܳ ॆ ೧ೞח ё۽
ٜ݅ӝ • HTML DOM ҳઑ Ӓ۽! 7 from bs4 import BeautifulSoup
HTML DOM? 8 <!DOCTYPE html> <html> <head> <title>ఋౣ</title> </head> <body>
<h1>ઁੌ ઁݾ</h1> <div> 2017 KR </div> </body> </html> TITLE HEAD DIV H1 BODY HTML
܁ Inspector ࢎਊೞӝ 9
CSS Selectorۆ? 10 body > div.frontpage > div.onsky > nav
HTML TAG CSS Class CSS Class CSS Class HTML TAG CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >
рױೠ ਢ ಕ ܀݂ೞӝ 11
ࣁ࣌ ݾ۾ਸ оઉ৬ࠇद 12 https://www.pycon.kr/2017/program/list/
13
14 body > div.container > div:nth-child(1) > div.col-md-9.content > ul:nth-child(3)
> li:nth-child(16) > a nth-childח BeautifulSoup select()ীࢲ ਗೞ ঋח body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
15 div > div.col-md-9.content > ul > li > a
body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
nth-child vs nth-of-type • NotImplementedError: Only the following pseudo-classes are
implemented: nth-of-type. • div:nth-child(5): э ࠗݽܳ о Element 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ • div:nth-of-type(5): э ࠗݽܳ о div 5ߣ૩. 16
17 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('body > div.container > div:nth-of-type(1) '\ '> div.col-md-9.content > ul:nth-of-type(1) '\ '> li:nth-of-type(16) > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
18 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('div > div.col-md-9.content > ul > li > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
19
۽Ӓੋ ਃೠ ҃ 20
ONOFFMIX नݾ۾ оઉয়ӝ 21
۽Ӓੋ যڌѱ ೞաਃ? 22
ONOFFMIX ۽Ӓੋ ڳযࠁӝ 23 http://onoffmix.com/account/login
requests Session ਊ 24
25 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text)
۽Ӓੋ റ ܐ оઉয়ӝ 26 #eventListHolder > div:nth-child(1) > ul
> li.title > a #eventListHolder > div > ul > li.title > a
27 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text) 27
28
۽Ӓੋ ցޖ য۰ਕਃ Ӓր ࠳ۄ ॳݶ উغաਃ 29
܁ਵ۽ ܀݂೧ࠇद 30 Selenium + Chrome(v60) pip install selenium https://sites.google.com/a/chromium.org/chromedriver/downloads
31 /Users/ࢎਊܴ/Downloads/chromedriver
32 from selenium import webdriver driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver') # Selenium
ݽٚ ਗਸ оઉয়ӝө 3ୡܳ ӝ۰સפ driver.implicitly_wait(3) # ֎ߡ ചݶਸ оઉ৬ࠇद driver.get('https://naver.com')
33
݅ড ۠ ী۞о լݶ 34 selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs
to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home ‘chromedriver'о PATH ী ١۾غ ঋইࢲ ࢤӝח ޙઁ. https://sites.google.com/a/chromium.org/chromedriver/home ਤ ࣗীࢲ latest driverܳ ߉ই ୷ਸ ಽযળ ҃۽ܳ ഛೞѱ ೞӝ
֎ߡ ۽Ӓੋ ೧ࠁӝ 35 from selenium import webdriver driver =
webdriver.Chrome(‘/Users/username/Downloads/chromedriver') driver.implicitly_wait(3) driver.get('https://naver.com') input_id = driver.find_element_by_css_selector('#id') input_pw = driver.find_element_by_css_selector('#pw') login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]') input_id.send_keys('someid') input_pw.send_keys(‘mypassword1234!') login_button.click()
֎ߡ ನੋ ҳݒղ ܀݂ 36 #_listContentArea > ul > li:nth-child(1)
> div > div.item_content > div.info_space > p #_listContentArea > ul > li > div > div.item_content > div.info_space > p https://order.pay.naver.com/home?tabMenu=POINT_TOTAL
܁ ࣛীࢲ ܻ ഛੋ೧ࠁӝ 37 CSS Selectorܳ ࠗ ॳ ঋইب
܀݂ оמೞ! ೞ݅ ܁ ࣛ JS
оઉ৬ ࠇद 38 # ਤ ٘ যࢲ login_button.click() driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL') point_list
= driver.find_elements_by_css_selector('div.info_space > p') for point in point_list: print(point.text) driver.quit() ]
ചݶਸ ڸחѱ फযਃ 39 driver.set_window_position(-10000,0) ೞ݅ Ѥ ܻо ߄ۄחѱ ইפભ
Headless Browser like Headless Chrome
ಕझ࠘ झܽࢫ 40 chrome://settings/help ীࢲ ܁ ߡਸ ഛੋ೧ࣁਃ Headless Chrome
v60࢚ীࢲ ࢎਊоמפ. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1920x1080') driver = webdriver.Chrome('chromedriver', chrome_options=options) driver.get('https://facebook.com') driver.implicitly_wait(3) email = driver.find_element_by_css_selector('input[type=email]') password = driver.find_element_by_css_selector('input[type=password]') login = driver.find_element_by_css_selector('input[type="submit"]') email.send_keys('
[email protected]
') password.send_keys('ilovepython') login.click() driver.get_screenshot_as_file('facebook.png') driver.quit()
24दр ܀݂ جܻҊ रযਃ 41 Cron স + ࢲߡী ৢܻӝ
Ѥয়טೞঋইਃ
10࠙݃ ೠߣঀ ܀݂ ೞҊरযਃ 42 */10 * * * *
/usr/bin/python3 /home/beomi/parser.py ࠙ द ੌ ਘ ਃੌ ॆ ਤ ॆ ੌ ਤ crontab -e ഛೠ ‘दп’ ं, ‘ݻ ࠙/दр݃’ח */ं ೠঀ झಕझ
43 • एযࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ ঋইਃ. • ےؒ.
time.sleep(2 + random.random() * 4)ۢ ےؒਵ۽ एযࣁਃ. • User-Agent ܳ ࠈ ইצ ੌ߈ ࠳ۄۢ ֍যࣁਃ. • ߄झ݀о ݆ ҃ীח requestsח Әߑ Ѧܾࣻ णפ. • robots.txt ܳ ઓ೧ࣁਃ. ࢲߡਗ ޖೠ ইפѢٚਃ. • ࠁా ݽ߄ੌ ಕо PCಕࠁ ܀݂ೞӝ औणפ. (ੌױ Flash৬ ActiveXо হणפ) ࠗ۾: ખ ؊ ࢎۈۢ ܀݂ೞӝ
QnA 44
хࢎפ :D 45