Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
Search
Beomi
August 13, 2017
Technology
3
16k
PYCON KR 2017: 처음부터 알아보는 웹 크롤러
PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료
Beomi
August 13, 2017
Tweet
Share
More Decks by Beomi
See All by Beomi
[2024.11.27] SK WaveHill Meetup - LLM Fine-tuning
beomi
0
120
[PyCon Korea 2024 Keynote] 커뮤니티와 파이썬, 그리고 우리
beomi
0
150
[PyCon Korea 2024 Session] 우리 모두는 스팸에서 자유로울 권리가 있다 - Smart Spam Filter 개발기
beomi
0
130
[2024 창구 성장 지원 세미나] LLM과 온디바이스LM으로 스팸필터 앱 서비스 만들기
beomi
0
33
[2024.08.30] Gemma-Ko, 오픈 언어모델에 한국어 입히기 @ 머신러닝부트캠프2024
beomi
0
870
[PyConKR 2019] 온라인 뉴스 댓글은 정말 사람들의 목소리일까? - PART2
beomi
3
2.9k
AWS Lambda를 통한 Tensorflow 및 Keras 기반 추론 모델 서비스하기
beomi
4
1.2k
GDG Campus SummerParty: 쓸데많은 웹 크롤러 만들기 with Python
beomi
3
1.3k
PYCON KR 2017 튜토리얼: 나만의 웹 크롤러 만들기
beomi
1
850
Other Decks in Technology
See All in Technology
Unsafe.BitCast のすゝめ。
nenonaninu
0
190
Copilotの力を実感!3ヶ月間の生成AI研修の試行錯誤&成功事例をご紹介。果たして得たものとは・・?
ktc_shiori
0
320
I could be Wrong!! - Learning from Agile Experts
kawaguti
PRO
8
3.2k
Fabric 移行時の躓きポイントと対応策
ohata_ds
1
150
データ基盤におけるIaCの重要性とその運用
mtpooh
1
210
20241125 - AI 繪圖實戰魔法工作坊 @ 實踐大學
dpys
1
460
Evolving Architecture
rainerhahnekamp
3
250
iPadOS18でフローティングタブバーを解除してみた
sansantech
PRO
1
110
When Windows Meets Kubernetes…
pichuang
0
300
Cloudflareで実現する AIエージェント ワークフロー基盤
kmd09
0
270
デジタルアイデンティティ技術 認可・ID連携・認証 応用 / 20250114-OIDF-J-EduWG-TechSWG
oidfj
2
520
【JAWS-UG大阪 reInvent reCap LT大会 サンバが始まったら強制終了】“1分”で初めてのソロ参戦reInventを数字で振り返りながら反省する
ttelltte
0
120
Featured
See All Featured
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
656
59k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
365
25k
Why Our Code Smells
bkeepers
PRO
335
57k
Speed Design
sergeychernyshev
25
730
Building an army of robots
kneath
302
45k
A better future with KSS
kneath
238
17k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
27
1.5k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
132
33k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
6
500
Code Review Best Practice
trishagee
65
17k
Transcript
ࠗఠঌইࠁחਢ܀۞ ળߧ KVO!CFPNJOFU Back to the Basic
ߊ ࣗѐ ળߧ (
[email protected]
) - DjangoGirls Seoul -
<ա݅ ਢ ܀۞ ٜ݅ӝ> दܻૉ ো - ౚషܻ: <ա݅ ਢ ܀۞ ٜ݅ӝ> - Python + Django = <3 - ইೠ ప(ইೠ ഋઁٜ) ੋఢ 2
܀݂? 3
য়ט ೡ Ѫ 4 ܀݂ೡ ѐߊജ҃ ܁ ੜ ॄࠁӝ, CSS
Selectorۆ? ߊࣁ࣌ ܀݂ೞӝ ৡয়झ ۽ӒੋೞҊ ܀݂ೞӝ ਵ۽ ܀݂ೞӝ ( )ചݶহ ܀݂ೞӝ ӝਵ۽ ܀݂ೞӝ
যڃ ജ҃ীࢲ সೞաਃ? • Python 3.6.x (ইޖܻ ծইب 3.4.x /
Python2ॳݶ Ҋా߉ইਃ) • Requests 2.18.x (2.1x.x ࢚ߡݶ ޖդפ.) • beautifulsoup4 4.6.x (4.5 ࢚ ߡݶ ؾפ.) • selenium 3.4.x (selenium ઁա ୭न ߡਸ ਊ೧ࣁਃ.) • Chrome v60 ࢚ (Headlessݽ٘ח 60ߡ ࢚ ਗؾפ.) pip install requests bs4 selenium 5
Requestsۆ? 6 1ZUIPO)5513FRVFTUTGPS)VNBOT >>> import requests >>> r = requests.get('https://api.github.com/user',
auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...'
BeautifulSoupۆ? EFTJHOFEGPSRVJDLUVSOBSPVOEQSPKFDUT MJLFTDSFFOTDSBQJOH • Requests۽ ߉ইৡ ؘఠܳ ॆ ೧ೞח ё۽
ٜ݅ӝ • HTML DOM ҳઑ Ӓ۽! 7 from bs4 import BeautifulSoup
HTML DOM? 8 <!DOCTYPE html> <html> <head> <title>ఋౣ</title> </head> <body>
<h1>ઁੌ ઁݾ</h1> <div> 2017 KR </div> </body> </html> TITLE HEAD DIV H1 BODY HTML
܁ Inspector ࢎਊೞӝ 9
CSS Selectorۆ? 10 body > div.frontpage > div.onsky > nav
HTML TAG CSS Class CSS Class CSS Class HTML TAG CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >
рױೠ ਢ ಕ ܀݂ೞӝ 11
ࣁ࣌ ݾ۾ਸ оઉ৬ࠇद 12 https://www.pycon.kr/2017/program/list/
13
14 body > div.container > div:nth-child(1) > div.col-md-9.content > ul:nth-child(3)
> li:nth-child(16) > a nth-childח BeautifulSoup select()ীࢲ ਗೞ ঋח body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
15 div > div.col-md-9.content > ul > li > a
body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
nth-child vs nth-of-type • NotImplementedError: Only the following pseudo-classes are
implemented: nth-of-type. • div:nth-child(5): э ࠗݽܳ о Element 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ • div:nth-of-type(5): э ࠗݽܳ о div 5ߣ૩. 16
17 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('body > div.container > div:nth-of-type(1) '\ '> div.col-md-9.content > ul:nth-of-type(1) '\ '> li:nth-of-type(16) > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
18 import requests from bs4 import BeautifulSoup as bs #
Getߑधਵ۽ ࣗझܳ оઉ২פ. req = requests.get('https://www.pycon.kr/2017/program/list/') # ؊ࠗ࠙ ইצ HTTP Body(Text)ܳ оઉ২פ. html = req.text # HTMLਸ ॆ ೧ೞח Soup ё۽ यפ. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖפ.(iterable) session_list = soup.select('div > div.col-md-9.content > ul > li > a') for session in session_list: # HTML DOMё ղਊޛ(text)݅ ࠇפ. print(session.text)
19
۽Ӓੋ ਃೠ ҃ 20
ONOFFMIX नݾ۾ оઉয়ӝ 21
۽Ӓੋ যڌѱ ೞաਃ? 22
ONOFFMIX ۽Ӓੋ ڳযࠁӝ 23 http://onoffmix.com/account/login
requests Session ਊ 24
25 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text)
۽Ӓੋ റ ܐ оઉয়ӝ 26 #eventListHolder > div:nth-child(1) > ul
> li.title > a #eventListHolder > div > ul > li.title > a
27 import requests from bs4 import BeautifulSoup as bs def
onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '
[email protected]
', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text) 27
28
۽Ӓੋ ցޖ য۰ਕਃ Ӓր ࠳ۄ ॳݶ উغաਃ 29
܁ਵ۽ ܀݂೧ࠇद 30 Selenium + Chrome(v60) pip install selenium https://sites.google.com/a/chromium.org/chromedriver/downloads
31 /Users/ࢎਊܴ/Downloads/chromedriver
32 from selenium import webdriver driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver') # Selenium
ݽٚ ਗਸ оઉয়ӝө 3ୡܳ ӝ۰સפ driver.implicitly_wait(3) # ֎ߡ ചݶਸ оઉ৬ࠇद driver.get('https://naver.com')
33
݅ড ۠ ী۞о լݶ 34 selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs
to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home ‘chromedriver'о PATH ী ١۾غ ঋইࢲ ࢤӝח ޙઁ. https://sites.google.com/a/chromium.org/chromedriver/home ਤ ࣗীࢲ latest driverܳ ߉ই ୷ਸ ಽযળ ҃۽ܳ ഛೞѱ ೞӝ
֎ߡ ۽Ӓੋ ೧ࠁӝ 35 from selenium import webdriver driver =
webdriver.Chrome(‘/Users/username/Downloads/chromedriver') driver.implicitly_wait(3) driver.get('https://naver.com') input_id = driver.find_element_by_css_selector('#id') input_pw = driver.find_element_by_css_selector('#pw') login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]') input_id.send_keys('someid') input_pw.send_keys(‘mypassword1234!') login_button.click()
֎ߡ ನੋ ҳݒղ ܀݂ 36 #_listContentArea > ul > li:nth-child(1)
> div > div.item_content > div.info_space > p #_listContentArea > ul > li > div > div.item_content > div.info_space > p https://order.pay.naver.com/home?tabMenu=POINT_TOTAL
܁ ࣛীࢲ ܻ ഛੋ೧ࠁӝ 37 CSS Selectorܳ ࠗ ॳ ঋইب
܀݂ оמೞ! ೞ݅ ܁ ࣛ JS
оઉ৬ ࠇद 38 # ਤ ٘ যࢲ login_button.click() driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL') point_list
= driver.find_elements_by_css_selector('div.info_space > p') for point in point_list: print(point.text) driver.quit() ]
ചݶਸ ڸחѱ फযਃ 39 driver.set_window_position(-10000,0) ೞ݅ Ѥ ܻо ߄ۄחѱ ইפભ
Headless Browser like Headless Chrome
ಕझ࠘ झܽࢫ 40 chrome://settings/help ীࢲ ܁ ߡਸ ഛੋ೧ࣁਃ Headless Chrome
v60࢚ীࢲ ࢎਊоמפ. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1920x1080') driver = webdriver.Chrome('chromedriver', chrome_options=options) driver.get('https://facebook.com') driver.implicitly_wait(3) email = driver.find_element_by_css_selector('input[type=email]') password = driver.find_element_by_css_selector('input[type=password]') login = driver.find_element_by_css_selector('input[type="submit"]') email.send_keys('
[email protected]
') password.send_keys('ilovepython') login.click() driver.get_screenshot_as_file('facebook.png') driver.quit()
24दр ܀݂ جܻҊ रযਃ 41 Cron স + ࢲߡী ৢܻӝ
Ѥয়טೞঋইਃ
10࠙݃ ೠߣঀ ܀݂ ೞҊरযਃ 42 */10 * * * *
/usr/bin/python3 /home/beomi/parser.py ࠙ द ੌ ਘ ਃੌ ॆ ਤ ॆ ੌ ਤ crontab -e ഛೠ ‘दп’ ं, ‘ݻ ࠙/दр݃’ח */ं ೠঀ झಕझ
43 • एযࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ ঋইਃ. • ےؒ.
time.sleep(2 + random.random() * 4)ۢ ےؒਵ۽ एযࣁਃ. • User-Agent ܳ ࠈ ইצ ੌ߈ ࠳ۄۢ ֍যࣁਃ. • ߄झ݀о ݆ ҃ীח requestsח Әߑ Ѧܾࣻ णפ. • robots.txt ܳ ઓ೧ࣁਃ. ࢲߡਗ ޖೠ ইפѢٚਃ. • ࠁా ݽ߄ੌ ಕо PCಕࠁ ܀݂ೞӝ औणפ. (ੌױ Flash৬ ActiveXо হणפ) ࠗ۾: ખ ؊ ࢎۈۢ ܀݂ೞӝ
QnA 44
хࢎפ :D 45