Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PYCON KR 2017: 처음부터 알아보는 웹 크롤러

44fe9e4f89b57ca0ab95d334799f33c7?s=47 Beomi
August 13, 2017

PYCON KR 2017: 처음부터 알아보는 웹 크롤러

PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료

44fe9e4f89b57ca0ab95d334799f33c7?s=128

Beomi

August 13, 2017
Tweet

Transcript

  1. ୊਺ࠗఠঌইࠁחਢ௼܀۞ ੉ળߧ KVO!CFPNJOFU Back to the Basic

  2. ߊ಴੗ ࣗѐ ੉ળߧ ( jun@beomi.net ) - DjangoGirls Seoul -

    <ա݅੄ ਢ ௼܀۞ ٜ݅ӝ> दܻૉ ো੤ - ౵੉௑ ౚషܻ঴: <ա݅੄ ਢ ௼܀۞ ٜ݅ӝ> - Python + Django = <3 - ਋ইೠ ప௼஬೐(਋ইೠ ഋઁٜ) ੋఢ 2
  3. ௼܀݂? 3

  4. য়ט ೡ Ѫ 4 ௼܀݂ೡ ѐߊജ҃ ௼܁ ੜ ॄࠁӝ, CSS

    Selectorۆ? ౵੉௑ ߊ಴ࣁ࣌ ௼܀݂ೞӝ ৡয়೐޸झ ۽ӒੋೞҊ ௼܀݂ೞӝ ਵ۽ ௼܀݂ೞӝ ( )ചݶহ੉ ௼܀݂ೞӝ ઱ӝ੸ਵ۽ ௼܀݂ೞӝ
  5. যڃ ജ҃ীࢲ ੘সೞաਃ? • Python 3.6.x (ইޖܻ ծইب 3.4.x /

    Python2ॳݶ Ҋా߉ইਃ) • Requests 2.18.x (2.1x.x ੉࢚ߡ੹੉ݶ ޖդ೤פ׮.) • beautifulsoup4 4.6.x (4.5 ੉࢚ ߡ੹੉ݶ ؾפ׮.) • selenium 3.4.x (selenium਷ ঱ઁա ୭न ߡ੹ਸ ੉ਊ೧઱ࣁਃ.) • Chrome v60 ഑਷ ੉࢚ (Headlessݽ٘ח 60ߡ੹ ੉࢚ ૑ਗؾפ׮.)
 
 pip install requests bs4 selenium 5
  6. Requestsۆ? 6 1ZUIPO)5513FRVFTUTGPS)VNBOT >>> import requests >>> r = requests.get('https://api.github.com/user',

    auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...'
  7. BeautifulSoupۆ? EFTJHOFEGPSRVJDLUVSOBSPVOEQSPKFDUT MJLFTDSFFOTDSBQJOH • Requests۽ ߉ইৡ ؘ੉ఠܳ ౵੉ॆ੉ ੉೧ೞח ё୓۽

    ٜ݅ӝ • HTML DOM ҳઑ Ӓ؀۽! 7 from bs4 import BeautifulSoup
  8. HTML DOM? 8 <!DOCTYPE html> <html> <head> <title>ఋ੉ౣ</title> </head> <body>

    <h1>ઁੌ ௾ ઁݾ</h1> <div> ౵੉௑ 2017 KR </div> </body> </html> TITLE HEAD DIV H1 BODY HTML
  9. ௼܁ Inspector ࢎਊೞӝ 9

  10. CSS Selectorۆ? 10 body > div.frontpage > div.onsky > nav

    HTML TAG CSS Class CSS Class CSS Class HTML TAG CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >
  11. рױೠ ਢ ಕ੉૑ ௼܀݂ೞӝ 11

  12. ౵੉௑ ࣁ࣌ ݾ۾ਸ оઉ৬ࠇद׮ 12 https://www.pycon.kr/2017/program/list/

  13. 13

  14. 14 body > div.container > div:nth-child(1) > div.col-md-9.content > ul:nth-child(3)

    > li:nth-child(16) > a nth-childח BeautifulSoup੄ select()ীࢲ ૑ਗೞ૑ ঋח׮ body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
  15. 15 div > div.col-md-9.content > ul > li > a

    body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
  16. nth-child vs nth-of-type • NotImplementedError: Only the following pseudo-classes are

    implemented: nth-of-type. • div:nth-child(5): 
 э਷ ࠗݽܳ о૓ Element઺ 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ • div:nth-of-type(5):
 э਷ ࠗݽܳ о૓ div઺ 5ߣ૩. 16
  17. 17 import requests from bs4 import BeautifulSoup as bs #

    Getߑधਵ۽ ࣗझܳ оઉ২פ׮. req = requests.get('https://www.pycon.kr/2017/program/list/') # ೻؊ࠗ࠙੉ ইצ HTTP੄ Body(Text)ܳ оઉ২פ׮. html = req.text # HTMLਸ ౵੉ॆ੉ ੉೧ೞח Soup ё୓۽ ౵य೤פ׮. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖ೤פ׮.(iterable) session_list = soup.select('body > div.container > div:nth-of-type(1) '\ '> div.col-md-9.content > ul:nth-of-type(1) '\ '> li:nth-of-type(16) > a') for session in session_list: # HTML DOMё୓੄ ղਊޛ(text)݅ ࠇפ׮. print(session.text)
  18. 18 import requests from bs4 import BeautifulSoup as bs #

    Getߑधਵ۽ ࣗझܳ оઉ২פ׮. req = requests.get('https://www.pycon.kr/2017/program/list/') # ೻؊ࠗ࠙੉ ইצ HTTP੄ Body(Text)ܳ оઉ২פ׮. html = req.text # HTMLਸ ౵੉ॆ੉ ੉೧ೞח Soup ё୓۽ ౵य೤פ׮. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖ೤פ׮.(iterable) session_list = soup.select('div > div.col-md-9.content > ul > li > a') for session in session_list: # HTML DOMё୓੄ ղਊޛ(text)݅ ࠇפ׮. print(session.text)
  19. 19

  20. ۽Ӓੋ੉ ೙ਃೠ ҃਋ 20

  21. ONOFFMIX न୒ݾ۾ оઉয়ӝ 21

  22. ۽Ӓੋ਷ যڌѱ ೞաਃ? 22

  23. ONOFFMIX ۽Ӓੋ ڳযࠁӝ 23 http://onoffmix.com/account/login

  24. requests ੄ Session ੉ਊ 24

  25. 25 import requests from bs4 import BeautifulSoup as bs def

    onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': 'usermail@gmail.com', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text)
  26. ۽Ӓੋ റ ੗ܐ оઉয়ӝ 26 #eventListHolder > div:nth-child(1) > ul

    > li.title > a #eventListHolder > div > ul > li.title > a
  27. 27 import requests from bs4 import BeautifulSoup as bs def

    onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': 'usermail@gmail.com', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text) 27
  28. 28

  29. ۽Ӓੋ੉ ցޖ য۰ਕਃ Ӓր ࠳ۄ਋੷ ॳݶ উغաਃ 29

  30. ௼܁ਵ۽ ௼܀݂೧ࠇद׮ 30 Selenium + Chrome(v60) pip install selenium https://sites.google.com/a/chromium.org/chromedriver/downloads

  31. 31 /Users/ࢎਊ੗੉ܴ/Downloads/chromedriver

  32. 32 from selenium import webdriver driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver') # Selenium੉

    ݽٚ ੗ਗਸ оઉয়ӝө૑ 3ୡܳ ӝ׮۰સפ׮ driver.implicitly_wait(3) # ֎੉ߡ ୐ ചݶਸ оઉ৬ࠇद׮ driver.get('https://naver.com')
  33. 33

  34. ݅ড ੉۠ ী۞о լ׮ݶ 34 selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs

    to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home ‘chromedriver'о PATH ী ١۾غ૑ ঋইࢲ ࢤӝח ޙઁ. https://sites.google.com/a/chromium.org/chromedriver/home ਤ ઱ࣗীࢲ latest driverܳ ׮਍߉ই ঑୷ਸ ಽযળ ҃۽ܳ ੿ഛೞѱ ૑੿ೞӝ
  35. ֎੉ߡ ۽Ӓੋ ೧ࠁӝ 35 from selenium import webdriver driver =

    webdriver.Chrome(‘/Users/username/Downloads/chromedriver') driver.implicitly_wait(3) driver.get('https://naver.com') input_id = driver.find_element_by_css_selector('#id') input_pw = driver.find_element_by_css_selector('#pw') login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]') input_id.send_keys('someid') input_pw.send_keys(‘mypassword1234!') login_button.click()
  36. ֎੉ߡ ನੋ౟ ҳݒղ৉ ௼܀݂ 36 #_listContentArea > ul > li:nth-child(1)

    > div > div.item_content > div.info_space > p #_listContentArea > ul > li > div > div.item_content > div.info_space > p https://order.pay.naver.com/home?tabMenu=POINT_TOTAL
  37. ௼܁ ௑ࣛীࢲ ޷ܻ ഛੋ೧ࠁӝ 37 CSS Selectorܳ ੹ࠗ ॳ૑ ঋইب

    ௼܀݂੉ оמೞ׮! ೞ૑݅ ௼܁ ௑ࣛ਷ JS
  38. оઉ৬ ࠇद׮ 38 # ਤ ௏٘ ੉যࢲ login_button.click() driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL') point_list

    = driver.find_elements_by_css_selector('div.info_space > p') for point in point_list: print(point.text) driver.quit() ]
  39. ചݶਸ ڸ਋חѱ फযਃ 39 driver.set_window_position(-10000,0) ೞ૑݅ ੉Ѥ ਋ܻо ߄ۄחѱ ইפભ

    Headless Browser like Headless Chrome
  40. ಕ੉झ࠘ झ௼ܽࢫ 40 chrome://settings/help ীࢲ ௼܁ ߡ੹ਸ ഛੋ೧઱ࣁਃ Headless Chrome਷

    v60੉࢚ীࢲ ࢎਊоמ೤פ׮. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1920x1080') driver = webdriver.Chrome('chromedriver', chrome_options=options) driver.get('https://facebook.com') driver.implicitly_wait(3) email = driver.find_element_by_css_selector('input[type=email]') password = driver.find_element_by_css_selector('input[type=password]') login = driver.find_element_by_css_selector('input[type="submit"]') email.send_keys('username@mail.com') password.send_keys('ilovepython') login.click() driver.get_screenshot_as_file('facebook.png') driver.quit()
  41. 24दр ௼܀݂ جܻҊ रযਃ 41 Cron ੘স + ࢲߡী ৢܻӝ

    ੉Ѥয়טೞ૑ঋইਃ
  42. 10࠙݃׮ ೠߣঀ ௼܀݂ ೞҊरযਃ 42 */10 * * * *

    /usr/bin/python3 /home/beomi/parser.py ࠙ द ੌ ਘ ਃੌ ౵੉ॆ ਤ஖ ౵੉ॆ ౵ੌ ਤ஖ crontab -e ੿ഛೠ ‘दп’਷ ं੗, ‘ݻ ࠙/दр݃׮’ח */ं੗ ೠ஢ঀ झಕ੉झ
  43. 43 • एয઱ࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ૑ ঋইਃ. • ےؒ.

    time.sleep(2 + random.random() * 4)୊ۢ ےؒਵ۽ एয઱ࣁਃ. • User-Agent ܳ ࠈ੉ ইצ ੌ߈ ࠳ۄ਋੷୊ۢ ֍য઱ࣁਃ. • ੗߄झ௼݀౟о ݆਷ ҃਋ীח requestsח Әߑ Ѧܾࣻ ੓णפ׮. • robots.txt ܳ ઓ઺೧઱ࣁਃ. ࢲߡ੗ਗ਷ ޖೠ੉ ইפѢٚਃ. • ࠁా ݽ߄ੌ ಕ੉૑о PCಕ੉૑ࠁ׮ ௼܀݂ೞӝ औणפ׮.
 (ੌױ Flash৬ ActiveXо হणפ׮) ࠗ۾: ખ ؊ ࢎۈ୊ۢ ௼܀݂ೞӝ
  44. QnA 44

  45. хࢎ೤פ׮ :D 45