Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PYCON KR 2017: 처음부터 알아보는 웹 크롤러

Beomi
August 13, 2017

PYCON KR 2017: 처음부터 알아보는 웹 크롤러

PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료

Beomi

August 13, 2017
Tweet

More Decks by Beomi

Other Decks in Technology

Transcript

  1. ߊ಴੗ ࣗѐ ੉ળߧ ( [email protected] ) - DjangoGirls Seoul -

    <ա݅੄ ਢ ௼܀۞ ٜ݅ӝ> दܻૉ ো੤ - ౵੉௑ ౚషܻ঴: <ա݅੄ ਢ ௼܀۞ ٜ݅ӝ> - Python + Django = <3 - ਋ইೠ ప௼஬೐(਋ইೠ ഋઁٜ) ੋఢ 2
  2. য়ט ೡ Ѫ 4 ௼܀݂ೡ ѐߊജ҃ ௼܁ ੜ ॄࠁӝ, CSS

    Selectorۆ? ౵੉௑ ߊ಴ࣁ࣌ ௼܀݂ೞӝ ৡয়೐޸झ ۽ӒੋೞҊ ௼܀݂ೞӝ ਵ۽ ௼܀݂ೞӝ ( )ചݶহ੉ ௼܀݂ೞӝ ઱ӝ੸ਵ۽ ௼܀݂ೞӝ
  3. যڃ ജ҃ীࢲ ੘সೞաਃ? • Python 3.6.x (ইޖܻ ծইب 3.4.x /

    Python2ॳݶ Ҋా߉ইਃ) • Requests 2.18.x (2.1x.x ੉࢚ߡ੹੉ݶ ޖդ೤פ׮.) • beautifulsoup4 4.6.x (4.5 ੉࢚ ߡ੹੉ݶ ؾפ׮.) • selenium 3.4.x (selenium਷ ঱ઁա ୭न ߡ੹ਸ ੉ਊ೧઱ࣁਃ.) • Chrome v60 ഑਷ ੉࢚ (Headlessݽ٘ח 60ߡ੹ ੉࢚ ૑ਗؾפ׮.)
 
 pip install requests bs4 selenium 5
  4. Requestsۆ? 6 1ZUIPO)5513FRVFTUTGPS)VNBOT >>> import requests >>> r = requests.get('https://api.github.com/user',

    auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...'
  5. HTML DOM? 8 <!DOCTYPE html> <html> <head> <title>ఋ੉ౣ</title> </head> <body>

    <h1>ઁੌ ௾ ઁݾ</h1> <div> ౵੉௑ 2017 KR </div> </body> </html> TITLE HEAD DIV H1 BODY HTML
  6. CSS Selectorۆ? 10 body > div.frontpage > div.onsky > nav

    HTML TAG CSS Class CSS Class CSS Class HTML TAG CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >
  7. 13

  8. 14 body > div.container > div:nth-child(1) > div.col-md-9.content > ul:nth-child(3)

    > li:nth-child(16) > a nth-childח BeautifulSoup੄ select()ীࢲ ૑ਗೞ૑ ঋח׮ body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
  9. 15 div > div.col-md-9.content > ul > li > a

    body > div.container > div:nth-of-type(1) > div.col-md-9.content > ul:nth-of-type(2) > li:nth-of-type(16) > a
  10. nth-child vs nth-of-type • NotImplementedError: Only the following pseudo-classes are

    implemented: nth-of-type. • div:nth-child(5): 
 э਷ ࠗݽܳ о૓ Element઺ 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ • div:nth-of-type(5):
 э਷ ࠗݽܳ о૓ div઺ 5ߣ૩. 16
  11. 17 import requests from bs4 import BeautifulSoup as bs #

    Getߑधਵ۽ ࣗझܳ оઉ২פ׮. req = requests.get('https://www.pycon.kr/2017/program/list/') # ೻؊ࠗ࠙੉ ইצ HTTP੄ Body(Text)ܳ оઉ২פ׮. html = req.text # HTMLਸ ౵੉ॆ੉ ੉೧ೞח Soup ё୓۽ ౵य೤פ׮. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖ೤פ׮.(iterable) session_list = soup.select('body > div.container > div:nth-of-type(1) '\ '> div.col-md-9.content > ul:nth-of-type(1) '\ '> li:nth-of-type(16) > a') for session in session_list: # HTML DOMё୓੄ ղਊޛ(text)݅ ࠇפ׮. print(session.text)
  12. 18 import requests from bs4 import BeautifulSoup as bs #

    Getߑधਵ۽ ࣗझܳ оઉ২פ׮. req = requests.get('https://www.pycon.kr/2017/program/list/') # ೻؊ࠗ࠙੉ ইצ HTTP੄ Body(Text)ܳ оઉ২פ׮. html = req.text # HTMLਸ ౵੉ॆ੉ ੉೧ೞח Soup ё୓۽ ౵य೤פ׮. soup = bs(html, 'html.parser') # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖ೤פ׮.(iterable) session_list = soup.select('div > div.col-md-9.content > ul > li > a') for session in session_list: # HTML DOMё୓੄ ղਊޛ(text)݅ ࠇפ׮. print(session.text)
  13. 19

  14. 25 import requests from bs4 import BeautifulSoup as bs def

    onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '[email protected]', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text)
  15. ۽Ӓੋ റ ੗ܐ оઉয়ӝ 26 #eventListHolder > div:nth-child(1) > ul

    > li.title > a #eventListHolder > div > ul > li.title > a
  16. 27 import requests from bs4 import BeautifulSoup as bs def

    onoffmix(): with requests.Session() as s: s.headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) ' 'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36' } login = s.post('https://onoffmix.com/account/login', data={ 'email': '[email protected]', 'pw': 'mypassword1234', 'proc': 'login' }) html = s.get('http://onoffmix.com/account/event') soup = bs(html.text, 'html.parser') event_list = soup.select('#eventListHolder > div > ul > li.title > a') for event in event_list: print(event.text) 27
  17. 28

  18. 32 from selenium import webdriver driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver') # Selenium੉

    ݽٚ ੗ਗਸ оઉয়ӝө૑ 3ୡܳ ӝ׮۰સפ׮ driver.implicitly_wait(3) # ֎੉ߡ ୐ ചݶਸ оઉ৬ࠇद׮ driver.get('https://naver.com')
  19. 33

  20. ݅ড ੉۠ ী۞о լ׮ݶ 34 selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs

    to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home ‘chromedriver'о PATH ী ١۾غ૑ ঋইࢲ ࢤӝח ޙઁ. https://sites.google.com/a/chromium.org/chromedriver/home ਤ ઱ࣗীࢲ latest driverܳ ׮਍߉ই ঑୷ਸ ಽযળ ҃۽ܳ ੿ഛೞѱ ૑੿ೞӝ
  21. ֎੉ߡ ۽Ӓੋ ೧ࠁӝ 35 from selenium import webdriver driver =

    webdriver.Chrome(‘/Users/username/Downloads/chromedriver') driver.implicitly_wait(3) driver.get('https://naver.com') input_id = driver.find_element_by_css_selector('#id') input_pw = driver.find_element_by_css_selector('#pw') login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]') input_id.send_keys('someid') input_pw.send_keys(‘mypassword1234!') login_button.click()
  22. ֎੉ߡ ನੋ౟ ҳݒղ৉ ௼܀݂ 36 #_listContentArea > ul > li:nth-child(1)

    > div > div.item_content > div.info_space > p #_listContentArea > ul > li > div > div.item_content > div.info_space > p https://order.pay.naver.com/home?tabMenu=POINT_TOTAL
  23. оઉ৬ ࠇद׮ 38 # ਤ ௏٘ ੉যࢲ login_button.click() driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL') point_list

    = driver.find_elements_by_css_selector('div.info_space > p') for point in point_list: print(point.text) driver.quit() ]
  24. ಕ੉झ࠘ झ௼ܽࢫ 40 chrome://settings/help ীࢲ ௼܁ ߡ੹ਸ ഛੋ೧઱ࣁਃ Headless Chrome਷

    v60੉࢚ীࢲ ࢎਊоמ೤פ׮. from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('headless') options.add_argument('window-size=1920x1080') driver = webdriver.Chrome('chromedriver', chrome_options=options) driver.get('https://facebook.com') driver.implicitly_wait(3) email = driver.find_element_by_css_selector('input[type=email]') password = driver.find_element_by_css_selector('input[type=password]') login = driver.find_element_by_css_selector('input[type="submit"]') email.send_keys('[email protected]') password.send_keys('ilovepython') login.click() driver.get_screenshot_as_file('facebook.png') driver.quit()
  25. 10࠙݃׮ ೠߣঀ ௼܀݂ ೞҊरযਃ 42 */10 * * * *

    /usr/bin/python3 /home/beomi/parser.py ࠙ द ੌ ਘ ਃੌ ౵੉ॆ ਤ஖ ౵੉ॆ ౵ੌ ਤ஖ crontab -e ੿ഛೠ ‘दп’਷ ं੗, ‘ݻ ࠙/दр݃׮’ח */ं੗ ೠ஢ঀ झಕ੉झ
  26. 43 • एয઱ࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ૑ ঋইਃ. • ےؒ.

    time.sleep(2 + random.random() * 4)୊ۢ ےؒਵ۽ एয઱ࣁਃ. • User-Agent ܳ ࠈ੉ ইצ ੌ߈ ࠳ۄ਋੷୊ۢ ֍য઱ࣁਃ. • ੗߄झ௼݀౟о ݆਷ ҃਋ীח requestsח Әߑ Ѧܾࣻ ੓णפ׮. • robots.txt ܳ ઓ઺೧઱ࣁਃ. ࢲߡ੗ਗ਷ ޖೠ੉ ইפѢٚਃ. • ࠁా ݽ߄ੌ ಕ੉૑о PCಕ੉૑ࠁ׮ ௼܀݂ೞӝ औणפ׮.
 (ੌױ Flash৬ ActiveXо হणפ׮) ࠗ۾: ખ ؊ ࢎۈ୊ۢ ௼܀݂ೞӝ