$30 off During Our Annual Pro Sale. View Details »

PYCON KR 2017: 처음부터 알아보는 웹 크롤러

Beomi
August 13, 2017

PYCON KR 2017: 처음부터 알아보는 웹 크롤러

PYCON KR 2017 (2017. 08. 13) 처음부터 알아보는 웹 크롤러 세션 발표자료

Beomi

August 13, 2017
Tweet

More Decks by Beomi

Other Decks in Technology

Transcript

  1. ୊਺ࠗఠঌইࠁחਢ௼܀۞
    ੉ળߧ
    KVO!CFPNJOFU
    Back to the Basic

    View Slide

  2. ߊ಴੗ ࣗѐ
    ੉ળߧ ( [email protected] )
    - DjangoGirls Seoul

    - <ա݅੄ ਢ ௼܀۞ ٜ݅ӝ> दܻૉ ো੤

    - ౵੉௑ ౚషܻ঴: <ա݅੄ ਢ ௼܀۞ ٜ݅ӝ>

    - Python + Django = <3

    - ਋ইೠ ప௼஬೐(਋ইೠ ഋઁٜ) ੋఢ
    2

    View Slide

  3. ௼܀݂?
    3

    View Slide

  4. য়ט ೡ Ѫ 4
    ௼܀݂ೡ ѐߊജ҃
    ௼܁ ੜ ॄࠁӝ, CSS Selectorۆ?
    ౵੉௑ ߊ಴ࣁ࣌ ௼܀݂ೞӝ
    ৡয়೐޸झ ۽ӒੋೞҊ ௼܀݂ೞӝ
    ਵ۽ ௼܀݂ೞӝ
    ( )ചݶহ੉ ௼܀݂ೞӝ
    ઱ӝ੸ਵ۽ ௼܀݂ೞӝ

    View Slide

  5. যڃ ജ҃ীࢲ ੘সೞաਃ?
    • Python 3.6.x (ইޖܻ ծইب 3.4.x / Python2ॳݶ Ҋా߉ইਃ)

    • Requests 2.18.x (2.1x.x ੉࢚ߡ੹੉ݶ ޖդ೤פ׮.)

    • beautifulsoup4 4.6.x (4.5 ੉࢚ ߡ੹੉ݶ ؾפ׮.)

    • selenium 3.4.x (selenium਷ ঱ઁա ୭न ߡ੹ਸ ੉ਊ೧઱ࣁਃ.)

    • Chrome v60 ഑਷ ੉࢚ (Headlessݽ٘ח 60ߡ੹ ੉࢚ ૑ਗؾפ׮.)


    pip install requests bs4 selenium
    5

    View Slide

  6. Requestsۆ? 6
    1ZUIPO)5513FRVFTUTGPS)VNBOT
    >>> import requests
    >>> r = requests.get('https://api.github.com/user',
    auth=('user', 'pass'))
    >>> r.status_code
    200
    >>> r.headers['content-type']
    'application/json; charset=utf8'
    >>> r.encoding
    'utf-8'
    >>> r.text
    u'{"type":"User"...'

    View Slide

  7. BeautifulSoupۆ?
    EFTJHOFEGPSRVJDLUVSOBSPVOEQSPKFDUT
    MJLFTDSFFOTDSBQJOH
    • Requests۽ ߉ইৡ ؘ੉ఠܳ ౵੉ॆ੉ ੉೧ೞח ё୓۽ ٜ݅ӝ

    • HTML DOM ҳઑ Ӓ؀۽!
    7
    from bs4 import BeautifulSoup

    View Slide

  8. HTML DOM?
    8



    ఋ੉ౣ


    ઁੌ ௾ ઁݾ

    ౵੉௑ 2017 KR



    TITLE
    HEAD
    DIV
    H1
    BODY
    HTML

    View Slide

  9. ௼܁ Inspector ࢎਊೞӝ 9

    View Slide

  10. CSS Selectorۆ?
    10
    body > div.frontpage > div.onsky > nav
    HTML TAG CSS Class CSS Class
    CSS Class
    HTML TAG
    CSS Classח .~~~ IDח #~~~ ߄۽ ইېח >

    View Slide

  11. рױೠ ਢ ಕ੉૑ ௼܀݂ೞӝ
    11

    View Slide

  12. ౵੉௑ ࣁ࣌ ݾ۾ਸ оઉ৬ࠇद׮ 12
    https://www.pycon.kr/2017/program/list/

    View Slide

  13. 13

    View Slide

  14. 14
    body > div.container > div:nth-child(1) >
    div.col-md-9.content > ul:nth-child(3) >
    li:nth-child(16) > a
    nth-childח BeautifulSoup੄ select()ীࢲ ૑ਗೞ૑ ঋח׮
    body > div.container > div:nth-of-type(1) >
    div.col-md-9.content > ul:nth-of-type(2) >
    li:nth-of-type(16) > a

    View Slide

  15. 15
    div > div.col-md-9.content > ul > li > a
    body > div.container > div:nth-of-type(1) >
    div.col-md-9.content > ul:nth-of-type(2) >
    li:nth-of-type(16) > a

    View Slide

  16. nth-child vs nth-of-type
    • NotImplementedError: Only the following pseudo-classes
    are implemented: nth-of-type.
    • div:nth-child(5): 

    э਷ ࠗݽܳ о૓ Element઺ 5ߣ૩. ݅ড divо ইפݶ হחѦ۽ ஜ
    • div:nth-of-type(5):

    э਷ ࠗݽܳ о૓ div઺ 5ߣ૩.
    16

    View Slide

  17. 17
    import requests
    from bs4 import BeautifulSoup as bs
    # Getߑधਵ۽ ࣗझܳ оઉ২פ׮.
    req = requests.get('https://www.pycon.kr/2017/program/list/')
    # ೻؊ࠗ࠙੉ ইצ HTTP੄ Body(Text)ܳ оઉ২פ׮.
    html = req.text
    # HTMLਸ ౵੉ॆ੉ ੉೧ೞח Soup ё୓۽ ౵य೤פ׮.
    soup = bs(html, 'html.parser')
    # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖ೤פ׮.(iterable)
    session_list = soup.select('body > div.container > div:nth-of-type(1) '\
    '> div.col-md-9.content > ul:nth-of-type(1) '\
    '> li:nth-of-type(16) > a')
    for session in session_list:
    # HTML DOMё୓੄ ղਊޛ(text)݅ ࠇפ׮.
    print(session.text)

    View Slide

  18. 18
    import requests
    from bs4 import BeautifulSoup as bs
    # Getߑधਵ۽ ࣗझܳ оઉ২פ׮.
    req = requests.get('https://www.pycon.kr/2017/program/list/')
    # ೻؊ࠗ࠙੉ ইצ HTTP੄ Body(Text)ܳ оઉ২פ׮.
    html = req.text
    # HTMLਸ ౵੉ॆ੉ ੉೧ೞח Soup ё୓۽ ౵य೤פ׮.
    soup = bs(html, 'html.parser')
    # CSS Selectorܳ ా೧ ղਊޛਸ ݽف ࢶఖ೤פ׮.(iterable)
    session_list = soup.select('div > div.col-md-9.content > ul > li > a')
    for session in session_list:
    # HTML DOMё୓੄ ղਊޛ(text)݅ ࠇפ׮.
    print(session.text)

    View Slide

  19. 19

    View Slide

  20. ۽Ӓੋ੉ ೙ਃೠ ҃਋
    20

    View Slide

  21. ONOFFMIX न୒ݾ۾ оઉয়ӝ 21

    View Slide

  22. ۽Ӓੋ਷ যڌѱ ೞաਃ?
    22

    View Slide

  23. ONOFFMIX ۽Ӓੋ ڳযࠁӝ 23
    http://onoffmix.com/account/login

    View Slide

  24. requests ੄ Session ੉ਊ
    24

    View Slide

  25. 25
    import requests
    from bs4 import BeautifulSoup as bs
    def onoffmix():
    with requests.Session() as s:
    s.headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) '
    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
    }
    login = s.post('https://onoffmix.com/account/login', data={
    'email': '[email protected]',
    'pw': 'mypassword1234',
    'proc': 'login'
    })
    html = s.get('http://onoffmix.com/account/event')
    soup = bs(html.text, 'html.parser')
    event_list = soup.select('#eventListHolder > div > ul > li.title > a')
    for event in event_list:
    print(event.text)

    View Slide

  26. ۽Ӓੋ റ ੗ܐ оઉয়ӝ 26
    #eventListHolder > div:nth-child(1) > ul > li.title > a
    #eventListHolder > div > ul > li.title > a

    View Slide

  27. 27
    import requests
    from bs4 import BeautifulSoup as bs
    def onoffmix():
    with requests.Session() as s:
    s.headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) '
    'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
    }
    login = s.post('https://onoffmix.com/account/login', data={
    'email': '[email protected]',
    'pw': 'mypassword1234',
    'proc': 'login'
    })
    html = s.get('http://onoffmix.com/account/event')
    soup = bs(html.text, 'html.parser')
    event_list = soup.select('#eventListHolder > div > ul > li.title > a')
    for event in event_list:
    print(event.text)
    27

    View Slide

  28. 28

    View Slide

  29. ۽Ӓੋ੉ ցޖ য۰ਕਃ
    Ӓր ࠳ۄ਋੷ ॳݶ উغաਃ
    29

    View Slide

  30. ௼܁ਵ۽ ௼܀݂೧ࠇद׮ 30
    Selenium + Chrome(v60)
    pip install selenium
    https://sites.google.com/a/chromium.org/chromedriver/downloads

    View Slide

  31. 31
    /Users/ࢎਊ੗੉ܴ/Downloads/chromedriver

    View Slide

  32. 32
    from selenium import webdriver
    driver = webdriver.Chrome('/Users/beomi/Downloads/chromedriver')
    # Selenium੉ ݽٚ ੗ਗਸ оઉয়ӝө૑ 3ୡܳ ӝ׮۰સפ׮
    driver.implicitly_wait(3)
    # ֎੉ߡ ୐ ചݶਸ оઉ৬ࠇद׮
    driver.get('https://naver.com')

    View Slide

  33. 33

    View Slide

  34. ݅ড ੉۠ ী۞о լ׮ݶ 34
    selenium.common.exceptions.WebDriverException:
    Message: 'chromedriver' executable needs to be in PATH.
    Please see https://sites.google.com/a/chromium.org/chromedriver/home
    ‘chromedriver'о PATH ী ١۾غ૑ ঋইࢲ ࢤӝח ޙઁ.
    https://sites.google.com/a/chromium.org/chromedriver/home
    ਤ ઱ࣗীࢲ latest driverܳ ׮਍߉ই ঑୷ਸ ಽযળ ҃۽ܳ ੿ഛೞѱ ૑੿ೞӝ

    View Slide

  35. ֎੉ߡ ۽Ӓੋ ೧ࠁӝ 35
    from selenium import webdriver
    driver = webdriver.Chrome(‘/Users/username/Downloads/chromedriver')
    driver.implicitly_wait(3)
    driver.get('https://naver.com')
    input_id = driver.find_element_by_css_selector('#id')
    input_pw = driver.find_element_by_css_selector('#pw')
    login_button = driver.find_element_by_css_selector('#frmNIDLogin > fieldset > span > input[type="submit"]')
    input_id.send_keys('someid')
    input_pw.send_keys(‘mypassword1234!')
    login_button.click()

    View Slide

  36. ֎੉ߡ ನੋ౟ ҳݒղ৉ ௼܀݂ 36
    #_listContentArea > ul > li:nth-child(1) > div > div.item_content > div.info_space > p
    #_listContentArea > ul > li > div > div.item_content > div.info_space > p
    https://order.pay.naver.com/home?tabMenu=POINT_TOTAL

    View Slide

  37. ௼܁ ௑ࣛীࢲ ޷ܻ ഛੋ೧ࠁӝ 37
    CSS Selectorܳ ੹ࠗ ॳ૑ ঋইب ௼܀݂੉ оמೞ׮!
    ೞ૑݅ ௼܁ ௑ࣛ਷ JS

    View Slide

  38. оઉ৬ ࠇद׮ 38
    # ਤ ௏٘ ੉যࢲ
    login_button.click()
    driver.get('https://order.pay.naver.com/home?tabMenu=POINT_TOTAL')
    point_list = driver.find_elements_by_css_selector('div.info_space > p')
    for point in point_list:
    print(point.text)
    driver.quit()
    ]

    View Slide

  39. ചݶਸ ڸ਋חѱ फযਃ 39
    driver.set_window_position(-10000,0)
    ೞ૑݅ ੉Ѥ ਋ܻо ߄ۄחѱ ইפભ
    Headless Browser
    like Headless Chrome

    View Slide

  40. ಕ੉झ࠘ झ௼ܽࢫ 40
    chrome://settings/help ীࢲ
    ௼܁ ߡ੹ਸ ഛੋ೧઱ࣁਃ
    Headless Chrome਷
    v60੉࢚ীࢲ ࢎਊоמ೤פ׮.
    from selenium import webdriver
    options = webdriver.ChromeOptions()
    options.add_argument('headless')
    options.add_argument('window-size=1920x1080')
    driver = webdriver.Chrome('chromedriver', chrome_options=options)
    driver.get('https://facebook.com')
    driver.implicitly_wait(3)
    email = driver.find_element_by_css_selector('input[type=email]')
    password = driver.find_element_by_css_selector('input[type=password]')
    login = driver.find_element_by_css_selector('input[type="submit"]')
    email.send_keys('[email protected]')
    password.send_keys('ilovepython')
    login.click()
    driver.get_screenshot_as_file('facebook.png')
    driver.quit()

    View Slide

  41. 24दр ௼܀݂ جܻҊ रযਃ 41
    Cron ੘স
    +
    ࢲߡী ৢܻӝ
    ੉Ѥয়טೞ૑ঋইਃ

    View Slide

  42. 10࠙݃׮ ೠߣঀ ௼܀݂ ೞҊरযਃ 42
    */10 * * * * /usr/bin/python3 /home/beomi/parser.py
    ࠙ द ੌ ਘ ਃੌ ౵੉ॆ ਤ஖ ౵੉ॆ ౵ੌ ਤ஖
    crontab -e
    ੿ഛೠ ‘दп’਷ ं੗, ‘ݻ ࠙/दр݃׮’ח */ं੗
    ೠ஢ঀ झಕ੉झ

    View Slide

  43. 43
    • एয઱ࣁਃ. time.sleep(3) एযоݶ ࢲߡח ചղ૑ ঋইਃ.

    • ےؒ. time.sleep(2 + random.random() * 4)୊ۢ ےؒਵ۽ एয઱ࣁਃ.

    • User-Agent ܳ ࠈ੉ ইצ ੌ߈ ࠳ۄ਋੷୊ۢ ֍য઱ࣁਃ.

    • ੗߄झ௼݀౟о ݆਷ ҃਋ীח requestsח Әߑ Ѧܾࣻ ੓णפ׮.

    • robots.txt ܳ ઓ઺೧઱ࣁਃ. ࢲߡ੗ਗ਷ ޖೠ੉ ইפѢٚਃ.

    • ࠁా ݽ߄ੌ ಕ੉૑о PCಕ੉૑ࠁ׮ ௼܀݂ೞӝ औणפ׮.

    (ੌױ Flash৬ ActiveXо হणפ׮)
    ࠗ۾: ખ ؊ ࢎۈ୊ۢ ௼܀݂ೞӝ

    View Slide

  44. QnA
    44

    View Slide

  45. хࢎ೤פ׮ :D
    45

    View Slide