Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pythonで始めるウェブスクレイピング実践入門 / pyconjp-2018

sin-tanaka
September 18, 2018

Pythonで始めるウェブスクレイピング実践入門 / pyconjp-2018

PyCon JP 2018 の発表資料です

sin-tanaka

September 18, 2018
Tweet

More Decks by sin-tanaka

Other Decks in Technology

Transcript

  1. PythonͰ࢝ΊΔ
    ΢ΣϒεΫϨΠϐϯά
    ࣮ફೖ໳
    ## level: All.
    ## ాத ৻ଠ࿠ (@_sin_tanaka)

    View full-size slide

  2. ର৅ͷํ
    • Pythonॻ͚Δ
    • Webͷ͜ͱʢHTTP Req / ResʣͳΜͱͳ͘Θ͔Δ
    • εΫϨΠϐϯά͸΍ͬͨ͜ͱͳ͍
    • ΍ͬͨ͜ͱ͋Δ͚ͲɺBeautifulSoup4 ͱ Selenium Ͱ͠ΐʁ

    View full-size slide

  3. 1.ϥΠϒϥϦ֓ཁ
    2. جຊฤ
    3. ࣮ફฤ
    ΞδΣϯμ

    View full-size slide

  4. ͜ͷηογϣϯͰѻ͏༻ޠ

    View full-size slide

  5. εΫϨΠϐϯάͱ͸
    • ΢Σϒ্ͷϦιʔεͷऩू
    • ໨తͷ΢Σϒϖʔδ΁ͷΞΫηε
    • HTMLͷύʔε
    ΢Σϒ্ͷϦιʔεΛऔಘ͠
    ෼ੳՄೳͳߏ଄Խσʔλ΁
    ม׵͢Δ͜ͱ

    View full-size slide

  6. Ϋϩʔϥʔͱ͸
    • εΫϨΠϐϯάͰߦ͏ϑϩʔΛࣗಈԽͨ͠΋ͷ
    • ྫ͑͹Google΍YahooͷݕࡧΤϯδϯ͸Ϋϩʔϥʔ͕ಈ͍ͯ
    ΠϯσοΫε࡞੒·Ͱߦͳ͍ͬͯΔ
    • ͜ͷηογϣϯͰ͸ѻΘͳ͍

    View full-size slide

  7. ాத ৻ଠ࿠(@_sin_tanaka)
    ʢגʣ೔ຊγεςϜٕݚ / GeekLab.NAGANO ελοϑ
    GitHub / Medium: @sin_tanaka
    Իָͱਂ໷ϥδΦ͕޷͖
    serverless-wsgi ʹϓϧϦΫग़ͨ͠Γͨ͠

    View full-size slide

  8. • Ϋϩʔϥʔͷ࡞Γํ
    • औಘͨ͠σʔλͷ׆༻ํ๏
    • ࠷ۙͷϥΠϒϥϦࣄ৘ʢrequests-html, pyppetteerʣ
    • εΫϨΠϐϯάͷجૅ
    • εΫϨΠϐϯάͷ٧·ΓͲ͜Ζʢ࣮ફྫʣ
    ΍Δ͜ͱ
    ΍Βͳ͍͜ͱ

    View full-size slide

  9. εΫϨΠϐϯάͷϞνϕʔγϣϯ
    • ϒϥ΢βʹݟ͑ͯΔσʔλͰ͋Ε͹औಘՄೳ
    • API͕ͳͯ͘΋σʔλऔಘͰ͖Δ
    • ඞཁͳσʔλʹߜͬͯऔಘͰ͖Δ
    APIར༻ʹൺ΂ɺ
    • ϋʔυϧ͕௿͍!
    • ࣗ༝౓͕ߴ͍!

    View full-size slide

  10. ஫ҙ఺
    ࢒೦ͳ͕Βɺݚڀऀͷதʹ͸ɺΫοΫύουͷσʔλΛ
    ࢖༻͢ΔͨΊɺѱ࣭ͳΫϩʔϧΛߦ͏ํ͕͍·͢ɻ
    ΫοΫύου։ൃऀϒϩάΑΓҾ༻
    ɾαʔόෛՙͷ૿Ճ
    ɾऔಘͨ͠σʔλͷೋ࣍ར༻͸΍ΊΑ͏
    ɾαʔϏεن໿ΛΑ͘ಡΉ͜ͱ
    અ౓΍ྙཧΛ΋ͬͯεΫϨΠϐϯά͠Α͏

    View full-size slide

  11. ࡢࠓͷεΫϨΠϐϯάϥΠϒϥϦ
    ͜ͷล͔Βίʔυهड़͕ଟ͘ͳΓ·͕͢
    ਅ໘໨ʹಡ΋͏ͱ͢Δͱ຾͘ͳΔͱࢥ͏ͷͰͬ͘͟ΓோΊΔ͘Β͍ʹ͠·͠ΐ͏
    2೔໨࠷ޙͷηογϣϯͩ͠

    View full-size slide

  12. ϥΠϒϥϦͷछྨ
    ϒϥ΢βࣗಈૢ࡞ HTMLͷύʔε πϦʔͷ୳ࡧ
    ɾSelenium
    ɾpuppeteer
    ɾhtml.parser
    ɾlxml
    ɾhtml5lib
    ɾBeautifulSoup4
    ɾpyquery

    View full-size slide

  13. ఆ൪
    BeautifulSoup4 / Selenium

    View full-size slide

  14. $ pip install beautifulsoup4
    BeautifulSoup4(bs4)
    • MIT
    • ݱ࣌఺࠷৽͸4.6.3
    • OSS launchpadͱ͍͏։ൃϓϥοτϑΥʔϜͰ։ൃ͞Ε͍ͯΔ
    • HTMLͷύʔεͱ୳ࡧ
    • φϨοδ͕๛෋

    View full-size slide

  15. BeautifulSoup4(bs4)
    1 from bs4 import BeautifulSoup
    2 import requests
    3
    4 resp = requests.get('https://www.example.com/')
    5 bs_obj = BeautifulSoup(resp.content, 'lxml')
    6
    7 [i.text for i in bs_obj.find_all(‘h1')]
    8
    ίʔυྫ
    ϙΠϯτ
    L5. Response().contentͰbyteྻΛ౉͢
    ɹˠbs4ଆͰΑ͠ͳʹจࣈίʔυ൑ఆͯ͘͠ΕΔ
    ɹˠcchardetϞδϡʔϧ͕ೖΕ͓ͯ͘ͱߴ଎ʹ൑ఆͯ͘͠ΕΔ
    L5. ύʔαΛࢦఆ͢Δ
    ɹˠެࣜͷਪ঑͸lxml

    View full-size slide

  16. Selenium
    • Apeche 2.0
    • OSS
    • UIςετࣗಈԽ༻ͷϒϥ΢βࣗಈૢ࡞πʔϧ
    • ϒϥ΢βૢ࡞ͳͷͰJSΛ࣮ߦՄೳ
    • ༷ʑͳݴޠ͔Βૢ࡞Մೳ
    $ pip install selenium
    $ # e.g. brew cask install chromedriver
    ⭐ελʔ: 11800 over

    View full-size slide

  17. Selenium
    1 from selenium import webdriver
    2
    3 options = webdriver.ChromeOptions()
    4 options.add_argument('--headless')
    5
    6 driver = webdriver.Chrome(options=options)
    7 driver.get('https://www.example.com/')
    8
    9 print(driver.title)
    10
    11 driver.quit()
    12
    ίʔυྫ
    ϙΠϯτ
    L6, L11. WebDriver.quit()͢Δ͜ͱʂ
    ɹˠquit()ͱclose()͕͋Γ໾ׂ͕ҧ͏
    ɹˠκϯϏϓϩηε͕૿͑ΔͷΛ;͙ͤ

    View full-size slide

  18. Selenium
    1 with webdriver.Chrome(options=options) as driver:
    2 driver.get('https://www.example.com/')
    3
    WebDriverΫϥεͰ͸__exit__Ͱquit()͍ͯ͠ΔͷͰwithจͰ΋OK

    View full-size slide

  19. ࠷ۙ͸ଞʹ΋બ୒ࢶ͋Δʂ

    View full-size slide

  20. requests-html

    View full-size slide

  21. requests-html
    • https://github.com/kennethreitz/requests-html
    • requestsͱ͔pipenvͷ࡞ऀ͕࡞͍ͬͯΔOSS
    • HTML Parsing for Human
    • ݱ࣌఺Ͱ0.9.0ͳͷͰϝδϟʔόʔδϣϯͰ͸ͳ͍
    • HTMLͷύʔεͱ୳ࡧ
    • JSΛಈ͔͢͜ͱ͕Մೳ!
    $ pip install requests_html
    ⭐ελʔ: 8300 over

    View full-size slide

  22. requests-html
    1 from requests_html import HTMLSession
    2
    3 session = HTMLSession()
    4 resp = session.get('https://www.example.com/')
    5 print([i.text for i in resp.html.find('h1')])
    6
    ίʔυྫ
    ϙΠϯτ
    L5. Response().html
    ɹˠ࣮͸requests.Responseͷ֦ு
    ɹˠϨεϙϯεʹhtmlͱ͍͏ϓϩύςΟ͕ੜ͍͑ͯΔ

    View full-size slide

  23. pyppeteer
    • https://github.com/miyakogi/pyppeteer
    • puppeteerͷPython࣮૷
    • ϒϥ΢βͷࣗಈૢ࡞
    • ΊͬͪΌtypo͢Δ
    $ pip install pyppeteer
    ⭐ελʔ: 900 over

    View full-size slide

  24. pyppeteer
    1 import asyncio
    2 from pyppeteer import launch
    3
    4
    5 async def main():
    6 browser = await launch()
    7 page = await browser.newPage()
    8 await page.goto('https://example.com')
    9 print(await page.title())
    10 await browser.close()
    11
    12
    13 loop = asyncio.get_event_loop()
    14 loop.run_until_complete(main())
    15
    ίʔυྫ
    ϙΠϯτ
    NodejsͷҠ২ͱ͍͏఺
    ɹˠAPI͕coroutineฦ͢࡞Γ
    ɹˠϝιου໊΍Ҿ਺໊ͳͲ΋Ωϟϝϧέʔε

    View full-size slide

  25. pyquery
    • https://github.com/gawel/pyquery
    • API͕jQueryϥΠΫ
    • bs4ΑΓ΋ߋʹ௚ײత
    • ίϐʔͨ͠ηϨΫλʔΛͦͷ··౉ͤΔ
    • jQueryͰ͖Δਓʹ͸Φεεϝ
    $ pip install pyquery
    ⭐ελʔ: 1500 over

    View full-size slide

  26. pyquery
    1 from pyquery import PyQuery
    2
    3 pq = PyQuery(url='https://example.com')
    4 print(pq.find(‘h1').text())
    5
    ίʔυྫ
    ϙΠϯτ
    L3. urlΛ௚઀౉͚ͩ͢ͰHTTP ϦΫΤετ͢Δ·Ͱ໘౗ݟͯ͘ΕΔ
    ɹˠ΋ͪΖΜbs4ϥΠΫʹbyte΍strΛ౉ͯ͠΋OK

    View full-size slide

  27. ͜͜·Ͱͷ·ͱΊ
    • ఆ൪͸bs4ͱSelenium
    • ஌ݟ͕๛෋
    • ͨͩ͠ɺࡢࠓͰ͸༷ʑͳOSS͕͋Δ
    • ৽͍͠ͷ΋ग़͍ͯΔ
    • ޷͖ͳOSSʹίϯτϦϏϡʔτ͠Α͏ʂ
    • ϥΠϒϥϦͷରԠͰ͖ΔϨΠϠʔΛ஌͓ͬͯ͘
    • ϒϥ΢βࣗಈૢ࡞ / ύʔε / ୳ࡧʁ

    View full-size slide

  28. ͓͓·͔ͳखॱ
    1. ର৅ͷϖʔδΛܾΊΔ
    2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ
    3. HTMLͷύʔε
    4. ඞཁͳσʔλͷऔಘ

    View full-size slide

  29. ͓͓·͔ͳखॱ
    1. ର৅ͷϖʔδΛܾΊΔ
    2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ
    3. HTMLͷύʔε
    4. ඞཁͳσʔλͷऔಘ
    ϥΠϒϥϦ͕΍ͬͯ͘ΕΔ෦෼

    View full-size slide

  30. ͓͓·͔ͳखॱ
    1. ର৅ͷϖʔδΛܾΊΔ
    2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ
    3. HTMLͷύʔε
    4. ඞཁͳσʔλͷऔಘ
    ࣗ෼ͰܾΊΔ෦෼

    View full-size slide

  31. ૝ఆ
    PyCon JP 2017ͷαΠτ͔ΒεϙϯαʔҰཡΛऔಘ͢Δ
    PythonBootCampͷύΫϦ
    URL: https://pycon.jp/2017/ja/sponsors/

    View full-size slide

  32. ։ൃऀπʔϧΛ࢖͏

    View full-size slide

  33. ࣸਅ
    ରԠ͢Δίʔυ

    View full-size slide

  34. ࣸਅ
    ࣮ࡍʹݟ͍͑ͯΔཁૉ
    ରԠ͢Δίʔυ

    View full-size slide

  35. ෼͔ͬͨ͜ͱ
    ɾURL͸ https://pycon.jp/2017/ja/sponsors/
    ɾsponsor-content Ϋϥεͷதͷ h4 λάΛ୳ͯ͠
    ɹςΩετΛऔಘ͢Ε͹Αͦ͞͏
    ίʔυʹམͱ͠ࠐΉ

    View full-size slide

  36. PyCon JP 2017ͷεΫϨΠϐϯά
    1 from requests_html import HTMLSession
    2
    3 url = 'https://pycon.jp/2017/ja/sponsors/'
    4
    5 session = HTMLSession()
    6 resp = session.get(url)
    7
    8 sel = '.sponsor-content h4'
    9 elems = resp.html.find(sel)
    10 print([i.text for i in elems])
    11

    View full-size slide

  37. PyCon JP 2017ͷεΫϨΠϐϯά
    1 from requests_html import HTMLSession
    2
    3 url = 'https://pycon.jp/2017/ja/sponsors/'
    4
    5 session = HTMLSession()
    6 resp = session.get(url)
    7
    8 sel = '.sponsor-content h4'
    9 elems = resp.html.find(sel)
    10 print([i.text for i in elems])
    11

    View full-size slide

  38. ݁Ռ
    >> python scraping.py
    ['株式会社SQUEEZE',
    '株式会社MonotaRO',
    'LINE株式会社',
    'Retty株式会社',
    'iRidge, Inc.',
    '株式会社いい生活',

    'Togetter',
    'CodeZine',
    'エンジニアtype']

    View full-size slide

  39. try-exceptͪΌΜͱ΍͓ͬͯ͘

    View full-size slide

  40. try-exceptͪΌΜͱ΍͓ͬͯ͘
    1 try:
    2 resp = session.get(some_url)
    3 except requests.exceptions.ConnectionError:
    4 print('NetworkError')
    5 except requests.exceptions.TooManyRedirects:
    6 print('TooManyRedirects')
    7 except requests.exceptions.HTTPError:
    8 print('BadResponse')
    9
    ɾωοτϫʔΫͷճઢ͕ෆௐ
    ɾϦμΠϨΫτͷճ਺͕ଟ͍
    ɾϨεϙϯε͕ෆਖ਼
    ͙Β͍Λߟྀ͓͚ͯ͠͹͍͍ͱࢥ͏

    View full-size slide

  41. @retryσίϨʔλΛ࢖͏
    $ pip install retry

    View full-size slide

  42. @retryσίϨʔλΛ࢖͏
    1 from retry import retry
    2
    3 # 試行回数:5 間隔: 2sec 間隔の指数: 2
    4 @retry(tries=5, delay=2, backoff=2)
    5
    • σίϨʔλΛ෇༩ͨؔ͠਺಺ͰException͕ൃੜͨ͠ͱ͖
    ؔ਺Λࢦఆͷճ਺࠶ࢼߦ͢ΔϞδϡʔϧ
    • ࠶ࢼߦͷճ਺Λ؆୯ʹܾΊΒΕΔ
    • ࠶ࢼߦͷִؒΛࢦ਺ͰࢦఆͰ͖Δ

    View full-size slide

  43. ࠷ऴతͳίʔυ
    1 from requests.exceptions import ConnectionError, TooManyRedirects, HTTPError
    2 from requests_html import HTMLSession
    3 from retry import retry
    4
    5 @retry(tries=3, delay=2, backoff=2)
    6 def get_resp():
    7 try:
    8 session = HTMLSession()
    9 return session.get('https://pycon.jp/2017/ja/sponsors/')
    10 except ConnectionError:
    11 print('NetworkError')
    12 raise
    13 except TooManyRedirects:
    14 print('TooManyRedirects')
    15 raise
    16 except HTTPError:
    17 print('BadResponse')
    18 raise
    19
    20
    21 try:
    22 resp = get_resp()
    23 except:
    24 print('Response not found')
    25
    26 sel = '.sponsor-content h4'
    27 elems = resp.html.find(sel)
    28 print([i.text for i in elems])
    29

    View full-size slide

  44. جຊฤ·ͱΊ
    • ԿΛεΫϨΠϐϯά͢Δͷ͔ܾΊΔ
    • جຊతͳσϕϩούʔπʔϧͷ࢖͍ํΛ஌͓ͬͯ͘
    • try-except͸͔ͬ͠Γ΍͓ͬͯ͘
    • ෆཁͳϦΫΤετͷૹ৴Λ๷͙
    • ໷࣮ߦ͢Δˠேى͖ͯσʔλ͕Կ΋ͳ͍Λ๷͙
    • @retryσίϨʔλ࢖͏ͱ࠶ࢼߦ͕؆୯

    View full-size slide

  45. ٧·ͬͨࣄྫ
    1. ϖʔδϯά
    2. ϩάΠϯඞਢͷձһϖʔδ
    3. ϒϥ΢βͰݟ͍͑ͯΔཁૉ͕ϓϩάϥϜͩͱݟͭ
    ͔Βͳ͍

    View full-size slide

  46. ࣄྫ1. ϖʔδϯάɹ૝ఆ
    Metacritic.comʹܝࡌ͞ΕͨϨϏϡʔΛऔಘ͍ͨ͠

    View full-size slide

  47. ࣄྫ1. ϖʔδϯάɹ૝ఆ
    ɾϖʔδϯά
    ɾϖʔδϟʔ
    ɾϖʔδωʔγϣϯͱ΋
    ϖʔδϯάͰදࣔ಺༰Λ੾Γସ͑ΔཁૉΛऔಘ͍ͨ͠

    View full-size slide

  48. ෼͔ͬͨ͜ͱ
    product-rowΫϥεͷதͷςΩετΛऔಘ͢Ε͹Αͦ͞͏
    ϖʔδΛ੾Γସ͑ΔͱURL͕มΘΔ
    https://www.metacritic.com/browse/games/score/
    metascore/all/all/filtered?sort=desc&page=1
    pageͷΫΤϦετϦϯάΛ
    ΠϯΫϦϝϯτ͍͚ͯ͠͹Αͦ͞͏ʂ

    View full-size slide

  49. ίʔυ
    1 from requests_html import HTMLSession
    2 import time
    3
    4 base_url = ‘https://www.metacritic.com/~~略'
    5 qs = 'sort=desc&page='
    6
    7 for page_num in range(200):
    8 session = HTMLSession()
    9 resp = session.get(f'{base_url}?{qs}{page_num}')
    10
    11 elems = resp.html.find('.product_row')
    12 print([i.text for i in elems])
    13
    14 time.sleep(2)
    15

    View full-size slide

  50. ίʔυ
    1 from requests_html import HTMLSession
    2 import time
    3
    4 base_url = ‘https://www.metacritic.com/~~略'
    5 qs = 'sort=desc&page='
    6
    7 for page_num in range(200):
    8 session = HTMLSession()
    9 resp = session.get(f'{base_url}?{qs}{page_num}')
    10
    11 elems = resp.html.find('.product_row')
    12 print([i.text for i in elems])
    13
    14 time.sleep(2)
    15
    ಉ͡υϝΠϯʹෳ਺ճΞΫηε͢Δͱ͖͸sleep()͢Δ

    View full-size slide

  51. ݁Ռ
    > $ python scraping.py
    ['1.\n99\nThe Legend of Zelda: Ocarina of Time (N64)\nUser: 9.1\nNov 23, 1998’,
    "2.\n98\nTony Hawk's Pro Skater 2 (PS)\nUser: 7.4\nSep 20, 2000”,
    '3.\n98\nGrand Theft Auto IV (PS3)\nUser: 7.5\nApr 29, 2008’,
    '4.\n98\nSoulCalibur (DC)\nUser: 8.6\nSep 8, 1999’,
    '5.\n98\nGrand Theft Auto IV (X360)\nUser: 7.9\nApr 29, 2008’,
    '6.\n97\nSuper Mario Galaxy (WII)\nUser: 9.0\nNov 12, 2007’,
    '7.\n97\nSuper Mario Galaxy 2 (WII)\nUser: 9.1\nMay 23, 2010’,
    '8.\n97\nGrand Theft Auto V (XONE)\nUser: 7.8\nNov 18, 2014’,

    '93.\n94\nJet Grind Radio (DC)\nUser: 8.0\nOct 30, 2000’,
    '94.\n94\nMetal Gear Solid (PS)\nUser: 9.2\nOct 21, 1998’,
    '95.\n94\nGrim Fandango (PC)\nUser: 9.1\nOct 14, 1998’,
    "96.\n94\nTom Clancy's Splinter Cell Chaos Theory (XBOX)\nUser: 8.9\nMar 28, 2005”,
    '97.\n94\nBurnout 3: Takedown (XBOX)\nUser: 7.4\nSep 7, 2004’,
    '98.\n94\nDiablo (PC)\nUser: 8.7\nDec 31, 1996’,
    '99.\n94\nMetal Gear Solid 3: Subsistence (PS2)\nUser: 9.0\nMar 14, 2006’,
    '100.\n94\nCall of Duty: Modern Warfare 2 (X360)\nUser: 6.4\nNov 10, 2009']

    View full-size slide

  52. ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ
    • moneyfowardͰࣗ෼ͷձһϖʔδ৘ใΛऔಘ͍ͨ͠

    View full-size slide

  53. ϩάΠϯ͠ͳ͍ͱ
    ձһϖʔδ͸ݟΕͳ͍

    View full-size slide

  54. ϒϥ΢βͷࣗಈૢ࡞ͰϩάΠϯ͢Δ

    View full-size slide

  55. ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ
    ϩάΠϯϖʔδ:
    https://moneyforward.com/users/sign_in

    View full-size slide

  56. ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ
    ϩάΠϯϖʔδ:
    https://moneyforward.com/users/sign_in

    View full-size slide

  57. ίʔυ
    1 import asyncio
    2
    3 from pyppeteer import launch
    4
    5
    6 async def main():
    7 browser = await launch()
    8 page = await browser.newPage()
    9 await page.goto('https://moneyforward.com/users/sign_in')
    10
    11 await page.type('#sign_in_session_service_email', '[email protected]')
    12 await page.type('#sign_in_session_service_password', 'your_password')
    13 btn_elem = await page.querySelector('#login-btn-sumit')
    14 await btn_elem.click()
    15
    16 await page.waitFor(5000)
    17 await page.screenshot({'path': 'logined.png', 'fullPage': True})
    18 await browser.close()
    19
    20
    21 loop = asyncio.get_event_loop()
    22 loop.run_until_complete(main())
    23

    View full-size slide

  58. ίʔυ
    1 import asyncio
    2
    3 from pyppeteer import launch
    4
    5
    6 async def main():
    7 browser = await launch()
    8 page = await browser.newPage()
    9 await page.goto('https://moneyforward.com/users/sign_in')
    10
    11 await page.type('#sign_in_session_service_email', '[email protected]')
    12 await page.type('#sign_in_session_service_password', 'your_password')
    13 btn_elem = await page.querySelector('#login-btn-sumit')
    14 await btn_elem.click()
    15
    16 await page.waitFor(5000)
    17 await page.screenshot({'path': 'logined.png', 'fullPage': True})
    18 await browser.close()
    19
    20
    21 loop = asyncio.get_event_loop()
    22 loop.run_until_complete(main())
    23

    View full-size slide

  59. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δɹ૝ఆ
    PyCon JP 2018ͷηογϣϯҰཡΛऔಘ͍ͨ͠

    View full-size slide

  60. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ
    ɾURL͸ https://pycon.jp/2018/event/sessions
    ɾsession-summary Ϋϥεͷதͷ h3 λάΛ୳ͯ͠
    ɹςΩετΛऔಘ͢Ε͹Αͦ͞͏
    ϖʔδϯά΋ͳ͍͠ϩάΠϯ΋͍Βͳ͍͠؆୯͡ΌΜʂ

    View full-size slide

  61. ίʔυ
    1 from requests_html import HTMLSession
    2
    3 session = HTMLSession()
    4
    5 resp = session.get('https://pycon.jp/2018/event/sessions')
    6
    7 sel = '.session-summary h3'
    8 elems = resp.html.find(sel)
    9
    10 print([i.text for i in elems])
    11

    View full-size slide

  62. ݁Ռ
    > $ python scraping.py
    []

    View full-size slide

  63. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ
    $ python
    >>> from requests_html import HTMLSession
    >>>
    >>> session = HTMLSession()
    >>> resp = session.get('https://pycon.jp/2018/event/sessions')
    >>> print('スクレイピング実践入門' in resp.html.find('body', first=True).text)
    False
    >>>
    ͞Βʹௐ΂ͯΈΔ
    ͦ΋ͦ΋Ϩεϙϯεʹཉ͍͠ཁૉ͕ແͦ͞͏…
    ɾϨεϙϯε಺ͷςΩετΛݕࡧͯ͠ΈΔ

    View full-size slide

  64. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ
    • Ajaxͷొ৔ʹΑΓαʔό͕͢΂ͯͷHTMLΛฦͯ͘͠ΕΔͱ͸ݶΒͳ͘
    ͳͬͨ
    • SPAʢSingle-Page Applicationʣͷ୆಄ʹΑΓɺ͢΂ͯͷཁૉΛJSͰඳ
    ը͢Δϖʔδ΋গͳ͘ͳ͍
    • TwitterͷϗʔϜͱ͔ͷιʔεݟΔͱݦஶ
    ݪҼ
    JavaScriptͰཁૉΛඳը͍ͯ͠Δ

    View full-size slide

  65. ϒϥ΢βʹΑΔ
    JavaScriptͷ࣮ߦ͕ඞཁ

    View full-size slide

  66. ίʔυ
    1 from requests_html import HTMLSession
    2
    3 session = HTMLSession()
    4 resp = session.get('https://pycon.jp/2018/event/sessions')
    5
    6 resp.html.render(sleep=5)
    7
    8 sel = '.session-summary h3'
    9 elems = resp.html.find(sel)
    10
    11 print([i.text for i in elems])
    12
    JSͷ࣮ߦΛ଴ͭ
    ɾSelenium΍pyppeteer౳ͷϒϥ΢βૢ࡞ϥΠϒϥϦͰඳը͢Δ
    ɾ࣮͸requests-htmlͰ׬݁Ͱ͖Δ

    View full-size slide

  67. ίʔυ
    1 from requests_html import HTMLSession
    2
    3 session = HTMLSession()
    4 resp = session.get('https://pycon.jp/2018/event/sessions')
    5
    6 resp.html.render(sleep=5)
    7
    8 sel = '.session-summary h3'
    9 elems = resp.html.find(sel)
    10
    11 print([i.text for i in elems])
    12
    JSͷ࣮ߦΛ଴ͭ
    ɾSelenium΍pyppeteer౳ͷϒϥ΢βૢ࡞ϥΠϒϥϦͰඳը͢Δ
    ɾ࣮͸requests-htmlͰ׬݁Ͱ͖Δ

    View full-size slide

  68. ݁Ռ
    > $ python scraping.py
    > ['「リモートペアプロでマントルを突き抜けろ!」AWS Cloud9でリモートペアプロ&楽々サーバーレス開発’,
    ‘1次元畳み込みフィルターを利用した音楽データのオートエンコーダ',
    'Adding JWT Authentication to Python and Django REST Framework Using Auth0’,
    'AltJSとしてのPython - フロントエンドをPythonで書こう’,
    'Applying serverless architecture pattern to distributed data processing’,
    'Build text classification models ( CBOW and Skip-gram) with FastText in python’,
    'Building Maintainable Python Web App using Flask’,
    'C拡張と共に乗り切るPython 2→3移行術’,
    'Django REST Framework におけるAPI実装プラクティス’,

    ‘テキストマイニングによるTwitter個人アカウントの性格推定',
    'Why your Django account registration should use a Turing test…',
    ‘医学研究者が深層学習環境の立ち上げの際に苦労した話',
    '暗号通貨技術・ブロックチェーン技術を活用するCrypto-Fintech Lab.’,
    '安全なサンドボックス構築の裏側 ~投資アルゴリズム構築環境QuantX Factoryの事例~’,
    'diff 最小化原理で導く Zen of Python’,
    'Python × Investment ~投資信託をPythonで分析して、その結果を公開するサービス作った話~’,
    ‘Pythonの軽量フレームワークによるシンプルで高速なWebAPIの作り方',
    ‘システム開発素人が深層学習を用いた画像認識で麻雀点数計算するLINEbot作った話',
    ‘【poke2vec】ポケモンの役割ベクトルの学習とその分析・可視化',
    'asyncio + aiohttp で作るウェブサービス’,
    'PyCon JP 傾向と対策']

    View full-size slide

  69. ࣮ફฤ·ͱΊ
    • URL͕ԿΛࢦࣔ͢͠ͷ͔஌͓ͬͯ͘
    • εΩʔϚͷߏ଄ͱ͔ΫΤϦετϦϯάͷߏ଄ͱ͔
    • ϒϥ΢β͕ԿΛ͍ͯ͠Δͷ͔஌͓ͬͯ͘
    • HTTPαʔόΫϥΠΞϯτʹ͓͚ΔͲ͜ʁ
    • HTMLͷղऍ / CSS ͷద༻ / JavaScriptͷ࣮ߦ
    • ηογϣϯ / ΫοΩʔ / ϩʔΧϧετϨʔδ
    • ϒϥ΢βͷؾ࣋ͪʹͳΔ

    View full-size slide

  70. ࠓ೔࿩ͨ͜͠ͱ·ͱΊ
    • ͍ΖΜͳOSSϥΠϒϥϦ͋Δ
    • ࣗ෼ʹ߹͏ϥΠϒϥϦΛ࢖͓͏
    • ޷͖ͳ΋ͷʹߩݙͯ͜͠͏
    • جຊతʹ͸ݟ͑ͯΔ΋ͷ͸औಘͰ͖Δ
    • ෑډͷ௿͞
    • ҰํͰHTTP(S)ͷ࢓૊Έ஌Βͳ͍ͱ٧·Δ
    • ࣄྫΛ௨ͯ͠࢓૊Έ΋ษڧͯ͘͠ͱΑ͍(ࣗ෼΋·ͩ·ͩͳͷͰ…)

    View full-size slide

  71. ͝ਗ਼ௌ
    ͋Γ͕ͱ͏͍͟͝·ͨ͠ʂ

    View full-size slide