Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pythonで始めるウェブスクレイピング実践入門 / pyconjp-2018

sin-tanaka
September 18, 2018

Pythonで始めるウェブスクレイピング実践入門 / pyconjp-2018

PyCon JP 2018 の発表資料です

sin-tanaka

September 18, 2018
Tweet

More Decks by sin-tanaka

Other Decks in Technology

Transcript

  1. $ pip install beautifulsoup4 BeautifulSoup4(bs4) • MIT • ݱ࣌఺࠷৽͸4.6.3 •

    OSS launchpadͱ͍͏։ൃϓϥοτϑΥʔϜͰ։ൃ͞Ε͍ͯΔ • HTMLͷύʔεͱ୳ࡧ • φϨοδ͕๛෋
  2. BeautifulSoup4(bs4) 1 from bs4 import BeautifulSoup 2 import requests 3

    4 resp = requests.get('https://www.example.com/') 5 bs_obj = BeautifulSoup(resp.content, 'lxml') 6 7 [i.text for i in bs_obj.find_all(‘h1')] 8 ίʔυྫ ϙΠϯτ L5. Response().contentͰbyteྻΛ౉͢ ɹˠbs4ଆͰΑ͠ͳʹจࣈίʔυ൑ఆͯ͘͠ΕΔ ɹˠcchardetϞδϡʔϧ͕ೖΕ͓ͯ͘ͱߴ଎ʹ൑ఆͯ͘͠ΕΔ L5. ύʔαΛࢦఆ͢Δ ɹˠެࣜͷਪ঑͸lxml
  3. Selenium • Apeche 2.0 • OSS • UIςετࣗಈԽ༻ͷϒϥ΢βࣗಈૢ࡞πʔϧ • ϒϥ΢βૢ࡞ͳͷͰJSΛ࣮ߦՄೳ

    • ༷ʑͳݴޠ͔Βૢ࡞Մೳ $ pip install selenium $ # e.g. brew cask install chromedriver ⭐ελʔ: 11800 over
  4. Selenium 1 from selenium import webdriver 2 3 options =

    webdriver.ChromeOptions() 4 options.add_argument('--headless') 5 6 driver = webdriver.Chrome(options=options) 7 driver.get('https://www.example.com/') 8 9 print(driver.title) 10 11 driver.quit() 12 ίʔυྫ ϙΠϯτ L6, L11. WebDriver.quit()͢Δ͜ͱʂ ɹˠquit()ͱclose()͕͋Γ໾ׂ͕ҧ͏ ɹˠκϯϏϓϩηε͕૿͑ΔͷΛ;͙ͤ
  5. requests-html • https://github.com/kennethreitz/requests-html • requestsͱ͔pipenvͷ࡞ऀ͕࡞͍ͬͯΔOSS • HTML Parsing for Human

    • ݱ࣌఺Ͱ0.9.0ͳͷͰϝδϟʔόʔδϣϯͰ͸ͳ͍ • HTMLͷύʔεͱ୳ࡧ • JSΛಈ͔͢͜ͱ͕Մೳ! $ pip install requests_html ⭐ελʔ: 8300 over
  6. requests-html 1 from requests_html import HTMLSession 2 3 session =

    HTMLSession() 4 resp = session.get('https://www.example.com/') 5 print([i.text for i in resp.html.find('h1')]) 6 ίʔυྫ ϙΠϯτ L5. Response().html ɹˠ࣮͸requests.Responseͷ֦ு ɹˠϨεϙϯεʹhtmlͱ͍͏ϓϩύςΟ͕ੜ͍͑ͯΔ
  7. pyppeteer 1 import asyncio 2 from pyppeteer import launch 3

    4 5 async def main(): 6 browser = await launch() 7 page = await browser.newPage() 8 await page.goto('https://example.com') 9 print(await page.title()) 10 await browser.close() 11 12 13 loop = asyncio.get_event_loop() 14 loop.run_until_complete(main()) 15 ίʔυྫ ϙΠϯτ NodejsͷҠ২ͱ͍͏఺ ɹˠAPI͕coroutineฦ͢࡞Γ ɹˠϝιου໊΍Ҿ਺໊ͳͲ΋Ωϟϝϧέʔε
  8. pyquery 1 from pyquery import PyQuery 2 3 pq =

    PyQuery(url='https://example.com') 4 print(pq.find(‘h1').text()) 5 ίʔυྫ ϙΠϯτ L3. urlΛ௚઀౉͚ͩ͢ͰHTTP ϦΫΤετ͢Δ·Ͱ໘౗ݟͯ͘ΕΔ ɹˠ΋ͪΖΜbs4ϥΠΫʹbyte΍strΛ౉ͯ͠΋OK
  9. ͜͜·Ͱͷ·ͱΊ • ఆ൪͸bs4ͱSelenium • ஌ݟ͕๛෋ • ͨͩ͠ɺࡢࠓͰ͸༷ʑͳOSS͕͋Δ • ৽͍͠ͷ΋ग़͍ͯΔ •

    ޷͖ͳOSSʹίϯτϦϏϡʔτ͠Α͏ʂ • ϥΠϒϥϦͷରԠͰ͖ΔϨΠϠʔΛ஌͓ͬͯ͘ • ϒϥ΢βࣗಈૢ࡞ / ύʔε / ୳ࡧʁ
  10. PyCon JP 2017ͷεΫϨΠϐϯά 1 from requests_html import HTMLSession 2 3

    url = 'https://pycon.jp/2017/ja/sponsors/' 4 5 session = HTMLSession() 6 resp = session.get(url) 7 8 sel = '.sponsor-content h4' 9 elems = resp.html.find(sel) 10 print([i.text for i in elems]) 11
  11. PyCon JP 2017ͷεΫϨΠϐϯά 1 from requests_html import HTMLSession 2 3

    url = 'https://pycon.jp/2017/ja/sponsors/' 4 5 session = HTMLSession() 6 resp = session.get(url) 7 8 sel = '.sponsor-content h4' 9 elems = resp.html.find(sel) 10 print([i.text for i in elems]) 11
  12. "

  13. try-exceptͪΌΜͱ΍͓ͬͯ͘ 1 try: 2 resp = session.get(some_url) 3 except requests.exceptions.ConnectionError:

    4 print('NetworkError') 5 except requests.exceptions.TooManyRedirects: 6 print('TooManyRedirects') 7 except requests.exceptions.HTTPError: 8 print('BadResponse') 9 ɾωοτϫʔΫͷճઢ͕ෆௐ ɾϦμΠϨΫτͷճ਺͕ଟ͍ ɾϨεϙϯε͕ෆਖ਼ ͙Β͍Λߟྀ͓͚ͯ͠͹͍͍ͱࢥ͏
  14. ""

  15. @retryσίϨʔλΛ࢖͏ 1 from retry import retry 2 3 # 試行回数:5

    間隔: 2sec 間隔の指数: 2 4 @retry(tries=5, delay=2, backoff=2) 5 • σίϨʔλΛ෇༩ͨؔ͠਺಺ͰException͕ൃੜͨ͠ͱ͖ ؔ਺Λࢦఆͷճ਺࠶ࢼߦ͢ΔϞδϡʔϧ • ࠶ࢼߦͷճ਺Λ؆୯ʹܾΊΒΕΔ • ࠶ࢼߦͷִؒΛࢦ਺ͰࢦఆͰ͖Δ
  16. ࠷ऴతͳίʔυ 1 from requests.exceptions import ConnectionError, TooManyRedirects, HTTPError 2 from

    requests_html import HTMLSession 3 from retry import retry 4 5 @retry(tries=3, delay=2, backoff=2) 6 def get_resp(): 7 try: 8 session = HTMLSession() 9 return session.get('https://pycon.jp/2017/ja/sponsors/') 10 except ConnectionError: 11 print('NetworkError') 12 raise 13 except TooManyRedirects: 14 print('TooManyRedirects') 15 raise 16 except HTTPError: 17 print('BadResponse') 18 raise 19 20 21 try: 22 resp = get_resp() 23 except: 24 print('Response not found') 25 26 sel = '.sponsor-content h4' 27 elems = resp.html.find(sel) 28 print([i.text for i in elems]) 29
  17. """

  18. ίʔυ 1 from requests_html import HTMLSession 2 import time 3

    4 base_url = ‘https://www.metacritic.com/~~略' 5 qs = 'sort=desc&page=' 6 7 for page_num in range(200): 8 session = HTMLSession() 9 resp = session.get(f'{base_url}?{qs}{page_num}') 10 11 elems = resp.html.find('.product_row') 12 print([i.text for i in elems]) 13 14 time.sleep(2) 15
  19. ίʔυ 1 from requests_html import HTMLSession 2 import time 3

    4 base_url = ‘https://www.metacritic.com/~~略' 5 qs = 'sort=desc&page=' 6 7 for page_num in range(200): 8 session = HTMLSession() 9 resp = session.get(f'{base_url}?{qs}{page_num}') 10 11 elems = resp.html.find('.product_row') 12 print([i.text for i in elems]) 13 14 time.sleep(2) 15 ಉ͡υϝΠϯʹෳ਺ճΞΫηε͢Δͱ͖͸sleep()͢Δ
  20. ݁Ռ > $ python scraping.py ['1.\n99\nThe Legend of Zelda: Ocarina

    of Time (N64)\nUser: 9.1\nNov 23, 1998’, "2.\n98\nTony Hawk's Pro Skater 2 (PS)\nUser: 7.4\nSep 20, 2000”, '3.\n98\nGrand Theft Auto IV (PS3)\nUser: 7.5\nApr 29, 2008’, '4.\n98\nSoulCalibur (DC)\nUser: 8.6\nSep 8, 1999’, '5.\n98\nGrand Theft Auto IV (X360)\nUser: 7.9\nApr 29, 2008’, '6.\n97\nSuper Mario Galaxy (WII)\nUser: 9.0\nNov 12, 2007’, '7.\n97\nSuper Mario Galaxy 2 (WII)\nUser: 9.1\nMay 23, 2010’, '8.\n97\nGrand Theft Auto V (XONE)\nUser: 7.8\nNov 18, 2014’, … '93.\n94\nJet Grind Radio (DC)\nUser: 8.0\nOct 30, 2000’, '94.\n94\nMetal Gear Solid (PS)\nUser: 9.2\nOct 21, 1998’, '95.\n94\nGrim Fandango (PC)\nUser: 9.1\nOct 14, 1998’, "96.\n94\nTom Clancy's Splinter Cell Chaos Theory (XBOX)\nUser: 8.9\nMar 28, 2005”, '97.\n94\nBurnout 3: Takedown (XBOX)\nUser: 7.4\nSep 7, 2004’, '98.\n94\nDiablo (PC)\nUser: 8.7\nDec 31, 1996’, '99.\n94\nMetal Gear Solid 3: Subsistence (PS2)\nUser: 9.0\nMar 14, 2006’, '100.\n94\nCall of Duty: Modern Warfare 2 (X360)\nUser: 6.4\nNov 10, 2009']
  21. """

  22. ίʔυ 1 import asyncio 2 3 from pyppeteer import launch

    4 5 6 async def main(): 7 browser = await launch() 8 page = await browser.newPage() 9 await page.goto('https://moneyforward.com/users/sign_in') 10 11 await page.type('#sign_in_session_service_email', '[email protected]') 12 await page.type('#sign_in_session_service_password', 'your_password') 13 btn_elem = await page.querySelector('#login-btn-sumit') 14 await btn_elem.click() 15 16 await page.waitFor(5000) 17 await page.screenshot({'path': 'logined.png', 'fullPage': True}) 18 await browser.close() 19 20 21 loop = asyncio.get_event_loop() 22 loop.run_until_complete(main()) 23
  23. ίʔυ 1 import asyncio 2 3 from pyppeteer import launch

    4 5 6 async def main(): 7 browser = await launch() 8 page = await browser.newPage() 9 await page.goto('https://moneyforward.com/users/sign_in') 10 11 await page.type('#sign_in_session_service_email', '[email protected]') 12 await page.type('#sign_in_session_service_password', 'your_password') 13 btn_elem = await page.querySelector('#login-btn-sumit') 14 await btn_elem.click() 15 16 await page.waitFor(5000) 17 await page.screenshot({'path': 'logined.png', 'fullPage': True}) 18 await browser.close() 19 20 21 loop = asyncio.get_event_loop() 22 loop.run_until_complete(main()) 23
  24. """

  25. ίʔυ 1 from requests_html import HTMLSession 2 3 session =

    HTMLSession() 4 5 resp = session.get('https://pycon.jp/2018/event/sessions') 6 7 sel = '.session-summary h3' 8 elems = resp.html.find(sel) 9 10 print([i.text for i in elems]) 11
  26. #

  27. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ $ python >>> from requests_html import HTMLSession >>>

    >>> session = HTMLSession() >>> resp = session.get('https://pycon.jp/2018/event/sessions') >>> print('スクレイピング実践入門' in resp.html.find('body', first=True).text) False >>> ͞Βʹௐ΂ͯΈΔ ͦ΋ͦ΋Ϩεϙϯεʹཉ͍͠ཁૉ͕ແͦ͞͏… ɾϨεϙϯε಺ͷςΩετΛݕࡧͯ͠ΈΔ
  28. ίʔυ 1 from requests_html import HTMLSession 2 3 session =

    HTMLSession() 4 resp = session.get('https://pycon.jp/2018/event/sessions') 5 6 resp.html.render(sleep=5) 7 8 sel = '.session-summary h3' 9 elems = resp.html.find(sel) 10 11 print([i.text for i in elems]) 12 JSͷ࣮ߦΛ଴ͭ ɾSelenium΍pyppeteer౳ͷϒϥ΢βૢ࡞ϥΠϒϥϦͰඳը͢Δ ɾ࣮͸requests-htmlͰ׬݁Ͱ͖Δ
  29. ίʔυ 1 from requests_html import HTMLSession 2 3 session =

    HTMLSession() 4 resp = session.get('https://pycon.jp/2018/event/sessions') 5 6 resp.html.render(sleep=5) 7 8 sel = '.session-summary h3' 9 elems = resp.html.find(sel) 10 11 print([i.text for i in elems]) 12 JSͷ࣮ߦΛ଴ͭ ɾSelenium΍pyppeteer౳ͷϒϥ΢βૢ࡞ϥΠϒϥϦͰඳը͢Δ ɾ࣮͸requests-htmlͰ׬݁Ͱ͖Δ
  30. ݁Ռ > $ python scraping.py > ['「リモートペアプロでマントルを突き抜けろ!」AWS Cloud9でリモートペアプロ&楽々サーバーレス開発’, ‘1次元畳み込みフィルターを利用した音楽データのオートエンコーダ', 'Adding

    JWT Authentication to Python and Django REST Framework Using Auth0’, 'AltJSとしてのPython - フロントエンドをPythonで書こう’, 'Applying serverless architecture pattern to distributed data processing’, 'Build text classification models ( CBOW and Skip-gram) with FastText in python’, 'Building Maintainable Python Web App using Flask’, 'C拡張と共に乗り切るPython 2→3移行術’, 'Django REST Framework におけるAPI実装プラクティス’, … ‘テキストマイニングによるTwitter個人アカウントの性格推定', 'Why your Django account registration should use a Turing test…', ‘医学研究者が深層学習環境の立ち上げの際に苦労した話', '暗号通貨技術・ブロックチェーン技術を活用するCrypto-Fintech Lab.’, '安全なサンドボックス構築の裏側 ~投資アルゴリズム構築環境QuantX Factoryの事例~’, 'diff 最小化原理で導く Zen of Python’, 'Python × Investment ~投資信託をPythonで分析して、その結果を公開するサービス作った話~’, ‘Pythonの軽量フレームワークによるシンプルで高速なWebAPIの作り方', ‘システム開発素人が深層学習を用いた画像認識で麻雀点数計算するLINEbot作った話', ‘【poke2vec】ポケモンの役割ベクトルの学習とその分析・可視化', 'asyncio + aiohttp で作るウェブサービス’, 'PyCon JP 傾向と対策']
  31. """

  32. ࠓ೔࿩ͨ͜͠ͱ·ͱΊ • ͍ΖΜͳOSSϥΠϒϥϦ͋Δ • ࣗ෼ʹ߹͏ϥΠϒϥϦΛ࢖͓͏ • ޷͖ͳ΋ͷʹߩݙͯ͜͠͏ • جຊతʹ͸ݟ͑ͯΔ΋ͷ͸औಘͰ͖Δ •

    ෑډͷ௿͞ • ҰํͰHTTP(S)ͷ࢓૊Έ஌Βͳ͍ͱ٧·Δ • ࣄྫΛ௨ͯ͠࢓૊Έ΋ษڧͯ͘͠ͱΑ͍(ࣗ෼΋·ͩ·ͩͳͷͰ…)