Pythonで始めるウェブスクレイピング実践入門 / pyconjp-2018

7560933eeba917db748c562b05fea3a3?s=47 sin-tanaka
September 18, 2018

Pythonで始めるウェブスクレイピング実践入門 / pyconjp-2018

PyCon JP 2018 の発表資料です

7560933eeba917db748c562b05fea3a3?s=128

sin-tanaka

September 18, 2018
Tweet

Transcript

  1. PythonͰ࢝ΊΔ ΢ΣϒεΫϨΠϐϯά ࣮ફೖ໳ ## level: All. ## ాத ৻ଠ࿠ (@_sin_tanaka)

  2. ର৅ͷํ • Pythonॻ͚Δ • Webͷ͜ͱʢHTTP Req / ResʣͳΜͱͳ͘Θ͔Δ • εΫϨΠϐϯά͸΍ͬͨ͜ͱͳ͍

    • ΍ͬͨ͜ͱ͋Δ͚ͲɺBeautifulSoup4 ͱ Selenium Ͱ͠ΐʁ
  3. 1.ϥΠϒϥϦ֓ཁ 2. جຊฤ 3. ࣮ફฤ ΞδΣϯμ

  4. ͜ͷηογϣϯͰѻ͏༻ޠ

  5. εΫϨΠϐϯάͱ͸ • ΢Σϒ্ͷϦιʔεͷऩू • ໨తͷ΢Σϒϖʔδ΁ͷΞΫηε • HTMLͷύʔε ΢Σϒ্ͷϦιʔεΛऔಘ͠ ෼ੳՄೳͳߏ଄Խσʔλ΁ ม׵͢Δ͜ͱ

  6. Ϋϩʔϥʔͱ͸ • εΫϨΠϐϯάͰߦ͏ϑϩʔΛࣗಈԽͨ͠΋ͷ • ྫ͑͹Google΍YahooͷݕࡧΤϯδϯ͸Ϋϩʔϥʔ͕ಈ͍ͯ ΠϯσοΫε࡞੒·Ͱߦͳ͍ͬͯΔ • ͜ͷηογϣϯͰ͸ѻΘͳ͍

  7. ాத ৻ଠ࿠(@_sin_tanaka) ʢגʣ೔ຊγεςϜٕݚ / GeekLab.NAGANO ελοϑ GitHub / Medium: @sin_tanaka

    Իָͱਂ໷ϥδΦ͕޷͖ serverless-wsgi ʹϓϧϦΫग़ͨ͠Γͨ͠
  8. • Ϋϩʔϥʔͷ࡞Γํ • औಘͨ͠σʔλͷ׆༻ํ๏ • ࠷ۙͷϥΠϒϥϦࣄ৘ʢrequests-html, pyppetteerʣ • εΫϨΠϐϯάͷجૅ •

    εΫϨΠϐϯάͷ٧·ΓͲ͜Ζʢ࣮ફྫʣ ΍Δ͜ͱ ΍Βͳ͍͜ͱ
  9. εΫϨΠϐϯάͷϞνϕʔγϣϯ • ϒϥ΢βʹݟ͑ͯΔσʔλͰ͋Ε͹औಘՄೳ • API͕ͳͯ͘΋σʔλऔಘͰ͖Δ • ඞཁͳσʔλʹߜͬͯऔಘͰ͖Δ APIར༻ʹൺ΂ɺ • ϋʔυϧ͕௿͍!

    • ࣗ༝౓͕ߴ͍!
  10. ஫ҙ఺ ࢒೦ͳ͕Βɺݚڀऀͷதʹ͸ɺΫοΫύουͷσʔλΛ ࢖༻͢ΔͨΊɺѱ࣭ͳΫϩʔϧΛߦ͏ํ͕͍·͢ɻ ΫοΫύου։ൃऀϒϩάΑΓҾ༻ ɾαʔόෛՙͷ૿Ճ ɾऔಘͨ͠σʔλͷೋ࣍ར༻͸΍ΊΑ͏ ɾαʔϏεن໿ΛΑ͘ಡΉ͜ͱ અ౓΍ྙཧΛ΋ͬͯεΫϨΠϐϯά͠Α͏

  11. ࡢࠓͷεΫϨΠϐϯάϥΠϒϥϦ ͜ͷล͔Βίʔυهड़͕ଟ͘ͳΓ·͕͢ ਅ໘໨ʹಡ΋͏ͱ͢Δͱ຾͘ͳΔͱࢥ͏ͷͰͬ͘͟ΓோΊΔ͘Β͍ʹ͠·͠ΐ͏ 2೔໨࠷ޙͷηογϣϯͩ͠

  12. ϥΠϒϥϦͷछྨ ϒϥ΢βࣗಈૢ࡞ HTMLͷύʔε πϦʔͷ୳ࡧ ɾSelenium ɾpuppeteer ɾhtml.parser ɾlxml ɾhtml5lib ɾBeautifulSoup4

    ɾpyquery
  13. ఆ൪ BeautifulSoup4 / Selenium

  14. $ pip install beautifulsoup4 BeautifulSoup4(bs4) • MIT • ݱ࣌఺࠷৽͸4.6.3 •

    OSS launchpadͱ͍͏։ൃϓϥοτϑΥʔϜͰ։ൃ͞Ε͍ͯΔ • HTMLͷύʔεͱ୳ࡧ • φϨοδ͕๛෋
  15. BeautifulSoup4(bs4) 1 from bs4 import BeautifulSoup 2 import requests 3

    4 resp = requests.get('https://www.example.com/') 5 bs_obj = BeautifulSoup(resp.content, 'lxml') 6 7 [i.text for i in bs_obj.find_all(‘h1')] 8 ίʔυྫ ϙΠϯτ L5. Response().contentͰbyteྻΛ౉͢ ɹˠbs4ଆͰΑ͠ͳʹจࣈίʔυ൑ఆͯ͘͠ΕΔ ɹˠcchardetϞδϡʔϧ͕ೖΕ͓ͯ͘ͱߴ଎ʹ൑ఆͯ͘͠ΕΔ L5. ύʔαΛࢦఆ͢Δ ɹˠެࣜͷਪ঑͸lxml
  16. Selenium • Apeche 2.0 • OSS • UIςετࣗಈԽ༻ͷϒϥ΢βࣗಈૢ࡞πʔϧ • ϒϥ΢βૢ࡞ͳͷͰJSΛ࣮ߦՄೳ

    • ༷ʑͳݴޠ͔Βૢ࡞Մೳ $ pip install selenium $ # e.g. brew cask install chromedriver ⭐ελʔ: 11800 over
  17. Selenium 1 from selenium import webdriver 2 3 options =

    webdriver.ChromeOptions() 4 options.add_argument('--headless') 5 6 driver = webdriver.Chrome(options=options) 7 driver.get('https://www.example.com/') 8 9 print(driver.title) 10 11 driver.quit() 12 ίʔυྫ ϙΠϯτ L6, L11. WebDriver.quit()͢Δ͜ͱʂ ɹˠquit()ͱclose()͕͋Γ໾ׂ͕ҧ͏ ɹˠκϯϏϓϩηε͕૿͑ΔͷΛ;͙ͤ
  18. Selenium 1 with webdriver.Chrome(options=options) as driver: 2 driver.get('https://www.example.com/') 3 WebDriverΫϥεͰ͸__exit__Ͱquit()͍ͯ͠ΔͷͰwithจͰ΋OK

  19. ࠷ۙ͸ଞʹ΋બ୒ࢶ͋Δʂ

  20. requests-html

  21. requests-html • https://github.com/kennethreitz/requests-html • requestsͱ͔pipenvͷ࡞ऀ͕࡞͍ͬͯΔOSS • HTML Parsing for Human

    • ݱ࣌఺Ͱ0.9.0ͳͷͰϝδϟʔόʔδϣϯͰ͸ͳ͍ • HTMLͷύʔεͱ୳ࡧ • JSΛಈ͔͢͜ͱ͕Մೳ! $ pip install requests_html ⭐ελʔ: 8300 over
  22. requests-html 1 from requests_html import HTMLSession 2 3 session =

    HTMLSession() 4 resp = session.get('https://www.example.com/') 5 print([i.text for i in resp.html.find('h1')]) 6 ίʔυྫ ϙΠϯτ L5. Response().html ɹˠ࣮͸requests.Responseͷ֦ு ɹˠϨεϙϯεʹhtmlͱ͍͏ϓϩύςΟ͕ੜ͍͑ͯΔ
  23. pyppeteer

  24. pyppeteer • https://github.com/miyakogi/pyppeteer • puppeteerͷPython࣮૷ • ϒϥ΢βͷࣗಈૢ࡞ • ΊͬͪΌtypo͢Δ $

    pip install pyppeteer ⭐ελʔ: 900 over
  25. pyppeteer 1 import asyncio 2 from pyppeteer import launch 3

    4 5 async def main(): 6 browser = await launch() 7 page = await browser.newPage() 8 await page.goto('https://example.com') 9 print(await page.title()) 10 await browser.close() 11 12 13 loop = asyncio.get_event_loop() 14 loop.run_until_complete(main()) 15 ίʔυྫ ϙΠϯτ NodejsͷҠ২ͱ͍͏఺ ɹˠAPI͕coroutineฦ͢࡞Γ ɹˠϝιου໊΍Ҿ਺໊ͳͲ΋Ωϟϝϧέʔε
  26. pyquery

  27. pyquery • https://github.com/gawel/pyquery • API͕jQueryϥΠΫ • bs4ΑΓ΋ߋʹ௚ײత • ίϐʔͨ͠ηϨΫλʔΛͦͷ··౉ͤΔ •

    jQueryͰ͖Δਓʹ͸Φεεϝ $ pip install pyquery ⭐ελʔ: 1500 over
  28. pyquery 1 from pyquery import PyQuery 2 3 pq =

    PyQuery(url='https://example.com') 4 print(pq.find(‘h1').text()) 5 ίʔυྫ ϙΠϯτ L3. urlΛ௚઀౉͚ͩ͢ͰHTTP ϦΫΤετ͢Δ·Ͱ໘౗ݟͯ͘ΕΔ ɹˠ΋ͪΖΜbs4ϥΠΫʹbyte΍strΛ౉ͯ͠΋OK
  29. ͜͜·Ͱͷ·ͱΊ • ఆ൪͸bs4ͱSelenium • ஌ݟ͕๛෋ • ͨͩ͠ɺࡢࠓͰ͸༷ʑͳOSS͕͋Δ • ৽͍͠ͷ΋ग़͍ͯΔ •

    ޷͖ͳOSSʹίϯτϦϏϡʔτ͠Α͏ʂ • ϥΠϒϥϦͷରԠͰ͖ΔϨΠϠʔΛ஌͓ͬͯ͘ • ϒϥ΢βࣗಈૢ࡞ / ύʔε / ୳ࡧʁ
  30. جຊฤ

  31. ͓͓·͔ͳखॱ 1. ର৅ͷϖʔδΛܾΊΔ 2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ 3. HTMLͷύʔε

    4. ඞཁͳσʔλͷऔಘ
  32. ͓͓·͔ͳखॱ 1. ର৅ͷϖʔδΛܾΊΔ 2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ 3. HTMLͷύʔε

    4. ඞཁͳσʔλͷऔಘ ϥΠϒϥϦ͕΍ͬͯ͘ΕΔ෦෼
  33. ͓͓·͔ͳखॱ 1. ର৅ͷϖʔδΛܾΊΔ 2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ 3. HTMLͷύʔε

    4. ඞཁͳσʔλͷऔಘ ࣗ෼ͰܾΊΔ෦෼
  34. ૝ఆ PyCon JP 2017ͷαΠτ͔ΒεϙϯαʔҰཡΛऔಘ͢Δ PythonBootCampͷύΫϦ URL: https://pycon.jp/2017/ja/sponsors/

  35. ։ൃऀπʔϧΛ࢖͏

  36. ࣸਅ ରԠ͢Δίʔυ

  37. ࣸਅ ࣮ࡍʹݟ͍͑ͯΔཁૉ ରԠ͢Δίʔυ

  38. ෼͔ͬͨ͜ͱ ɾURL͸ https://pycon.jp/2017/ja/sponsors/ ɾsponsor-content Ϋϥεͷதͷ h4 λάΛ୳ͯ͠ ɹςΩετΛऔಘ͢Ε͹Αͦ͞͏ ίʔυʹམͱ͠ࠐΉ

  39. PyCon JP 2017ͷεΫϨΠϐϯά 1 from requests_html import HTMLSession 2 3

    url = 'https://pycon.jp/2017/ja/sponsors/' 4 5 session = HTMLSession() 6 resp = session.get(url) 7 8 sel = '.sponsor-content h4' 9 elems = resp.html.find(sel) 10 print([i.text for i in elems]) 11
  40. PyCon JP 2017ͷεΫϨΠϐϯά 1 from requests_html import HTMLSession 2 3

    url = 'https://pycon.jp/2017/ja/sponsors/' 4 5 session = HTMLSession() 6 resp = session.get(url) 7 8 sel = '.sponsor-content h4' 9 elems = resp.html.find(sel) 10 print([i.text for i in elems]) 11
  41. ݁Ռ >> python scraping.py ['株式会社SQUEEZE', '株式会社MonotaRO', 'LINE株式会社', 'Retty株式会社', 'iRidge, Inc.',

    '株式会社いい生活', … 'Togetter', 'CodeZine', 'エンジニアtype']
  42. "

  43. try-exceptͪΌΜͱ΍͓ͬͯ͘

  44. try-exceptͪΌΜͱ΍͓ͬͯ͘ 1 try: 2 resp = session.get(some_url) 3 except requests.exceptions.ConnectionError:

    4 print('NetworkError') 5 except requests.exceptions.TooManyRedirects: 6 print('TooManyRedirects') 7 except requests.exceptions.HTTPError: 8 print('BadResponse') 9 ɾωοτϫʔΫͷճઢ͕ෆௐ ɾϦμΠϨΫτͷճ਺͕ଟ͍ ɾϨεϙϯε͕ෆਖ਼ ͙Β͍Λߟྀ͓͚ͯ͠͹͍͍ͱࢥ͏
  45. ""

  46. @retryσίϨʔλΛ࢖͏ $ pip install retry

  47. @retryσίϨʔλΛ࢖͏ 1 from retry import retry 2 3 # 試行回数:5

    間隔: 2sec 間隔の指数: 2 4 @retry(tries=5, delay=2, backoff=2) 5 • σίϨʔλΛ෇༩ͨؔ͠਺಺ͰException͕ൃੜͨ͠ͱ͖ ؔ਺Λࢦఆͷճ਺࠶ࢼߦ͢ΔϞδϡʔϧ • ࠶ࢼߦͷճ਺Λ؆୯ʹܾΊΒΕΔ • ࠶ࢼߦͷִؒΛࢦ਺ͰࢦఆͰ͖Δ
  48. ࠷ऴతͳίʔυ 1 from requests.exceptions import ConnectionError, TooManyRedirects, HTTPError 2 from

    requests_html import HTMLSession 3 from retry import retry 4 5 @retry(tries=3, delay=2, backoff=2) 6 def get_resp(): 7 try: 8 session = HTMLSession() 9 return session.get('https://pycon.jp/2017/ja/sponsors/') 10 except ConnectionError: 11 print('NetworkError') 12 raise 13 except TooManyRedirects: 14 print('TooManyRedirects') 15 raise 16 except HTTPError: 17 print('BadResponse') 18 raise 19 20 21 try: 22 resp = get_resp() 23 except: 24 print('Response not found') 25 26 sel = '.sponsor-content h4' 27 elems = resp.html.find(sel) 28 print([i.text for i in elems]) 29
  49. """

  50. جຊฤ·ͱΊ • ԿΛεΫϨΠϐϯά͢Δͷ͔ܾΊΔ • جຊతͳσϕϩούʔπʔϧͷ࢖͍ํΛ஌͓ͬͯ͘ • try-except͸͔ͬ͠Γ΍͓ͬͯ͘ • ෆཁͳϦΫΤετͷૹ৴Λ๷͙ •

    ໷࣮ߦ͢Δˠேى͖ͯσʔλ͕Կ΋ͳ͍Λ๷͙ • @retryσίϨʔλ࢖͏ͱ࠶ࢼߦ͕؆୯
  51. ࣮ફฤ

  52. ٧·ͬͨࣄྫ 1. ϖʔδϯά 2. ϩάΠϯඞਢͷձһϖʔδ 3. ϒϥ΢βͰݟ͍͑ͯΔཁૉ͕ϓϩάϥϜͩͱݟͭ ͔Βͳ͍

  53. ࣄྫ1. ϖʔδϯάɹ૝ఆ Metacritic.comʹܝࡌ͞ΕͨϨϏϡʔΛऔಘ͍ͨ͠

  54. ࣄྫ1. ϖʔδϯάɹ૝ఆ ɾϖʔδϯά ɾϖʔδϟʔ ɾϖʔδωʔγϣϯͱ΋ ϖʔδϯάͰදࣔ಺༰Λ੾Γସ͑ΔཁૉΛऔಘ͍ͨ͠

  55. ෼͔ͬͨ͜ͱ product-rowΫϥεͷதͷςΩετΛऔಘ͢Ε͹Αͦ͞͏ ϖʔδΛ੾Γସ͑ΔͱURL͕มΘΔ https://www.metacritic.com/browse/games/score/ metascore/all/all/filtered?sort=desc&page=1 pageͷΫΤϦετϦϯάΛ ΠϯΫϦϝϯτ͍͚ͯ͠͹Αͦ͞͏ʂ

  56. ίʔυ 1 from requests_html import HTMLSession 2 import time 3

    4 base_url = ‘https://www.metacritic.com/~~略' 5 qs = 'sort=desc&page=' 6 7 for page_num in range(200): 8 session = HTMLSession() 9 resp = session.get(f'{base_url}?{qs}{page_num}') 10 11 elems = resp.html.find('.product_row') 12 print([i.text for i in elems]) 13 14 time.sleep(2) 15
  57. ίʔυ 1 from requests_html import HTMLSession 2 import time 3

    4 base_url = ‘https://www.metacritic.com/~~略' 5 qs = 'sort=desc&page=' 6 7 for page_num in range(200): 8 session = HTMLSession() 9 resp = session.get(f'{base_url}?{qs}{page_num}') 10 11 elems = resp.html.find('.product_row') 12 print([i.text for i in elems]) 13 14 time.sleep(2) 15 ಉ͡υϝΠϯʹෳ਺ճΞΫηε͢Δͱ͖͸sleep()͢Δ
  58. ݁Ռ > $ python scraping.py ['1.\n99\nThe Legend of Zelda: Ocarina

    of Time (N64)\nUser: 9.1\nNov 23, 1998’, "2.\n98\nTony Hawk's Pro Skater 2 (PS)\nUser: 7.4\nSep 20, 2000”, '3.\n98\nGrand Theft Auto IV (PS3)\nUser: 7.5\nApr 29, 2008’, '4.\n98\nSoulCalibur (DC)\nUser: 8.6\nSep 8, 1999’, '5.\n98\nGrand Theft Auto IV (X360)\nUser: 7.9\nApr 29, 2008’, '6.\n97\nSuper Mario Galaxy (WII)\nUser: 9.0\nNov 12, 2007’, '7.\n97\nSuper Mario Galaxy 2 (WII)\nUser: 9.1\nMay 23, 2010’, '8.\n97\nGrand Theft Auto V (XONE)\nUser: 7.8\nNov 18, 2014’, … '93.\n94\nJet Grind Radio (DC)\nUser: 8.0\nOct 30, 2000’, '94.\n94\nMetal Gear Solid (PS)\nUser: 9.2\nOct 21, 1998’, '95.\n94\nGrim Fandango (PC)\nUser: 9.1\nOct 14, 1998’, "96.\n94\nTom Clancy's Splinter Cell Chaos Theory (XBOX)\nUser: 8.9\nMar 28, 2005”, '97.\n94\nBurnout 3: Takedown (XBOX)\nUser: 7.4\nSep 7, 2004’, '98.\n94\nDiablo (PC)\nUser: 8.7\nDec 31, 1996’, '99.\n94\nMetal Gear Solid 3: Subsistence (PS2)\nUser: 9.0\nMar 14, 2006’, '100.\n94\nCall of Duty: Modern Warfare 2 (X360)\nUser: 6.4\nNov 10, 2009']
  59. """

  60. ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ • moneyfowardͰࣗ෼ͷձһϖʔδ৘ใΛऔಘ͍ͨ͠

  61. ϩάΠϯ͠ͳ͍ͱ ձһϖʔδ͸ݟΕͳ͍

  62. ϒϥ΢βͷࣗಈૢ࡞ͰϩάΠϯ͢Δ

  63. ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ ϩάΠϯϖʔδ: https://moneyforward.com/users/sign_in

  64. ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ ϩάΠϯϖʔδ: https://moneyforward.com/users/sign_in

  65. ίʔυ 1 import asyncio 2 3 from pyppeteer import launch

    4 5 6 async def main(): 7 browser = await launch() 8 page = await browser.newPage() 9 await page.goto('https://moneyforward.com/users/sign_in') 10 11 await page.type('#sign_in_session_service_email', 'your@mail.com') 12 await page.type('#sign_in_session_service_password', 'your_password') 13 btn_elem = await page.querySelector('#login-btn-sumit') 14 await btn_elem.click() 15 16 await page.waitFor(5000) 17 await page.screenshot({'path': 'logined.png', 'fullPage': True}) 18 await browser.close() 19 20 21 loop = asyncio.get_event_loop() 22 loop.run_until_complete(main()) 23
  66. ίʔυ 1 import asyncio 2 3 from pyppeteer import launch

    4 5 6 async def main(): 7 browser = await launch() 8 page = await browser.newPage() 9 await page.goto('https://moneyforward.com/users/sign_in') 10 11 await page.type('#sign_in_session_service_email', 'your@mail.com') 12 await page.type('#sign_in_session_service_password', 'your_password') 13 btn_elem = await page.querySelector('#login-btn-sumit') 14 await btn_elem.click() 15 16 await page.waitFor(5000) 17 await page.screenshot({'path': 'logined.png', 'fullPage': True}) 18 await browser.close() 19 20 21 loop = asyncio.get_event_loop() 22 loop.run_until_complete(main()) 23
  67. ݁Ռ

  68. """

  69. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δɹ૝ఆ PyCon JP 2018ͷηογϣϯҰཡΛऔಘ͍ͨ͠

  70. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ ɾURL͸ https://pycon.jp/2018/event/sessions ɾsession-summary Ϋϥεͷதͷ h3 λάΛ୳ͯ͠ ɹςΩετΛऔಘ͢Ε͹Αͦ͞͏ ϖʔδϯά΋ͳ͍͠ϩάΠϯ΋͍Βͳ͍͠؆୯͡ΌΜʂ

  71. ίʔυ 1 from requests_html import HTMLSession 2 3 session =

    HTMLSession() 4 5 resp = session.get('https://pycon.jp/2018/event/sessions') 6 7 sel = '.session-summary h3' 8 elems = resp.html.find(sel) 9 10 print([i.text for i in elems]) 11
  72. ݁Ռ > $ python scraping.py []

  73. #

  74. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ $ python >>> from requests_html import HTMLSession >>>

    >>> session = HTMLSession() >>> resp = session.get('https://pycon.jp/2018/event/sessions') >>> print('スクレイピング実践入門' in resp.html.find('body', first=True).text) False >>> ͞Βʹௐ΂ͯΈΔ ͦ΋ͦ΋Ϩεϙϯεʹཉ͍͠ཁૉ͕ແͦ͞͏… ɾϨεϙϯε಺ͷςΩετΛݕࡧͯ͠ΈΔ
  75. ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ • Ajaxͷొ৔ʹΑΓαʔό͕͢΂ͯͷHTMLΛฦͯ͘͠ΕΔͱ͸ݶΒͳ͘ ͳͬͨ • SPAʢSingle-Page Applicationʣͷ୆಄ʹΑΓɺ͢΂ͯͷཁૉΛJSͰඳ ը͢Δϖʔδ΋গͳ͘ͳ͍ •

    TwitterͷϗʔϜͱ͔ͷιʔεݟΔͱݦஶ ݪҼ JavaScriptͰཁૉΛඳը͍ͯ͠Δ
  76. ϒϥ΢βʹΑΔ JavaScriptͷ࣮ߦ͕ඞཁ

  77. ίʔυ 1 from requests_html import HTMLSession 2 3 session =

    HTMLSession() 4 resp = session.get('https://pycon.jp/2018/event/sessions') 5 6 resp.html.render(sleep=5) 7 8 sel = '.session-summary h3' 9 elems = resp.html.find(sel) 10 11 print([i.text for i in elems]) 12 JSͷ࣮ߦΛ଴ͭ ɾSelenium΍pyppeteer౳ͷϒϥ΢βૢ࡞ϥΠϒϥϦͰඳը͢Δ ɾ࣮͸requests-htmlͰ׬݁Ͱ͖Δ
  78. ίʔυ 1 from requests_html import HTMLSession 2 3 session =

    HTMLSession() 4 resp = session.get('https://pycon.jp/2018/event/sessions') 5 6 resp.html.render(sleep=5) 7 8 sel = '.session-summary h3' 9 elems = resp.html.find(sel) 10 11 print([i.text for i in elems]) 12 JSͷ࣮ߦΛ଴ͭ ɾSelenium΍pyppeteer౳ͷϒϥ΢βૢ࡞ϥΠϒϥϦͰඳը͢Δ ɾ࣮͸requests-htmlͰ׬݁Ͱ͖Δ
  79. ݁Ռ > $ python scraping.py > ['「リモートペアプロでマントルを突き抜けろ!」AWS Cloud9でリモートペアプロ&楽々サーバーレス開発’, ‘1次元畳み込みフィルターを利用した音楽データのオートエンコーダ', 'Adding

    JWT Authentication to Python and Django REST Framework Using Auth0’, 'AltJSとしてのPython - フロントエンドをPythonで書こう’, 'Applying serverless architecture pattern to distributed data processing’, 'Build text classification models ( CBOW and Skip-gram) with FastText in python’, 'Building Maintainable Python Web App using Flask’, 'C拡張と共に乗り切るPython 2→3移行術’, 'Django REST Framework におけるAPI実装プラクティス’, … ‘テキストマイニングによるTwitter個人アカウントの性格推定', 'Why your Django account registration should use a Turing test…', ‘医学研究者が深層学習環境の立ち上げの際に苦労した話', '暗号通貨技術・ブロックチェーン技術を活用するCrypto-Fintech Lab.’, '安全なサンドボックス構築の裏側 ~投資アルゴリズム構築環境QuantX Factoryの事例~’, 'diff 最小化原理で導く Zen of Python’, 'Python × Investment ~投資信託をPythonで分析して、その結果を公開するサービス作った話~’, ‘Pythonの軽量フレームワークによるシンプルで高速なWebAPIの作り方', ‘システム開発素人が深層学習を用いた画像認識で麻雀点数計算するLINEbot作った話', ‘【poke2vec】ポケモンの役割ベクトルの学習とその分析・可視化', 'asyncio + aiohttp で作るウェブサービス’, 'PyCon JP 傾向と対策']
  80. """

  81. ࣮ફฤ·ͱΊ • URL͕ԿΛࢦࣔ͢͠ͷ͔஌͓ͬͯ͘ • εΩʔϚͷߏ଄ͱ͔ΫΤϦετϦϯάͷߏ଄ͱ͔ • ϒϥ΢β͕ԿΛ͍ͯ͠Δͷ͔஌͓ͬͯ͘ • HTTPαʔόΫϥΠΞϯτʹ͓͚ΔͲ͜ʁ •

    HTMLͷղऍ / CSS ͷద༻ / JavaScriptͷ࣮ߦ • ηογϣϯ / ΫοΩʔ / ϩʔΧϧετϨʔδ • ϒϥ΢βͷؾ࣋ͪʹͳΔ
  82. ࠓ೔࿩ͨ͜͠ͱ·ͱΊ • ͍ΖΜͳOSSϥΠϒϥϦ͋Δ • ࣗ෼ʹ߹͏ϥΠϒϥϦΛ࢖͓͏ • ޷͖ͳ΋ͷʹߩݙͯ͜͠͏ • جຊతʹ͸ݟ͑ͯΔ΋ͷ͸औಘͰ͖Δ •

    ෑډͷ௿͞ • ҰํͰHTTP(S)ͷ࢓૊Έ஌Βͳ͍ͱ٧·Δ • ࣄྫΛ௨ͯ͠࢓૊Έ΋ษڧͯ͘͠ͱΑ͍(ࣗ෼΋·ͩ·ͩͳͷͰ…)
  83. ͝ਗ਼ௌ ͋Γ͕ͱ͏͍͟͝·ͨ͠ʂ