Slide 1

Slide 1 text

PythonͰ࢝ΊΔ ΢ΣϒεΫϨΠϐϯά ࣮ફೖ໳ ## level: All. ## ాத ৻ଠ࿠ (@_sin_tanaka)

Slide 2

Slide 2 text

ର৅ͷํ • Pythonॻ͚Δ • Webͷ͜ͱʢHTTP Req / ResʣͳΜͱͳ͘Θ͔Δ • εΫϨΠϐϯά͸΍ͬͨ͜ͱͳ͍ • ΍ͬͨ͜ͱ͋Δ͚ͲɺBeautifulSoup4 ͱ Selenium Ͱ͠ΐʁ

Slide 3

Slide 3 text

1.ϥΠϒϥϦ֓ཁ 2. جຊฤ 3. ࣮ફฤ ΞδΣϯμ

Slide 4

Slide 4 text

͜ͷηογϣϯͰѻ͏༻ޠ

Slide 5

Slide 5 text

εΫϨΠϐϯάͱ͸ • ΢Σϒ্ͷϦιʔεͷऩू • ໨తͷ΢Σϒϖʔδ΁ͷΞΫηε • HTMLͷύʔε ΢Σϒ্ͷϦιʔεΛऔಘ͠ ෼ੳՄೳͳߏ଄Խσʔλ΁ ม׵͢Δ͜ͱ

Slide 6

Slide 6 text

Ϋϩʔϥʔͱ͸ • εΫϨΠϐϯάͰߦ͏ϑϩʔΛࣗಈԽͨ͠΋ͷ • ྫ͑͹Google΍YahooͷݕࡧΤϯδϯ͸Ϋϩʔϥʔ͕ಈ͍ͯ ΠϯσοΫε࡞੒·Ͱߦͳ͍ͬͯΔ • ͜ͷηογϣϯͰ͸ѻΘͳ͍

Slide 7

Slide 7 text

ాத ৻ଠ࿠(@_sin_tanaka) ʢגʣ೔ຊγεςϜٕݚ / GeekLab.NAGANO ελοϑ GitHub / Medium: @sin_tanaka Իָͱਂ໷ϥδΦ͕޷͖ serverless-wsgi ʹϓϧϦΫग़ͨ͠Γͨ͠

Slide 8

Slide 8 text

• Ϋϩʔϥʔͷ࡞Γํ • औಘͨ͠σʔλͷ׆༻ํ๏ • ࠷ۙͷϥΠϒϥϦࣄ৘ʢrequests-html, pyppetteerʣ • εΫϨΠϐϯάͷجૅ • εΫϨΠϐϯάͷ٧·ΓͲ͜Ζʢ࣮ફྫʣ ΍Δ͜ͱ ΍Βͳ͍͜ͱ

Slide 9

Slide 9 text

εΫϨΠϐϯάͷϞνϕʔγϣϯ • ϒϥ΢βʹݟ͑ͯΔσʔλͰ͋Ε͹औಘՄೳ • API͕ͳͯ͘΋σʔλऔಘͰ͖Δ • ඞཁͳσʔλʹߜͬͯऔಘͰ͖Δ APIར༻ʹൺ΂ɺ • ϋʔυϧ͕௿͍! • ࣗ༝౓͕ߴ͍!

Slide 10

Slide 10 text

஫ҙ఺ ࢒೦ͳ͕Βɺݚڀऀͷதʹ͸ɺΫοΫύουͷσʔλΛ ࢖༻͢ΔͨΊɺѱ࣭ͳΫϩʔϧΛߦ͏ํ͕͍·͢ɻ ΫοΫύου։ൃऀϒϩάΑΓҾ༻ ɾαʔόෛՙͷ૿Ճ ɾऔಘͨ͠σʔλͷೋ࣍ར༻͸΍ΊΑ͏ ɾαʔϏεن໿ΛΑ͘ಡΉ͜ͱ અ౓΍ྙཧΛ΋ͬͯεΫϨΠϐϯά͠Α͏

Slide 11

Slide 11 text

ࡢࠓͷεΫϨΠϐϯάϥΠϒϥϦ ͜ͷล͔Βίʔυهड़͕ଟ͘ͳΓ·͕͢ ਅ໘໨ʹಡ΋͏ͱ͢Δͱ຾͘ͳΔͱࢥ͏ͷͰͬ͘͟ΓோΊΔ͘Β͍ʹ͠·͠ΐ͏ 2೔໨࠷ޙͷηογϣϯͩ͠

Slide 12

Slide 12 text

ϥΠϒϥϦͷछྨ ϒϥ΢βࣗಈૢ࡞ HTMLͷύʔε πϦʔͷ୳ࡧ ɾSelenium ɾpuppeteer ɾhtml.parser ɾlxml ɾhtml5lib ɾBeautifulSoup4 ɾpyquery

Slide 13

Slide 13 text

ఆ൪ BeautifulSoup4 / Selenium

Slide 14

Slide 14 text

$ pip install beautifulsoup4 BeautifulSoup4(bs4) • MIT • ݱ࣌఺࠷৽͸4.6.3 • OSS launchpadͱ͍͏։ൃϓϥοτϑΥʔϜͰ։ൃ͞Ε͍ͯΔ • HTMLͷύʔεͱ୳ࡧ • φϨοδ͕๛෋

Slide 15

Slide 15 text

BeautifulSoup4(bs4) 1 from bs4 import BeautifulSoup 2 import requests 3 4 resp = requests.get('https://www.example.com/') 5 bs_obj = BeautifulSoup(resp.content, 'lxml') 6 7 [i.text for i in bs_obj.find_all(‘h1')] 8 ίʔυྫ ϙΠϯτ L5. Response().contentͰbyteྻΛ౉͢ ɹˠbs4ଆͰΑ͠ͳʹจࣈίʔυ൑ఆͯ͘͠ΕΔ ɹˠcchardetϞδϡʔϧ͕ೖΕ͓ͯ͘ͱߴ଎ʹ൑ఆͯ͘͠ΕΔ L5. ύʔαΛࢦఆ͢Δ ɹˠެࣜͷਪ঑͸lxml

Slide 16

Slide 16 text

Selenium • Apeche 2.0 • OSS • UIςετࣗಈԽ༻ͷϒϥ΢βࣗಈૢ࡞πʔϧ • ϒϥ΢βૢ࡞ͳͷͰJSΛ࣮ߦՄೳ • ༷ʑͳݴޠ͔Βૢ࡞Մೳ $ pip install selenium $ # e.g. brew cask install chromedriver ⭐ελʔ: 11800 over

Slide 17

Slide 17 text

Selenium 1 from selenium import webdriver 2 3 options = webdriver.ChromeOptions() 4 options.add_argument('--headless') 5 6 driver = webdriver.Chrome(options=options) 7 driver.get('https://www.example.com/') 8 9 print(driver.title) 10 11 driver.quit() 12 ίʔυྫ ϙΠϯτ L6, L11. WebDriver.quit()͢Δ͜ͱʂ ɹˠquit()ͱclose()͕͋Γ໾ׂ͕ҧ͏ ɹˠκϯϏϓϩηε͕૿͑ΔͷΛ;͙ͤ

Slide 18

Slide 18 text

Selenium 1 with webdriver.Chrome(options=options) as driver: 2 driver.get('https://www.example.com/') 3 WebDriverΫϥεͰ͸__exit__Ͱquit()͍ͯ͠ΔͷͰwithจͰ΋OK

Slide 19

Slide 19 text

࠷ۙ͸ଞʹ΋બ୒ࢶ͋Δʂ

Slide 20

Slide 20 text

requests-html

Slide 21

Slide 21 text

requests-html • https://github.com/kennethreitz/requests-html • requestsͱ͔pipenvͷ࡞ऀ͕࡞͍ͬͯΔOSS • HTML Parsing for Human • ݱ࣌఺Ͱ0.9.0ͳͷͰϝδϟʔόʔδϣϯͰ͸ͳ͍ • HTMLͷύʔεͱ୳ࡧ • JSΛಈ͔͢͜ͱ͕Մೳ! $ pip install requests_html ⭐ελʔ: 8300 over

Slide 22

Slide 22 text

requests-html 1 from requests_html import HTMLSession 2 3 session = HTMLSession() 4 resp = session.get('https://www.example.com/') 5 print([i.text for i in resp.html.find('h1')]) 6 ίʔυྫ ϙΠϯτ L5. Response().html ɹˠ࣮͸requests.Responseͷ֦ு ɹˠϨεϙϯεʹhtmlͱ͍͏ϓϩύςΟ͕ੜ͍͑ͯΔ

Slide 23

Slide 23 text

pyppeteer

Slide 24

Slide 24 text

pyppeteer • https://github.com/miyakogi/pyppeteer • puppeteerͷPython࣮૷ • ϒϥ΢βͷࣗಈૢ࡞ • ΊͬͪΌtypo͢Δ $ pip install pyppeteer ⭐ελʔ: 900 over

Slide 25

Slide 25 text

pyppeteer 1 import asyncio 2 from pyppeteer import launch 3 4 5 async def main(): 6 browser = await launch() 7 page = await browser.newPage() 8 await page.goto('https://example.com') 9 print(await page.title()) 10 await browser.close() 11 12 13 loop = asyncio.get_event_loop() 14 loop.run_until_complete(main()) 15 ίʔυྫ ϙΠϯτ NodejsͷҠ২ͱ͍͏఺ ɹˠAPI͕coroutineฦ͢࡞Γ ɹˠϝιου໊΍Ҿ਺໊ͳͲ΋Ωϟϝϧέʔε

Slide 26

Slide 26 text

pyquery

Slide 27

Slide 27 text

pyquery • https://github.com/gawel/pyquery • API͕jQueryϥΠΫ • bs4ΑΓ΋ߋʹ௚ײత • ίϐʔͨ͠ηϨΫλʔΛͦͷ··౉ͤΔ • jQueryͰ͖Δਓʹ͸Φεεϝ $ pip install pyquery ⭐ελʔ: 1500 over

Slide 28

Slide 28 text

pyquery 1 from pyquery import PyQuery 2 3 pq = PyQuery(url='https://example.com') 4 print(pq.find(‘h1').text()) 5 ίʔυྫ ϙΠϯτ L3. urlΛ௚઀౉͚ͩ͢ͰHTTP ϦΫΤετ͢Δ·Ͱ໘౗ݟͯ͘ΕΔ ɹˠ΋ͪΖΜbs4ϥΠΫʹbyte΍strΛ౉ͯ͠΋OK

Slide 29

Slide 29 text

͜͜·Ͱͷ·ͱΊ • ఆ൪͸bs4ͱSelenium • ஌ݟ͕๛෋ • ͨͩ͠ɺࡢࠓͰ͸༷ʑͳOSS͕͋Δ • ৽͍͠ͷ΋ग़͍ͯΔ • ޷͖ͳOSSʹίϯτϦϏϡʔτ͠Α͏ʂ • ϥΠϒϥϦͷରԠͰ͖ΔϨΠϠʔΛ஌͓ͬͯ͘ • ϒϥ΢βࣗಈૢ࡞ / ύʔε / ୳ࡧʁ

Slide 30

Slide 30 text

جຊฤ

Slide 31

Slide 31 text

͓͓·͔ͳखॱ 1. ର৅ͷϖʔδΛܾΊΔ 2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ 3. HTMLͷύʔε 4. ඞཁͳσʔλͷऔಘ

Slide 32

Slide 32 text

͓͓·͔ͳखॱ 1. ର৅ͷϖʔδΛܾΊΔ 2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ 3. HTMLͷύʔε 4. ඞཁͳσʔλͷऔಘ ϥΠϒϥϦ͕΍ͬͯ͘ΕΔ෦෼

Slide 33

Slide 33 text

͓͓·͔ͳखॱ 1. ର৅ͷϖʔδΛܾΊΔ 2. HTTP ϦΫΤετΛૹΔ / ϨεϙϯεΛಘΔ 3. HTMLͷύʔε 4. ඞཁͳσʔλͷऔಘ ࣗ෼ͰܾΊΔ෦෼

Slide 34

Slide 34 text

૝ఆ PyCon JP 2017ͷαΠτ͔ΒεϙϯαʔҰཡΛऔಘ͢Δ PythonBootCampͷύΫϦ URL: https://pycon.jp/2017/ja/sponsors/

Slide 35

Slide 35 text

։ൃऀπʔϧΛ࢖͏

Slide 36

Slide 36 text

ࣸਅ ରԠ͢Δίʔυ

Slide 37

Slide 37 text

ࣸਅ ࣮ࡍʹݟ͍͑ͯΔཁૉ ରԠ͢Δίʔυ

Slide 38

Slide 38 text

෼͔ͬͨ͜ͱ ɾURL͸ https://pycon.jp/2017/ja/sponsors/ ɾsponsor-content Ϋϥεͷதͷ h4 λάΛ୳ͯ͠ ɹςΩετΛऔಘ͢Ε͹Αͦ͞͏ ίʔυʹམͱ͠ࠐΉ

Slide 39

Slide 39 text

PyCon JP 2017ͷεΫϨΠϐϯά 1 from requests_html import HTMLSession 2 3 url = 'https://pycon.jp/2017/ja/sponsors/' 4 5 session = HTMLSession() 6 resp = session.get(url) 7 8 sel = '.sponsor-content h4' 9 elems = resp.html.find(sel) 10 print([i.text for i in elems]) 11

Slide 40

Slide 40 text

PyCon JP 2017ͷεΫϨΠϐϯά 1 from requests_html import HTMLSession 2 3 url = 'https://pycon.jp/2017/ja/sponsors/' 4 5 session = HTMLSession() 6 resp = session.get(url) 7 8 sel = '.sponsor-content h4' 9 elems = resp.html.find(sel) 10 print([i.text for i in elems]) 11

Slide 41

Slide 41 text

݁Ռ >> python scraping.py ['株式会社SQUEEZE', '株式会社MonotaRO', 'LINE株式会社', 'Retty株式会社', 'iRidge, Inc.', '株式会社いい生活', … 'Togetter', 'CodeZine', 'エンジニアtype']

Slide 42

Slide 42 text

"

Slide 43

Slide 43 text

try-exceptͪΌΜͱ΍͓ͬͯ͘

Slide 44

Slide 44 text

try-exceptͪΌΜͱ΍͓ͬͯ͘ 1 try: 2 resp = session.get(some_url) 3 except requests.exceptions.ConnectionError: 4 print('NetworkError') 5 except requests.exceptions.TooManyRedirects: 6 print('TooManyRedirects') 7 except requests.exceptions.HTTPError: 8 print('BadResponse') 9 ɾωοτϫʔΫͷճઢ͕ෆௐ ɾϦμΠϨΫτͷճ਺͕ଟ͍ ɾϨεϙϯε͕ෆਖ਼ ͙Β͍Λߟྀ͓͚ͯ͠͹͍͍ͱࢥ͏

Slide 45

Slide 45 text

""

Slide 46

Slide 46 text

@retryσίϨʔλΛ࢖͏ $ pip install retry

Slide 47

Slide 47 text

@retryσίϨʔλΛ࢖͏ 1 from retry import retry 2 3 # 試行回数:5 間隔: 2sec 間隔の指数: 2 4 @retry(tries=5, delay=2, backoff=2) 5 • σίϨʔλΛ෇༩ͨؔ͠਺಺ͰException͕ൃੜͨ͠ͱ͖ ؔ਺Λࢦఆͷճ਺࠶ࢼߦ͢ΔϞδϡʔϧ • ࠶ࢼߦͷճ਺Λ؆୯ʹܾΊΒΕΔ • ࠶ࢼߦͷִؒΛࢦ਺ͰࢦఆͰ͖Δ

Slide 48

Slide 48 text

࠷ऴతͳίʔυ 1 from requests.exceptions import ConnectionError, TooManyRedirects, HTTPError 2 from requests_html import HTMLSession 3 from retry import retry 4 5 @retry(tries=3, delay=2, backoff=2) 6 def get_resp(): 7 try: 8 session = HTMLSession() 9 return session.get('https://pycon.jp/2017/ja/sponsors/') 10 except ConnectionError: 11 print('NetworkError') 12 raise 13 except TooManyRedirects: 14 print('TooManyRedirects') 15 raise 16 except HTTPError: 17 print('BadResponse') 18 raise 19 20 21 try: 22 resp = get_resp() 23 except: 24 print('Response not found') 25 26 sel = '.sponsor-content h4' 27 elems = resp.html.find(sel) 28 print([i.text for i in elems]) 29

Slide 49

Slide 49 text

"""

Slide 50

Slide 50 text

جຊฤ·ͱΊ • ԿΛεΫϨΠϐϯά͢Δͷ͔ܾΊΔ • جຊతͳσϕϩούʔπʔϧͷ࢖͍ํΛ஌͓ͬͯ͘ • try-except͸͔ͬ͠Γ΍͓ͬͯ͘ • ෆཁͳϦΫΤετͷૹ৴Λ๷͙ • ໷࣮ߦ͢Δˠேى͖ͯσʔλ͕Կ΋ͳ͍Λ๷͙ • @retryσίϨʔλ࢖͏ͱ࠶ࢼߦ͕؆୯

Slide 51

Slide 51 text

࣮ફฤ

Slide 52

Slide 52 text

٧·ͬͨࣄྫ 1. ϖʔδϯά 2. ϩάΠϯඞਢͷձһϖʔδ 3. ϒϥ΢βͰݟ͍͑ͯΔཁૉ͕ϓϩάϥϜͩͱݟͭ ͔Βͳ͍

Slide 53

Slide 53 text

ࣄྫ1. ϖʔδϯάɹ૝ఆ Metacritic.comʹܝࡌ͞ΕͨϨϏϡʔΛऔಘ͍ͨ͠

Slide 54

Slide 54 text

ࣄྫ1. ϖʔδϯάɹ૝ఆ ɾϖʔδϯά ɾϖʔδϟʔ ɾϖʔδωʔγϣϯͱ΋ ϖʔδϯάͰදࣔ಺༰Λ੾Γସ͑ΔཁૉΛऔಘ͍ͨ͠

Slide 55

Slide 55 text

෼͔ͬͨ͜ͱ product-rowΫϥεͷதͷςΩετΛऔಘ͢Ε͹Αͦ͞͏ ϖʔδΛ੾Γସ͑ΔͱURL͕มΘΔ https://www.metacritic.com/browse/games/score/ metascore/all/all/filtered?sort=desc&page=1 pageͷΫΤϦετϦϯάΛ ΠϯΫϦϝϯτ͍͚ͯ͠͹Αͦ͞͏ʂ

Slide 56

Slide 56 text

ίʔυ 1 from requests_html import HTMLSession 2 import time 3 4 base_url = ‘https://www.metacritic.com/~~略' 5 qs = 'sort=desc&page=' 6 7 for page_num in range(200): 8 session = HTMLSession() 9 resp = session.get(f'{base_url}?{qs}{page_num}') 10 11 elems = resp.html.find('.product_row') 12 print([i.text for i in elems]) 13 14 time.sleep(2) 15

Slide 57

Slide 57 text

ίʔυ 1 from requests_html import HTMLSession 2 import time 3 4 base_url = ‘https://www.metacritic.com/~~略' 5 qs = 'sort=desc&page=' 6 7 for page_num in range(200): 8 session = HTMLSession() 9 resp = session.get(f'{base_url}?{qs}{page_num}') 10 11 elems = resp.html.find('.product_row') 12 print([i.text for i in elems]) 13 14 time.sleep(2) 15 ಉ͡υϝΠϯʹෳ਺ճΞΫηε͢Δͱ͖͸sleep()͢Δ

Slide 58

Slide 58 text

݁Ռ > $ python scraping.py ['1.\n99\nThe Legend of Zelda: Ocarina of Time (N64)\nUser: 9.1\nNov 23, 1998’, "2.\n98\nTony Hawk's Pro Skater 2 (PS)\nUser: 7.4\nSep 20, 2000”, '3.\n98\nGrand Theft Auto IV (PS3)\nUser: 7.5\nApr 29, 2008’, '4.\n98\nSoulCalibur (DC)\nUser: 8.6\nSep 8, 1999’, '5.\n98\nGrand Theft Auto IV (X360)\nUser: 7.9\nApr 29, 2008’, '6.\n97\nSuper Mario Galaxy (WII)\nUser: 9.0\nNov 12, 2007’, '7.\n97\nSuper Mario Galaxy 2 (WII)\nUser: 9.1\nMay 23, 2010’, '8.\n97\nGrand Theft Auto V (XONE)\nUser: 7.8\nNov 18, 2014’, … '93.\n94\nJet Grind Radio (DC)\nUser: 8.0\nOct 30, 2000’, '94.\n94\nMetal Gear Solid (PS)\nUser: 9.2\nOct 21, 1998’, '95.\n94\nGrim Fandango (PC)\nUser: 9.1\nOct 14, 1998’, "96.\n94\nTom Clancy's Splinter Cell Chaos Theory (XBOX)\nUser: 8.9\nMar 28, 2005”, '97.\n94\nBurnout 3: Takedown (XBOX)\nUser: 7.4\nSep 7, 2004’, '98.\n94\nDiablo (PC)\nUser: 8.7\nDec 31, 1996’, '99.\n94\nMetal Gear Solid 3: Subsistence (PS2)\nUser: 9.0\nMar 14, 2006’, '100.\n94\nCall of Duty: Modern Warfare 2 (X360)\nUser: 6.4\nNov 10, 2009']

Slide 59

Slide 59 text

"""

Slide 60

Slide 60 text

ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ • moneyfowardͰࣗ෼ͷձһϖʔδ৘ใΛऔಘ͍ͨ͠

Slide 61

Slide 61 text

ϩάΠϯ͠ͳ͍ͱ ձһϖʔδ͸ݟΕͳ͍

Slide 62

Slide 62 text

ϒϥ΢βͷࣗಈૢ࡞ͰϩάΠϯ͢Δ

Slide 63

Slide 63 text

ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ ϩάΠϯϖʔδ: https://moneyforward.com/users/sign_in

Slide 64

Slide 64 text

ࣄྫ2 ϩάΠϯඞਢͷϖʔδɹ૝ఆ ϩάΠϯϖʔδ: https://moneyforward.com/users/sign_in

Slide 65

Slide 65 text

ίʔυ 1 import asyncio 2 3 from pyppeteer import launch 4 5 6 async def main(): 7 browser = await launch() 8 page = await browser.newPage() 9 await page.goto('https://moneyforward.com/users/sign_in') 10 11 await page.type('#sign_in_session_service_email', '[email protected]') 12 await page.type('#sign_in_session_service_password', 'your_password') 13 btn_elem = await page.querySelector('#login-btn-sumit') 14 await btn_elem.click() 15 16 await page.waitFor(5000) 17 await page.screenshot({'path': 'logined.png', 'fullPage': True}) 18 await browser.close() 19 20 21 loop = asyncio.get_event_loop() 22 loop.run_until_complete(main()) 23

Slide 66

Slide 66 text

ίʔυ 1 import asyncio 2 3 from pyppeteer import launch 4 5 6 async def main(): 7 browser = await launch() 8 page = await browser.newPage() 9 await page.goto('https://moneyforward.com/users/sign_in') 10 11 await page.type('#sign_in_session_service_email', '[email protected]') 12 await page.type('#sign_in_session_service_password', 'your_password') 13 btn_elem = await page.querySelector('#login-btn-sumit') 14 await btn_elem.click() 15 16 await page.waitFor(5000) 17 await page.screenshot({'path': 'logined.png', 'fullPage': True}) 18 await browser.close() 19 20 21 loop = asyncio.get_event_loop() 22 loop.run_until_complete(main()) 23

Slide 67

Slide 67 text

݁Ռ

Slide 68

Slide 68 text

"""

Slide 69

Slide 69 text

ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δɹ૝ఆ PyCon JP 2018ͷηογϣϯҰཡΛऔಘ͍ͨ͠

Slide 70

Slide 70 text

ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ ɾURL͸ https://pycon.jp/2018/event/sessions ɾsession-summary Ϋϥεͷதͷ h3 λάΛ୳ͯ͠ ɹςΩετΛऔಘ͢Ε͹Αͦ͞͏ ϖʔδϯά΋ͳ͍͠ϩάΠϯ΋͍Βͳ͍͠؆୯͡ΌΜʂ

Slide 71

Slide 71 text

ίʔυ 1 from requests_html import HTMLSession 2 3 session = HTMLSession() 4 5 resp = session.get('https://pycon.jp/2018/event/sessions') 6 7 sel = '.session-summary h3' 8 elems = resp.html.find(sel) 9 10 print([i.text for i in elems]) 11

Slide 72

Slide 72 text

݁Ռ > $ python scraping.py []

Slide 73

Slide 73 text

#

Slide 74

Slide 74 text

ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ $ python >>> from requests_html import HTMLSession >>> >>> session = HTMLSession() >>> resp = session.get('https://pycon.jp/2018/event/sessions') >>> print('スクレイピング実践入門' in resp.html.find('body', first=True).text) False >>> ͞Βʹௐ΂ͯΈΔ ͦ΋ͦ΋Ϩεϙϯεʹཉ͍͠ཁૉ͕ແͦ͞͏… ɾϨεϙϯε಺ͷςΩετΛݕࡧͯ͠ΈΔ

Slide 75

Slide 75 text

ࣄྫ3 JavaScriptͰཁૉΛඳը͍ͯ͠Δ • Ajaxͷొ৔ʹΑΓαʔό͕͢΂ͯͷHTMLΛฦͯ͘͠ΕΔͱ͸ݶΒͳ͘ ͳͬͨ • SPAʢSingle-Page Applicationʣͷ୆಄ʹΑΓɺ͢΂ͯͷཁૉΛJSͰඳ ը͢Δϖʔδ΋গͳ͘ͳ͍ • TwitterͷϗʔϜͱ͔ͷιʔεݟΔͱݦஶ ݪҼ JavaScriptͰཁૉΛඳը͍ͯ͠Δ

Slide 76

Slide 76 text

ϒϥ΢βʹΑΔ JavaScriptͷ࣮ߦ͕ඞཁ

Slide 77

Slide 77 text

ίʔυ 1 from requests_html import HTMLSession 2 3 session = HTMLSession() 4 resp = session.get('https://pycon.jp/2018/event/sessions') 5 6 resp.html.render(sleep=5) 7 8 sel = '.session-summary h3' 9 elems = resp.html.find(sel) 10 11 print([i.text for i in elems]) 12 JSͷ࣮ߦΛ଴ͭ ɾSelenium΍pyppeteer౳ͷϒϥ΢βૢ࡞ϥΠϒϥϦͰඳը͢Δ ɾ࣮͸requests-htmlͰ׬݁Ͱ͖Δ

Slide 78

Slide 78 text

ίʔυ 1 from requests_html import HTMLSession 2 3 session = HTMLSession() 4 resp = session.get('https://pycon.jp/2018/event/sessions') 5 6 resp.html.render(sleep=5) 7 8 sel = '.session-summary h3' 9 elems = resp.html.find(sel) 10 11 print([i.text for i in elems]) 12 JSͷ࣮ߦΛ଴ͭ ɾSelenium΍pyppeteer౳ͷϒϥ΢βૢ࡞ϥΠϒϥϦͰඳը͢Δ ɾ࣮͸requests-htmlͰ׬݁Ͱ͖Δ

Slide 79

Slide 79 text

݁Ռ > $ python scraping.py > ['「リモートペアプロでマントルを突き抜けろ!」AWS Cloud9でリモートペアプロ&楽々サーバーレス開発’, ‘1次元畳み込みフィルターを利用した音楽データのオートエンコーダ', 'Adding JWT Authentication to Python and Django REST Framework Using Auth0’, 'AltJSとしてのPython - フロントエンドをPythonで書こう’, 'Applying serverless architecture pattern to distributed data processing’, 'Build text classification models ( CBOW and Skip-gram) with FastText in python’, 'Building Maintainable Python Web App using Flask’, 'C拡張と共に乗り切るPython 2→3移行術’, 'Django REST Framework におけるAPI実装プラクティス’, … ‘テキストマイニングによるTwitter個人アカウントの性格推定', 'Why your Django account registration should use a Turing test…', ‘医学研究者が深層学習環境の立ち上げの際に苦労した話', '暗号通貨技術・ブロックチェーン技術を活用するCrypto-Fintech Lab.’, '安全なサンドボックス構築の裏側 ~投資アルゴリズム構築環境QuantX Factoryの事例~’, 'diff 最小化原理で導く Zen of Python’, 'Python × Investment ~投資信託をPythonで分析して、その結果を公開するサービス作った話~’, ‘Pythonの軽量フレームワークによるシンプルで高速なWebAPIの作り方', ‘システム開発素人が深層学習を用いた画像認識で麻雀点数計算するLINEbot作った話', ‘【poke2vec】ポケモンの役割ベクトルの学習とその分析・可視化', 'asyncio + aiohttp で作るウェブサービス’, 'PyCon JP 傾向と対策']

Slide 80

Slide 80 text

"""

Slide 81

Slide 81 text

࣮ફฤ·ͱΊ • URL͕ԿΛࢦࣔ͢͠ͷ͔஌͓ͬͯ͘ • εΩʔϚͷߏ଄ͱ͔ΫΤϦετϦϯάͷߏ଄ͱ͔ • ϒϥ΢β͕ԿΛ͍ͯ͠Δͷ͔஌͓ͬͯ͘ • HTTPαʔόΫϥΠΞϯτʹ͓͚ΔͲ͜ʁ • HTMLͷղऍ / CSS ͷద༻ / JavaScriptͷ࣮ߦ • ηογϣϯ / ΫοΩʔ / ϩʔΧϧετϨʔδ • ϒϥ΢βͷؾ࣋ͪʹͳΔ

Slide 82

Slide 82 text

ࠓ೔࿩ͨ͜͠ͱ·ͱΊ • ͍ΖΜͳOSSϥΠϒϥϦ͋Δ • ࣗ෼ʹ߹͏ϥΠϒϥϦΛ࢖͓͏ • ޷͖ͳ΋ͷʹߩݙͯ͜͠͏ • جຊతʹ͸ݟ͑ͯΔ΋ͷ͸औಘͰ͖Δ • ෑډͷ௿͞ • ҰํͰHTTP(S)ͷ࢓૊Έ஌Βͳ͍ͱ٧·Δ • ࣄྫΛ௨ͯ͠࢓૊Έ΋ษڧͯ͘͠ͱΑ͍(ࣗ෼΋·ͩ·ͩͳͷͰ…)

Slide 83

Slide 83 text

͝ਗ਼ௌ ͋Γ͕ͱ͏͍͟͝·ͨ͠ʂ