Save 37% off PRO during our Black Friday Sale! »

Python Scraping and Web Apps for G's ACADEMY TOKYO

Python Scraping and Web Apps for G's ACADEMY TOKYO

G's Academy Tokyo にて行なっているPython講義の2日目の資料です。2日目はPythonによるWebスクレイピングとWebアプリケーション構築を学びます。
Presented by http://www.yoheim.net

2dfd5e0acd70adff0e2efc745d992396?s=128

Yohei Munesada

August 30, 2017
Tweet

Transcript

  1. Scraping and Web Apps ʙੈք͔Β͋ͳͨ΁ɺ͋ͳ͔ͨΒੈք΁ʙ Yohei Munesada

  2.  फఆ༸ฏ ΉͶͩ͞Α͏΁͍ (`T"$"%&.:50,:0ϝϯλʔ IUUQXXXZPIFJNOFU about me

  3.  course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎

    8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced
  4.  re: agenda Basic 1 ϓϩάϥϜͷ࣮ߦ ΠϯσϯτελΠϧ ৚݅෼ذͱϧʔϓॲཧ ม਺ఆٛͱσʔλܕ σʔλߏ଄

    -JTU %JDUJPOBSZ 4FU 5VQMF  ؔ਺ Basic 2
  5.  re: agenda Ϟδϡʔϧͷఆٛͱར༻ ύοέʔδΛ࢖͏ ඪ४ϞδϡʔϧΛ࢖͏ ֎෦ϞδϡʔϧΛ࢖͏

  6.  web scraping #FBVUJGVM4PVQΛ༻͍ͨ8FCεΫϨΠϐϯά

  7.  web scraping steps ?

  8.  web scraping VSMMJCSFRVFTUͱCFBVUJGVM4PVQΛ࢖͏ͱ؆୯ʹεΫϨΠϐϯά͕Ͱ͖·͢  HTMLΛαʔόʔ͔Βऔಘ͢Δ  BeautifulSoup ͰHTMLΛಡΈࠐΉ 

    DOM͔Β৘ใΛऔΓग़͢ 3 steps
  9.  web scraping from urllib.request import urlopen from bs4 import

    BeautifulSoup # 1. Get a html. with urlopen("http://www.yoheim.net") as res: html = res.read().decode("utf-8") # 2. Load a html by BeautifulSoup. soup = BeautifulSoup(html, "html.parser") # 3. Get items you want. titles = soup.select(".articleListItem h2") titles = [t.string for t in titles]
  10.  web scraping # Check results. from pprint import pprint

    pprint(titles[:4]) ['[Linux] ࠷ऴߋ৽೔΍࠷ऴΞΫηε೔Λࢦఆͯ͠ɺϑΝΠϧΛݕࡧ/࡟আ͢Δ', '[ϑϩϯτΤϯυ] Yamlͱ͍͏σʔλߏ଄ʹೖ໳͢Δ', '[Docker] DockerͷΠϯετʔϧͱLinuxىಈ·Ͱ', '[Javascript] ֆจࣈ(αϩήʔτϖΞ)ΛؚΜͩจࣈྻͷจࣈ਺Λਖ਼͘͠औಘ͢Δ']
  11.  selectors in BeautifulSoup CFBVUJGVM4PVQʹ͸ओʹͭͷཁૉબ୒͕͋Γ·͢ find find_all select soup.find("h1") soup.find(id="header_subtitle")

    soup.find(class_="articleListItem") soup.find_all("h2") soup.find_all(id="header_subtitle") soup.find_all(class_="pubDate") soup.select(".articleListItem h2")
  12.  extractors in BeautifulSoup %0.ཁૉ͔Β஋Λऔಘ͢Δํ๏͸ओʹͭ͋Γ·͢ attribute text # <h1>My Special

    App</h1> elm.string # <img src="/my_secret.png"/> elm["src"]
  13.  practice for web scraping ͓ఱؾ"1*Λ࢖ͬͯΈΑ͏ ҎԼͷ"1*͸͓ఱؾ݁ՌΛSTT YNM Ͱฦ͠·͢ɻཉ͍͠৘ใΛऔಘͯ͠ΈΑ͏ɻ IUUQXXXZPIFJNOFUCMPHQIQ

    R ࣸܦͯ͠ΈΑ͏
  14.  web scraping advance ࣮ફతͳ8FCεΫϨΠϐϯά

  15.  ͕ɺͦͷલʹʜ

  16.  read / write a file ϑΝΠϧͷಡΈॻ͖ʹ͍ͭͯઆ໌͠·͢ # Read. f

    = open("my.txt") txt = f.read() f.close() # Write. f = open("my2.txt", "w") f.write("writewrite") f.close() # Auto Resoucing. with open("my.txt", "r") as f: txt = f.read() # Binary mode. f = open("my.txt", "rb") txt = f.read().decode("utf-8") f.close()
  17.  case 1 : download all images http://gsacademy.tokyo/mentor/

  18.  case 1 : download all images ?

  19.  case 1 : download all images )5.-Λೖख͢Δ JNHͷ%0.ΞΫηεํ๏Λ֬ೝ͢Δ JNHλά͔Βը૾ͷ63-Λऔಘ͢Δ

    ʢը૾ͷ63-Λ%-Մೳͳܗࣜʹม׵͢Δʣ ݅ͣͭμ΢ϯϩʔυ͢Δ ϑΝΠϧʹอଘ͢Δ
  20.  https://gist.github.com/yoheiMune/ 33a3e1c2066fe3f4ba43b13ea7fa53fa case 1 : download all images

  21.  case 2 : fake the request https://www.ebay.com/sch/sch/allcategories/all-categories

  22.  case 2 : fake the request $ pip3 install

    —upgrade requests import requests r = requests.get("https://www.ebay.com/sch/sch/allcategories/ all-categories") html = r.text ࢀߟ SFRVFTUTϞδϡʔϧͷ࢖͍ํ http://www.yoheim.net/blog.php?q=20170802
  23.  Կ͔ҧ͏ɾɾɾ

  24.  https://www.charlesproxy.com

  25.  fake the request

  26.  fake the request import requests url = "https://www.ebay.com/sch/sch/allcategories/all-categories" headers

    = {"user-agent": "Mozilla/5.0 (Macintosh..."} r = requests.get(url, headers=headers) html = r.text ͜͜Ͱ͸ɺ6TFS"HFOUͷϦΫΤετϔομʔΛઃఆ͍ͯ͠·͢
  27.  case 3 : Javascript page rendering https://dokusho-ojikan.jp/original

  28.  case 3 : Javascript page rendering BeautifulSoup Selenium PhantomJS

    x x
  29.  http://www.seleniumhq.org/

  30.  http://phantomjs.org/

  31.  case 3 : Javascript page rendering 1IBOUPN+4ͱ4FMFOJVNΛΠϯετʔϧ͠·͢ http://phantomjs.org/ PhantomJS

    Selenium $ pip3 install —upgrade selenium
  32.  ิ଍ : PhantomJS ͷΠϯετʔϧ 1IBOUPN+4ͷαΠτ͔Βμ΢ϯϩʔυͯ͠ɺ1"5)ͷ௨͍ͬͯΔ৔ॴʹ഑ஔ͠·͢ 1IBOUPN+4ͷμ΢ϯϩʔυ ҎԼͷ63-ΑΓ1IBOUPN+4ͷ࣮ߦϑΝΠϧ όΠφϦʔ Λऔಘ͠·͢ɻ

    http://phantomjs.org/download.html 1"5)ͷ௨͍ͬͯΔ৔ॴʹ഑ஔ ྫ͑͹.BDͷ৔߹ʹ͸ɺҎԼͷΑ͏ʹίϐʔ͠·͢ɻ $ cd ~/Downloads/phantomjs-2.1.1-macosx/bin/ $ cp phantomjs /usr/local/bin/
  33.  from selenium import webdriver from bs4 import BeautifulSoup driver

    = webdriver.PhantomJS() driver.get("https://dokusho-ojikan.jp/original/#!top") html = driver.page_source # Get image urls. bs = BeautifulSoup(html, "html.parser") img_urls = [img.get("src") for img in bs.select("#unique-pickup img")] print(img_urls) # ScreenShot driver.save_screenshot("ss.png") driver.quit()
  34.  practice for web scraping ޷͖ͳ8FCϖʔδΛεΫϨΠϐϯάͯ͠ΈΑ͏ ɹྫʣ ɹɹɾ͸ͯͳϒοΫϚʔΫ ɹɹɾ:BIPPχϡʔε ɹɹɾϥΠϒυΞχϡʔε

    ɹɹɾͳͲ ࣸܦͯ͠ΈΑ͏
  35.  course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎

    8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced
  36.  web application 'MBTLΛ༻͍ͨ8FCΞϓϦέʔγϣϯ

  37.  agenda ओཁͳ8FC"QQϥΠϒϥϦ 'MBTLͱ͸ 'MBTLͷΠϯετʔϧ εϞʔϧελʔτ ϧʔςΟϯά (&5ͱ1045 ςϯϓϨʔτͱTUBUJDϑΝΠϧ

  38.  major web app libraries XFCBQQTϥΠϒϥϦͰ͸%KBOHPͱ'MBTLͷڧͰ͢ django flask pyramid bottle

    ॏྔڃͳϥΠϒϥϦͰେن໛։ൃʹ࢖ΘΕΔɻ ௒ܰྔͳϥΠϒϥϦͰඇৗʹ͓खܰʹ࢖͑Δɻ …
  39.  major web app libraries https://goo.gl/DVu9SE

  40.  major web app libraries XFCBQQTϥΠϒϥϦͰ͸EKBOHPͱqBTLͷڧͰ͢ django flask pyramid bottle

    ॏྔڃͳϥΠϒϥϦͰେن໛։ൃʹ࢖ΘΕΔɻ ௒ܰྔͳϥΠϒϥϦͰඇৗʹ͓खܰʹ࢖͑Δɻ …
  41.  Flask is ϚΠΫϩϑϨʔϜϫʔΫ SFRVFTUSFTQPOTFपΓͷػೳ͕த৺ γϯϓϧ͕ͩػೳతʹ͸े෼ ΞϓϦέʔγϣϯઃܭ͸ࣗ෼Ͱ΍Δ %BUBCBTFΞΫηεͳͲ΋ࣗલͰ༻ҙ͢Δ

  42.  https://www.getpostman.com/

  43.  install qBTL͸QJQܦ༝ͰΠϯετʔϧ͢Δ͜ͱ͕Ͱ͖·͢ $ pip3 install --upgrade Flask

  44.  small start qBTL͸؆୯ʹαʔόʔΛىಈ͢Δ͜ͱ͕Ͱ͖·͢ from flask import Flask app =

    Flask(__name__) @app.route("/") def index(): return "Hello from flask" if __name__ == "__main__": app.run() $ python3 app.py
  45.  routing ϧʔςΟϯάʹ͸σίϨʔλʔʢ!BQQSPVUFʣΛ࢖͍·͢ @app.route("/") def index(): return "Hello from flask"

    @app.route("/api/hello") def api_hello(): return "api_hello" @app.route("/api/items/<int:item_id>") def api2(item_id): return "item_id is %d" % item_id σίϨʔλʔͱ͸ɿhttp://www.yoheim.net/blog.php?q=20160607
  46.  GET and POST )551ϝιουͷࢦఆ͸σίϨʔλʔͰߦ͍·͢ from flask import Flask, request

    @app.route("/api/users", methods=["GET"]) def api_users_get(): search_key = request.args.get("user_id") return “user_id is %s" % user_id @app.route("/api/users/<int:user_id>", methods=["POST"]) def api_users_update(user_id): user_name = request.form.get("user_name") return "user_id=%d, username=%s" % (user_id, user_name)
  47.  template and static files ςϯϓϨʔτػೳͱTUBUJDϑΝΠϧͷ഑৴Λઆ໌͠·͢ app.py templates index.html …

    from flask import Flask, render_template @app.route("/mypage") def mypage(): title = "Hello G's members !!" return render_template("index.html", title=title) <html> <body> <h1>{{ title }}</h1> </body> </html> template
  48.  template and static files ςϯϓϨʔτػೳͱTUBUJDϑΝΠϧͷ഑৴Λઆ໌͠·͢ app.py static main.js …

    static files no code !
  49.  practice for flask application (FUͱ1PTUͷରԠ ͭͷ"1*Ͱɺ(FUϝιουͱ1PTUϝιουͷͲͪΒʹ΋ରԠͯ͠ΈΑ͏ɻ ࣸܦͯ͠ɺಈ͔ͯ͠ΈΑ͏

  50.  web application advance 'MBTLΛ༻͍ͨ8FCΞϓϦέʔγϣϯ

  51.  case 1 : cookie ͱͯ΋ʹ؆୯ʹѻ͑·̇͢ from datetime import datetime

    from flask import Flask, make_response @app.route("/cookie") def cookie(): # Contents response = make_response("OK") # Create cookie max_age = 60 * 60 * 24 * 30 # 30 days expires = int(datetime.now().timestamp()) + max_age response.set_cookie("gscookie", value="valval", expires=expires) # Response return response
  52.  case 1 : cookie ͱͯ΋ʹ؆୯ʹѻ͑·̇͢ from flask import Flask,

    request @app.route("/get_from_cookie") def get_from_cookie(): val = request.cookies.get("gscookie") return val
  53.  case 2 : session ͜Ε΋ͱͯ΋؆୯ʹѻ͑·̇͢ from flask import Flask,

    session app.secret_key = 'my_special_secret_key' @app.route("/session") def session_sample(): val = int(session.get("num", 1)) session["num"] = val + 1 return "%dճ໨ͷ๚໰Ͱ͢Ͷʂ" % val
  54.  case 3 : divide controllers #MVFQSJOUΛ༻͍ͯίϯτϩʔϥʔΛ෼ׂ͢Δ͜ͱ͕Ͱ͖·͢ # api.py from

    flask import Blueprint app = Blueprint('api', __name__) @app.route('/api/hello') def hello(): return "api_hello"
  55.  case 3 : divide controllers #MVFQSJOUΛ༻͍ͯίϯτϩʔϥʔΛ෼ׂ͢Δ͜ͱ͕Ͱ͖·͢ # app.py from

    api import app as api_app app.register_blueprint(api_app) ࢀߟɿhttp://www.yoheim.net/blog.php?q=20160507
  56.  case 4 : logging 8FC"QQTͰϩάग़ྗ͸࣮༻Ͱ͸ॏཁͳͱ͜ΖͰ͢ Purpose for logging ?

  57.  ϩάग़ྗ͸໨తΛߟ͑ͯઃܭ͠·͢ ΞϓϦέʔγϣϯͷਖ਼ৗಈ࡞Λ֬ೝ͢Δ ো֐ൃੜ࣌ͷݪҼڀ໌ʹ࢖͏ ηΩϡϦςΟ؂ࠪʹ࢖͏ ͳͲ case 4 : logging

  58.  https://docs.python.jp/3/library/logging.html

  59.  import logging from flask import Flask info_handler = logging.FileHandler('info.log')

    info_handler.setLevel(logging.INFO) app.logger.addHandler(info_handler) error_handler = logging.FileHandler('error.log') error_handler.setLevel(logging.ERROR) app.logger.addHandler(error_handler) case 4 : logging
  60.  @app.route("/logging") def logging_sample(): app.logger.info('Info log...') app.logger.warning('Warning log...') app.logger.error('Error log...')

    try: 1 / 0 except: app.logger.exception("Exception log...") # Response. return "ok" case 4 : logging
  61.  case 5 : deployment 8FC"QQTͷαʔόʔ΁ͷσϓϩΠ͸ͪΐ͍ۤ࿑͠·͢ http://www.yoheim.net/blog.php?q=20170206

  62.  practice for flask application ࣸܦͯ͠ɺಈ͔ͯ͠ΈΑ͏ νϡʔτϦΞϧΛಡ΋͏ ࠓճ঺հͨ͠΋ͷҎ֎ʹ΋ͨ͘͞Μͷػೳ͕͋Γ·͢ɻҎԼͷνϡʔτϦΞϧϖʔ δΛಡΜͰֶΜͰΈΑ͏ʢӳޠ͸ؤுΔʂʣ IUUQqBTLQPDPPPSHEPDTRVJDLTUBSU

  63.  1st demo app 8FCνϟοτϘοτΛ࡞Ζ͏

  64.  https://goo.gl/0v5dSj today’s demo

  65.  how it works 8FCεΫϨΠϐϯάͱ'MBTLΞϓϦέʔγϣϯΛ૊Έ߹Θ͍ͤͯ·͢ http://localhost:5000 index.html /api/recommend_articles web scraping

  66.  how to create ϨϙδτϦΛΫϩʔϯ͢Δ ϥΠϒϥϦҰཡΛಡΈࠐΉ ىಈͯ͠ΈΔ 8FCεΫϨΠϐϯάͷॲཧΛ࣮૷͢Δ ಈ࡞ςετΛ͢Δ https://github.com/yoheimune-python-lecture/chatbot-news

  67.  extends the app ʮࠓ೔ͷఱؾ͸ʁʯʹԠ͑ͯΈΑ͏ ʮΦεεϝͷϨγϐ͸ʁʯʹԠ͑ͯΈΑ͏ ͦͷଞɺࣗ༝ʹվ଄ͯ͠ΈΑ͏

  68.  course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎

    8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced
  69.  enjoy your python world