Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python Scraping and Web Apps for G's ACADEMY TOKYO

Python Scraping and Web Apps for G's ACADEMY TOKYO

G's Academy Tokyo にて行なっているPython講義の2日目の資料です。2日目はPythonによるWebスクレイピングとWebアプリケーション構築を学びます。
Presented by http://www.yoheim.net

Yohei Munesada

August 30, 2017
Tweet

More Decks by Yohei Munesada

Other Decks in Technology

Transcript

  1.  course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎

    8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced
  2.  web scraping from urllib.request import urlopen from bs4 import

    BeautifulSoup # 1. Get a html. with urlopen("http://www.yoheim.net") as res: html = res.read().decode("utf-8") # 2. Load a html by BeautifulSoup. soup = BeautifulSoup(html, "html.parser") # 3. Get items you want. titles = soup.select(".articleListItem h2") titles = [t.string for t in titles]
  3.  web scraping # Check results. from pprint import pprint

    pprint(titles[:4]) ['[Linux] ࠷ऴߋ৽೔΍࠷ऴΞΫηε೔Λࢦఆͯ͠ɺϑΝΠϧΛݕࡧ/࡟আ͢Δ', '[ϑϩϯτΤϯυ] Yamlͱ͍͏σʔλߏ଄ʹೖ໳͢Δ', '[Docker] DockerͷΠϯετʔϧͱLinuxىಈ·Ͱ', '[Javascript] ֆจࣈ(αϩήʔτϖΞ)ΛؚΜͩจࣈྻͷจࣈ਺Λਖ਼͘͠औಘ͢Δ']
  4.  selectors in BeautifulSoup CFBVUJGVM4PVQʹ͸ओʹͭͷཁૉબ୒͕͋Γ·͢ find find_all select soup.find("h1") soup.find(id="header_subtitle")

    soup.find(class_="articleListItem") soup.find_all("h2") soup.find_all(id="header_subtitle") soup.find_all(class_="pubDate") soup.select(".articleListItem h2")
  5.  read / write a file ϑΝΠϧͷಡΈॻ͖ʹ͍ͭͯઆ໌͠·͢ # Read. f

    = open("my.txt") txt = f.read() f.close() # Write. f = open("my2.txt", "w") f.write("writewrite") f.close() # Auto Resoucing. with open("my.txt", "r") as f: txt = f.read() # Binary mode. f = open("my.txt", "rb") txt = f.read().decode("utf-8") f.close()
  6.  case 1 : download all images )5.-Λೖख͢Δ JNHͷ%0.ΞΫηεํ๏Λ֬ೝ͢Δ JNHλά͔Βը૾ͷ63-Λऔಘ͢Δ

    ʢը૾ͷ63-Λ%-Մೳͳܗࣜʹม׵͢Δʣ ݅ͣͭμ΢ϯϩʔυ͢Δ ϑΝΠϧʹอଘ͢Δ
  7.  case 2 : fake the request $ pip3 install

    —upgrade requests import requests r = requests.get("https://www.ebay.com/sch/sch/allcategories/ all-categories") html = r.text ࢀߟ SFRVFTUTϞδϡʔϧͷ࢖͍ํ http://www.yoheim.net/blog.php?q=20170802
  8.  fake the request import requests url = "https://www.ebay.com/sch/sch/allcategories/all-categories" headers

    = {"user-agent": "Mozilla/5.0 (Macintosh..."} r = requests.get(url, headers=headers) html = r.text ͜͜Ͱ͸ɺ6TFS"HFOUͷϦΫΤετϔομʔΛઃఆ͍ͯ͠·͢
  9.  ิ଍ : PhantomJS ͷΠϯετʔϧ 1IBOUPN+4ͷαΠτ͔Βμ΢ϯϩʔυͯ͠ɺ1"5)ͷ௨͍ͬͯΔ৔ॴʹ഑ஔ͠·͢ 1IBOUPN+4ͷμ΢ϯϩʔυ ҎԼͷ63-ΑΓ1IBOUPN+4ͷ࣮ߦϑΝΠϧ όΠφϦʔ Λऔಘ͠·͢ɻ

    http://phantomjs.org/download.html 1"5)ͷ௨͍ͬͯΔ৔ॴʹ഑ஔ ྫ͑͹.BDͷ৔߹ʹ͸ɺҎԼͷΑ͏ʹίϐʔ͠·͢ɻ $ cd ~/Downloads/phantomjs-2.1.1-macosx/bin/ $ cp phantomjs /usr/local/bin/
  10.  from selenium import webdriver from bs4 import BeautifulSoup driver

    = webdriver.PhantomJS() driver.get("https://dokusho-ojikan.jp/original/#!top") html = driver.page_source # Get image urls. bs = BeautifulSoup(html, "html.parser") img_urls = [img.get("src") for img in bs.select("#unique-pickup img")] print(img_urls) # ScreenShot driver.save_screenshot("ss.png") driver.quit()
  11.  course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎

    8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced
  12.  major web app libraries XFCBQQTϥΠϒϥϦͰ͸%KBOHPͱ'MBTLͷڧͰ͢ django flask pyramid bottle

    ॏྔڃͳϥΠϒϥϦͰେن໛։ൃʹ࢖ΘΕΔɻ ௒ܰྔͳϥΠϒϥϦͰඇৗʹ͓खܰʹ࢖͑Δɻ …
  13.  major web app libraries XFCBQQTϥΠϒϥϦͰ͸EKBOHPͱqBTLͷڧͰ͢ django flask pyramid bottle

    ॏྔڃͳϥΠϒϥϦͰେن໛։ൃʹ࢖ΘΕΔɻ ௒ܰྔͳϥΠϒϥϦͰඇৗʹ͓खܰʹ࢖͑Δɻ …
  14.  small start qBTL͸؆୯ʹαʔόʔΛىಈ͢Δ͜ͱ͕Ͱ͖·͢ from flask import Flask app =

    Flask(__name__) @app.route("/") def index(): return "Hello from flask" if __name__ == "__main__": app.run() $ python3 app.py
  15.  routing ϧʔςΟϯάʹ͸σίϨʔλʔʢ!BQQSPVUFʣΛ࢖͍·͢ @app.route("/") def index(): return "Hello from flask"

    @app.route("/api/hello") def api_hello(): return "api_hello" @app.route("/api/items/<int:item_id>") def api2(item_id): return "item_id is %d" % item_id σίϨʔλʔͱ͸ɿhttp://www.yoheim.net/blog.php?q=20160607
  16.  GET and POST )551ϝιουͷࢦఆ͸σίϨʔλʔͰߦ͍·͢ from flask import Flask, request

    @app.route("/api/users", methods=["GET"]) def api_users_get(): search_key = request.args.get("user_id") return “user_id is %s" % user_id @app.route("/api/users/<int:user_id>", methods=["POST"]) def api_users_update(user_id): user_name = request.form.get("user_name") return "user_id=%d, username=%s" % (user_id, user_name)
  17.  template and static files ςϯϓϨʔτػೳͱTUBUJDϑΝΠϧͷ഑৴Λઆ໌͠·͢ app.py templates index.html …

    from flask import Flask, render_template @app.route("/mypage") def mypage(): title = "Hello G's members !!" return render_template("index.html", title=title) <html> <body> <h1>{{ title }}</h1> </body> </html> template
  18.  case 1 : cookie ͱͯ΋ʹ؆୯ʹѻ͑·̇͢ from datetime import datetime

    from flask import Flask, make_response @app.route("/cookie") def cookie(): # Contents response = make_response("OK") # Create cookie max_age = 60 * 60 * 24 * 30 # 30 days expires = int(datetime.now().timestamp()) + max_age response.set_cookie("gscookie", value="valval", expires=expires) # Response return response
  19.  case 1 : cookie ͱͯ΋ʹ؆୯ʹѻ͑·̇͢ from flask import Flask,

    request @app.route("/get_from_cookie") def get_from_cookie(): val = request.cookies.get("gscookie") return val
  20.  case 2 : session ͜Ε΋ͱͯ΋؆୯ʹѻ͑·̇͢ from flask import Flask,

    session app.secret_key = 'my_special_secret_key' @app.route("/session") def session_sample(): val = int(session.get("num", 1)) session["num"] = val + 1 return "%dճ໨ͷ๚໰Ͱ͢Ͷʂ" % val
  21.  case 3 : divide controllers #MVFQSJOUΛ༻͍ͯίϯτϩʔϥʔΛ෼ׂ͢Δ͜ͱ͕Ͱ͖·͢ # api.py from

    flask import Blueprint app = Blueprint('api', __name__) @app.route('/api/hello') def hello(): return "api_hello"
  22.  case 3 : divide controllers #MVFQSJOUΛ༻͍ͯίϯτϩʔϥʔΛ෼ׂ͢Δ͜ͱ͕Ͱ͖·͢ # app.py from

    api import app as api_app app.register_blueprint(api_app) ࢀߟɿhttp://www.yoheim.net/blog.php?q=20160507
  23.  import logging from flask import Flask info_handler = logging.FileHandler('info.log')

    info_handler.setLevel(logging.INFO) app.logger.addHandler(info_handler) error_handler = logging.FileHandler('error.log') error_handler.setLevel(logging.ERROR) app.logger.addHandler(error_handler) case 4 : logging
  24.  @app.route("/logging") def logging_sample(): app.logger.info('Info log...') app.logger.warning('Warning log...') app.logger.error('Error log...')

    try: 1 / 0 except: app.logger.exception("Exception log...") # Response. return "ok" case 4 : logging
  25.  course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎

    8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced