Slide 1

Slide 1 text

Scraping and Web Apps ʙੈք͔Β͋ͳͨ΁ɺ͋ͳ͔ͨΒੈք΁ʙ Yohei Munesada

Slide 2

Slide 2 text

 फఆ༸ฏ ΉͶͩ͞Α͏΁͍ (`T"$"%&.:50,:0ϝϯλʔ IUUQXXXZPIFJNOFU about me

Slide 3

Slide 3 text

 course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎 8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced

Slide 4

Slide 4 text

 re: agenda Basic 1 ϓϩάϥϜͷ࣮ߦ ΠϯσϯτελΠϧ ৚݅෼ذͱϧʔϓॲཧ ม਺ఆٛͱσʔλܕ σʔλߏ଄ -JTU %JDUJPOBSZ 4FU 5VQMF  ؔ਺ Basic 2

Slide 5

Slide 5 text

 re: agenda Ϟδϡʔϧͷఆٛͱར༻ ύοέʔδΛ࢖͏ ඪ४ϞδϡʔϧΛ࢖͏ ֎෦ϞδϡʔϧΛ࢖͏

Slide 6

Slide 6 text

 web scraping #FBVUJGVM4PVQΛ༻͍ͨ8FCεΫϨΠϐϯά

Slide 7

Slide 7 text

 web scraping steps ?

Slide 8

Slide 8 text

 web scraping VSMMJCSFRVFTUͱCFBVUJGVM4PVQΛ࢖͏ͱ؆୯ʹεΫϨΠϐϯά͕Ͱ͖·͢  HTMLΛαʔόʔ͔Βऔಘ͢Δ  BeautifulSoup ͰHTMLΛಡΈࠐΉ  DOM͔Β৘ใΛऔΓग़͢ 3 steps

Slide 9

Slide 9 text

 web scraping from urllib.request import urlopen from bs4 import BeautifulSoup # 1. Get a html. with urlopen("http://www.yoheim.net") as res: html = res.read().decode("utf-8") # 2. Load a html by BeautifulSoup. soup = BeautifulSoup(html, "html.parser") # 3. Get items you want. titles = soup.select(".articleListItem h2") titles = [t.string for t in titles]

Slide 10

Slide 10 text

 web scraping # Check results. from pprint import pprint pprint(titles[:4]) ['[Linux] ࠷ऴߋ৽೔΍࠷ऴΞΫηε೔Λࢦఆͯ͠ɺϑΝΠϧΛݕࡧ/࡟আ͢Δ', '[ϑϩϯτΤϯυ] Yamlͱ͍͏σʔλߏ଄ʹೖ໳͢Δ', '[Docker] DockerͷΠϯετʔϧͱLinuxىಈ·Ͱ', '[Javascript] ֆจࣈ(αϩήʔτϖΞ)ΛؚΜͩจࣈྻͷจࣈ਺Λਖ਼͘͠औಘ͢Δ']

Slide 11

Slide 11 text

 selectors in BeautifulSoup CFBVUJGVM4PVQʹ͸ओʹͭͷཁૉબ୒͕͋Γ·͢ find find_all select soup.find("h1") soup.find(id="header_subtitle") soup.find(class_="articleListItem") soup.find_all("h2") soup.find_all(id="header_subtitle") soup.find_all(class_="pubDate") soup.select(".articleListItem h2")

Slide 12

Slide 12 text

 extractors in BeautifulSoup %0.ཁૉ͔Β஋Λऔಘ͢Δํ๏͸ओʹͭ͋Γ·͢ attribute text #

My Special App

elm.string # elm["src"]

Slide 13

Slide 13 text

 practice for web scraping ͓ఱؾ"1*Λ࢖ͬͯΈΑ͏ ҎԼͷ"1*͸͓ఱؾ݁ՌΛSTT YNM Ͱฦ͠·͢ɻཉ͍͠৘ใΛऔಘͯ͠ΈΑ͏ɻ IUUQXXXZPIFJNOFUCMPHQIQ R ࣸܦͯ͠ΈΑ͏

Slide 14

Slide 14 text

 web scraping advance ࣮ફతͳ8FCεΫϨΠϐϯά

Slide 15

Slide 15 text

 ͕ɺͦͷલʹʜ

Slide 16

Slide 16 text

 read / write a file ϑΝΠϧͷಡΈॻ͖ʹ͍ͭͯઆ໌͠·͢ # Read. f = open("my.txt") txt = f.read() f.close() # Write. f = open("my2.txt", "w") f.write("writewrite") f.close() # Auto Resoucing. with open("my.txt", "r") as f: txt = f.read() # Binary mode. f = open("my.txt", "rb") txt = f.read().decode("utf-8") f.close()

Slide 17

Slide 17 text

 case 1 : download all images http://gsacademy.tokyo/mentor/

Slide 18

Slide 18 text

 case 1 : download all images ?

Slide 19

Slide 19 text

 case 1 : download all images )5.-Λೖख͢Δ JNHͷ%0.ΞΫηεํ๏Λ֬ೝ͢Δ JNHλά͔Βը૾ͷ63-Λऔಘ͢Δ ʢը૾ͷ63-Λ%-Մೳͳܗࣜʹม׵͢Δʣ ݅ͣͭμ΢ϯϩʔυ͢Δ ϑΝΠϧʹอଘ͢Δ

Slide 20

Slide 20 text

 https://gist.github.com/yoheiMune/ 33a3e1c2066fe3f4ba43b13ea7fa53fa case 1 : download all images

Slide 21

Slide 21 text

 case 2 : fake the request https://www.ebay.com/sch/sch/allcategories/all-categories

Slide 22

Slide 22 text

 case 2 : fake the request $ pip3 install —upgrade requests import requests r = requests.get("https://www.ebay.com/sch/sch/allcategories/ all-categories") html = r.text ࢀߟ SFRVFTUTϞδϡʔϧͷ࢖͍ํ http://www.yoheim.net/blog.php?q=20170802

Slide 23

Slide 23 text

 Կ͔ҧ͏ɾɾɾ

Slide 24

Slide 24 text

 https://www.charlesproxy.com

Slide 25

Slide 25 text

 fake the request

Slide 26

Slide 26 text

 fake the request import requests url = "https://www.ebay.com/sch/sch/allcategories/all-categories" headers = {"user-agent": "Mozilla/5.0 (Macintosh..."} r = requests.get(url, headers=headers) html = r.text ͜͜Ͱ͸ɺ6TFS"HFOUͷϦΫΤετϔομʔΛઃఆ͍ͯ͠·͢

Slide 27

Slide 27 text

 case 3 : Javascript page rendering https://dokusho-ojikan.jp/original

Slide 28

Slide 28 text

 case 3 : Javascript page rendering BeautifulSoup Selenium PhantomJS x x

Slide 29

Slide 29 text

 http://www.seleniumhq.org/

Slide 30

Slide 30 text

 http://phantomjs.org/

Slide 31

Slide 31 text

 case 3 : Javascript page rendering 1IBOUPN+4ͱ4FMFOJVNΛΠϯετʔϧ͠·͢ http://phantomjs.org/ PhantomJS Selenium $ pip3 install —upgrade selenium

Slide 32

Slide 32 text

 ิ଍ : PhantomJS ͷΠϯετʔϧ 1IBOUPN+4ͷαΠτ͔Βμ΢ϯϩʔυͯ͠ɺ1"5)ͷ௨͍ͬͯΔ৔ॴʹ഑ஔ͠·͢ 1IBOUPN+4ͷμ΢ϯϩʔυ ҎԼͷ63-ΑΓ1IBOUPN+4ͷ࣮ߦϑΝΠϧ όΠφϦʔ Λऔಘ͠·͢ɻ http://phantomjs.org/download.html 1"5)ͷ௨͍ͬͯΔ৔ॴʹ഑ஔ ྫ͑͹.BDͷ৔߹ʹ͸ɺҎԼͷΑ͏ʹίϐʔ͠·͢ɻ $ cd ~/Downloads/phantomjs-2.1.1-macosx/bin/ $ cp phantomjs /usr/local/bin/

Slide 33

Slide 33 text

 from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.PhantomJS() driver.get("https://dokusho-ojikan.jp/original/#!top") html = driver.page_source # Get image urls. bs = BeautifulSoup(html, "html.parser") img_urls = [img.get("src") for img in bs.select("#unique-pickup img")] print(img_urls) # ScreenShot driver.save_screenshot("ss.png") driver.quit()

Slide 34

Slide 34 text

 practice for web scraping ޷͖ͳ8FCϖʔδΛεΫϨΠϐϯάͯ͠ΈΑ͏ ɹྫʣ ɹɹɾ͸ͯͳϒοΫϚʔΫ ɹɹɾ:BIPPχϡʔε ɹɹɾϥΠϒυΞχϡʔε ɹɹɾͳͲ ࣸܦͯ͠ΈΑ͏

Slide 35

Slide 35 text

 course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎 8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced

Slide 36

Slide 36 text

 web application 'MBTLΛ༻͍ͨ8FCΞϓϦέʔγϣϯ

Slide 37

Slide 37 text

 agenda ओཁͳ8FC"QQϥΠϒϥϦ 'MBTLͱ͸ 'MBTLͷΠϯετʔϧ εϞʔϧελʔτ ϧʔςΟϯά (&5ͱ1045 ςϯϓϨʔτͱTUBUJDϑΝΠϧ

Slide 38

Slide 38 text

 major web app libraries XFCBQQTϥΠϒϥϦͰ͸%KBOHPͱ'MBTLͷڧͰ͢ django flask pyramid bottle ॏྔڃͳϥΠϒϥϦͰେن໛։ൃʹ࢖ΘΕΔɻ ௒ܰྔͳϥΠϒϥϦͰඇৗʹ͓खܰʹ࢖͑Δɻ …

Slide 39

Slide 39 text

 major web app libraries https://goo.gl/DVu9SE

Slide 40

Slide 40 text

 major web app libraries XFCBQQTϥΠϒϥϦͰ͸EKBOHPͱqBTLͷڧͰ͢ django flask pyramid bottle ॏྔڃͳϥΠϒϥϦͰେن໛։ൃʹ࢖ΘΕΔɻ ௒ܰྔͳϥΠϒϥϦͰඇৗʹ͓खܰʹ࢖͑Δɻ …

Slide 41

Slide 41 text

 Flask is ϚΠΫϩϑϨʔϜϫʔΫ SFRVFTUSFTQPOTFपΓͷػೳ͕த৺ γϯϓϧ͕ͩػೳతʹ͸े෼ ΞϓϦέʔγϣϯઃܭ͸ࣗ෼Ͱ΍Δ %BUBCBTFΞΫηεͳͲ΋ࣗલͰ༻ҙ͢Δ

Slide 42

Slide 42 text

 https://www.getpostman.com/

Slide 43

Slide 43 text

 install qBTL͸QJQܦ༝ͰΠϯετʔϧ͢Δ͜ͱ͕Ͱ͖·͢ $ pip3 install --upgrade Flask

Slide 44

Slide 44 text

 small start qBTL͸؆୯ʹαʔόʔΛىಈ͢Δ͜ͱ͕Ͱ͖·͢ from flask import Flask app = Flask(__name__) @app.route("/") def index(): return "Hello from flask" if __name__ == "__main__": app.run() $ python3 app.py

Slide 45

Slide 45 text

 routing ϧʔςΟϯάʹ͸σίϨʔλʔʢ!BQQSPVUFʣΛ࢖͍·͢ @app.route("/") def index(): return "Hello from flask" @app.route("/api/hello") def api_hello(): return "api_hello" @app.route("/api/items/") def api2(item_id): return "item_id is %d" % item_id σίϨʔλʔͱ͸ɿhttp://www.yoheim.net/blog.php?q=20160607

Slide 46

Slide 46 text

 GET and POST )551ϝιουͷࢦఆ͸σίϨʔλʔͰߦ͍·͢ from flask import Flask, request @app.route("/api/users", methods=["GET"]) def api_users_get(): search_key = request.args.get("user_id") return “user_id is %s" % user_id @app.route("/api/users/", methods=["POST"]) def api_users_update(user_id): user_name = request.form.get("user_name") return "user_id=%d, username=%s" % (user_id, user_name)

Slide 47

Slide 47 text

 template and static files ςϯϓϨʔτػೳͱTUBUJDϑΝΠϧͷ഑৴Λઆ໌͠·͢ app.py templates index.html … from flask import Flask, render_template @app.route("/mypage") def mypage(): title = "Hello G's members !!" return render_template("index.html", title=title)

{{ title }}

template

Slide 48

Slide 48 text

 template and static files ςϯϓϨʔτػೳͱTUBUJDϑΝΠϧͷ഑৴Λઆ໌͠·͢ app.py static main.js … static files no code !

Slide 49

Slide 49 text

 practice for flask application (FUͱ1PTUͷରԠ ͭͷ"1*Ͱɺ(FUϝιουͱ1PTUϝιουͷͲͪΒʹ΋ରԠͯ͠ΈΑ͏ɻ ࣸܦͯ͠ɺಈ͔ͯ͠ΈΑ͏

Slide 50

Slide 50 text

 web application advance 'MBTLΛ༻͍ͨ8FCΞϓϦέʔγϣϯ

Slide 51

Slide 51 text

 case 1 : cookie ͱͯ΋ʹ؆୯ʹѻ͑·̇͢ from datetime import datetime from flask import Flask, make_response @app.route("/cookie") def cookie(): # Contents response = make_response("OK") # Create cookie max_age = 60 * 60 * 24 * 30 # 30 days expires = int(datetime.now().timestamp()) + max_age response.set_cookie("gscookie", value="valval", expires=expires) # Response return response

Slide 52

Slide 52 text

 case 1 : cookie ͱͯ΋ʹ؆୯ʹѻ͑·̇͢ from flask import Flask, request @app.route("/get_from_cookie") def get_from_cookie(): val = request.cookies.get("gscookie") return val

Slide 53

Slide 53 text

 case 2 : session ͜Ε΋ͱͯ΋؆୯ʹѻ͑·̇͢ from flask import Flask, session app.secret_key = 'my_special_secret_key' @app.route("/session") def session_sample(): val = int(session.get("num", 1)) session["num"] = val + 1 return "%dճ໨ͷ๚໰Ͱ͢Ͷʂ" % val

Slide 54

Slide 54 text

 case 3 : divide controllers #MVFQSJOUΛ༻͍ͯίϯτϩʔϥʔΛ෼ׂ͢Δ͜ͱ͕Ͱ͖·͢ # api.py from flask import Blueprint app = Blueprint('api', __name__) @app.route('/api/hello') def hello(): return "api_hello"

Slide 55

Slide 55 text

 case 3 : divide controllers #MVFQSJOUΛ༻͍ͯίϯτϩʔϥʔΛ෼ׂ͢Δ͜ͱ͕Ͱ͖·͢ # app.py from api import app as api_app app.register_blueprint(api_app) ࢀߟɿhttp://www.yoheim.net/blog.php?q=20160507

Slide 56

Slide 56 text

 case 4 : logging 8FC"QQTͰϩάग़ྗ͸࣮༻Ͱ͸ॏཁͳͱ͜ΖͰ͢ Purpose for logging ?

Slide 57

Slide 57 text

 ϩάग़ྗ͸໨తΛߟ͑ͯઃܭ͠·͢ ΞϓϦέʔγϣϯͷਖ਼ৗಈ࡞Λ֬ೝ͢Δ ো֐ൃੜ࣌ͷݪҼڀ໌ʹ࢖͏ ηΩϡϦςΟ؂ࠪʹ࢖͏ ͳͲ case 4 : logging

Slide 58

Slide 58 text

 https://docs.python.jp/3/library/logging.html

Slide 59

Slide 59 text

 import logging from flask import Flask info_handler = logging.FileHandler('info.log') info_handler.setLevel(logging.INFO) app.logger.addHandler(info_handler) error_handler = logging.FileHandler('error.log') error_handler.setLevel(logging.ERROR) app.logger.addHandler(error_handler) case 4 : logging

Slide 60

Slide 60 text

 @app.route("/logging") def logging_sample(): app.logger.info('Info log...') app.logger.warning('Warning log...') app.logger.error('Error log...') try: 1 / 0 except: app.logger.exception("Exception log...") # Response. return "ok" case 4 : logging

Slide 61

Slide 61 text

 case 5 : deployment 8FC"QQTͷαʔόʔ΁ͷσϓϩΠ͸ͪΐ͍ۤ࿑͠·͢ http://www.yoheim.net/blog.php?q=20170206

Slide 62

Slide 62 text

 practice for flask application ࣸܦͯ͠ɺಈ͔ͯ͠ΈΑ͏ νϡʔτϦΞϧΛಡ΋͏ ࠓճ঺հͨ͠΋ͷҎ֎ʹ΋ͨ͘͞Μͷػೳ͕͋Γ·͢ɻҎԼͷνϡʔτϦΞϧϖʔ δΛಡΜͰֶΜͰΈΑ͏ʢӳޠ͸ؤுΔʂʣ IUUQqBTLQPDPPPSHEPDTRVJDLTUBSU

Slide 63

Slide 63 text

 1st demo app 8FCνϟοτϘοτΛ࡞Ζ͏

Slide 64

Slide 64 text

 https://goo.gl/0v5dSj today’s demo

Slide 65

Slide 65 text

 how it works 8FCεΫϨΠϐϯάͱ'MBTLΞϓϦέʔγϣϯΛ૊Έ߹Θ͍ͤͯ·͢ http://localhost:5000 index.html /api/recommend_articles web scraping

Slide 66

Slide 66 text

 how to create ϨϙδτϦΛΫϩʔϯ͢Δ ϥΠϒϥϦҰཡΛಡΈࠐΉ ىಈͯ͠ΈΔ 8FCεΫϨΠϐϯάͷॲཧΛ࣮૷͢Δ ಈ࡞ςετΛ͢Δ https://github.com/yoheimune-python-lecture/chatbot-news

Slide 67

Slide 67 text

 extends the app ʮࠓ೔ͷఱؾ͸ʁʯʹԠ͑ͯΈΑ͏ ʮΦεεϝͷϨγϐ͸ʁʯʹԠ͑ͯΈΑ͏ ͦͷଞɺࣗ༝ʹվ଄ͯ͠ΈΑ͏

Slide 68

Slide 68 text

 course catalog 㾎 1ZUIPOͱ͸ 㾎 1ZUIPOجຊฤ 㾎 Ϟδϡʔϧͱύοέʔδ 㾎 8FCεΫϨΠϐϯά 㾎 8FCαʔόʔ 㾎 1ZUIPOͱػցֶश Basic Advanced

Slide 69

Slide 69 text

 enjoy your python world