Slide 1

Slide 1 text

޿ౡͷຽധσʔλ Λ෼ੳ͠Α͏ʂ 2016/11/12 ؔࠜ༟ل PyCon mini Hiroshima 2016

Slide 2

Slide 2 text

ࣗݾ঺հ • ؔࠜ༟لʢ͖ͤͶ ͻΖͷΓʣ • גࣜձࣾSQUEEZE • Twitter: @checkpoint

Slide 3

Slide 3 text

PythonͱͷؔΘΓʢ̍ʣ • PyCon JP 2014 ελοϑ • PyCon JP 2015 ෭࠲௕ʢϓϩάϥϜʣ • PyCon jp 2016 ελοϑ • Python΋͘΋͘ձʢओ࠵ʣ

Slide 4

Slide 4 text

PythonͱͷؔΘΓʢ̎ʣ • LLDiver • PyCon JP 2014 • Phone Symposium Tokyo 2015 • PyCon mini Hiroshima 2015 • PyCon mini Hiroshima 2016 • PythonΤϯδχΞཆ੒ಡຊʢڞஶʣ

Slide 5

Slide 5 text

ۀ຿ͰͷPython • ຽധ෺݅Λ؅ཧɺӡ༻͢ΔͨΊͷαʔϏε

Slide 6

Slide 6 text

ۀ຿ͰͷPython • ຽധ෺݅Λ෼ੳ͢ΔͨΊͷαʔϏε

Slide 7

Slide 7 text

࢖༻ͯ͠Δٕज़

Slide 8

Slide 8 text

ΞδΣϯμ • ޿ౡݝͷຽധʹ͍ͭͯ • PythonͰͷσʔλऩू • PythonͰͷσʔλ෼ੳ

Slide 9

Slide 9 text

ຽധͱ͸ ҰൠͷຽՈʹ॓ധ͢Δ͜ͱʢ༷ʑͳܗଶʣ

Slide 10

Slide 10 text

ϓϥοτϑΥʔϜ COPYRIGHT (C) 2014-2016 SQUEEZE Inc. ALL RIGHTS RESERVED.

Slide 11

Slide 11 text

ϓϥοτϑΥʔϜʢ೔ຊʣ

Slide 12

Slide 12 text

޿ౡݝͷຽധ • தࠃɾ࢛ࠃ஍ํͰҰ൪େ͖ͳ౎ࢢʢ޿ౡࢢʣ • ੈքҨ࢈ΛؚΉ๛͔ͳ؍ޫࢿݯ • ΦόϚถେ౷ྖͷ๚໰ • ޿ౡΧʔϓͷηϦʔά༏উ • ຽധ΋੝Γ্͕͖͍ͬͯͯΔ͸ͣʂ

Slide 13

Slide 13 text

ຽധσʔλͷ෼ੳ • σʔλͷऩू • σʔλͷ෼ੳ • σʔλͷදࣔ

Slide 14

Slide 14 text

σʔλͷऩू • ΫϩʔϦϯά • εΫϨΠϐϯά • ౷ܭσʔλ • ૯຿ল౷ܭہ • σʔλΧλϩάαΠτ

Slide 15

Slide 15 text

ΫϩʔϦϯά • ӳޠͷҙຯ͸ɺ[͸͏ɺΏͬ͘ΓਐΉ] • WebϖʔδͷϦϯΫͷ಺༰ΛͨͲΔ • Webϖʔδͷ಺༰Λμ΢ϯϩʔυͯ͠ऩू • Web APIͷσʔλΛऔΔ৔߹΋͋Δ

Slide 16

Slide 16 text

εΫϨΠϐϯά • ӳޠͷҙຯ͸ɺ[ ࡟Δ͜ͱ ] • ϖʔδͷ಺༰͔Βඞཁͳ৘ใΛநग़

Slide 17

Slide 17 text

όοςϦʔ෇ଐݴޠ ʴ ڧྗͳαʔυύʔςΟϥΠϒϥϦ

Slide 18

Slide 18 text

ศརͳϥΠϒϥϦ • ඪ४ϥΠϒϥϦ • requests • BeautifulSoup • Scrapy • Selenium

Slide 19

Slide 19 text

ඪ४ϥΠϒϥϦ • Pythonͷඪ४ϥΠϒϥϦ͸ͱͯ΋ॆ࣮ • ωοτϫʔΫɺਖ਼نදݱɺetc • Pythonͷॲཧܥ͚ͩ͋Ε͹ྑ͍ • ؆୯ͳεΫϨΠϐϯάͰ͋Ε͹े෼࣮༻త

Slide 20

Slide 20 text

αϯϓϧ

Slide 21

Slide 21 text

Requests • Python੡ͷHTTP Client • ਓؒʹ༏͍͠ΠϯλʔϑΣʔε • ͱʹ͔͘Θ͔Γ΍͍͢ • γϯϓϧ͔ͭڧྗ

Slide 22

Slide 22 text

ެࣜαΠταϯϓϧ

Slide 23

Slide 23 text

αϯϓϧ(requests൛ʣ

Slide 24

Slide 24 text

Beautiful Soup • 2004೥Ґ͔Βଘࡏ͢ΔϥΠϒϥϦ • HTML΍XML͔ΒσʔλΛநग़ͯ͠औಘ • ࠷৽όʔγϣϯ͸Beautiful Soup 4ܥ • Python 2.7ɺPython 3.2ʹରԠ

Slide 25

Slide 25 text

αϯϓϧ

Slide 26

Slide 26 text

Scrapy Scarpy͸଎ͯ͘ɺϋΠϨϕϧͳεΫϨΠϐϯά ΫϩʔϥʔͷϑϨʔϜϫʔΫɻWebαΠτͷΫ ϩʔϧͱɺߏ଄Խ͞ΕͨσʔλΛऔΓग़͢ͷʹ ࢖༻͢Δɻ෯޿͍໨తʹ࢖༻Ͱ͖ΔɻσʔλϚ Πχϯά͔ΒɺϞχλϦϯάɺࣗಈςετͳͲ

Slide 27

Slide 27 text

Scrapyͷಛ௃ • ΫϩʔϦϯάɺεΫϨΠϐϯάϑϨʔϜϫʔΫ • DjangoʹӨڹ͞Ε͍ͯΔʢMiddlewareͳͲʣ • εΫϨΠϐϯάʹඞཁͳػೳ͕ͦΖ͍ͬͯΔ • υΩϡϝϯτ͕ॆ࣮͍ͯ͠Δ

Slide 28

Slide 28 text

Scrapyͷओͳػೳ • μ΢ϯϩʔυɺநग़ɺอଘ • μ΢ϯϩʔυͨ͠υΩϡϝϯτͷΩϟογϡ • ڧྗͳίϚϯυϥΠϯγΣϧ • Robots.txtͷύʔε • ඇಉظɺฒߦμ΢ϯϩʔυʢTwistedΛ࢖༻ʣ • υϝΠϯɺIPΞυϨε୯ҐͷΫϩʔϧִؒௐ੔ • Τϥʔ࣌ͷϦτϥΠ • ϩάग़ྗ

Slide 29

Slide 29 text

։ൃखॱ • ScrapyϓϩδΣΫτͷ࡞੒ • SpiderΛ࡞੒ʢϦϯΫநग़ɺμ΢ϯϩʔυʣ • ItemύΠϓϥΠϯͰσʔλΛอଘ

Slide 30

Slide 30 text

ϓϩδΣΫτͷ࡞੒ $ scrapy startproject scrapy_sample

Slide 31

Slide 31 text

αϯϓϧ

Slide 32

Slide 32 text

Spider࡞੒ʢެࣜαΠτΑΓʣ

Slide 33

Slide 33 text

࣮ߦ $ scrapy crawl dmoz_spider -o scraped_data.json

Slide 34

Slide 34 text

ৄࡉ • Scrapyೖ໳ʢ̍ʣ • Scrapyೖ໳ʢ̎ʣ

Slide 35

Slide 35 text

αϯϓϧʢ̍ʣ

Slide 36

Slide 36 text

αϯϓϧʢ̎ʣ

Slide 37

Slide 37 text

࣮ࡍͷࣄྫͷ঺հ • ޿ౡݝͷຽധσʔλΛ෼ੳ • ෺݅৘ใ • Ձ֨৘ใ

Slide 38

Slide 38 text

෺݅৘ใʢ๭αΠτʣ

Slide 39

Slide 39 text

։ൃͷྲྀΕ • ෺݅ɺՁ֨৘ใऔಘ༻ͷεύΠμʔΛ࡞੒ • εύΠμʔ͕Ұ࣌σʔλΛอଘʢJSONʣ • όονॲཧʹͯ෺݅ɺՁ֨ΛอଘʢΫϨϯδϯάʣ • ूܭόονʹͯσʔλΛ෼ੳͯ͠DBʹอଘ • ूܭσʔλΛදࣔ

Slide 40

Slide 40 text

σϞ

Slide 41

Slide 41 text

෺݅਺ 0 50 100 150 200 250 300 350 400 450 500 2016/2/15 2016/2/22 2016/2/29 2016/3/7 2016/3/14 2016/3/21 2016/3/28 2016/4/4 2016/4/11 2016/4/18 2016/4/25 2016/5/2 2016/5/9 2016/5/16 2016/5/23 2016/5/30 2016/6/6 2016/6/13 2016/6/20 2016/6/27 2016/7/4 2016/7/11 2016/7/18 2016/7/25 2016/8/1 2016/8/8 2016/8/15 2016/8/22 2016/8/29 2016/9/5 2016/9/12 2016/9/19 2016/9/26 2016/10/3 2016/10… 2016/10… 2016/10… 2016/10… 2016/11/7 2016/11… 2016/11… 2016/11… 2016/12/5 2016/12… 2016/12… 2016/12…

Slide 42

Slide 42 text

෺݅਺ • ෺݅਺ 461݅ • 1೥Ͱ໿2ഒʢ240݅ => 461݅) • શࠃͰ10൪໨ʹଟ͍ • ౦ژ౎, େࡕ෎, ژ౎෎, ԭೄ݅, ๺ւಓ, ෱Ԭ݅ɺ ਆಸ઒݅, ௕໺݅, Ѫ஌݅, ઍ༿ݝ, ޿ౡݝ

Slide 43

Slide 43 text

ฏۉՁ֨ 0 2000 4000 6000 8000 10000 12000

Slide 44

Slide 44 text

ฏۉՁ֨ • ౙ৔͸ฏۉՁ͕֨௿͍ʢ5000ԁ୆ʣ • 8݄, 10݄, 11݄ͷि຤͕ߴ͍ʢ8000୆ʣ • ೥຤ɺ͓ਖ਼݄͕ϐʔΫʢ10000ԁ୆ʣ

Slide 45

Slide 45 text

Քಇ཰ 0 10 20 30 40 50 60 70 80 90 100

Slide 46

Slide 46 text

Քಇ཰ • Նͷγʔζϯ͕ϐʔΫʢ80%ऑʣ • 10݄, 11݄ͷि຤͸ߴ͍ʢ70%Ҏ্ʣ • ౙͷγʔζϯ͸௿͍ʢ40%ҎԼʣ • 10/15ʢ౔), 10/29ʢ౔ʣ͕ߴ͔ͬͨ

Slide 47

Slide 47 text

·ͱΊ • PythonͰεΫϨΠϐϯάΛߦ͏৔߹ɺ৭ʑͳ Ξϓϩʔν͕͋Δɻ • Scrapy͸໘౗ͳॲཧΛߦͬͯ͘ΕΔͷͰΦε εϝ • ޿ౡͷຽധ͸͜Ε͔Β΋੝Γ্͕Δ͸ͣʂ

Slide 48

Slide 48 text

͝੩ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠