Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Vacation Rentals of Hiroshima
Search
hsekine
November 12, 2016
Programming
880
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Vacation Rentals of Hiroshima
広島の民泊データを分析しよう!
hsekine
November 12, 2016
More Decks by hsekine
See All by hsekine
What I Learned from the Python Community
hsekine
1
52
What I Learned from the Python Community
hsekine
0
280
Python Engineer First Book
hsekine
1
1.6k
Python's Situation in Japanese Startups
hsekine
1
230
technology-of-squeeze
hsekine
0
3k
Technology of Mister Suite
hsekine
0
190
PyCon JP 2015 Opening 02
hsekine
0
150
PyCon JP 2015 Closing 02
hsekine
0
52
PyCon JP 2015 Opening 01
hsekine
0
140
Other Decks in Programming
See All in Programming
AI時代の仕事技芸論 — ソフトウェア開発で「遊ぶように働く」職人的熟達のすすめ
kuranuki
2
660
The NotImplementedError Problem in Ruby
koic
1
720
AIとASP.NET Coreで雑Webアプリを作った話
mayuki
0
510
タクシーアプリ『GO』の バックエンド開発のおける AI利活用と若者のすべて
pyama86
3
2k
Make SRE Operations Easier with Azure SRE Agent
kkamegawa
0
5.5k
脅威をエンジニアリングの糧にして――現場編 / Turning Threats into Engineering Fuel — Field Edition
nrslib
0
270
TAKTでAI駆動開発の品質を設計する
j5ik2o
6
1.2k
Go1.27で導入されるジェネリクスメソッドでできること
mackee
0
110
技術記事、 専門家としてのプログラマ、 言語化
mizchi
11
4.4k
運用エージェントは "作る" から "育てる" へ - 記憶と自己進化の3層設計パターン / self-evolving-agents-three-layer-agent-design
gawa
12
3.6k
AutonomyとControlのあいだ:Graflowで記述するAIエージェント協調
myui
0
120
TypeScript+Orvalで実現する型安全かつ堅牢でスケーラブルなマルチチャネル通知基盤 / TSKaigi Night talks ~after conference~
d0riven
0
330
Featured
See All Featured
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
1
2k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
330
Building an army of robots
kneath
306
46k
Building a Scalable Design System with Sketch
lauravandoore
463
34k
Designing for Performance
lara
611
70k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
250
1.3M
Applied NLP in the Age of Generative AI
inesmontani
PRO
4
2.3k
WCS-LA-2024
lcolladotor
0
630
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
310
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
250
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
410
Transcript
ౡͷຽധσʔλ Λੳ͠Α͏ʂ 2016/11/12 ؔࠜ༟ل PyCon mini Hiroshima 2016
ࣗݾհ • ؔࠜ༟لʢ͖ͤͶ ͻΖͷΓʣ • גࣜձࣾSQUEEZE • Twitter: @checkpoint
PythonͱͷؔΘΓʢ̍ʣ • PyCon JP 2014 ελοϑ • PyCon JP 2015
෭࠲ʢϓϩάϥϜʣ • PyCon jp 2016 ελοϑ • Python͘͘ձʢओ࠵ʣ
PythonͱͷؔΘΓʢ̎ʣ • LLDiver • PyCon JP 2014 • Phone Symposium
Tokyo 2015 • PyCon mini Hiroshima 2015 • PyCon mini Hiroshima 2016 • PythonΤϯδχΞཆಡຊʢڞஶʣ
ۀͰͷPython • ຽധ݅Λཧɺӡ༻͢ΔͨΊͷαʔϏε
ۀͰͷPython • ຽധ݅Λੳ͢ΔͨΊͷαʔϏε
༻ͯ͠Δٕज़
ΞδΣϯμ • ౡݝͷຽധʹ͍ͭͯ • PythonͰͷσʔλऩू • PythonͰͷσʔλੳ
ຽധͱ ҰൠͷຽՈʹ॓ധ͢Δ͜ͱʢ༷ʑͳܗଶʣ
ϓϥοτϑΥʔϜ COPYRIGHT (C) 2014-2016 SQUEEZE Inc. ALL RIGHTS RESERVED.
ϓϥοτϑΥʔϜʢຊʣ
ౡݝͷຽധ • தࠃɾ࢛ࠃํͰҰ൪େ͖ͳࢢʢౡࢢʣ • ੈքҨ࢈ΛؚΉ๛͔ͳ؍ޫࢿݯ • ΦόϚถେ౷ྖͷ๚ • ౡΧʔϓͷηϦʔά༏উ •
ຽധΓ্͕͖͍ͬͯͯΔͣʂ
ຽധσʔλͷੳ • σʔλͷऩू • σʔλͷੳ • σʔλͷදࣔ
σʔλͷऩू • ΫϩʔϦϯά • εΫϨΠϐϯά • ౷ܭσʔλ • ૯ল౷ܭہ •
σʔλΧλϩάαΠτ
ΫϩʔϦϯά • ӳޠͷҙຯɺ[͏ɺΏͬ͘ΓਐΉ] • WebϖʔδͷϦϯΫͷ༰ΛͨͲΔ • Webϖʔδͷ༰Λμϯϩʔυͯ͠ऩू • Web APIͷσʔλΛऔΔ߹͋Δ
εΫϨΠϐϯά • ӳޠͷҙຯɺ[ Δ͜ͱ ] • ϖʔδͷ༰͔ΒඞཁͳใΛநग़
όοςϦʔଐݴޠ ʴ ڧྗͳαʔυύʔςΟϥΠϒϥϦ
ศརͳϥΠϒϥϦ • ඪ४ϥΠϒϥϦ • requests • BeautifulSoup • Scrapy •
Selenium
ඪ४ϥΠϒϥϦ • Pythonͷඪ४ϥΠϒϥϦͱͯॆ࣮ • ωοτϫʔΫɺਖ਼نදݱɺetc • Pythonͷॲཧܥ͚ͩ͋Εྑ͍ • ؆୯ͳεΫϨΠϐϯάͰ͋Εे࣮༻త
αϯϓϧ
Requests • PythonͷHTTP Client • ਓؒʹ༏͍͠ΠϯλʔϑΣʔε • ͱʹ͔͘Θ͔Γ͍͢ • γϯϓϧ͔ͭڧྗ
ެࣜαΠταϯϓϧ
αϯϓϧ(requests൛ʣ
Beautiful Soup • 2004Ґ͔Βଘࡏ͢ΔϥΠϒϥϦ • HTMLXML͔ΒσʔλΛநग़ͯ͠औಘ • ࠷৽όʔγϣϯBeautiful Soup 4ܥ
• Python 2.7ɺPython 3.2ʹରԠ
αϯϓϧ
Scrapy Scarpyͯ͘ɺϋΠϨϕϧͳεΫϨΠϐϯά ΫϩʔϥʔͷϑϨʔϜϫʔΫɻWebαΠτͷΫ ϩʔϧͱɺߏԽ͞ΕͨσʔλΛऔΓग़͢ͷʹ ༻͢Δɻ෯͍తʹ༻Ͱ͖ΔɻσʔλϚ Πχϯά͔ΒɺϞχλϦϯάɺࣗಈςετͳͲ
Scrapyͷಛ • ΫϩʔϦϯάɺεΫϨΠϐϯάϑϨʔϜϫʔΫ • DjangoʹӨڹ͞Ε͍ͯΔʢMiddlewareͳͲʣ • εΫϨΠϐϯάʹඞཁͳػೳ͕ͦΖ͍ͬͯΔ • υΩϡϝϯτ͕ॆ࣮͍ͯ͠Δ
Scrapyͷओͳػೳ • μϯϩʔυɺநग़ɺอଘ • μϯϩʔυͨ͠υΩϡϝϯτͷΩϟογϡ • ڧྗͳίϚϯυϥΠϯγΣϧ • Robots.txtͷύʔε •
ඇಉظɺฒߦμϯϩʔυʢTwistedΛ༻ʣ • υϝΠϯɺIPΞυϨε୯ҐͷΫϩʔϧִؒௐ • Τϥʔ࣌ͷϦτϥΠ • ϩάग़ྗ
։ൃखॱ • ScrapyϓϩδΣΫτͷ࡞ • SpiderΛ࡞ʢϦϯΫநग़ɺμϯϩʔυʣ • ItemύΠϓϥΠϯͰσʔλΛอଘ
ϓϩδΣΫτͷ࡞ $ scrapy startproject scrapy_sample
αϯϓϧ
Spider࡞ʢެࣜαΠτΑΓʣ
࣮ߦ $ scrapy crawl dmoz_spider -o scraped_data.json
ৄࡉ • Scrapyೖʢ̍ʣ • Scrapyೖʢ̎ʣ
αϯϓϧʢ̍ʣ
αϯϓϧʢ̎ʣ
࣮ࡍͷࣄྫͷհ • ౡݝͷຽധσʔλΛੳ • ݅ใ • Ձ֨ใ
݅ใʢαΠτʣ
։ൃͷྲྀΕ • ݅ɺՁ֨ใऔಘ༻ͷεύΠμʔΛ࡞ • εύΠμʔ͕Ұ࣌σʔλΛอଘʢJSONʣ • όονॲཧʹͯ݅ɺՁ֨ΛอଘʢΫϨϯδϯάʣ • ूܭόονʹͯσʔλΛੳͯ͠DBʹอଘ •
ूܭσʔλΛදࣔ
σϞ
݅ 0 50 100 150 200 250 300 350 400
450 500 2016/2/15 2016/2/22 2016/2/29 2016/3/7 2016/3/14 2016/3/21 2016/3/28 2016/4/4 2016/4/11 2016/4/18 2016/4/25 2016/5/2 2016/5/9 2016/5/16 2016/5/23 2016/5/30 2016/6/6 2016/6/13 2016/6/20 2016/6/27 2016/7/4 2016/7/11 2016/7/18 2016/7/25 2016/8/1 2016/8/8 2016/8/15 2016/8/22 2016/8/29 2016/9/5 2016/9/12 2016/9/19 2016/9/26 2016/10/3 2016/10… 2016/10… 2016/10… 2016/10… 2016/11/7 2016/11… 2016/11… 2016/11… 2016/12/5 2016/12… 2016/12… 2016/12…
݅ • ݅ 461݅ • 1Ͱ2ഒʢ240݅ => 461݅) • શࠃͰ10൪ʹଟ͍
• ౦ژ, େࡕ, ژ, ԭೄ݅, ւಓ, Ԭ݅ɺ ਆಸ݅, ݅, Ѫ݅, ઍ༿ݝ, ౡݝ
ฏۉՁ֨ 0 2000 4000 6000 8000 10000 12000
ฏۉՁ֨ • ౙฏۉՁ͕͍֨ʢ5000ԁʣ • 8݄, 10݄, 11݄ͷि͕ߴ͍ʢ8000ʣ • ɺ͓ਖ਼݄͕ϐʔΫʢ10000ԁʣ
Քಇ 0 10 20 30 40 50 60 70 80
90 100
Քಇ • Նͷγʔζϯ͕ϐʔΫʢ80%ऑʣ • 10݄, 11݄ͷिߴ͍ʢ70%Ҏ্ʣ • ౙͷγʔζϯ͍ʢ40%ҎԼʣ • 10/15ʢ),
10/29ʢʣ͕ߴ͔ͬͨ
·ͱΊ • PythonͰεΫϨΠϐϯάΛߦ͏߹ɺ৭ʑͳ Ξϓϩʔν͕͋Δɻ • Scrapy໘ͳॲཧΛߦͬͯ͘ΕΔͷͰΦε εϝ • ౡͷຽധ͜Ε͔ΒΓ্͕Δͣʂ
͝੩ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠