Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Vacation Rentals of Hiroshima
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
hsekine
November 12, 2016
Programming
880
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Vacation Rentals of Hiroshima
広島の民泊データを分析しよう!
hsekine
November 12, 2016
More Decks by hsekine
See All by hsekine
What I Learned from the Python Community
hsekine
1
52
What I Learned from the Python Community
hsekine
0
280
Python Engineer First Book
hsekine
1
1.6k
Python's Situation in Japanese Startups
hsekine
1
230
technology-of-squeeze
hsekine
0
3k
Technology of Mister Suite
hsekine
0
190
PyCon JP 2015 Opening 02
hsekine
0
150
PyCon JP 2015 Closing 02
hsekine
0
52
PyCon JP 2015 Opening 01
hsekine
0
140
Other Decks in Programming
See All in Programming
Semantic Version 単位で戦略を柔軟に変えて、パッケージアップデートを自動化する
daitasu
0
220
LLM本来の能力を解き放つサンドボックス技術とAI民主化への適用
yukukotani
3
3.6k
生成AI時代にこそ効くGo | Why Go Works in the Age of Generative AI
mom0tomo
8
3.2k
気づいたらRubyで100作品 ー クリエイティブコーディングが生活の一部になるまで / 100 Ruby Sketches Later: How Creative Coding Became Part of My Life
chobishiba
3
570
dRuby over BLE
makicamel
2
330
Technical Debt: Understanding it Rightly, Engaging it Rightly #LaravelLiveJP
shogogg
0
220
AI時代のUIはどこへ行く?その2!
yusukebe
21
7.1k
DynamoDBには集計系のクエリがないけどなんとかしたい
musan
1
140
TSKaigi Night Talks 2026_TypeScriptでサプライチェーンの整合性を型に閉じ込める
geekplus_tech
0
340
IBM Bobを活用したレガシーアプリの最新化
oniak3ibm
PRO
1
190
Signal Forms: Beyond the Basics @ngBaguette 2026 in Paris
manfredsteyer
PRO
0
240
Agentic UI
manfredsteyer
PRO
0
150
Featured
See All Featured
How to train your dragon (web standard)
notwaldorf
97
6.7k
Skip the Path - Find Your Career Trail
mkilby
1
150
Agile Actions for Facilitating Distributed Teams - ADO2019
mkilby
0
200
Product Roadmaps are Hard
iamctodd
PRO
55
12k
Documentation Writing (for coders)
carmenintech
77
5.4k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.2k
Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)
techseoconnect
PRO
0
180
Test your architecture with Archunit
thirion
1
2.3k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.5k
Kristin Tynski - Automating Marketing Tasks With AI
techseoconnect
PRO
0
270
Mobile First: as difficult as doing things right
swwweet
225
10k
How to Talk to Developers About Accessibility
jct
2
230
Transcript
ౡͷຽധσʔλ Λੳ͠Α͏ʂ 2016/11/12 ؔࠜ༟ل PyCon mini Hiroshima 2016
ࣗݾհ • ؔࠜ༟لʢ͖ͤͶ ͻΖͷΓʣ • גࣜձࣾSQUEEZE • Twitter: @checkpoint
PythonͱͷؔΘΓʢ̍ʣ • PyCon JP 2014 ελοϑ • PyCon JP 2015
෭࠲ʢϓϩάϥϜʣ • PyCon jp 2016 ελοϑ • Python͘͘ձʢओ࠵ʣ
PythonͱͷؔΘΓʢ̎ʣ • LLDiver • PyCon JP 2014 • Phone Symposium
Tokyo 2015 • PyCon mini Hiroshima 2015 • PyCon mini Hiroshima 2016 • PythonΤϯδχΞཆಡຊʢڞஶʣ
ۀͰͷPython • ຽധ݅Λཧɺӡ༻͢ΔͨΊͷαʔϏε
ۀͰͷPython • ຽധ݅Λੳ͢ΔͨΊͷαʔϏε
༻ͯ͠Δٕज़
ΞδΣϯμ • ౡݝͷຽധʹ͍ͭͯ • PythonͰͷσʔλऩू • PythonͰͷσʔλੳ
ຽധͱ ҰൠͷຽՈʹ॓ധ͢Δ͜ͱʢ༷ʑͳܗଶʣ
ϓϥοτϑΥʔϜ COPYRIGHT (C) 2014-2016 SQUEEZE Inc. ALL RIGHTS RESERVED.
ϓϥοτϑΥʔϜʢຊʣ
ౡݝͷຽധ • தࠃɾ࢛ࠃํͰҰ൪େ͖ͳࢢʢౡࢢʣ • ੈքҨ࢈ΛؚΉ๛͔ͳ؍ޫࢿݯ • ΦόϚถେ౷ྖͷ๚ • ౡΧʔϓͷηϦʔά༏উ •
ຽധΓ্͕͖͍ͬͯͯΔͣʂ
ຽധσʔλͷੳ • σʔλͷऩू • σʔλͷੳ • σʔλͷදࣔ
σʔλͷऩू • ΫϩʔϦϯά • εΫϨΠϐϯά • ౷ܭσʔλ • ૯ল౷ܭہ •
σʔλΧλϩάαΠτ
ΫϩʔϦϯά • ӳޠͷҙຯɺ[͏ɺΏͬ͘ΓਐΉ] • WebϖʔδͷϦϯΫͷ༰ΛͨͲΔ • Webϖʔδͷ༰Λμϯϩʔυͯ͠ऩू • Web APIͷσʔλΛऔΔ߹͋Δ
εΫϨΠϐϯά • ӳޠͷҙຯɺ[ Δ͜ͱ ] • ϖʔδͷ༰͔ΒඞཁͳใΛநग़
όοςϦʔଐݴޠ ʴ ڧྗͳαʔυύʔςΟϥΠϒϥϦ
ศརͳϥΠϒϥϦ • ඪ४ϥΠϒϥϦ • requests • BeautifulSoup • Scrapy •
Selenium
ඪ४ϥΠϒϥϦ • Pythonͷඪ४ϥΠϒϥϦͱͯॆ࣮ • ωοτϫʔΫɺਖ਼نදݱɺetc • Pythonͷॲཧܥ͚ͩ͋Εྑ͍ • ؆୯ͳεΫϨΠϐϯάͰ͋Εे࣮༻త
αϯϓϧ
Requests • PythonͷHTTP Client • ਓؒʹ༏͍͠ΠϯλʔϑΣʔε • ͱʹ͔͘Θ͔Γ͍͢ • γϯϓϧ͔ͭڧྗ
ެࣜαΠταϯϓϧ
αϯϓϧ(requests൛ʣ
Beautiful Soup • 2004Ґ͔Βଘࡏ͢ΔϥΠϒϥϦ • HTMLXML͔ΒσʔλΛநग़ͯ͠औಘ • ࠷৽όʔγϣϯBeautiful Soup 4ܥ
• Python 2.7ɺPython 3.2ʹରԠ
αϯϓϧ
Scrapy Scarpyͯ͘ɺϋΠϨϕϧͳεΫϨΠϐϯά ΫϩʔϥʔͷϑϨʔϜϫʔΫɻWebαΠτͷΫ ϩʔϧͱɺߏԽ͞ΕͨσʔλΛऔΓग़͢ͷʹ ༻͢Δɻ෯͍తʹ༻Ͱ͖ΔɻσʔλϚ Πχϯά͔ΒɺϞχλϦϯάɺࣗಈςετͳͲ
Scrapyͷಛ • ΫϩʔϦϯάɺεΫϨΠϐϯάϑϨʔϜϫʔΫ • DjangoʹӨڹ͞Ε͍ͯΔʢMiddlewareͳͲʣ • εΫϨΠϐϯάʹඞཁͳػೳ͕ͦΖ͍ͬͯΔ • υΩϡϝϯτ͕ॆ࣮͍ͯ͠Δ
Scrapyͷओͳػೳ • μϯϩʔυɺநग़ɺอଘ • μϯϩʔυͨ͠υΩϡϝϯτͷΩϟογϡ • ڧྗͳίϚϯυϥΠϯγΣϧ • Robots.txtͷύʔε •
ඇಉظɺฒߦμϯϩʔυʢTwistedΛ༻ʣ • υϝΠϯɺIPΞυϨε୯ҐͷΫϩʔϧִؒௐ • Τϥʔ࣌ͷϦτϥΠ • ϩάग़ྗ
։ൃखॱ • ScrapyϓϩδΣΫτͷ࡞ • SpiderΛ࡞ʢϦϯΫநग़ɺμϯϩʔυʣ • ItemύΠϓϥΠϯͰσʔλΛอଘ
ϓϩδΣΫτͷ࡞ $ scrapy startproject scrapy_sample
αϯϓϧ
Spider࡞ʢެࣜαΠτΑΓʣ
࣮ߦ $ scrapy crawl dmoz_spider -o scraped_data.json
ৄࡉ • Scrapyೖʢ̍ʣ • Scrapyೖʢ̎ʣ
αϯϓϧʢ̍ʣ
αϯϓϧʢ̎ʣ
࣮ࡍͷࣄྫͷհ • ౡݝͷຽധσʔλΛੳ • ݅ใ • Ձ֨ใ
݅ใʢαΠτʣ
։ൃͷྲྀΕ • ݅ɺՁ֨ใऔಘ༻ͷεύΠμʔΛ࡞ • εύΠμʔ͕Ұ࣌σʔλΛอଘʢJSONʣ • όονॲཧʹͯ݅ɺՁ֨ΛอଘʢΫϨϯδϯάʣ • ूܭόονʹͯσʔλΛੳͯ͠DBʹอଘ •
ूܭσʔλΛදࣔ
σϞ
݅ 0 50 100 150 200 250 300 350 400
450 500 2016/2/15 2016/2/22 2016/2/29 2016/3/7 2016/3/14 2016/3/21 2016/3/28 2016/4/4 2016/4/11 2016/4/18 2016/4/25 2016/5/2 2016/5/9 2016/5/16 2016/5/23 2016/5/30 2016/6/6 2016/6/13 2016/6/20 2016/6/27 2016/7/4 2016/7/11 2016/7/18 2016/7/25 2016/8/1 2016/8/8 2016/8/15 2016/8/22 2016/8/29 2016/9/5 2016/9/12 2016/9/19 2016/9/26 2016/10/3 2016/10… 2016/10… 2016/10… 2016/10… 2016/11/7 2016/11… 2016/11… 2016/11… 2016/12/5 2016/12… 2016/12… 2016/12…
݅ • ݅ 461݅ • 1Ͱ2ഒʢ240݅ => 461݅) • શࠃͰ10൪ʹଟ͍
• ౦ژ, େࡕ, ژ, ԭೄ݅, ւಓ, Ԭ݅ɺ ਆಸ݅, ݅, Ѫ݅, ઍ༿ݝ, ౡݝ
ฏۉՁ֨ 0 2000 4000 6000 8000 10000 12000
ฏۉՁ֨ • ౙฏۉՁ͕͍֨ʢ5000ԁʣ • 8݄, 10݄, 11݄ͷि͕ߴ͍ʢ8000ʣ • ɺ͓ਖ਼݄͕ϐʔΫʢ10000ԁʣ
Քಇ 0 10 20 30 40 50 60 70 80
90 100
Քಇ • Նͷγʔζϯ͕ϐʔΫʢ80%ऑʣ • 10݄, 11݄ͷिߴ͍ʢ70%Ҏ্ʣ • ౙͷγʔζϯ͍ʢ40%ҎԼʣ • 10/15ʢ),
10/29ʢʣ͕ߴ͔ͬͨ
·ͱΊ • PythonͰεΫϨΠϐϯάΛߦ͏߹ɺ৭ʑͳ Ξϓϩʔν͕͋Δɻ • Scrapy໘ͳॲཧΛߦͬͯ͘ΕΔͷͰΦε εϝ • ౡͷຽധ͜Ε͔ΒΓ্͕Δͣʂ
͝੩ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠