Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな
yujiosaka
February 22, 2018
4
810
俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな
yujiosaka
February 22, 2018
Tweet
Share
More Decks by yujiosaka
See All by yujiosaka
Machine Learning with JavaScript
yujiosaka
0
95
JavaScriptでも機械学習がやりたかった話
yujiosaka
2
340
ヘッドレスChromeでクローラを作った後の話
yujiosaka
3
450
『XXX』のための管理画面
yujiosaka
1
940
Enjoy Deep Learning by JavaScript
yujiosaka
1
230
ひたすら楽してディープラーニング
yujiosaka
20
13k
technology x business
yujiosaka
3
480
第二回もんご祭 パネルディスカッション
yujiosaka
0
680
technology x business
yujiosaka
0
100
Featured
See All Featured
JazzCon 2018 Closing Keynote - Leadership for the Reluctant Leader
reverentgeek
173
8.6k
Happy Clients
brianwarren
89
5.6k
jQuery: Nuts, Bolts and Bling
dougneiner
56
6.5k
Reflections from 52 weeks, 52 projects
jeffersonlam
337
17k
Music & Morning Musume
bryan
35
4.3k
Code Review Best Practice
trishagee
44
9.8k
Creatively Recalculating Your Daily Design Routine
revolveconf
207
10k
Designing for Performance
lara
597
64k
Facilitating Awesome Meetings
lara
29
4.1k
Thoughts on Productivity
jonyablonski
44
2.4k
Testing 201, or: Great Expectations
jmmastey
21
5.5k
Designing for humans not robots
tammielis
242
24k
Transcript
Yuji Isobe Զ͕࠷ॳʹϔουϨε ChromeͰΫϩʔϥ࡞ͬͨ ࣄʹͳΜͶʔ͔ͳ NodeֶԂ29࣌ݶ
min e ϓϩδΣΫτϚωʔδϟʔ at @yujiosaka https://speakerdeck.com/yujiosaka/hitasurale-sitedeipuraningu
✓ Կނ͍·͞ΒΫϩʔϥͳͷ͔ ✓ ԿΛࢦͯ͠࡞͔ͬͨ ✓ ԿΛߟ͑ͳ͕Β࡞͔ͬͨ ✓ ͜Ε͔ΒͷΫϩʔϥ ࠓճΫϩʔϥΛ࡞ͬͨ
ڈ৭Μͳ͜ͱΛͬͨ…
ECZine࿈ࡌ http://eczine.jp/article/detail/4869
ECઐՈσϏϡʔ http://amzn.asia/aOkwFjH
ࠔͬͨ(´ɾωɾʆ)
ձࣾͰΤϯδχΞͩͱ ࢥΘΕͳ͘ͳ͖ͬͯͨorz
ݸࣾຖʹνϡʔχϯάΛߦ͏ Ӧۀಉߦʹग़͔͚Δ ৽نϓϩμΫτͷఏҊ ӦۀࢿྉΛॻ͖࢝ΊΔ ϓϨεϦϦʔεΛॻ͖࢝ΊΔ ͍͚͑ͯͳ͍Ұઢ ←AIΤϯδχΞͰ͢͠ ← ٕज़Ӧۀ͔ͳ ←
BizDevͩΑͶ ← ͓ɺ͓͏… ←͍͋ͭ͏ ɹΤϯδχΞ͡ΌͶʔΘ
Ͱ͖ΕΤϯδχΞͱͯ͠ Ұੜ൧Λ৯͍͖͍ͬͯͨ
ձࣾͰΤϯδχΞͱͯ͠ͷ ଚݫΛ࠶ͼऔΓ͢
ͦΜͳ͋Δ࣌…
ϔουϨεChromeΛΔ https://developers.google.com/web/updates/2017/04/headless-chrome?hl=ja
✓ Chrome͕ϔουϨεϞʔυͰىಈͰ͖Δ ✓ ChromeͷىಈΦϓγϣϯʹʮ--headessʯΛՃ͑Δ͚ͩ ✓ දతͳϔουϨεϒϥβͱ͍͑PhantomJS ✓ ߴͰ҆ఆͯ͠ಈ࡞͢Δ ✓ ඪ४ͷରԠ͕ૣ͍ʢES2017Async-Await͕͑Δʣ
✓ ओͳ༻్ςετࣗಈԽͱಈతΫϩʔϥ ϔουϨεChromeͱ
✓ ੩తΫϩʔϥʢwgetcurlʣ ✓ υΩϡϝϯτʢHTMLϑΝΠϧʣͷϦΫΤετͷΈ ✓ ϑΝΠϧΛύʔε͢Δ͚ͩͳͷͰߴʹಈ࡞͢Δ ✓ AngularJSɺReactɺVue.jsͰ࡞ΒΕͨSPAαΠτͰಈ࡞͠ͳ͍ ✓ ಈతΫϩʔϥʢPhantomJSϔουϨεChromeʣ
✓ ը૾JavaScript͓ΑͼCSSΛಡΈࠐΜͰඳը·Ͱߦ͏ ✓ JavaScriptͷ࣮ߦ·Ͱߦ͏ͷͰҰൠతʹ ✓ SPAαΠτͰैདྷͷαΠτͱಉ͡Α͏ʹಈ࡞͢Δ ੩తΫϩʔϥ vs. ಈతΫϩʔϥ ※ উखͳ໋໊Ͱ͢
Chrome DevTools Protocol https://chromedevtools.github.io/devtools-protocol/ ✓ ࠷৽ͷ༷ Chromiumίʔυ্ͷ JSONϑΝΠϧ ✓ 1࣌ؒʹ1ճGitHubͷ
ϨϙδτϦʹίϐʔ ͞Ε͍ͯΔ
ϕϯνϚʔΫ https://hackernoon.com/benchmark-headless-chrome-vs-phantomjs-e7f44c6956c
RIP PhantomJS https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE
͜Ε͔Β࢝ΊΔͳΒ ϔουϨεChrome
✓ API͕Ϩϕϧա͗ͯѻ͍͕͍͠ ✓ ༷͕·ͩෆ҆ఆͰ͍͔͚Δͷ͕େม ✓ ηΩϡϦςΟͷϒϩοΫʹҾ͔͔ͬΔ ✓ Content Security PolicyͳͲɺϢʔβʔͷอޢ͕࡞ಈͯ͠͠·͏
✓ ΧδϡΞϧʹόάΛ౿Ή ✓ setRequestInterceptionͷ࣮͕·࣮ͩݧஈ֊ ͔͠͠ࢁੵΈ
✓ Google ChromeνʔϜ͕ ϝϯςφϯε ✓ ߴϨϕϧͷAPIͰϔουϨε Chrome͕ѻ͑Δϥούʔ ✓ 1݄ʹv1.0.0͕ϦϦʔε͞Εͨ ✓
Slackάϧʔϓ࡞ΒΕ ରԠஸೡͰૣ͍ GoogleChrome / puppeteer https://github.com/GoogleChrome/puppeteer
None
None
ϔουϨε ChromeͰ Ϋϩʔϥ
ͬͯͭ ϝονϟ ྲྀߦͬͯΔ ʙʙʙ
Զ͕ ࠷ॳʹ ࡞ͬͨ ͜ͱʹ ͳΜͶ ʔ͔ͳ
ؾ͍ͮͨ
puppeteer / examples https://github.com/GoogleChrome/puppeteer/tree/master/examples
ʮͬͯΈͨʯͱʮղઆʯ ͔ΓͰ࣮༻తͳͷগͳ͍
ϔουϨεChromeͰ࠷ॳͷ ࣮༻తͳΫϩʔϥΛ࡞Ζ͏
✓ طଘͷΫϩʔϥ͕PromiseʹରԠ͍ͯ͠ͳ͍ ✓ ࢄڥͰಈ࡞͢ΔNode.jsͷΫϩʔϥ͕ͳ͔ͬͨ ͦͷଞͷཧ༝
✓ ࣮༻తͳΫϩʔϥͱͯ͠ඞཁͳػೳΛຬ͍ͨͯ͠Δ ✓ υΩϡϝϯτ͕ӳޠͰॻ͔Ε͍ͯΔ ✓ ςετ͕ेΧόʔ͞Ε͍ͯΔ ✓ ࢄڥͰಈ࡞͢Δ ✓ APIγϯϓϧʹอͭ
✓ puppeteer / examples ʹࡌͤͯΒ͏ ΰʔϧΛܾΊΔ
͜ΕͰΤϯδχΞͱͯ͠ͷ ଚݫΛऔΓ͢
…
Ͱ͖ͨ https://github.com/yujiosaka/headless-chrome-crawler
ΰʔϧୡ https://github.com/GoogleChrome/puppeteer/tree/master/examples
Google Developersʹసࡌ https://developers.google.com/web/tools/puppeteer/examples
ΞΫηε͕૿͑ͯϏϏΔ
)$$SBXMFSMBVODI \ NBY%FQUI ୳ࡧ͢Δ࠷େͷਂ͞ NBY$PODVSSFODZ ࠷େฒྻ BMMPXFE%PNBJOT<bXXXFNJODPKQ> ڐՄ͞Ε͍ͯΔυϝΠϯ FWBMVBUF1BHF
bUJUMF UFYU ϖʔδ্ͰධՁ͞ΕΔؔ PO4VDDFTT SFTVMU\ޭ࣌ʹධՁ͞ΕΔؔ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXFNJODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ σϞ
Ϋϩʔϥ͕Ͱ͖Δ·Ͱ
✓ ʮΫϩʔϦϯάʯͱʮεΫϨΠϐϯάʯҧ͏ ✓ ΫϩʔϦϯάɿHTML͔ΒϦϯΫΛݟ͚ͭΔ ✓ εΫϨΠϐϯάɿHTML͔Βཉ͍͠ใΛݟ͚ͭΔ ✓ ͦΕͧΕ୯ମͰଘࡏͯ͠ҙຯ͕ͳ͍ ࠷ϛχϚϧͳΫϩʔϥ
ೋͭͷڞ௨Կ͔
HTML͔ΒɹɹɹΛݟ͚ͭΔ
ͦΕͬͯjQueryͰΑ͘Ͷʁ
jQuery: true, ϖʔδʹK2VFSZΛࣗಈૠೖ v1.0.0ϦϦʔε
)$$SBXMFSMBVODI \ K2VFSZUSVF FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT
SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXFNJODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ example
✓ ੩తΫϩʔϥʹ׳Ε͍ͯΔͱɺ͛͢ʔ͘ײ͡Δ ✓ ͻͬͦΓΤϥʔͰࢭ·ͬͯͨΓ͢ΔͱϚδͰԜΉ ΠϥΠϥ͠ͳ͍Ϋϩʔϥ
✓ λεΫΩϡʔͱΩϟογϡʹRedisΛ༻͍Δ ✓ ෳͷαʔόͰRedisΛڞ༗ ࢄڥͰಈ࡞ͤ͞Δ
cache: new RedisCache(), ΩϟογϡετϨʔδʹ3FEJTΛࢦఆ v1.3.0ϦϦʔε
)$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU
PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ example )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^
✓ ෯༏ઌ୳ࡧʢBFSʣˍਂ͞༏ઌ୳ࡧʢDFSʣ ✓ robots.txtʹै͏ ✓ XMLαΠτϚοϓ୳ࡧ ✓ σόΠεͷΤϛϡϨʔγϣϯ ✓ ϖʔδͷεΫϦʔϯγϣοτ
✓ JSON/CSVग़ྗ ͦͷଞͷػೳ
͜Ε͔ΒͷΫϩʔϥ
✓ ͜ͷΫϩʔϥͷͨΊʹαʔόʔ100ฒͯ ΫϩʔϦϯά͢ΔౕͳΜ͍ͯͳ͍͠ΊΜͲ͍͘͞ ✓ ίϚϯυҰൃͰࢄڥʹσϓϩΠͯ͠ཉ͍͠ ݱࡏͷ՝
None
✓ ߏཧʰπʔϧʱʹ͍ۙ ✓ AWS LambdaɺAzure Functionsɺ Google CloudFunctionsΛ༰қʹσϓϩΠɾ࣮ߦ ✓ Node.js,
Python, Java, Scala, C#, F#, Go, Groovy, Kotlin, PHP & SwiftΛαϙʔτ ✓ ศརͳϓϥάΠϯͨ͘͞Μ Serverless Frameworkͱ
yarn (npm run) deploy yarn (npm run) start v2.0.0 will
be… "84-BNCEBʹσϓϩΠ ฒྻͰΫϩʔϦϯά։࢝
Զ͕࠷ॳʹϔουϨε ChromeͰΫϩʔϥ ࡞ͬͨࣄʹͳΜͶʔ͔ͳ
Զ͕࠷ॳʹϔουϨε ChromeͰ࣮༻తͳΫϩʔϥ ࡞ͬͨࣄʹͳΜͶʔ͔ͳ
͚ͩͲຊɺࣄͰ ͬͱίʔυΛॻ͖͍ͨ
WE ARE HIRING https://www.emin.co.jp/blog/news/1527/ ηʔϧε