Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな
yujiosaka
February 22, 2018
4
860
俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな
yujiosaka
February 22, 2018
Tweet
Share
More Decks by yujiosaka
See All by yujiosaka
Machine Learning with JavaScript
yujiosaka
0
100
JavaScriptでも機械学習がやりたかった話
yujiosaka
2
360
ヘッドレスChromeでクローラを作った後の話
yujiosaka
3
470
『XXX』のための管理画面
yujiosaka
1
1k
Enjoy Deep Learning by JavaScript
yujiosaka
1
250
ひたすら楽してディープラーニング
yujiosaka
20
13k
technology x business
yujiosaka
3
490
第二回もんご祭 パネルディスカッション
yujiosaka
0
710
technology x business
yujiosaka
0
110
Featured
See All Featured
Faster Mobile Websites
deanohume
295
29k
How New CSS Is Changing Everything About Graphic Design on the Web
jensimmons
214
12k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
15
1.2k
Principles of Awesome APIs and How to Build Them.
keavy
117
15k
How GitHub Uses GitHub to Build GitHub
holman
465
280k
Large-scale JavaScript Application Architecture
addyosmani
499
110k
Testing 201, or: Great Expectations
jmmastey
25
5.7k
Building a Modern Day E-commerce SEO Strategy
aleyda
6
4.5k
Rebuilding a faster, lazier Slack
samanthasiow
69
7.5k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
239
19k
Building Adaptive Systems
keathley
27
1.3k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
349
27k
Transcript
Yuji Isobe Զ͕࠷ॳʹϔουϨε ChromeͰΫϩʔϥ࡞ͬͨ ࣄʹͳΜͶʔ͔ͳ NodeֶԂ29࣌ݶ
min e ϓϩδΣΫτϚωʔδϟʔ at @yujiosaka https://speakerdeck.com/yujiosaka/hitasurale-sitedeipuraningu
✓ Կނ͍·͞ΒΫϩʔϥͳͷ͔ ✓ ԿΛࢦͯ͠࡞͔ͬͨ ✓ ԿΛߟ͑ͳ͕Β࡞͔ͬͨ ✓ ͜Ε͔ΒͷΫϩʔϥ ࠓճΫϩʔϥΛ࡞ͬͨ
ڈ৭Μͳ͜ͱΛͬͨ…
ECZine࿈ࡌ http://eczine.jp/article/detail/4869
ECઐՈσϏϡʔ http://amzn.asia/aOkwFjH
ࠔͬͨ(´ɾωɾʆ)
ձࣾͰΤϯδχΞͩͱ ࢥΘΕͳ͘ͳ͖ͬͯͨorz
ݸࣾຖʹνϡʔχϯάΛߦ͏ Ӧۀಉߦʹग़͔͚Δ ৽نϓϩμΫτͷఏҊ ӦۀࢿྉΛॻ͖࢝ΊΔ ϓϨεϦϦʔεΛॻ͖࢝ΊΔ ͍͚͑ͯͳ͍Ұઢ ←AIΤϯδχΞͰ͢͠ ← ٕज़Ӧۀ͔ͳ ←
BizDevͩΑͶ ← ͓ɺ͓͏… ←͍͋ͭ͏ ɹΤϯδχΞ͡ΌͶʔΘ
Ͱ͖ΕΤϯδχΞͱͯ͠ Ұੜ൧Λ৯͍͖͍ͬͯͨ
ձࣾͰΤϯδχΞͱͯ͠ͷ ଚݫΛ࠶ͼऔΓ͢
ͦΜͳ͋Δ࣌…
ϔουϨεChromeΛΔ https://developers.google.com/web/updates/2017/04/headless-chrome?hl=ja
✓ Chrome͕ϔουϨεϞʔυͰىಈͰ͖Δ ✓ ChromeͷىಈΦϓγϣϯʹʮ--headessʯΛՃ͑Δ͚ͩ ✓ දతͳϔουϨεϒϥβͱ͍͑PhantomJS ✓ ߴͰ҆ఆͯ͠ಈ࡞͢Δ ✓ ඪ४ͷରԠ͕ૣ͍ʢES2017Async-Await͕͑Δʣ
✓ ओͳ༻్ςετࣗಈԽͱಈతΫϩʔϥ ϔουϨεChromeͱ
✓ ੩తΫϩʔϥʢwgetcurlʣ ✓ υΩϡϝϯτʢHTMLϑΝΠϧʣͷϦΫΤετͷΈ ✓ ϑΝΠϧΛύʔε͢Δ͚ͩͳͷͰߴʹಈ࡞͢Δ ✓ AngularJSɺReactɺVue.jsͰ࡞ΒΕͨSPAαΠτͰಈ࡞͠ͳ͍ ✓ ಈతΫϩʔϥʢPhantomJSϔουϨεChromeʣ
✓ ը૾JavaScript͓ΑͼCSSΛಡΈࠐΜͰඳը·Ͱߦ͏ ✓ JavaScriptͷ࣮ߦ·Ͱߦ͏ͷͰҰൠతʹ ✓ SPAαΠτͰैདྷͷαΠτͱಉ͡Α͏ʹಈ࡞͢Δ ੩తΫϩʔϥ vs. ಈతΫϩʔϥ ※ উखͳ໋໊Ͱ͢
Chrome DevTools Protocol https://chromedevtools.github.io/devtools-protocol/ ✓ ࠷৽ͷ༷ Chromiumίʔυ্ͷ JSONϑΝΠϧ ✓ 1࣌ؒʹ1ճGitHubͷ
ϨϙδτϦʹίϐʔ ͞Ε͍ͯΔ
ϕϯνϚʔΫ https://hackernoon.com/benchmark-headless-chrome-vs-phantomjs-e7f44c6956c
RIP PhantomJS https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE
͜Ε͔Β࢝ΊΔͳΒ ϔουϨεChrome
✓ API͕Ϩϕϧա͗ͯѻ͍͕͍͠ ✓ ༷͕·ͩෆ҆ఆͰ͍͔͚Δͷ͕େม ✓ ηΩϡϦςΟͷϒϩοΫʹҾ͔͔ͬΔ ✓ Content Security PolicyͳͲɺϢʔβʔͷอޢ͕࡞ಈͯ͠͠·͏
✓ ΧδϡΞϧʹόάΛ౿Ή ✓ setRequestInterceptionͷ࣮͕·࣮ͩݧஈ֊ ͔͠͠ࢁੵΈ
✓ Google ChromeνʔϜ͕ ϝϯςφϯε ✓ ߴϨϕϧͷAPIͰϔουϨε Chrome͕ѻ͑Δϥούʔ ✓ 1݄ʹv1.0.0͕ϦϦʔε͞Εͨ ✓
Slackάϧʔϓ࡞ΒΕ ରԠஸೡͰૣ͍ GoogleChrome / puppeteer https://github.com/GoogleChrome/puppeteer
None
None
ϔουϨε ChromeͰ Ϋϩʔϥ
ͬͯͭ ϝονϟ ྲྀߦͬͯΔ ʙʙʙ
Զ͕ ࠷ॳʹ ࡞ͬͨ ͜ͱʹ ͳΜͶ ʔ͔ͳ
ؾ͍ͮͨ
puppeteer / examples https://github.com/GoogleChrome/puppeteer/tree/master/examples
ʮͬͯΈͨʯͱʮղઆʯ ͔ΓͰ࣮༻తͳͷগͳ͍
ϔουϨεChromeͰ࠷ॳͷ ࣮༻తͳΫϩʔϥΛ࡞Ζ͏
✓ طଘͷΫϩʔϥ͕PromiseʹରԠ͍ͯ͠ͳ͍ ✓ ࢄڥͰಈ࡞͢ΔNode.jsͷΫϩʔϥ͕ͳ͔ͬͨ ͦͷଞͷཧ༝
✓ ࣮༻తͳΫϩʔϥͱͯ͠ඞཁͳػೳΛຬ͍ͨͯ͠Δ ✓ υΩϡϝϯτ͕ӳޠͰॻ͔Ε͍ͯΔ ✓ ςετ͕ेΧόʔ͞Ε͍ͯΔ ✓ ࢄڥͰಈ࡞͢Δ ✓ APIγϯϓϧʹอͭ
✓ puppeteer / examples ʹࡌͤͯΒ͏ ΰʔϧΛܾΊΔ
͜ΕͰΤϯδχΞͱͯ͠ͷ ଚݫΛऔΓ͢
…
Ͱ͖ͨ https://github.com/yujiosaka/headless-chrome-crawler
ΰʔϧୡ https://github.com/GoogleChrome/puppeteer/tree/master/examples
Google Developersʹసࡌ https://developers.google.com/web/tools/puppeteer/examples
ΞΫηε͕૿͑ͯϏϏΔ
)$$SBXMFSMBVODI \ NBY%FQUI ୳ࡧ͢Δ࠷େͷਂ͞ NBY$PODVSSFODZ ࠷େฒྻ BMMPXFE%PNBJOT<bXXXFNJODPKQ> ڐՄ͞Ε͍ͯΔυϝΠϯ FWBMVBUF1BHF
bUJUMF UFYU ϖʔδ্ͰධՁ͞ΕΔؔ PO4VDDFTT SFTVMU\ޭ࣌ʹධՁ͞ΕΔؔ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXFNJODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ σϞ
Ϋϩʔϥ͕Ͱ͖Δ·Ͱ
✓ ʮΫϩʔϦϯάʯͱʮεΫϨΠϐϯάʯҧ͏ ✓ ΫϩʔϦϯάɿHTML͔ΒϦϯΫΛݟ͚ͭΔ ✓ εΫϨΠϐϯάɿHTML͔Βཉ͍͠ใΛݟ͚ͭΔ ✓ ͦΕͧΕ୯ମͰଘࡏͯ͠ҙຯ͕ͳ͍ ࠷ϛχϚϧͳΫϩʔϥ
ೋͭͷڞ௨Կ͔
HTML͔ΒɹɹɹΛݟ͚ͭΔ
ͦΕͬͯjQueryͰΑ͘Ͷʁ
jQuery: true, ϖʔδʹK2VFSZΛࣗಈૠೖ v1.0.0ϦϦʔε
)$$SBXMFSMBVODI \ K2VFSZUSVF FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT
SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXFNJODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ example
✓ ੩తΫϩʔϥʹ׳Ε͍ͯΔͱɺ͛͢ʔ͘ײ͡Δ ✓ ͻͬͦΓΤϥʔͰࢭ·ͬͯͨΓ͢ΔͱϚδͰԜΉ ΠϥΠϥ͠ͳ͍Ϋϩʔϥ
✓ λεΫΩϡʔͱΩϟογϡʹRedisΛ༻͍Δ ✓ ෳͷαʔόͰRedisΛڞ༗ ࢄڥͰಈ࡞ͤ͞Δ
cache: new RedisCache(), ΩϟογϡετϨʔδʹ3FEJTΛࢦఆ v1.3.0ϦϦʔε
)$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU
PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ example )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^ )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^ FWBMVBUF1BHF bUJUMF UFYU PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A ^ ^ UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ BXBJUDSBXMFSPO*EMF BXBJUDSBXMFSDMPTF ^
✓ ෯༏ઌ୳ࡧʢBFSʣˍਂ͞༏ઌ୳ࡧʢDFSʣ ✓ robots.txtʹै͏ ✓ XMLαΠτϚοϓ୳ࡧ ✓ σόΠεͷΤϛϡϨʔγϣϯ ✓ ϖʔδͷεΫϦʔϯγϣοτ
✓ JSON/CSVग़ྗ ͦͷଞͷػೳ
͜Ε͔ΒͷΫϩʔϥ
✓ ͜ͷΫϩʔϥͷͨΊʹαʔόʔ100ฒͯ ΫϩʔϦϯά͢ΔౕͳΜ͍ͯͳ͍͠ΊΜͲ͍͘͞ ✓ ίϚϯυҰൃͰࢄڥʹσϓϩΠͯ͠ཉ͍͠ ݱࡏͷ՝
None
✓ ߏཧʰπʔϧʱʹ͍ۙ ✓ AWS LambdaɺAzure Functionsɺ Google CloudFunctionsΛ༰қʹσϓϩΠɾ࣮ߦ ✓ Node.js,
Python, Java, Scala, C#, F#, Go, Groovy, Kotlin, PHP & SwiftΛαϙʔτ ✓ ศརͳϓϥάΠϯͨ͘͞Μ Serverless Frameworkͱ
yarn (npm run) deploy yarn (npm run) start v2.0.0 will
be… "84-BNCEBʹσϓϩΠ ฒྻͰΫϩʔϦϯά։࢝
Զ͕࠷ॳʹϔουϨε ChromeͰΫϩʔϥ ࡞ͬͨࣄʹͳΜͶʔ͔ͳ
Զ͕࠷ॳʹϔουϨε ChromeͰ࣮༻తͳΫϩʔϥ ࡞ͬͨࣄʹͳΜͶʔ͔ͳ
͚ͩͲຊɺࣄͰ ͬͱίʔυΛॻ͖͍ͨ
WE ARE HIRING https://www.emin.co.jp/blog/news/1527/ ηʔϧε