Upgrade to Pro — share decks privately, control downloads, hide ads and more …

俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな

yujiosaka
February 22, 2018
940

俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな

yujiosaka

February 22, 2018
Tweet

Transcript

  1. Yuji Isobe
    Զ͕࠷ॳʹϔουϨε
    ChromeͰΫϩʔϥ࡞ͬͨ
    ࣄʹͳΜͶʔ͔ͳ
    NodeֶԂ29࣌ݶ໨

    View Slide

  2. min
    e
    ϓϩδΣΫτϚωʔδϟʔ at


    @yujiosaka
    https://speakerdeck.com/yujiosaka/hitasurale-sitedeipuraningu

    View Slide

  3. ✓ Կނ͍·͞ΒΫϩʔϥͳͷ͔
    ✓ ԿΛ໨ࢦͯ͠࡞͔ͬͨ
    ✓ ԿΛߟ͑ͳ͕Β࡞͔ͬͨ
    ✓ ͜Ε͔ΒͷΫϩʔϥ
    ࠓճ͸ΫϩʔϥΛ࡞ͬͨ࿩

    View Slide

  4. ڈ೥͸৭Μͳ͜ͱΛ΍ͬͨ…

    View Slide

  5. ECZine࿈ࡌ
    http://eczine.jp/article/detail/4869

    View Slide

  6. ECઐ໳ՈσϏϡʔ
    http://amzn.asia/aOkwFjH

    View Slide

  7. ࠔͬͨ(´ɾωɾʆ)

    View Slide

  8. ձࣾͰΤϯδχΞͩͱ

    ࢥΘΕͳ͘ͳ͖ͬͯͨorz

    View Slide

  9. ݸࣾຖʹνϡʔχϯάΛߦ͏
    Ӧۀಉߦʹग़͔͚Δ
    ৽نϓϩμΫτͷఏҊ
    ӦۀࢿྉΛॻ͖࢝ΊΔ
    ϓϨεϦϦʔεΛॻ͖࢝ΊΔ
    ௒͑ͯ͸͍͚ͳ͍Ұઢ
    ←AIΤϯδχΞͰ͢͠
    ← ٕज़Ӧۀ͔ͳ
    ← BizDevͩΑͶ
    ← ͓ɺ͓͏…
    ←͍͋ͭ΋͏

    ɹΤϯδχΞ͡ΌͶʔΘ

    View Slide

  10. Ͱ͖Ε͹ΤϯδχΞͱͯ͠

    Ұੜ൧Λ৯͍͖͍ͬͯͨ

    View Slide

  11. ձࣾͰΤϯδχΞͱͯ͠ͷ
    ଚݫΛ࠶ͼऔΓ໭͢

    View Slide

  12. ͦΜͳ͋Δ࣌…

    View Slide

  13. ϔουϨεChromeΛ஌Δ
    https://developers.google.com/web/updates/2017/04/headless-chrome?hl=ja

    View Slide

  14. ✓ Chrome͕ϔουϨεϞʔυͰىಈͰ͖Δ
    ✓ ChromeͷىಈΦϓγϣϯʹʮ--headessʯΛՃ͑Δ͚ͩ
    ✓ ୅දతͳϔουϨεϒϥ΢βͱ͍͑͹PhantomJS
    ✓ ߴ଎Ͱ҆ఆͯ͠ಈ࡞͢Δ
    ✓ ඪ४΁ͷରԠ͕ૣ͍ʢES2017΍Async-Await͕࢖͑Δʣ
    ✓ ओͳ༻్͸ςετࣗಈԽͱಈతΫϩʔϥ
    ϔουϨεChromeͱ͸

    View Slide

  15. ✓ ੩తΫϩʔϥʢwget΍curl౳ʣ
    ✓ υΩϡϝϯτʢHTMLϑΝΠϧ౳ʣ΁ͷϦΫΤετͷΈ
    ✓ ϑΝΠϧΛύʔε͢Δ͚ͩͳͷͰߴ଎ʹಈ࡞͢Δ
    ✓ AngularJSɺReactɺVue.jsͰ࡞ΒΕͨSPAαΠτͰ͸ಈ࡞͠ͳ͍
    ✓ ಈతΫϩʔϥʢPhantomJS΍ϔουϨεChrome౳ʣ
    ✓ ը૾΍JavaScript͓ΑͼCSSΛಡΈࠐΜͰඳը·Ͱߦ͏
    ✓ JavaScriptͷ࣮ߦ·Ͱߦ͏ͷͰҰൠతʹ௿଎
    ✓ SPAαΠτͰ΋ैདྷͷαΠτͱಉ͡Α͏ʹಈ࡞͢Δ
    ੩తΫϩʔϥ vs. ಈతΫϩʔϥ
    ※ উखͳ໋໊Ͱ͢

    View Slide

  16. Chrome DevTools Protocol
    https://chromedevtools.github.io/devtools-protocol/
    ✓ ࠷৽ͷ࢓༷͸
    Chromiumίʔυ্ͷ

    JSONϑΝΠϧ
    ✓ 1࣌ؒʹ1ճGitHubͷ

    ϨϙδτϦʹίϐʔ

    ͞Ε͍ͯΔ

    View Slide

  17. ϕϯνϚʔΫ
    https://hackernoon.com/benchmark-headless-chrome-vs-phantomjs-e7f44c6956c

    View Slide

  18. RIP PhantomJS
    https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE

    View Slide

  19. ͜Ε͔Β࢝ΊΔͳΒ

    ϔουϨεChrome

    View Slide

  20. ✓ API͕௿Ϩϕϧա͗ͯѻ͍͕೉͍͠
    ✓ ࢓༷͕·ͩෆ҆ఆͰ௥͍͔͚Δͷ͕େม
    ✓ ηΩϡϦςΟͷϒϩοΫʹҾ͔͔ͬΔ
    ✓ Content Security PolicyͳͲɺϢʔβʔͷอޢ͕࡞ಈͯ͠͠·͏
    ✓ ΧδϡΞϧʹόάΛ౿Ή
    ✓ setRequestInterceptionͷ࣮૷͕·࣮ͩݧஈ֊
    ͔͠͠໰୊΋ࢁੵΈ

    View Slide

  21. ✓ Google ChromeνʔϜ͕

    ϝϯςφϯε
    ✓ ߴϨϕϧͷAPIͰϔουϨε

    Chrome͕ѻ͑Δϥούʔ
    ✓ 1݄ʹv1.0.0͕ϦϦʔε͞Εͨ
    ✓ Slackάϧʔϓ΋࡞ΒΕ

    ରԠ΋ஸೡͰૣ͍
    GoogleChrome / puppeteer
    https://github.com/GoogleChrome/puppeteer

    View Slide

  22. View Slide

  23. View Slide

  24. ϔουϨε
    ChromeͰ
    Ϋϩʔϥ

    View Slide

  25. ͬͯ΍ͭ
    ϝονϟ
    ྲྀߦͬͯΔ
    ʙʙʙ

    View Slide

  26. Զ͕
    ࠷ॳʹ
    ࡞ͬͨ
    ͜ͱʹ
    ͳΜͶ
    ʔ͔ͳ

    View Slide

  27. ؾ͍ͮͨ

    View Slide

  28. puppeteer / examples
    https://github.com/GoogleChrome/puppeteer/tree/master/examples

    View Slide

  29. ʮ࢖ͬͯΈͨʯͱʮղઆʯ

    ͹͔ΓͰ࣮༻తͳ΋ͷ͸গͳ͍

    View Slide

  30. ϔουϨεChromeͰ࠷ॳͷ
    ࣮༻తͳΫϩʔϥΛ࡞Ζ͏

    View Slide

  31. ✓ طଘͷΫϩʔϥ͕PromiseʹରԠ͍ͯ͠ͳ͍
    ✓ ෼ࢄ؀ڥͰಈ࡞͢ΔNode.jsͷΫϩʔϥ͕ͳ͔ͬͨ
    ͦͷଞͷཧ༝

    View Slide

  32. ✓ ࣮༻తͳΫϩʔϥͱͯ͠ඞཁͳػೳΛຬ͍ͨͯ͠Δ
    ✓ υΩϡϝϯτ͕ӳޠͰॻ͔Ε͍ͯΔ
    ✓ ςετ͕े෼Χόʔ͞Ε͍ͯΔ
    ✓ ෼ࢄ؀ڥͰಈ࡞͢Δ
    ✓ API͸γϯϓϧʹอͭ
    ✓ puppeteer / examples ʹࡌͤͯ΋Β͏
    ΰʔϧΛܾΊΔ

    View Slide

  33. ͜ΕͰΤϯδχΞͱͯ͠ͷ

    ଚݫΛऔΓ໭͢

    View Slide


  34. View Slide

  35. Ͱ͖ͨ
    https://github.com/yujiosaka/headless-chrome-crawler

    View Slide

  36. ΰʔϧୡ੒
    https://github.com/GoogleChrome/puppeteer/tree/master/examples

    View Slide

  37. Google Developersʹసࡌ
    https://developers.google.com/web/tools/puppeteer/examples

    View Slide

  38. ΞΫηε͕૿͑ͯϏϏΔ

    View Slide

  39. ׬

    View Slide

  40. )$$SBXMFSMBVODI \
    NBY%FQUI ୳ࡧ͢Δ࠷େͷਂ͞
    NBY$PODVSSFODZ ࠷େฒྻ਺
    BMMPXFE%PNBJOT ڐՄ͞Ε͍ͯΔυϝΠϯ
    FWBMVBUF1BHF
    bUJUMF
    UFYU

    ϖʔδ্ͰධՁ͞ΕΔؔ਺
    PO4VDDFTT SFTVMU\੒ޭ࣌ʹධՁ͞ΕΔؔ਺
    DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A

    ^

    ^

    UIFO BTZODDSBXMFS\
    DSBXMFSRVFVF IUUQTXXXFNJODPKQ

    BXBJUDSBXMFSPO*EMF

    BXBJUDSBXMFSDMPTF

    ^

    σϞ

    View Slide

  41. Ϋϩʔϥ͕Ͱ͖Δ·Ͱ

    View Slide

  42. ✓ ʮΫϩʔϦϯάʯͱʮεΫϨΠϐϯάʯ͸ҧ͏
    ✓ ΫϩʔϦϯάɿHTML͔ΒϦϯΫΛݟ͚ͭΔ
    ✓ εΫϨΠϐϯάɿHTML͔Βཉ͍͠৘ใΛݟ͚ͭΔ
    ✓ ͦΕͧΕ୯ମͰଘࡏͯ͠΋ҙຯ͕ͳ͍
    ࠷΋ϛχϚϧͳΫϩʔϥ

    View Slide

  43. ೋͭͷڞ௨఺͸Կ͔

    View Slide

  44. HTML͔ΒɹɹɹΛݟ͚ͭΔ

    View Slide

  45. ͦΕͬͯjQueryͰΑ͘Ͷʁ

    View Slide

  46. jQuery: true,
    ϖʔδʹK2VFSZΛࣗಈૠೖ
    v1.0.0ϦϦʔε

    View Slide

  47. )$$SBXMFSMBVODI \
    K2VFSZUSVF
    FWBMVBUF1BHF
    bUJUMF
    UFYU


    PO4VDDFTT SFTVMU\
    DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A

    ^

    ^

    UIFO BTZODDSBXMFS\
    DSBXMFSRVFVF IUUQTXXXFNJODPKQ

    BXBJUDSBXMFSPO*EMF

    BXBJUDSBXMFSDMPTF

    ^

    example

    View Slide

  48. ✓ ੩తΫϩʔϥʹ׳Ε͍ͯΔͱɺ͛͢ʔ஗͘ײ͡Δ
    ✓ ͻͬͦΓΤϥʔͰࢭ·ͬͯͨΓ͢ΔͱϚδͰԜΉ
    ΠϥΠϥ͠ͳ͍Ϋϩʔϥ

    View Slide

  49. ✓ λεΫΩϡʔͱΩϟογϡʹRedisΛ༻͍Δ
    ✓ ෳ਺ͷαʔόͰRedisΛڞ༗
    ෼ࢄ؀ڥͰಈ࡞ͤ͞Δ

    View Slide

  50. cache: new RedisCache(),
    ΩϟογϡετϨʔδʹ3FEJTΛࢦఆ
    v1.3.0ϦϦʔε

    View Slide

  51. )$$SBXMFSMBVODI \
    DBDIFOFX3FEJT$BDIF \IPTU QPSU^

    FWBMVBUF1BHF
    bUJUMF
    UFYU


    PO4VDDFTT SFTVMU\
    DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A

    ^

    ^

    UIFO BTZODDSBXMFS\
    DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ

    BXBJUDSBXMFSPO*EMF

    BXBJUDSBXMFSDMPTF

    ^

    example
    )$$SBXMFSMBVODI \
    DBDIFOFX3FEJT$BDIF \IPTU QPSU^

    FWBMVBUF1BHF
    bUJUMF
    UFYU


    PO4VDDFTT SFTVMU\
    DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A

    ^

    ^

    UIFO BTZODDSBXMFS\
    DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ

    BXBJUDSBXMFSPO*EMF

    BXBJUDSBXMFSDMPTF

    ^

    )$$SBXMFSMBVODI \
    DBDIFOFX3FEJT$BDIF \IPTU QPSU^

    FWBMVBUF1BHF
    bUJUMF
    UFYU


    PO4VDDFTT SFTVMU\
    DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A

    ^

    ^

    UIFO BTZODDSBXMFS\
    DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ

    BXBJUDSBXMFSPO*EMF

    BXBJUDSBXMFSDMPTF

    ^

    View Slide

  52. ✓ ෯༏ઌ୳ࡧʢBFSʣˍਂ͞༏ઌ୳ࡧʢDFSʣ
    ✓ robots.txtʹै͏
    ✓ XMLαΠτϚοϓ୳ࡧ
    ✓ σόΠεͷΤϛϡϨʔγϣϯ
    ✓ ϖʔδͷεΫϦʔϯγϣοτ
    ✓ JSON/CSVग़ྗ
    ͦͷଞͷػೳ

    View Slide

  53. ͜Ε͔ΒͷΫϩʔϥ

    View Slide

  54. ✓ ͜ͷΫϩʔϥͷͨΊʹαʔόʔ100୆ฒ΂ͯ

    ΫϩʔϦϯά͢ΔౕͳΜ͍ͯͳ͍͠ΊΜͲ͍͘͞
    ✓ ίϚϯυҰൃͰ෼ࢄ؀ڥʹσϓϩΠͯ͠ཉ͍͠
    ݱࡏͷ՝୊

    View Slide

  55. View Slide

  56. ✓ ߏ੒؅ཧʰπʔϧʱʹ͍ۙ
    ✓ AWS LambdaɺAzure Functionsɺ

    Google CloudFunctionsΛ༰қʹσϓϩΠɾ࣮ߦ
    ✓ Node.js, Python, Java, Scala, C#, F#, Go, Groovy,
    Kotlin, PHP & SwiftΛαϙʔτ
    ✓ ศརͳϓϥάΠϯ΋ͨ͘͞Μ
    Serverless Frameworkͱ͸

    View Slide

  57. yarn (npm run) deploy
    yarn (npm run) start
    v2.0.0 will be…
    "84-BNCEBʹσϓϩΠ
    ฒྻͰΫϩʔϦϯά։࢝

    View Slide

  58. Զ͕࠷ॳʹϔουϨε
    ChromeͰΫϩʔϥ
    ࡞ͬͨࣄʹͳΜͶʔ͔ͳ

    View Slide

  59. Զ͕࠷ॳʹϔουϨε
    ChromeͰ࣮༻తͳΫϩʔϥ
    ࡞ͬͨࣄʹͳΜͶʔ͔ͳ

    View Slide

  60. ͚ͩͲຊ౰͸ɺ࢓ࣄͰ

    ΋ͬͱίʔυΛॻ͖͍ͨ

    View Slide

  61. WE ARE HIRING
    https://www.emin.co.jp/blog/news/1527/
    ηʔϧε΋

    View Slide