Save 37% off PRO during our Black Friday Sale! »

俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな

5d247ff63b1861db5e6a56d4990e5a4f?s=47 yujiosaka
February 22, 2018
740

俺が最初にヘッドレスChromeでクローラ作った 事になんねーかな

5d247ff63b1861db5e6a56d4990e5a4f?s=128

yujiosaka

February 22, 2018
Tweet

Transcript

  1. Yuji Isobe Զ͕࠷ॳʹϔουϨε ChromeͰΫϩʔϥ࡞ͬͨ ࣄʹͳΜͶʔ͔ͳ NodeֶԂ29࣌ݶ໨

  2. min e ϓϩδΣΫτϚωʔδϟʔ at
 
 @yujiosaka https://speakerdeck.com/yujiosaka/hitasurale-sitedeipuraningu

  3. ✓ Կނ͍·͞ΒΫϩʔϥͳͷ͔ ✓ ԿΛ໨ࢦͯ͠࡞͔ͬͨ ✓ ԿΛߟ͑ͳ͕Β࡞͔ͬͨ ✓ ͜Ε͔ΒͷΫϩʔϥ ࠓճ͸ΫϩʔϥΛ࡞ͬͨ࿩

  4. ڈ೥͸৭Μͳ͜ͱΛ΍ͬͨ…

  5. ECZine࿈ࡌ http://eczine.jp/article/detail/4869

  6. ECઐ໳ՈσϏϡʔ http://amzn.asia/aOkwFjH

  7. ࠔͬͨ(´ɾωɾʆ)

  8. ձࣾͰΤϯδχΞͩͱ
 ࢥΘΕͳ͘ͳ͖ͬͯͨorz

  9. ݸࣾຖʹνϡʔχϯάΛߦ͏ Ӧۀಉߦʹग़͔͚Δ ৽نϓϩμΫτͷఏҊ ӦۀࢿྉΛॻ͖࢝ΊΔ ϓϨεϦϦʔεΛॻ͖࢝ΊΔ ௒͑ͯ͸͍͚ͳ͍Ұઢ ←AIΤϯδχΞͰ͢͠ ← ٕज़Ӧۀ͔ͳ ←

    BizDevͩΑͶ ← ͓ɺ͓͏… ←͍͋ͭ΋͏
 ɹΤϯδχΞ͡ΌͶʔΘ
  10. Ͱ͖Ε͹ΤϯδχΞͱͯ͠
 Ұੜ൧Λ৯͍͖͍ͬͯͨ

  11. ձࣾͰΤϯδχΞͱͯ͠ͷ ଚݫΛ࠶ͼऔΓ໭͢

  12. ͦΜͳ͋Δ࣌…

  13. ϔουϨεChromeΛ஌Δ https://developers.google.com/web/updates/2017/04/headless-chrome?hl=ja

  14. ✓ Chrome͕ϔουϨεϞʔυͰىಈͰ͖Δ ✓ ChromeͷىಈΦϓγϣϯʹʮ--headessʯΛՃ͑Δ͚ͩ ✓ ୅දతͳϔουϨεϒϥ΢βͱ͍͑͹PhantomJS ✓ ߴ଎Ͱ҆ఆͯ͠ಈ࡞͢Δ ✓ ඪ४΁ͷରԠ͕ૣ͍ʢES2017΍Async-Await͕࢖͑Δʣ

    ✓ ओͳ༻్͸ςετࣗಈԽͱಈతΫϩʔϥ ϔουϨεChromeͱ͸
  15. ✓ ੩తΫϩʔϥʢwget΍curl౳ʣ ✓ υΩϡϝϯτʢHTMLϑΝΠϧ౳ʣ΁ͷϦΫΤετͷΈ ✓ ϑΝΠϧΛύʔε͢Δ͚ͩͳͷͰߴ଎ʹಈ࡞͢Δ ✓ AngularJSɺReactɺVue.jsͰ࡞ΒΕͨSPAαΠτͰ͸ಈ࡞͠ͳ͍ ✓ ಈతΫϩʔϥʢPhantomJS΍ϔουϨεChrome౳ʣ

    ✓ ը૾΍JavaScript͓ΑͼCSSΛಡΈࠐΜͰඳը·Ͱߦ͏ ✓ JavaScriptͷ࣮ߦ·Ͱߦ͏ͷͰҰൠతʹ௿଎ ✓ SPAαΠτͰ΋ैདྷͷαΠτͱಉ͡Α͏ʹಈ࡞͢Δ ੩తΫϩʔϥ vs. ಈతΫϩʔϥ ※ উखͳ໋໊Ͱ͢
  16. Chrome DevTools Protocol https://chromedevtools.github.io/devtools-protocol/ ✓ ࠷৽ͷ࢓༷͸ Chromiumίʔυ্ͷ
 JSONϑΝΠϧ ✓ 1࣌ؒʹ1ճGitHubͷ


    ϨϙδτϦʹίϐʔ
 ͞Ε͍ͯΔ
  17. ϕϯνϚʔΫ https://hackernoon.com/benchmark-headless-chrome-vs-phantomjs-e7f44c6956c

  18. RIP PhantomJS https://groups.google.com/forum/#!topic/phantomjs/9aI5d-LDuNE

  19. ͜Ε͔Β࢝ΊΔͳΒ
 ϔουϨεChrome

  20. ✓ API͕௿Ϩϕϧա͗ͯѻ͍͕೉͍͠ ✓ ࢓༷͕·ͩෆ҆ఆͰ௥͍͔͚Δͷ͕େม ✓ ηΩϡϦςΟͷϒϩοΫʹҾ͔͔ͬΔ ✓ Content Security PolicyͳͲɺϢʔβʔͷอޢ͕࡞ಈͯ͠͠·͏

    ✓ ΧδϡΞϧʹόάΛ౿Ή ✓ setRequestInterceptionͷ࣮૷͕·࣮ͩݧஈ֊ ͔͠͠໰୊΋ࢁੵΈ
  21. ✓ Google ChromeνʔϜ͕
 ϝϯςφϯε ✓ ߴϨϕϧͷAPIͰϔουϨε
 Chrome͕ѻ͑Δϥούʔ ✓ 1݄ʹv1.0.0͕ϦϦʔε͞Εͨ ✓

    Slackάϧʔϓ΋࡞ΒΕ
 ରԠ΋ஸೡͰૣ͍ GoogleChrome / puppeteer https://github.com/GoogleChrome/puppeteer
  22. None
  23. None
  24. ϔουϨε ChromeͰ Ϋϩʔϥ

  25. ͬͯ΍ͭ ϝονϟ ྲྀߦͬͯΔ ʙʙʙ

  26. Զ͕ ࠷ॳʹ ࡞ͬͨ ͜ͱʹ ͳΜͶ ʔ͔ͳ

  27. ؾ͍ͮͨ

  28. puppeteer / examples https://github.com/GoogleChrome/puppeteer/tree/master/examples

  29. ʮ࢖ͬͯΈͨʯͱʮղઆʯ
 ͹͔ΓͰ࣮༻తͳ΋ͷ͸গͳ͍

  30. ϔουϨεChromeͰ࠷ॳͷ ࣮༻తͳΫϩʔϥΛ࡞Ζ͏

  31. ✓ طଘͷΫϩʔϥ͕PromiseʹରԠ͍ͯ͠ͳ͍ ✓ ෼ࢄ؀ڥͰಈ࡞͢ΔNode.jsͷΫϩʔϥ͕ͳ͔ͬͨ ͦͷଞͷཧ༝

  32. ✓ ࣮༻తͳΫϩʔϥͱͯ͠ඞཁͳػೳΛຬ͍ͨͯ͠Δ ✓ υΩϡϝϯτ͕ӳޠͰॻ͔Ε͍ͯΔ ✓ ςετ͕े෼Χόʔ͞Ε͍ͯΔ ✓ ෼ࢄ؀ڥͰಈ࡞͢Δ ✓ API͸γϯϓϧʹอͭ

    ✓ puppeteer / examples ʹࡌͤͯ΋Β͏ ΰʔϧΛܾΊΔ
  33. ͜ΕͰΤϯδχΞͱͯ͠ͷ
 ଚݫΛऔΓ໭͢

  34. Ͱ͖ͨ https://github.com/yujiosaka/headless-chrome-crawler

  35. ΰʔϧୡ੒ https://github.com/GoogleChrome/puppeteer/tree/master/examples

  36. Google Developersʹసࡌ https://developers.google.com/web/tools/puppeteer/examples

  37. ΞΫηε͕૿͑ͯϏϏΔ

  38. ׬

  39. )$$SBXMFSMBVODI \ NBY%FQUI ୳ࡧ͢Δ࠷େͷਂ͞ NBY$PODVSSFODZ ࠷େฒྻ਺ BMMPXFE%PNBJOT<bXXXFNJODPKQ> ڐՄ͞Ε͍ͯΔυϝΠϯ FWBMVBUF1BHF 

    bUJUMF UFYU ϖʔδ্ͰධՁ͞ΕΔؔ਺ PO4VDDFTT SFTVMU\੒ޭ࣌ʹධՁ͞ΕΔؔ਺ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A  ^  ^  UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXFNJODPKQ  BXBJUDSBXMFSPO*EMF  BXBJUDSBXMFSDMPTF  ^  σϞ
  40. Ϋϩʔϥ͕Ͱ͖Δ·Ͱ

  41. ✓ ʮΫϩʔϦϯάʯͱʮεΫϨΠϐϯάʯ͸ҧ͏ ✓ ΫϩʔϦϯάɿHTML͔ΒϦϯΫΛݟ͚ͭΔ ✓ εΫϨΠϐϯάɿHTML͔Βཉ͍͠৘ใΛݟ͚ͭΔ ✓ ͦΕͧΕ୯ମͰଘࡏͯ͠΋ҙຯ͕ͳ͍ ࠷΋ϛχϚϧͳΫϩʔϥ

  42. ೋͭͷڞ௨఺͸Կ͔

  43. HTML͔ΒɹɹɹΛݟ͚ͭΔ

  44. ͦΕͬͯjQueryͰΑ͘Ͷʁ

  45. jQuery: true, ϖʔδʹK2VFSZΛࣗಈૠೖ v1.0.0ϦϦʔε

  46. )$$SBXMFSMBVODI \ K2VFSZUSVF  FWBMVBUF1BHF  bUJUMF UFYU  PO4VDDFTT

    SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A  ^  ^  UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXFNJODPKQ  BXBJUDSBXMFSPO*EMF  BXBJUDSBXMFSDMPTF  ^  example
  47. ✓ ੩తΫϩʔϥʹ׳Ε͍ͯΔͱɺ͛͢ʔ஗͘ײ͡Δ ✓ ͻͬͦΓΤϥʔͰࢭ·ͬͯͨΓ͢ΔͱϚδͰԜΉ ΠϥΠϥ͠ͳ͍Ϋϩʔϥ

  48. ✓ λεΫΩϡʔͱΩϟογϡʹRedisΛ༻͍Δ ✓ ෳ਺ͷαʔόͰRedisΛڞ༗ ෼ࢄ؀ڥͰಈ࡞ͤ͞Δ

  49. cache: new RedisCache(), ΩϟογϡετϨʔδʹ3FEJTΛࢦఆ v1.3.0ϦϦʔε

  50. )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^  FWBMVBUF1BHF  bUJUMF UFYU

     PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A  ^  ^  UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ  BXBJUDSBXMFSPO*EMF  BXBJUDSBXMFSDMPTF  ^  example )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^  FWBMVBUF1BHF  bUJUMF UFYU  PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A  ^  ^  UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ  BXBJUDSBXMFSPO*EMF  BXBJUDSBXMFSDMPTF  ^  )$$SBXMFSMBVODI \ DBDIFOFX3FEJT$BDIF \IPTU QPSU^  FWBMVBUF1BHF  bUJUMF UFYU  PO4VDDFTT SFTVMU\ DPOTPMFMPH A\SFTVMUPQUJPOTVSM^aU\SFTVMUSFTVMU^A  ^  ^  UIFO BTZODDSBXMFS\ DSBXMFSRVFVF IUUQTXXXBNB[PODPKQ  BXBJUDSBXMFSPO*EMF  BXBJUDSBXMFSDMPTF  ^ 
  51. ✓ ෯༏ઌ୳ࡧʢBFSʣˍਂ͞༏ઌ୳ࡧʢDFSʣ ✓ robots.txtʹै͏ ✓ XMLαΠτϚοϓ୳ࡧ ✓ σόΠεͷΤϛϡϨʔγϣϯ ✓ ϖʔδͷεΫϦʔϯγϣοτ

    ✓ JSON/CSVग़ྗ ͦͷଞͷػೳ
  52. ͜Ε͔ΒͷΫϩʔϥ

  53. ✓ ͜ͷΫϩʔϥͷͨΊʹαʔόʔ100୆ฒ΂ͯ
 ΫϩʔϦϯά͢ΔౕͳΜ͍ͯͳ͍͠ΊΜͲ͍͘͞ ✓ ίϚϯυҰൃͰ෼ࢄ؀ڥʹσϓϩΠͯ͠ཉ͍͠ ݱࡏͷ՝୊

  54. None
  55. ✓ ߏ੒؅ཧʰπʔϧʱʹ͍ۙ ✓ AWS LambdaɺAzure Functionsɺ
 Google CloudFunctionsΛ༰қʹσϓϩΠɾ࣮ߦ ✓ Node.js,

    Python, Java, Scala, C#, F#, Go, Groovy, Kotlin, PHP & SwiftΛαϙʔτ ✓ ศརͳϓϥάΠϯ΋ͨ͘͞Μ Serverless Frameworkͱ͸
  56. yarn (npm run) deploy yarn (npm run) start v2.0.0 will

    be… "84-BNCEBʹσϓϩΠ  ฒྻͰΫϩʔϦϯά։࢝
  57. Զ͕࠷ॳʹϔουϨε ChromeͰΫϩʔϥ ࡞ͬͨࣄʹͳΜͶʔ͔ͳ

  58. Զ͕࠷ॳʹϔουϨε ChromeͰ࣮༻తͳΫϩʔϥ ࡞ͬͨࣄʹͳΜͶʔ͔ͳ

  59. ͚ͩͲຊ౰͸ɺ࢓ࣄͰ
 ΋ͬͱίʔυΛॻ͖͍ͨ

  60. WE ARE HIRING https://www.emin.co.jp/blog/news/1527/ ηʔϧε΋