Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
1日あたり数百万商品をクロールする 大規模クローラーの裏側 / How IQON crawler works
Search
Takehiro Shiozaki
August 01, 2017
Technology
3
1.4k
1日あたり数百万商品をクロールする 大規模クローラーの裏側 / How IQON crawler works
Takehiro Shiozaki
August 01, 2017
Tweet
Share
More Decks by Takehiro Shiozaki
See All by Takehiro Shiozaki
タイムトラベルはじめました 〜時をかけるBigQuery〜 / Now serving Time Machine 〜BigQuery Which Leapt Through Time〜
shiozaki
0
4k
これからのZOZOを支える ログ収集基盤を設計した話 / Log collection infrastructure to support ZOZO in the future
shiozaki
5
12k
Amazon AuroraのデータをリアルタイムにGoogle BigQueryに連携してみた / Realtime data linkage from Amazon Aurora to Google BigQuery
shiozaki
10
12k
ZOZOTOWNの事業を支えるBigQueryの話 / BigQuery behind ZOZOTOWN
shiozaki
7
8.5k
ZOZOTOWNのDWHをRedshiftからBigQueryにお引越しした話 / Moving ZOZOTOWN DWH from Redshift to BigQuery
shiozaki
16
10k
ZOZOTOWNのバッチデータ転送基盤紹介 / ZOZOTOWN's data transfer batch
shiozaki
0
460
Digdagを仕事で使ってみて良かったこと、ハマったこと / Using Digdag in production environment
shiozaki
1
1.8k
ファッションIT業界あるある / fashion IT aruaru
shiozaki
1
710
MySQLからBigQueryの同期を差分更新にしたら4倍高速になった話 / Sync from MySQL to BigQuery become 4x faster by incremental updating
shiozaki
6
31k
Other Decks in Technology
See All in Technology
いつか使うかも貯金してたらめちゃめちゃ機能が増えてた話
riyaamemiya
0
610
[新卒向け研修資料] テスト文字列に「うんこ」と入れるな(2024年版)
infiniteloop_inc
4
18k
Azureの基本的な権限管理の勉強会
yhana
1
2k
R3のコードから見る実践LINQ実装最適化・コンカレントプログラミング実例
neuecc
3
2.2k
The AI Revolution Will Not Be Monopolized: Behind the scenes
inesmontani
PRO
1
160
require(ESM)とECMAScript仕様
uhyo
4
950
EMとして2023年度に頑張ったこと / What we did well in FY2023 as a EM
pauli
1
210
Babylon.js JAPAN活動紹介 (2024/4)
limes2018
1
110
いいたいことちゃんという
tkengo
0
230
リテール金融(キャッシュレス・ネット銀行・ネット証券)の競争環境と経済圏
8maki
0
1.5k
今年のRubyKaigiはProfiler Year🤘
osyoyu
0
340
Azure犬駆動開発の記録/GlobalAzureFukuoka2024_20240420
nina01
1
240
Featured
See All Featured
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
7
1.3k
How to name files
jennybc
65
93k
The World Runs on Bad Software
bkeepers
PRO
61
6.7k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
33
6k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
188
16k
A Tale of Four Properties
chriscoyier
152
22k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
226
51k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
352
28k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
275
13k
The MySQL Ecosystem @ GitHub 2015
samlambert
244
12k
Mobile First: as difficult as doing things right
swwweet
217
8.6k
Code Review Best Practice
trishagee
56
15k
Transcript
© 2017 VASILY,Inc. ͋ͨΓඦສΛΫϩʔϧ͢Δ େنΫϩʔϥʔͷཪଆ 4QFFF$BGF.FFUVQ Ԙ㟒݈߂
© 2017 VASILY,Inc. ࣗݾհ ⾣ Ԙ㟒݈߂ ⾣ 7"4*-:৽ଔೖࣾ ⾣ όοΫΤϯυΤϯδχΞ
⾣ 3VCZ ⾣ (PPHMF#JH2VFSZ ⾣ "QBDIF4PMS ⾣ &NCVML ⾣ %JHEBH ▶ $ crontab -l 0 0 7 8 * /bin/increment_age ฐٕࣾज़ސ.BU[ࢯ
© 2017 VASILY,Inc. ໊ࣾגࣜձࣾ7"4*-: ϰΝγϦʔ 7"4*-: *OD ઃཱ݄ ॴࡏ౦ژ۠ޒాδχΞεϏϧ' ैۀһ໊
ࢿຊۚԯ දऔకۚࢁ༟थ औకࠓଜխઍ༿େี ओཁגओ άϩʔϏεϕϯνϟʔΩϟϐλϧ ҏ౻ςΫϊϩδʔϕϯνϟʔζ (.0ϕϯνϟʔύʔτφʔζ ,%%*גࣜձࣾ גࣜձࣾߨஊࣾ
© 2017 VASILY,Inc. Ҏ্ͷϑΝογϣϯ&$αΠτ͔ΒͷສΛ͑ΔΛܝࡌ ݄ؒສਓҎ্͕ར༻͢Δຊ࠷େڃͷϑΝογϣϯαΠτ
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. *20/ͷΫϩʔϥʔ ⾣ ఏܞαΠτҎ্ ⾣ ৗ࣌ߪങՄೳɿສҎ্
© 2017 VASILY,Inc. *20/ͷΫϩʔϥʔ ⾣ ߲ͷใΛऔಘ ࣸਅ Ձ֨ ࡏݿ
ϒϥϯυ ໊
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ࢄΫϩʔϦϯά ⾣ ରϖʔδ͕େͳͷͰࢄॲཧ ⾣ ࢄΫϩʔϧ༻ͷϑϨʔϜϫʔΫͳ͍ ⾣ 3VCZΛ༻͍ϑϧεΫϥονͰ࣮
4DSBQZEPFTO`UQSPWJEFBOZCVJMUJOGBDJMJUZGPSSVOOJOH DSBXMTJOBEJTUSJCVUF NVMUJTFSWFS NBOOFS IUUQTEPDTDSBQZPSHFOMBUFTUUPQJDTQSBDUJDFTIUNMEJTUSJCVUFEDSBXMT
© 2017 VASILY,Inc. UBTLRVFVFΛհͨ͠ࢄɾඇಉظॲཧ ⾣ 424ΛUBTLRVFVFͱͯ͠༻ ⾣ UBTLΛ࣮ߦ͢ΔϓϩηεʢXPSLFSʣಉ࢜ૄ݁߹ ⾣ ඇಉظॲཧϥΠϒϥϦͱͯ͠ɺ4IPSZVLFOΛ༻
XPSLFS XPSLFS FORVFVF EFRVFVF 424
© 2017 VASILY,Inc. 4IPSZVLFO ⾣ 3VCZͰॻ͔ΕͨඇಉظॲཧϑϨʔϜϫʔΫ ⾣ ෳΩϡʔͷཧػೳ ⾣
ϚϧνεϨου class HelloWorker include Shoryuken::Worker shoryuken_options queue: 'hello', auto_delete: true def perform(sqs_msg, name) puts "Hello, #{name}" end end HelloWorker.perform_async('joe') BTZODDBMM
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. $MPVE8BUDI-BNCEB ⾣ ఆظతʹ$MPVE8BUDI&WFOU͕ൃՐ͠ɺ-BNCEBΛىಈ ⾣ -BNCEB͕424ʹλεΫΛೖ͢Δ͜ͱͰɺΫϩʔϧ։࢝ $MPVE8BUDI 4/4
-BNCEB 424 JOWPLF FORVFVF
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. TUBSUDSBXMFSXPSLFS ⾣ 3%4͔ΒͦͷͷΫϩʔϧରαΠτใΛऔಘ ⾣ ͦΕΒͷαΠτΛΫϩʔϧ͢ΔͨΊͷλεΫΛೖ TUBSUDSBXMFS 424
3%4 ΫϩʔϧରαΠτใऔಘ FORVFVF º&$αΠτͷ
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. MJTUQBHFFORVFVFXPSLFS ⾣ &$αΠτͷϖʔδૹΓ෦ΛεΫϨΠϐϯά ⾣ શϖʔδͷ63-Λղੳ͢ΔͨΊͷλεΫΛೖ MJTUQBHFFORVFVF 424
ϖʔδૹΓ෦ղੳ FORVFVF ºϖʔδͷ &$TJUF IUUQTFYBNQMFDPNJUFNT QBHF IUUQTFYBNQMFDPNJUFNT QBHF IUUQTFYBNQMFDPNJUFNT QBHF
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFFORVFVFXPSLFS ⾣ Ϧετϖʔδ͔Βৄࡉϖʔδͷ63-Λղੳ ⾣ શϖʔδͷ63-Λղੳ͢ΔͨΊͷλεΫΛೖ JUFNQBHFFORVFVF 424
Ϧετϖʔδˠৄࡉϖʔδͷ63-Λऔಘ FORVFVF ºϖʔδͷ &$TJUF IUUQTFYBNQMFDPNJUFNT IUUQTFYBNQMFDPNJUFNT IUUQTFYBNQMFDPNJUFNT
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFEPXOMPBEXPSLFS ⾣ ৄࡉϖʔδ͔Β)5.-Λμϯϩʔυ ⾣ μϯϩʔυִؒͷௐͷͨΊʹ3FEJTʹࢄϩοΫʢޙड़ʣΛ࣮ݱ ⾣ )5.-Λղੳ͢ΔͨΊͷλεΫΛೖ
JUFNQBHFFORVFVF 424 ৄࡉϖʔδͷ)5.-Λऔಘ FORVFVF &$TJUF <!DOCTYPE> <HTML><HEAD><TITLE>トップス... ϩοΫऔಘ
© 2017 VASILY,Inc. શମߏਤ TUBSUDSBXMFS MJTUQBHFFORVFVF JUFNQBHFFORVFVF JUFNQBHFEPXOMPBE JUFNQBHFQBSTF
⾣ ΫϩʔϧॲཧΛෳݸͷλεΫʹࡉԽ $MPVE8BUDI-BNCEB
© 2017 VASILY,Inc. JUFNQBHFQBSTFXPSLFS ⾣ 91BUIɾਖ਼نදݱΛ͍)5.-Λύʔε ⾣ ύʔεઃఆʢ91BUIɾਖ਼نදݱXFCπʔϧͰೖߘ ⾣ ύʔε݁ՌΛ%#ʹॻ͖ࠐΉλεΫΛೖ
JUFNQBHFQBSTF 424 ύʔεઃఆΛऔಘ FORVFVF { "title": "トップス", "price": 9800, 3%4 ύʔεઃఆΛೖߘ
© 2017 VASILY,Inc. ͜ΕҎ߱ͷॲཧ ⾣ ࣌ؒͷ߹ͰࠓճׂѪ ⾣ Ϋϩʔϧ݁ՌΛ%#ʹॻ͖ࠐΈ ⾣ ը૾ॲཧ
⾣ ಁաॲཧ ⾣ ද৭நग़ ⾣ ΧςΰϦʔࣗಈྨ
© 2017 VASILY,Inc. μϯϩʔυִؒ ⾣ 3FEJTͰࢄϩοΫΛ࣮ݱ͠ɺμϯϩʔυִؒΛௐ IUUQTSFEJTJPUPQJDTEJTUMPDL EPXOMPBEXPSLFS" EPXOMPBEXPSLFS#
HFU@MPDLTVDDFTT MPDLFE HFU@MPDLGBJM EPXOMPBE HFU@MPDLGBJM FYQJSF HFU@MPDLTVDDFTT
© 2017 VASILY,Inc. จࣈίʔυ ⾣ NFUBDIBSTFU༻͍ͯ͠ͳ͍ ⾣ ,DPOWʢOLGϥούʔʣͷจࣈίʔυࣗಈਪଌػೳΛར༻ ▶
::Kconv.toutf8(str)͚ͩͰ0,
© 2017 VASILY,Inc. 41" 4JOHMF1BHF"QQMJDBUJPO ͷରԠ ⾣ 41")5.-ʹ΄ͱΜͲͷใ͕ͳ͍ ⾣
1IBOUPN+4Λͬͨ1SPYZΛհ͢Δ ⾣ PO-PBEΠϕϯτൃՐޙͷใΛऔಘ EPXOMPBEXPSLFS &$TJUF EPXOMPBE
© 2017 VASILY,Inc. 424ͷαΠζ੍ݶ ⾣ 424ʹ,#ҎԼͷςΩετσʔλ͔֨͠ೲͰ͖ͳ͍ ⾣ Ұ෦ͷϖʔδͷ)5.-͜ΕΛա ⾣
)UNM$PNQSFTTPS ;MJC #BTFͰ)5.-Λѹॖ ⾣ ʙఔʹѹॖ Base64.encode64( Zlib::Deflate.deflate( HtmlCompressor::Compressor.new.compress(html) ) )
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. Πϯϑϥߏਤ &$TQPUqFFUJOTUBODFT XPSLFSTJODPOUBJOFS %FQMPZ$POUBJOFS 424 -BNCEB
$MPVE8BUDI 8BUDI.FUSJDT "VUP4MBDF FORVFVFEFRVFVF
© 2017 VASILY,Inc. "QBDIF.FTPT.BSBUIPO ⾣ "QBDIF.FTPT ⾣ "EJTUSJCVUFETZTUFNTLFSOFM ⾣
ෳϚγϯΛͭͷܭࢉػϓʔϧͱͯ͠நԽ ⾣ .BSBUIPO ⾣ .FTPT্Ͱಈ࡞͢ΔίϯςφΦʔέετϨʔγϣϯπʔϧ ⾣ .FTPTͷλεΫΛσʔϞϯԽ
© 2017 VASILY,Inc. EPXOMPBEXPSLFSͷՔಇ ίϯςφ ॲཧ ⾣ EPXOMPBEXPSLFSʹϩοΫ͕͋ΔͨΊɺ͕ඞཁ XJUI
XJUIPVU
© 2017 VASILY,Inc. EPXOMPBEΩϡʔͷଟॏԽ ⾣ EPXOMPBEΩϡʔɾࢄϩοΫ&$αΠτຖʹಠཱ EPXOMPBEXPSLFS EPXOMPBEXPSLFS ϥϯμϜʹEFRVFVF
αΠτʹରԠͨ͠ ࢄϩοΫΛऔಘ
© 2017 VASILY,Inc. ΦʔτεέʔϧͷͨΊͷϝτϦΫε 424 -BNCEB $MPVE8BUDI &WFOU ⾣
EPXOMPBEΩϡʔະॲཧͷλεΫͷΛࢹͯ͠ҙຯ͕ͳ͍ ⾣ ͦͷΘΓʹɺະॲཧλεΫͷΩϡʔͷݸΛࢹ ⾣ $MPVE8BUDIͰࢹͰ͖ͳ͍ ⾣ -BNCEBͰ424ͷ"1*Λୟ͘ JOWPLF HFUUIFOVNCFSPG OPOFNQUZEPXOMPBERVFVFT DIBOHFUIFOVNCFSPGDPOUBJOFST
© 2017 VASILY,Inc. 'PSNPSFJOGPSNBUJPO ⾣ %PDLFS"QBDIF.FTPT.BSBUIPOʹΑΔഒ͍*20/Ϋϩʔϥʔͷߏங ⾣ IUUQUFDIWBTJMZKQFOUSZJRPODSBXMFSCZEPDLFSBOENFTPTBOENBSBUIPO ⾣
"QBDIF.FTPT.BSBUIPOΛຊ൪Ͱӡ༻͢ΔͨΊͷͭͷ5JQT ⾣ IUUQUFDIWBTJMZKQFOUSZBQBDIFNFTPTBOENBSBUIPOUJQT ⾣ 1SPEVDUJPOEFQMPZNFOUPGUIF%PDLFSDPOUBJOFSXJUI.BSBUIPO ⾣ IUUQTTQFBLFSEFDLDPNLPUBUTVQSPEVDUJPOEFQMPZNFOUPGUIFEPDLFSDPOUBJOFSXJUINBSBUIPO ⾣ "QBDIF.FTPTXJUI"NB[PO&$4QPU'MFFU ⾣ IUUQTTQFBLFSEFDLDPNLPUBUTVBQBDIFNFTPTXJUIBNB[POFDTQPUqFFU
© 2017 VASILY,Inc. ࣍ ⾣ *20/ͷΫϩʔϥʔʹ͍ͭͯ ⾣ نɾऔಘ͍ͯ͠Δใ ⾣
ࢄΫϩʔϦϯάΛ࣮ݱ͢Δཁૉٕज़ ⾣ 424ͱ4IPSZVLFOΛ༻͍ͨඇಉظॲཧ ⾣ εέʔϥϒϧͳΠϯϑϥ ⾣ %PDLFS .FTPT .BSBUIPOʹΑΔΦʔτεέʔϧ ⾣ ·ͱΊ
© 2017 VASILY,Inc. ·ͱΊ ⾣ *20/ͷΫϩʔϥʔຖඦສΛΫϩʔϧ͍ͯ͠Δ ⾣ େنͳࢄΫϩʔϥʔΛ3VCZͰϑϧεΫϥονͰ࣮ ⾣
ඇಉظॲཧΛ׆༻ͨ͠ॊೈͳΞϓϦέʔγϣϯ ⾣ εέʔϥϒϧͳΠϯϑϥͷ্ͰɺεϐʔυΞοϓˍඅ༻ݮ
© 2017 VASILY,Inc. 8FSF)JSJOH IUUQTWBTJMZKQSFDSVJU