スクレイピングの安定運用のために苦労したところ、工夫したところ

Ad14e43e0c4a78e35255a14022cd4225?s=47 shida
August 21, 2016

 スクレイピングの安定運用のために苦労したところ、工夫したところ

Bayside Tech Bridge 2 016.08.21
クローリングのスペシャリストが語る、クローラー運用の裏側!

Ad14e43e0c4a78e35255a14022cd4225?s=128

shida

August 21, 2016
Tweet

Transcript

  1. εΫϨΠϐϯάͷ҆ఆӡ༻ͷͨΊʹ ۤ࿑ͨ͠ͱ͜Ζɺ޻෉ͨ͠ͱ͜Ζ #BZTJEF5FDI#SJEHF ΫϩʔϦϯάͷεϖγϟϦετ͕ޠΔɺΫϩʔϥʔӡ༻ͷཪଆʂ ג ϏʔɾΞδϟΠϧ݉ɺδʔΫϥ΢υ ג ࢤా༟थ

  2. גࣜձࣾϏʔɾΞδϟΠϧ(΄΅ϑϦʔϥϯεɺ2012೥ΑΓ) δʔΫϥ΢υגࣜձࣾ औక໾ ݉೚ ϑϦʔϥϯε͕ू·ͬͯɺνʔϜ։ൃ डୗ։ൃ(8ׂ)ɺࣗࣾαʔϏε։ൃ(2ׂ) ϦʔϯɾελʔτΞοϓɺΞδϟΠϧ(اըɺ։ൃɺӡ༻) RailsɺSwiftɺ Java for

    Android ։ൃҊ݅΍ɺҰॹʹಇ͖͍ͨϑϦʔϥϯεͷํ େืूத!!! ຊൃදͰ͸ɺࣗࣾαʔϏεͷ҆ఆӡ༻ͷͨΊʹۤ ࿑ͨ͠ͱ͜Ζɺ޻෉ͨ͠ͱ͜ΖΛ͝঺հ
  3. ۤ࿑ͨ͠ͱ͜Ζ

  4. +BWB4DSJQU͕࣮ߦ͞Εͳ͍ͱ ৘ใ͕ͱΕͳ͍ αΠτଆͰ+4ͰಈతʹϖʔδΛߏஙͯ͠ΔͨΊ

  5. 1PSUFSHFJTUͰεΫϨΠϐϯά Ruby Capybara Poltergeist PhantomJS ର৅αΠτ Safariͱಉ͡JSΤϯδϯ͕ಈ͘ Headlessϒϥ΢β CapybaraͷPhantomJSυϥΠό ड͚ೖΕςετ༻ςεςΟϯά

    ϑϨʔϜϫʔΫ
  6. require 'capybara/poltergeist' Capybara.register_driver :poltergeist do |app| Capybara::Poltergeist::Driver.new(app) end Capybara.default_driver =

    :poltergeist agent = Capybara.current_session agent.visit('URL') number = agent.find('CSSηϨΫλ').text.to_i 1PSUFSHFJTUͰεΫϨΠϐϯά
  7. Ϣʔβʔೝূ͠ͳ͍ͱ ৘ใ͕ͱΕͳ͍

  8. agent.visit login_url agent.find('input[name="email"]').set(email) agent.find('input[name="pasword"]').set(password) agent.find('#login-btn').trigger('click') agent.visit account_url ೝূ͔ͯ͠Βର৅63-ʹΞΫηε

  9. ຖճೝূ͠௚͕͠ॏ͍

  10. def save_cookie(agent, user) cookies_str = Base64.encode64( Marshal.dump( agent.driver.browser.cookies)) user.update_attributes(cookies: cookies_str)

    end def load_cookie(agent, user) cookies = Marshal.load( Base64.decode64(user.cookies)) cookies.values.each do |cookie| cookie_hash = JSON.parse(cookie.to_json) ["attributes"] agent.driver.browser.set_cookie(cookie_hash) end end $PPLJFʹΑΔೝূ
  11. DPPLJFͷ༗ޮظݶ͕ ੾Ε͍ͯΔ͕࣌͋Δ

  12. DPPLJF͕੾Ε͍ͯͨΒSFUSZ scrape(need_login: true) do agent.visit('URL') agent.find('CSSηϨΫλ').text.to_i end # εΫϨΠϐϯά͢ΔՕॴ༻ڞ௨ϝιου def

    scrape need_login: false begin yield rescue => e if need_login && ! login? login retry end end end
  13. Կ౓΋ೝূτϥΠ͍ͯ͠Δͱ ϩοΫ͞ΕΔ

  14. agent.driver.headers = { "User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS

    X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36 #{Time.now.to_f.to_s}" } 6TFS"HFOUΛͪΐ͍ͪΐ͍ม͑Δ
  15. ΞΫηε͠·͍ͬͯ͘Δͱ ϒϩοΫ͞ΕΔ

  16. Proxy1 (AWS) ΞϓϦ αʔό ର৅ αΠτ ΞΫηεݩΛͪΐ͍ͪΐ͍ม͑Δ Proxy1 (AWS) ᶃϒϩοΫ

    ᶄ৽͘͠ىಈ
  17. )5.-ߏ଄͕มΘΓ εΫϨΠϐϯάࣦഊ ͱ͔

  18. "#ςετͯ͠ΔΒ͘͠ ΞΧ΢ϯτຖͰ)5.-͕ҧ͏

  19. Ͱ͖Δ͚ͩදࣔܥ͔Β͸εΫϨΠϐϯά͠ͳ͍ ॓ധਓ਺ 9໊ ৘ใදࣔը໘ ॓ധਓ਺ 9 ϑΥʔϜը໘ ໊ ϑΥʔϜ෦෼ͷϚʔΫΞοϓ͸αʔόʔαΠυͷϓϩάϥϜͱ࿈ ܞ͍ͯ͠ΔͷͰมߋ͕ൃੜ͠ʹ͍͘

    <div data-bootstrap-data="{a: 'b', ... }" /> JavaScriptଆʹJSONจࣈྻͰ৘ใΛ౉͍ͯ͠Δͱ͜Ζͱ͔΋ม ߋ͕ൃੜ͠ʹ͍͘ http://example.com/users/12345678 URL΋มߋ͕ൃੜ͠ʹ͍͘
  20. αΠτ͕ॏͯ͘ ͨ·ʹλΠϜΞ΢τͨ͠Γ མͪͨΓ͢Δ

  21. ϩʔυ଴ͪɺදࣔ࣌ؒΛԆ͹͢ Capybara.register_driver :poltergeist do |app| Capybara::Poltergeist::Driver.new(app, :timeout => 60) end

    Capybara.default_driver = :poltergeist Capybara.default_max_wait_time = 30 agent = Capybara.current_session # ࠷େ60ඵ଴ͬͯ͘ΕΔ agent.visit('URL') # ࠷େ30ඵJavaScriptͷඇಉظߋ৽ͳͲͷऴྃΛ଴ͬͯ͘ΕΔ number = agent.find('CSSηϨΫλ').text.to_i
  22. εΫϨΠϐϯάࣦഊ͸ ඞ͓͖ͣΔ )5.-ߏ଄ͷมԽ ӬଓతΤϥʔ  ઀ଓΤϥʔ Ұ࣌తͳΤϥʔ

  23. ϢχοτςετΛఆظతʹࣗಈ࣮ߦ project='ϦϙδτϦ໊' branch='master' api_token='APIτʔΫϯ' url=https://circleci.com/api/v1/project/${project}/ tree/${branch}?circle-token=${api_token} curl \ --header "Accept:

    application/json" \ --header "Content-Type: application/json" \ --request POST ${url} CircleCIͷϏϧυΛAPIΛ࢖ͬͯcron͔Βఆظ࣮ߦ ͚ͨ͜ΒCircleCI͕Slackʹ௨஌ͯ͘͠ΕΔ
  24. 4JEFLJRͷར༻ # 1000ళฮ͋ͬͨͱͯ͠ shops.each do |shop| # ΋͠10ళฮ໨ͰΤϥʔͰॲཧ͕ͱ·ͬͯ͠·ͬͨΒ # 990ళฮ͕ະॲཧʹͳͬͯ͠·͏

    shop.scrape end Ұ࣌తͳ઀ଓΤϥʔ ಛఆͷshopʹ͚ͩൃੜ͢Δ૝ఆ֎ͷΤϥʔ
  25. 4JEFLJRͷར༻ ಛఆͷshopͰམͪͯ΋ɺଞͷshopͷॲཧ͸ݺ͹ΕΔ

  26. 4JEFLJRͷͦͷଞར఺ ಛఆͷshopͰམͪͯ΋ɺଞͷshopͷॲཧ͸ݺ͹ΕΔ εϨουىಈ਺΍ىಈ·Ͱͷ଴ͪ࣌ؒΛ੍ޚ͠ɺର৅αΠτʹ ෛՙΛ͔͚ա͗ͳ͍ མͪͨ࣌ʹslackʹ௨஌ͯ͘͠ΕΔ མͪͨεϨουΛϦτϥΠͯ͘͠ΕΔ ࠷େϦτϥΠճ਺ࢦఆՄೳ ద౰ʹϦτϥΠִؒΛ޿͛ͳ͕Β(15, 16, 31,

    96, 271, ... ) ϦτϥΠ࣌ͷslack௨஌ΛؒҾ͍ͨΓͰ͖Δ ฒྻॲཧ΍αʔόʔͷεέʔϧΞ΢τͰύϑΥʔϚϯεΞοϓ
  27. ·ͱΊ ౰વͰ͖Δ͚ͩεΫϨΠϐϯάΤϥʔ΍ɺ઀ଓΤϥʔΛճආ͢ ΔΑ͏ʹ౒ྗ ͦΕͰ΋׬શʹ͸ճආͰ͖ͳ͍ εΫϨΠϐϯάΤϥʔΛ଎΍͔ʹݕ஌͢Δ࢓૊ΈΛ༻ҙ Ұ࣌తʹ઀ଓΤϥʔ΍ෆଌͷΤϥʔ͕ൃੜͨ͠৔߹͸ɺ֘౰ॲ ཧ͸ҟৗऴྃͭͭ͠΋ɺ༧ఆ͍ͯͨ͠όονॲཧ͸ܧଓ Ұ࣌తΤϥʔ(઀ଓΤϥʔɺcookie༗ޮظݶ੾Ε)͸ϦτϥΠ εΫϨΠϐϯάͷӡ༻ʹ͔͔Δ࣌ؒ΍ίετΛ͋Β͔͡ΊϓϩδΣ Ϋτॳظʹؔ܎ऀʹཧղͯ͠΋Β͏ඞཁ͕͋Δ(৔߹ʹΑͬͯ͸

    Ϣʔβʔʹ΋)
  28. ࠷ޙʹએ఻ϦϯελΧϑΣ ΦϯϥΠϯ ϦʔϯɾελʔτΞοϓ(΍ͦͷଞྨࣅ)ख๏Λϕʔεʹͨ͠αʔ Ϗεاըɾ։ൃͷ࣮ફऀͷͨΊͷίϛϡχςΟ աڈ8ճͷΦϑϥΠϯษڧձΛ࣮ࢪ ΠϯλϏϡʔͷ࢓ํɺϢʔβʔςετͷ࢓ํɺMVPͷܾΊ ํɺࣾ಺ελʔτΞοϓͷۤ࿑࿩ɺͳͲͳͲ ͦΕͷΦϯϥΠϯ൛ ຖिਫ༵೔ 21:30

    GoogleϋϯάΞ΢τʹͯ ݱࡏϝϯόʔ 4໊ ࢀՃऀ֤͕ࣗؔΘ͍ͬͯΔαʔϏεͷݱঢ়ใࠂ΍ɺ௚໘͍ͯ͠Δ ՝୊ͷڞ༗ͱ૬ޓΞυόΠεͱ͔ɺϦϯελܥͷຊͷಡॻձ ͝ڵຯ͋Ε͹੠Λ͔͚͍ͯͩ͘͞!