$30 off During Our Annual Pro Sale. View Details »

スクレイピングの安定運用のために苦労したところ、工夫したところ

shida
August 21, 2016

 スクレイピングの安定運用のために苦労したところ、工夫したところ

Bayside Tech Bridge 2 016.08.21
クローリングのスペシャリストが語る、クローラー運用の裏側!

shida

August 21, 2016
Tweet

More Decks by shida

Other Decks in Programming

Transcript

  1. εΫϨΠϐϯάͷ҆ఆӡ༻ͷͨΊʹ
    ۤ࿑ͨ͠ͱ͜Ζɺ޻෉ͨ͠ͱ͜Ζ
    #BZTJEF5FDI#SJEHF
    ΫϩʔϦϯάͷεϖγϟϦετ͕ޠΔɺΫϩʔϥʔӡ༻ͷཪଆʂ
    ג
    ϏʔɾΞδϟΠϧ݉ɺδʔΫϥ΢υ ג
    ࢤా༟थ

    View Slide

  2. גࣜձࣾϏʔɾΞδϟΠϧ(΄΅ϑϦʔϥϯεɺ2012೥ΑΓ)
    δʔΫϥ΢υגࣜձࣾ औక໾ ݉೚
    ϑϦʔϥϯε͕ू·ͬͯɺνʔϜ։ൃ
    डୗ։ൃ(8ׂ)ɺࣗࣾαʔϏε։ൃ(2ׂ)
    ϦʔϯɾελʔτΞοϓɺΞδϟΠϧ(اըɺ։ൃɺӡ༻)
    RailsɺSwiftɺ Java for Android
    ։ൃҊ݅΍ɺҰॹʹಇ͖͍ͨϑϦʔϥϯεͷํ େืूத!!!
    ຊൃදͰ͸ɺࣗࣾαʔϏεͷ҆ఆӡ༻ͷͨΊʹۤ
    ࿑ͨ͠ͱ͜Ζɺ޻෉ͨ͠ͱ͜ΖΛ͝঺հ

    View Slide

  3. ۤ࿑ͨ͠ͱ͜Ζ

    View Slide

  4. +BWB4DSJQU͕࣮ߦ͞Εͳ͍ͱ
    ৘ใ͕ͱΕͳ͍
    αΠτଆͰ+4ͰಈతʹϖʔδΛߏஙͯ͠ΔͨΊ

    View Slide

  5. 1PSUFSHFJTUͰεΫϨΠϐϯά
    Ruby
    Capybara
    Poltergeist
    PhantomJS
    ର৅αΠτ
    Safariͱಉ͡JSΤϯδϯ͕ಈ͘
    Headlessϒϥ΢β
    CapybaraͷPhantomJSυϥΠό
    ड͚ೖΕςετ༻ςεςΟϯά
    ϑϨʔϜϫʔΫ

    View Slide

  6. require 'capybara/poltergeist'
    Capybara.register_driver :poltergeist do |app|
    Capybara::Poltergeist::Driver.new(app)
    end
    Capybara.default_driver = :poltergeist
    agent = Capybara.current_session
    agent.visit('URL')
    number = agent.find('CSSηϨΫλ').text.to_i
    1PSUFSHFJTUͰεΫϨΠϐϯά

    View Slide

  7. Ϣʔβʔೝূ͠ͳ͍ͱ
    ৘ใ͕ͱΕͳ͍

    View Slide

  8. agent.visit login_url
    agent.find('input[name="email"]').set(email)
    agent.find('input[name="pasword"]').set(password)
    agent.find('#login-btn').trigger('click')
    agent.visit account_url
    ೝূ͔ͯ͠Βର৅63-ʹΞΫηε

    View Slide

  9. ຖճೝূ͠௚͕͠ॏ͍

    View Slide

  10. def save_cookie(agent, user)
    cookies_str = Base64.encode64(
    Marshal.dump(
    agent.driver.browser.cookies))
    user.update_attributes(cookies: cookies_str)
    end
    def load_cookie(agent, user)
    cookies = Marshal.load(
    Base64.decode64(user.cookies))
    cookies.values.each do |cookie|
    cookie_hash = JSON.parse(cookie.to_json)
    ["attributes"]
    agent.driver.browser.set_cookie(cookie_hash)
    end
    end
    $PPLJFʹΑΔೝূ

    View Slide

  11. DPPLJFͷ༗ޮظݶ͕
    ੾Ε͍ͯΔ͕࣌͋Δ

    View Slide

  12. DPPLJF͕੾Ε͍ͯͨΒSFUSZ
    scrape(need_login: true) do
    agent.visit('URL')
    agent.find('CSSηϨΫλ').text.to_i
    end
    # εΫϨΠϐϯά͢ΔՕॴ༻ڞ௨ϝιου
    def scrape need_login: false
    begin
    yield
    rescue => e
    if need_login && ! login?
    login
    retry
    end
    end
    end

    View Slide

  13. Կ౓΋ೝূτϥΠ͍ͯ͠Δͱ
    ϩοΫ͞ΕΔ

    View Slide

  14. agent.driver.headers = {
    "User-Agent" => "Mozilla/5.0 (Macintosh; Intel
    Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like
    Gecko) Chrome/47.0.2526.106 Safari/537.36
    #{Time.now.to_f.to_s}"
    }
    6TFS"HFOUΛͪΐ͍ͪΐ͍ม͑Δ

    View Slide

  15. ΞΫηε͠·͍ͬͯ͘Δͱ
    ϒϩοΫ͞ΕΔ

    View Slide

  16. Proxy1
    (AWS)
    ΞϓϦ
    αʔό
    ର৅
    αΠτ
    ΞΫηεݩΛͪΐ͍ͪΐ͍ม͑Δ
    Proxy1
    (AWS)
    ᶃϒϩοΫ
    ᶄ৽͘͠ىಈ

    View Slide

  17. )5.-ߏ଄͕มΘΓ
    εΫϨΠϐϯάࣦഊ
    ͱ͔

    View Slide

  18. "#ςετͯ͠ΔΒ͘͠
    ΞΧ΢ϯτຖͰ)5.-͕ҧ͏

    View Slide

  19. Ͱ͖Δ͚ͩදࣔܥ͔Β͸εΫϨΠϐϯά͠ͳ͍
    ॓ധਓ਺ 9໊
    ৘ใදࣔը໘
    ॓ധਓ਺ 9
    ϑΥʔϜը໘

    ϑΥʔϜ෦෼ͷϚʔΫΞοϓ͸αʔόʔαΠυͷϓϩάϥϜͱ࿈
    ܞ͍ͯ͠ΔͷͰมߋ͕ൃੜ͠ʹ͍͘

    JavaScriptଆʹJSONจࣈྻͰ৘ใΛ౉͍ͯ͠Δͱ͜Ζͱ͔΋ม
    ߋ͕ൃੜ͠ʹ͍͘
    http://example.com/users/12345678
    URL΋มߋ͕ൃੜ͠ʹ͍͘

    View Slide

  20. αΠτ͕ॏͯ͘
    ͨ·ʹλΠϜΞ΢τͨ͠Γ
    མͪͨΓ͢Δ

    View Slide

  21. ϩʔυ଴ͪɺදࣔ࣌ؒΛԆ͹͢
    Capybara.register_driver :poltergeist do |app|
    Capybara::Poltergeist::Driver.new(app, :timeout
    => 60)
    end
    Capybara.default_driver = :poltergeist
    Capybara.default_max_wait_time = 30
    agent = Capybara.current_session
    # ࠷େ60ඵ଴ͬͯ͘ΕΔ
    agent.visit('URL')
    # ࠷େ30ඵJavaScriptͷඇಉظߋ৽ͳͲͷऴྃΛ଴ͬͯ͘ΕΔ
    number = agent.find('CSSηϨΫλ').text.to_i

    View Slide

  22. εΫϨΠϐϯάࣦഊ͸
    ඞ͓͖ͣΔ
    )5.-ߏ଄ͷมԽ ӬଓతΤϥʔ

    ઀ଓΤϥʔ Ұ࣌తͳΤϥʔ

    View Slide

  23. ϢχοτςετΛఆظతʹࣗಈ࣮ߦ
    project='ϦϙδτϦ໊'
    branch='master'
    api_token='APIτʔΫϯ'
    url=https://circleci.com/api/v1/project/${project}/
    tree/${branch}?circle-token=${api_token}
    curl \
    --header "Accept: application/json" \
    --header "Content-Type: application/json" \
    --request POST ${url}
    CircleCIͷϏϧυΛAPIΛ࢖ͬͯcron͔Βఆظ࣮ߦ
    ͚ͨ͜ΒCircleCI͕Slackʹ௨஌ͯ͘͠ΕΔ

    View Slide

  24. 4JEFLJRͷར༻
    # 1000ళฮ͋ͬͨͱͯ͠
    shops.each do |shop|
    # ΋͠10ళฮ໨ͰΤϥʔͰॲཧ͕ͱ·ͬͯ͠·ͬͨΒ
    # 990ళฮ͕ະॲཧʹͳͬͯ͠·͏
    shop.scrape
    end
    Ұ࣌తͳ઀ଓΤϥʔ
    ಛఆͷshopʹ͚ͩൃੜ͢Δ૝ఆ֎ͷΤϥʔ

    View Slide

  25. 4JEFLJRͷར༻
    ಛఆͷshopͰམͪͯ΋ɺଞͷshopͷॲཧ͸ݺ͹ΕΔ

    View Slide

  26. 4JEFLJRͷͦͷଞར఺
    ಛఆͷshopͰམͪͯ΋ɺଞͷshopͷॲཧ͸ݺ͹ΕΔ
    εϨουىಈ਺΍ىಈ·Ͱͷ଴ͪ࣌ؒΛ੍ޚ͠ɺର৅αΠτʹ
    ෛՙΛ͔͚ա͗ͳ͍
    མͪͨ࣌ʹslackʹ௨஌ͯ͘͠ΕΔ
    མͪͨεϨουΛϦτϥΠͯ͘͠ΕΔ
    ࠷େϦτϥΠճ਺ࢦఆՄೳ
    ద౰ʹϦτϥΠִؒΛ޿͛ͳ͕Β(15, 16, 31, 96, 271, ... )
    ϦτϥΠ࣌ͷslack௨஌ΛؒҾ͍ͨΓͰ͖Δ
    ฒྻॲཧ΍αʔόʔͷεέʔϧΞ΢τͰύϑΥʔϚϯεΞοϓ

    View Slide

  27. ·ͱΊ
    ౰વͰ͖Δ͚ͩεΫϨΠϐϯάΤϥʔ΍ɺ઀ଓΤϥʔΛճආ͢
    ΔΑ͏ʹ౒ྗ
    ͦΕͰ΋׬શʹ͸ճආͰ͖ͳ͍
    εΫϨΠϐϯάΤϥʔΛ଎΍͔ʹݕ஌͢Δ࢓૊ΈΛ༻ҙ
    Ұ࣌తʹ઀ଓΤϥʔ΍ෆଌͷΤϥʔ͕ൃੜͨ͠৔߹͸ɺ֘౰ॲ
    ཧ͸ҟৗऴྃͭͭ͠΋ɺ༧ఆ͍ͯͨ͠όονॲཧ͸ܧଓ
    Ұ࣌తΤϥʔ(઀ଓΤϥʔɺcookie༗ޮظݶ੾Ε)͸ϦτϥΠ
    εΫϨΠϐϯάͷӡ༻ʹ͔͔Δ࣌ؒ΍ίετΛ͋Β͔͡ΊϓϩδΣ
    Ϋτॳظʹؔ܎ऀʹཧղͯ͠΋Β͏ඞཁ͕͋Δ(৔߹ʹΑͬͯ͸
    Ϣʔβʔʹ΋)

    View Slide

  28. ࠷ޙʹએ఻ϦϯελΧϑΣ ΦϯϥΠϯ

    ϦʔϯɾελʔτΞοϓ(΍ͦͷଞྨࣅ)ख๏Λϕʔεʹͨ͠αʔ
    Ϗεاըɾ։ൃͷ࣮ફऀͷͨΊͷίϛϡχςΟ
    աڈ8ճͷΦϑϥΠϯษڧձΛ࣮ࢪ
    ΠϯλϏϡʔͷ࢓ํɺϢʔβʔςετͷ࢓ํɺMVPͷܾΊ
    ํɺࣾ಺ελʔτΞοϓͷۤ࿑࿩ɺͳͲͳͲ
    ͦΕͷΦϯϥΠϯ൛
    ຖिਫ༵೔ 21:30 GoogleϋϯάΞ΢τʹͯ
    ݱࡏϝϯόʔ 4໊
    ࢀՃऀ֤͕ࣗؔΘ͍ͬͯΔαʔϏεͷݱঢ়ใࠂ΍ɺ௚໘͍ͯ͠Δ
    ՝୊ͷڞ༗ͱ૬ޓΞυόΠεͱ͔ɺϦϯελܥͷຊͷಡॻձ
    ͝ڵຯ͋Ε͹੠Λ͔͚͍ͯͩ͘͞!

    View Slide