Bayside Tech Bridge 2 016.08.21 クローリングのスペシャリストが語る、クローラー運用の裏側!
εΫϨΠϐϯάͷ҆ఆӡ༻ͷͨΊʹۤ࿑ͨ͠ͱ͜Ζɺͨ͠ͱ͜Ζ#BZTJEF5FDI#SJEHFΫϩʔϦϯάͷεϖγϟϦετ͕ޠΔɺΫϩʔϥʔӡ༻ͷཪଆʂגϏʔɾΞδϟΠϧ݉ɺδʔΫϥυ גࢤా༟थ
View Slide
גࣜձࣾϏʔɾΞδϟΠϧ(΄΅ϑϦʔϥϯεɺ2012ΑΓ)δʔΫϥυגࣜձࣾ औక ݉ϑϦʔϥϯε͕ू·ͬͯɺνʔϜ։ൃडୗ։ൃ(8ׂ)ɺࣗࣾαʔϏε։ൃ(2ׂ)ϦʔϯɾελʔτΞοϓɺΞδϟΠϧ(اըɺ։ൃɺӡ༻)RailsɺSwiftɺ Java for Android։ൃҊ݅ɺҰॹʹಇ͖͍ͨϑϦʔϥϯεͷํ େืूத!!!ຊൃදͰɺࣗࣾαʔϏεͷ҆ఆӡ༻ͷͨΊʹۤ࿑ͨ͠ͱ͜Ζɺͨ͠ͱ͜ΖΛ͝հ
ۤ࿑ͨ͠ͱ͜Ζ
+BWB4DSJQU͕࣮ߦ͞Εͳ͍ͱใ͕ͱΕͳ͍αΠτଆͰ+4ͰಈతʹϖʔδΛߏஙͯ͠ΔͨΊ
1PSUFSHFJTUͰεΫϨΠϐϯάRubyCapybaraPoltergeistPhantomJSରαΠτSafariͱಉ͡JSΤϯδϯ͕ಈ͘HeadlessϒϥβCapybaraͷPhantomJSυϥΠόड͚ೖΕςετ༻ςεςΟϯάϑϨʔϜϫʔΫ
require 'capybara/poltergeist'Capybara.register_driver :poltergeist do |app|Capybara::Poltergeist::Driver.new(app)endCapybara.default_driver = :poltergeistagent = Capybara.current_sessionagent.visit('URL')number = agent.find('CSSηϨΫλ').text.to_i1PSUFSHFJTUͰεΫϨΠϐϯά
Ϣʔβʔೝূ͠ͳ͍ͱใ͕ͱΕͳ͍
agent.visit login_urlagent.find('input[name="email"]').set(email)agent.find('input[name="pasword"]').set(password)agent.find('#login-btn').trigger('click')agent.visit account_urlೝূ͔ͯ͠Βର63-ʹΞΫηε
ຖճೝূ͕͠͠ॏ͍
def save_cookie(agent, user)cookies_str = Base64.encode64(Marshal.dump(agent.driver.browser.cookies))user.update_attributes(cookies: cookies_str)enddef load_cookie(agent, user)cookies = Marshal.load(Base64.decode64(user.cookies))cookies.values.each do |cookie|cookie_hash = JSON.parse(cookie.to_json)["attributes"]agent.driver.browser.set_cookie(cookie_hash)endend$PPLJFʹΑΔೝূ
DPPLJFͷ༗ޮظݶ͕Ε͍ͯΔ͕࣌͋Δ
DPPLJF͕Ε͍ͯͨΒSFUSZscrape(need_login: true) doagent.visit('URL')agent.find('CSSηϨΫλ').text.to_iend# εΫϨΠϐϯά͢ΔՕॴ༻ڞ௨ϝιουdef scrape need_login: falsebeginyieldrescue => eif need_login && ! login?loginretryendendend
ԿೝূτϥΠ͍ͯ͠ΔͱϩοΫ͞ΕΔ
agent.driver.headers = {"User-Agent" => "Mozilla/5.0 (Macintosh; IntelMac OS X 10_10_5) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/47.0.2526.106 Safari/537.36#{Time.now.to_f.to_s}"}6TFS"HFOUΛͪΐ͍ͪΐ͍ม͑Δ
ΞΫηε͠·͍ͬͯ͘ΔͱϒϩοΫ͞ΕΔ
Proxy1(AWS)ΞϓϦαʔόରαΠτΞΫηεݩΛͪΐ͍ͪΐ͍ม͑ΔProxy1(AWS)ᶃϒϩοΫᶄ৽͘͠ىಈ
)5.-ߏ͕มΘΓεΫϨΠϐϯάࣦഊͱ͔
"#ςετͯ͠ΔΒ͘͠ΞΧϯτຖͰ)5.-͕ҧ͏
Ͱ͖Δ͚ͩදࣔܥ͔ΒεΫϨΠϐϯά͠ͳ͍॓ധਓ 9໊ใදࣔը໘॓ധਓ 9ϑΥʔϜը໘໊ϑΥʔϜ෦ͷϚʔΫΞοϓαʔόʔαΠυͷϓϩάϥϜͱ࿈ܞ͍ͯ͠ΔͷͰมߋ͕ൃੜ͠ʹ͍͘JavaScriptଆʹJSONจࣈྻͰใΛ͍ͯ͠Δͱ͜Ζͱ͔มߋ͕ൃੜ͠ʹ͍͘http://example.com/users/12345678URLมߋ͕ൃੜ͠ʹ͍͘
αΠτ͕ॏͯͨ͘·ʹλΠϜΞτͨ͠ΓམͪͨΓ͢Δ
ϩʔυͪɺදࣔ࣌ؒΛԆ͢Capybara.register_driver :poltergeist do |app|Capybara::Poltergeist::Driver.new(app, :timeout=> 60)endCapybara.default_driver = :poltergeistCapybara.default_max_wait_time = 30agent = Capybara.current_session# ࠷େ60ඵͬͯ͘ΕΔagent.visit('URL')# ࠷େ30ඵJavaScriptͷඇಉظߋ৽ͳͲͷऴྃΛͬͯ͘ΕΔnumber = agent.find('CSSηϨΫλ').text.to_i
εΫϨΠϐϯάࣦഊඞ͓͖ͣΔ)5.-ߏͷมԽ ӬଓతΤϥʔଓΤϥʔ Ұ࣌తͳΤϥʔ
ϢχοτςετΛఆظతʹࣗಈ࣮ߦproject='ϦϙδτϦ໊'branch='master'api_token='APIτʔΫϯ'url=https://circleci.com/api/v1/project/${project}/tree/${branch}?circle-token=${api_token}curl \--header "Accept: application/json" \--header "Content-Type: application/json" \--request POST ${url}CircleCIͷϏϧυΛAPIΛͬͯcron͔Βఆظ࣮ߦ͚ͨ͜ΒCircleCI͕Slackʹ௨ͯ͘͠ΕΔ
4JEFLJRͷར༻# 1000ళฮ͋ͬͨͱͯ͠shops.each do |shop|# ͠10ళฮͰΤϥʔͰॲཧ͕ͱ·ͬͯ͠·ͬͨΒ# 990ళฮ͕ະॲཧʹͳͬͯ͠·͏shop.scrapeendҰ࣌తͳଓΤϥʔಛఆͷshopʹ͚ͩൃੜ͢Δఆ֎ͷΤϥʔ
4JEFLJRͷར༻ಛఆͷshopͰམͪͯɺଞͷshopͷॲཧݺΕΔ
4JEFLJRͷͦͷଞརಛఆͷshopͰམͪͯɺଞͷshopͷॲཧݺΕΔεϨουىಈىಈ·Ͱͷͪ࣌ؒΛ੍ޚ͠ɺରαΠτʹෛՙΛ͔͚ա͗ͳ͍མͪͨ࣌ʹslackʹ௨ͯ͘͠ΕΔམͪͨεϨουΛϦτϥΠͯ͘͠ΕΔ࠷େϦτϥΠճࢦఆՄೳదʹϦτϥΠִؒΛ͛ͳ͕Β(15, 16, 31, 96, 271, ... )ϦτϥΠ࣌ͷslack௨ΛؒҾ͍ͨΓͰ͖ΔฒྻॲཧαʔόʔͷεέʔϧΞτͰύϑΥʔϚϯεΞοϓ
·ͱΊવͰ͖Δ͚ͩεΫϨΠϐϯάΤϥʔɺଓΤϥʔΛճආ͢ΔΑ͏ʹྗͦΕͰશʹճආͰ͖ͳ͍εΫϨΠϐϯάΤϥʔΛ͔ʹݕ͢ΔΈΛ༻ҙҰ࣌తʹଓΤϥʔෆଌͷΤϥʔ͕ൃੜͨ͠߹ɺ֘ॲཧҟৗऴྃͭͭ͠ɺ༧ఆ͍ͯͨ͠όονॲཧܧଓҰ࣌తΤϥʔ(ଓΤϥʔɺcookie༗ޮظݶΕ)ϦτϥΠεΫϨΠϐϯάͷӡ༻ʹ͔͔Δ࣌ؒίετΛ͋Β͔͡ΊϓϩδΣΫτॳظʹؔऀʹཧղͯ͠Β͏ඞཁ͕͋Δ(߹ʹΑͬͯϢʔβʔʹ)
࠷ޙʹએϦϯελΧϑΣ ΦϯϥΠϯϦʔϯɾελʔτΞοϓ(ͦͷଞྨࣅ)ख๏Λϕʔεʹͨ͠αʔϏεاըɾ։ൃͷ࣮ફऀͷͨΊͷίϛϡχςΟաڈ8ճͷΦϑϥΠϯษڧձΛ࣮ࢪΠϯλϏϡʔͷํɺϢʔβʔςετͷํɺMVPͷܾΊํɺࣾελʔτΞοϓͷۤ࿑ɺͳͲͳͲͦΕͷΦϯϥΠϯ൛ຖिਫ༵ 21:30 GoogleϋϯάΞτʹͯݱࡏϝϯόʔ 4໊ࢀՃऀ֤͕ࣗؔΘ͍ͬͯΔαʔϏεͷݱঢ়ใࠂɺ໘͍ͯ͠Δ՝ͷڞ༗ͱ૬ޓΞυόΠεͱ͔ɺϦϯελܥͷຊͷಡॻձ͝ڵຯ͋ΕΛ͔͚͍ͯͩ͘͞!