Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Primum non nocere

Primum non nocere

Principles for building and running web scrapers that are good internet citizens.

Lindsay Holmwood

March 28, 2017
Tweet

More Decks by Lindsay Holmwood

Other Decks in Technology

Transcript

  1. Telecommunications Offences and Other Measures Act (No. 2) 2004 Section

    474.6 (5) A person is guilty of an offence if: (a) the person uses or operates any apparatus or device (whether or not it is comprised in, connected to or used in connection with a telecommunications network); and (b) this conduct results in hindering the normal operation of a carriage service supplied by a carriage service provider. Penalty: Imprisonment for 2 years.
  2. def get(url, opts={}) options = { :cache => true }.merge!(opts)

    @agent ||= Mechanize.new case # Cache bypass when !options[:cache] page = @agent.get(url) # Cache hit when cached?(url) cache_fetch(url) # Cache miss else page = @agent.get(url) cache_store(url, page.body.to_s) end end
  3. def cache_store(url, content) cache_path(url).open('w') {|f| f << content} unless cached?(url)

    Nokogiri::HTML(content) end def cache_path(url) base = Pathname.new(__FILE__).parent.join('cache') hash = Digest::MD5.hexdigest(url) directory = base.join(hash[0]) directory.mkpath unless directory.directory? directory.join(hash[1..-1]) end