Slide 1

Slide 1 text

Primum non nocere Lindsay Holmwood

Slide 2

Slide 2 text

Being good internet citizens

Slide 3

Slide 3 text

“Be nice”

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Don’t disrupt the service for others

Slide 6

Slide 6 text

Your own private DoS max out bandwidth max out the app take out a database

Slide 7

Slide 7 text

Don’t incur costs

Slide 8

Slide 8 text

Bandwidth

Slide 9

Slide 9 text

Hosting

Slide 10

Slide 10 text

Don’t create (unnecessary) work

Slide 11

Slide 11 text

Alerts

Slide 12

Slide 12 text

Logs exhaust storage

Slide 13

Slide 13 text

Don’t break the law

Slide 14

Slide 14 text

Telecommunications Offences and Other Measures Act (No. 2) 2004 Section 474.6 (5) A person is guilty of an offence if: (a) the person uses or operates any apparatus or device (whether or not it is comprised in, connected to or used in connection with a telecommunications network); and (b) this conduct results in hindering the normal operation of a carriage service supplied by a carriage service provider. Penalty: Imprisonment for 2 years.

Slide 15

Slide 15 text

Be mindful of organisational factors

Slide 16

Slide 16 text

Charities

Slide 17

Slide 17 text

Government services

Slide 18

Slide 18 text

Strategies for your scrapers

Slide 19

Slide 19 text

Identification

Slide 20

Slide 20 text

Set a user agent

Slide 21

Slide 21 text

a = Mechanize.new a.user_agent_alias = 'My Scraper'

Slide 22

Slide 22 text

Capture and replay responses

Slide 23

Slide 23 text

scraper website scraper capture scraper capture

Slide 24

Slide 24 text

Testing tools do this already

Slide 25

Slide 25 text

require 'webmock' include WebMock::API WebMock.enable! WebMock.disable_net_connect!(allow: %r{unitedcinemas.com.au/session-times}) stub_request(:get, %r{http://www.unitedcinemas.com.au/session_data\.php\? date=.*&l=\w+&sort=title}). with(:body => "

blah

")

Slide 26

Slide 26 text

Test for failure, latency

Slide 27

Slide 27 text

require 'webmock' include WebMock::API WebMock.enable! WebMock.disable_net_connect!(allow: %r{unitedcinemas.com.au/session-times}) stub_request(:get, %r{http://www.unitedcinemas.com.au/session_data\.php\? date=.*&l=\w+&sort=title}). to_return { |request| sleep(rand(0.0..1.0)) { :body => "

blah

", :headers => { 'Content-Type' => 'text/html; charset=UTF-8' } } }

Slide 28

Slide 28 text

Use a cache

Slide 29

Slide 29 text

Intercept and serve a previously fetched document

Slide 30

Slide 30 text

Squid & Nginx

Slide 31

Slide 31 text

agent = Mechanize.new agent.set_proxy 'localhost', 8000

Slide 32

Slide 32 text

HTTPS?

Slide 33

Slide 33 text

Roll your own

Slide 34

Slide 34 text

page = get(url, cache: true)

Slide 35

Slide 35 text

def get(url, opts={}) options = { :cache => true }.merge!(opts) @agent ||= Mechanize.new case # Cache bypass when !options[:cache] page = @agent.get(url) # Cache hit when cached?(url) cache_fetch(url) # Cache miss else page = @agent.get(url) cache_store(url, page.body.to_s) end end

Slide 36

Slide 36 text

def cached?(url) cache_path(url).exist? end def cache_fetch(url) body = cache_path(url).read Nokogiri::HTML(body) end

Slide 37

Slide 37 text

def cache_store(url, content) cache_path(url).open('w') {|f| f << content} unless cached?(url) Nokogiri::HTML(content) end def cache_path(url) base = Pathname.new(__FILE__).parent.join('cache') hash = Digest::MD5.hexdigest(url) directory = base.join(hash[0]) directory.mkpath unless directory.directory? directory.join(hash[1..-1]) end

Slide 38

Slide 38 text

Rate limit requests

Slide 39

Slide 39 text

intentionally introduce a delay to web requests

Slide 40

Slide 40 text

serve errors

Slide 41

Slide 41 text

introduce delays

Slide 42

Slide 42 text

In the proxy Nginx https://lincolnloop.com/blog/rate-limiting-nginx/

Slide 43

Slide 43 text

In the proxy HAProxy http://blog.serverfault.com/2010/08/26/1016491873/

Slide 44

Slide 44 text

In the scraper Backoff: linear exponential middleware

Slide 45

Slide 45 text

Development vs Production

Slide 46

Slide 46 text

Limit impact during development

Slide 47

Slide 47 text

Doesn’t apply in production

Slide 48

Slide 48 text

Eventually you have to interact with the real thing

Slide 49

Slide 49 text

“Be nice”

Slide 50

Slide 50 text

Don’t disrupt the service for others

Slide 51

Slide 51 text

Don’t incur costs

Slide 52

Slide 52 text

Don’t create (unnecessary) work

Slide 53

Slide 53 text

Be mindful of organisational factors

Slide 54

Slide 54 text

I’m Lindsay

Slide 55

Slide 55 text

❤ Thank you!