Slide 1

Slide 1 text

Web Scraping Best Practises Shane Evans @shaneaevans

Slide 2

Slide 2 text

About Shane ● 12y python, 8y scraping ● Scrapy, Portia, Frontera.. ● Co-founded Scrapinghub

Slide 3

Slide 3 text

Why Scrape? Sources of data -> Internet vast amount of data APIs - Availability, Limited Data, Throttling, Privacy The web is broken - microdata, microformats, RDFa Endless use cases: monitor prices, leads generation, e- commerce, research, specialized news...

Slide 4

Slide 4 text

Web Scraping Traffic

Slide 5

Slide 5 text

Badly Written Bots Bots can: ● use excessive website resources ● be unreliable ● be hard to maintain

Slide 6

Slide 6 text

What is Web Scraping A technique of extracting information from websites includes: ● downloading web pages (which may involve crawling - extracting and following links) ● Scraping - extracting data from downloaded pages

Slide 7

Slide 7 text

Web Scraping Example Our jobs get scraped and posted to jobs websites the day they are published! How would you build that web scraper? Scrapinghub gets scraped all the time

Slide 8

Slide 8 text

Downloading with Requests For simple cases, use requests url = 'https://ep2015.europython.eu/en/speakers/' r = requests.get(url) t.text Nice API, clean code!

Slide 9

Slide 9 text

Crawling with Requests def __init__(self): self.session = requests.Session() def make_request(self, url): try: return self.session.get(url) except (requests.exceptions.HTTPError, requests.exceptions.ConnectTimeout, requests.exceptions.ConnectionError) as e: # TODO: some retry logic with logging and waits pass def start_crawl(self): response = self.make_request(self.base_url + '/en/speakers/') # TODO: extract links to speakers, call make_requests for each, then extract

Slide 10

Slide 10 text

Crawling with Scrapy class EPSpeakerSpider(CrawlSpider): name = 'epspeakers_crawlspider' start_urls = ['https://ep2015.europython.eu/en/speakers/'] rules = [ Rule(LinkExtractor(allow=('/conference/',)), callback='parse_speakerdetails') ] def parse_speakerdetails(self, response): ... Scrapy is your friend

Slide 11

Slide 11 text

Crawling Multiple Websites ● Separate spiders for each different website (or crawl logic) ● Common logic in appropriate places (middlewares, item loaders, etc.) ● lots more: tool support, common patterns, etc. See demo at https://github.com/scrapinghub/pycon-speakers Scrapy encourages best practices for scraping multiple websites:

Slide 12

Slide 12 text

Crawling - following links Crawling tips: ● find good sources of links e.g. sitemaps ● consider the crawl order - depth first, breadth first, priority ● canonicalize and remove duplicates ● beware of spider traps! - always add limits

Slide 13

Slide 13 text

Crawling at Scale Lots of data: visited and discovered URLs Batch vs. Incremental crawling Different strategies for deciding what to crawl: - discover new content - revisit pages that are likely to have changed - prioritize relevant content Maintain politeness!

Slide 14

Slide 14 text

Frontera ● open source crawl frontier library, written in python ● python API, or integrated with Scrapy ● multiple crawl ordering algorithms (priority, HITS, pagerank, etc.) ● configurable back ends: ○ integrated or distributed ○ different crawl orderings ○ sqlite, hbase, etc. ● works at scale

Slide 15

Slide 15 text

Downloading Summary ● requests is a great library for HTTP ● scrapify early, especially if you do any crawling ● frontera for advanced URL orderings or larger scale crawling

Slide 16

Slide 16 text

Extraction - Standard Library We have a great standard library: ● string handling: slice, split, strip, lower, etc. ● regular expressions Usually used with other techniques

Slide 17

Slide 17 text

Extraction - HTML Parsers HTML Parsers are the go-to tools for web scrapers! Useful when data can be extracted via the structure of the HTML document

Slide 18

Slide 18 text

Extraction - HTML Parsers
TEXT-1
TEXT-2 TEXT-3
TEXT-4

Slide 19

Slide 19 text

Extraction - HTML Parsers HTML

Slide 20

Slide 20 text

Extraction - XPath XPath //b

Slide 21

Slide 21 text

Extraction - XPath XPath //div/b

Slide 22

Slide 22 text

Extraction - XPath XPath //div[2]/text()

Slide 23

Slide 23 text

Extraction - XPath XPath //div[2]//text()

Slide 24

Slide 24 text

Scrapy Selectors name = response.xpath('//section[@class="profile-name"]//h1/text()').extract_first() avatar = response.urljoin(response.xpath('//img[@class="avatar"]/@src').extract_first() for talk in response.xpath('//div[@class="speaker-talks well"]//li'): talk_title = talk.xpath('.//text()').extract_first() xpath using Scrapy selectors

Slide 25

Slide 25 text

BeautifulSoup name = soup.find('section', attrs={'class': 'profile-name'}).h1.text item['avatar'] = self.base_url + soup.find('img', attrs={'class': 'avatar'})['src'] for talk in soup.find('div', attrs={'class': 'speaker-talks well'}).dl.dd.ul.li: title = talk.text BeautifulSoup uses python objects to interact with the parse tree

Slide 26

Slide 26 text

Someone might have solved it! Don’t reinvent the wheel! Lots of small tools that can make your life easier, for example: ● Scrapy loginform fills in login forms ● dateparser - parse text dates ● webpager helps with pagination

Slide 27

Slide 27 text

Visual Data Extraction Portia is a Visual Scraper written in Python! ● train by annotating web pages ● run with Scrapy

Slide 28

Slide 28 text

Scaling Extraction At scale, use methods that do not require additional work per website ● boilerplate removal - python- goose, readability, justext.. ● analyze text with nltk ● scikit-learn and scikit-image for classification, feature extraction ● webstruct - NER with HTML

Slide 29

Slide 29 text

Named Entity Recognition ● finds and classifies elements in text into predefined categories ● examples include person names or job titles, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. For English it is often solved using machine learning

Slide 30

Slide 30 text

Web Page Annotation Web pages need to be annotated manually Useful tools include: ● https://github.com/xtannier/WebAnnotator ● https://gate.ac.uk/ ● http://brat.nlplab.org/

Slide 31

Slide 31 text

Labeling Name entity > one or more tokens This data format is not convenient for ML algorithms

Slide 32

Slide 32 text

Encoding IOB encoding • Tokens 'outside' named entities - tag O • The first token in entity - tag B-ENTITY • Other tokens of an entity - tag I-ENTITY

Slide 33

Slide 33 text

Classification Task The problem is reduced to a "standard" ML classification task ● Input data - information about tokens (==features) ● Output data - named entity label, encoded as IOB ● Use a classifier which takes the order of predicted labels into account (Conditional Random Fields is a common choice)

Slide 34

Slide 34 text

Feature Examples ● token == "Cafe"? ● is the first letter uppercase? ● is token a name of a month? ● are the two previous tokens "© 2014"? ● is the token inside a HTML element?

Slide 35

Slide 35 text

Putting it together One way to do it: ● WebAnnotator to annotate pages manually ● WebStruct to load training data (annotated pages) and encode named entity labels to IOB ● write Python functions to extract features (and/or use some of the WebStruct feature extraction functions) ● train a CRF model using python-crfsuite ● WebStruct to combine all the pieces

Slide 36

Slide 36 text

Data Extraction Summary

Slide 37

Slide 37 text

Saatchi Global Gallery Guide ● Scrape 11k+ gallery websites ● Extract artworks, artist and exhibitions

Slide 38

Slide 38 text

Saatchi Global Gallery Guide Crawling: ● Use Scrapy, running a batch of sites per process. Many processes at once. ● Deploy on Scrapy Cloud ● Rank links to follow, prioritizing likely sources of content ● scikit-learn for selecting links to crawl ● webpager to follow pagination ● Limit crawl depth and requests per website

Slide 39

Slide 39 text

Saatchi Global Gallery Guide Extraction: ● webstruct for all structured data extraction ● scikit-learn for feature extraction ● scikit-image - face recognition to distinguish between artists and artworks ● fuzzywuzzy string matching ● goose to clean input html ● Store item hashes and only export updates

Slide 40

Slide 40 text

Saatchi Global Gallery Guide r = requests.get('http://oh-wow.com/artists/charlie-billingham/') response = scrapy.TextResponse(r.url, body=r.text, encoding='utf-8', headers={'content-type': 'text/html'}) ext = ArtworkExtractor() print ext.extract_artworks(response) [{'MEDIUM': u'Oil and acrylic on linen', 'PHOTO':'http://oh-wow.com/wp-content/uploads/2014/0...', 'SIZE': u'39.5 x 31.5 inches 100.3 x 80 cm', 'TITLE': u'Unforced Error', 'YEAR': u'2015'}] CHARLIE BILLINGHAM Unforced Error, 2015 Artwork extraction example

Slide 41

Slide 41 text

Saatchi Global Gallery Guide evaluation: ● measure accuracy (precision and recall) ● avoid false positives ● test everything, improve iteratively

Slide 42

Slide 42 text

Web Scraping Challenges Difficulty typically depends on the size of data, complexity of extracted items and accuracy requirements

Slide 43

Slide 43 text

Web Scraping Challenges But getting clean data from the web is a dirty business! Some things will kill your scraping..

Slide 44

Slide 44 text

Irregular Structure HTML parsers, our go-to tool, require sane and consistent HTML Structure In practise, some websites will: ● use many different templates ● run multivariate testing ● have very, very broken HTML

Slide 45

Slide 45 text

JavaScript Many sites require JavaScript, or browser rendering, to be scraped. ● Splash is a scriptable browser available via an API. Works well with Scrapy. ● Automate web browser interaction with Selenium ● Most JS-heavy sites call APIs. You can do that too! use browser tools to inspect ● js2xml can make JS easier to parse, sometimes pull data with regexp ● Maybe the mobile site is better?

Slide 46

Slide 46 text

Proxies ● Sometimes need to crawl from a specific location ● Many hosting centres (e.g. EC2) are frequently entirely banned ● Privacy can be important ● Multiple proxies for sustained reliable crawling e.g. Tor, illuminati, open proxies, private providers and Crawlera

Slide 47

Slide 47 text

Ethics ● is your web scraping causing harm? ● crawl at a reasonable rate, especially on smaller websites ● identify your bot via a user agent ● respect robots.txt on broad crawls

Slide 48

Slide 48 text

Q & A Ask me anything..

Slide 49

Slide 49 text

THANK YOU! Shane Evans @shaneaevans Visit our booth Talk to us at the recruiting session Stick around for the “Dive into Scrapy” talk