Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Scraping Best Practises

Web Scraping Best Practises

Python is a fantastic language for writing web scrapers. There is a large ecosystem of useful projects and a great developer community. However, it can be confusing once you go beyond the simpler scrapers typically covered in tutorials.

In this talk, from EuroPython 2015, we explore some common real-world scraping tasks. You will learn best practises and get a deeper understanding of what tools and techniques can be used and how to deal with the most challenging of web scraping projects.

Shane Evans

July 21, 2015
Tweet

More Decks by Shane Evans

Other Decks in Programming

Transcript

  1. Web Scraping Best Practises
    Shane Evans
    @shaneaevans

    View full-size slide

  2. About Shane
    ● 12y python, 8y scraping
    ● Scrapy, Portia, Frontera..
    ● Co-founded Scrapinghub

    View full-size slide

  3. Why Scrape?
    Sources of data -> Internet vast amount of data
    APIs - Availability, Limited Data, Throttling, Privacy
    The web is broken - microdata, microformats, RDFa
    Endless use cases: monitor prices, leads generation, e-
    commerce, research, specialized news...

    View full-size slide

  4. Web Scraping Traffic

    View full-size slide

  5. Badly Written Bots
    Bots can:
    ● use excessive website resources
    ● be unreliable
    ● be hard to maintain

    View full-size slide

  6. What is Web Scraping
    A technique of extracting
    information from websites
    includes:
    ● downloading web pages (which may involve
    crawling - extracting and following links)
    ● Scraping - extracting data from downloaded pages

    View full-size slide

  7. Web Scraping Example
    Our jobs get scraped and posted to jobs websites the
    day they are published!
    How would you build that web scraper?
    Scrapinghub gets scraped all the time

    View full-size slide

  8. Downloading with Requests
    For simple cases, use requests
    url = 'https://ep2015.europython.eu/en/speakers/'
    r = requests.get(url)
    t.text
    Nice API, clean code!

    View full-size slide

  9. Crawling with Requests
    def __init__(self):
    self.session = requests.Session()
    def make_request(self, url):
    try:
    return self.session.get(url)
    except (requests.exceptions.HTTPError,
    requests.exceptions.ConnectTimeout,
    requests.exceptions.ConnectionError) as e:
    # TODO: some retry logic with logging and waits
    pass
    def start_crawl(self):
    response = self.make_request(self.base_url + '/en/speakers/')
    # TODO: extract links to speakers, call make_requests for each, then extract

    View full-size slide

  10. Crawling with Scrapy
    class EPSpeakerSpider(CrawlSpider):
    name = 'epspeakers_crawlspider'
    start_urls = ['https://ep2015.europython.eu/en/speakers/']
    rules = [
    Rule(LinkExtractor(allow=('/conference/',)),
    callback='parse_speakerdetails')
    ]
    def parse_speakerdetails(self, response):
    ...
    Scrapy is your friend

    View full-size slide

  11. Crawling Multiple Websites
    ● Separate spiders for each different website (or crawl
    logic)
    ● Common logic in appropriate places (middlewares, item
    loaders, etc.)
    ● lots more: tool support, common patterns, etc.
    See demo at https://github.com/scrapinghub/pycon-speakers
    Scrapy encourages best practices for
    scraping multiple websites:

    View full-size slide

  12. Crawling - following links
    Crawling tips:
    ● find good sources of links e.g. sitemaps
    ● consider the crawl order - depth first,
    breadth first, priority
    ● canonicalize and remove duplicates
    ● beware of spider traps! - always add limits

    View full-size slide

  13. Crawling at Scale
    Lots of data: visited and discovered URLs
    Batch vs. Incremental crawling
    Different strategies for deciding what to crawl:
    - discover new content
    - revisit pages that are likely to have changed
    - prioritize relevant content
    Maintain politeness!

    View full-size slide

  14. Frontera
    ● open source crawl frontier library, written in python
    ● python API, or integrated with Scrapy
    ● multiple crawl ordering algorithms (priority, HITS,
    pagerank, etc.)
    ● configurable back ends:
    ○ integrated or distributed
    ○ different crawl orderings
    ○ sqlite, hbase, etc.
    ● works at scale

    View full-size slide

  15. Downloading Summary
    ● requests is a great library for HTTP
    ● scrapify early, especially if you do any
    crawling
    ● frontera for advanced URL orderings or
    larger scale crawling

    View full-size slide

  16. Extraction - Standard Library
    We have a great standard library:
    ● string handling: slice, split, strip,
    lower, etc.
    ● regular expressions
    Usually used with other techniques

    View full-size slide

  17. Extraction - HTML Parsers
    HTML Parsers are the go-to
    tools for web scrapers!
    Useful when data can be
    extracted via the structure of
    the HTML document

    View full-size slide

  18. Extraction - HTML Parsers



    TEXT-1

    TEXT-2 TEXT-3

    TEXT-4


    View full-size slide

  19. Extraction - HTML Parsers
    HTML

    View full-size slide

  20. Extraction - XPath
    XPath
    //b

    View full-size slide

  21. Extraction - XPath
    XPath
    //div/b

    View full-size slide

  22. Extraction - XPath
    XPath
    //div[2]/text()

    View full-size slide

  23. Extraction - XPath
    XPath
    //div[2]//text()

    View full-size slide

  24. Scrapy Selectors
    name = response.xpath('//section[@class="profile-name"]//h1/text()').extract_first()
    avatar = response.urljoin(response.xpath('//img[@class="avatar"]/@src').extract_first()
    for talk in response.xpath('//div[@class="speaker-talks well"]//li'):
    talk_title = talk.xpath('.//text()').extract_first()
    xpath using Scrapy selectors

    View full-size slide

  25. BeautifulSoup
    name = soup.find('section', attrs={'class': 'profile-name'}).h1.text
    item['avatar'] = self.base_url + soup.find('img', attrs={'class': 'avatar'})['src']
    for talk in soup.find('div', attrs={'class': 'speaker-talks well'}).dl.dd.ul.li:
    title = talk.text
    BeautifulSoup uses python objects to interact with the parse tree

    View full-size slide

  26. Someone might have solved it!
    Don’t reinvent the wheel!
    Lots of small tools that can make your
    life easier, for example:
    ● Scrapy loginform fills in login
    forms
    ● dateparser - parse text dates
    ● webpager helps with pagination

    View full-size slide

  27. Visual Data Extraction
    Portia is a Visual Scraper
    written in Python!
    ● train by annotating web
    pages
    ● run with Scrapy

    View full-size slide

  28. Scaling Extraction
    At scale, use methods that do not
    require additional work per website
    ● boilerplate removal - python-
    goose, readability, justext..
    ● analyze text with nltk
    ● scikit-learn and scikit-image for
    classification, feature extraction
    ● webstruct - NER with HTML

    View full-size slide

  29. Named Entity Recognition
    ● finds and classifies elements in text into predefined
    categories
    ● examples include person names or job titles,
    organizations, locations, expressions of times, quantities,
    monetary values, percentages, etc.
    For English it is often solved using machine learning

    View full-size slide

  30. Web Page Annotation
    Web pages need to be annotated manually
    Useful tools include:
    ● https://github.com/xtannier/WebAnnotator
    ● https://gate.ac.uk/
    ● http://brat.nlplab.org/

    View full-size slide

  31. Labeling
    Name entity > one or more tokens
    This data format is not convenient for ML algorithms

    View full-size slide

  32. Encoding
    IOB encoding
    • Tokens 'outside' named entities - tag O
    • The first token in entity - tag B-ENTITY
    • Other tokens of an entity - tag I-ENTITY

    View full-size slide

  33. Classification Task
    The problem is reduced to a "standard" ML classification task
    ● Input data - information about tokens (==features)
    ● Output data - named entity label, encoded as IOB
    ● Use a classifier which takes the order of predicted labels
    into account (Conditional Random Fields is a common
    choice)

    View full-size slide

  34. Feature Examples
    ● token == "Cafe"?
    ● is the first letter uppercase?
    ● is token a name of a month?
    ● are the two previous tokens "© 2014"?
    ● is the token inside a HTML element?

    View full-size slide

  35. Putting it together
    One way to do it:
    ● WebAnnotator to annotate pages manually
    ● WebStruct to load training data (annotated pages)
    and encode named entity labels to IOB
    ● write Python functions to extract features (and/or use
    some of the WebStruct feature extraction functions)
    ● train a CRF model using python-crfsuite
    ● WebStruct to combine all the pieces

    View full-size slide

  36. Data Extraction Summary

    View full-size slide

  37. Saatchi Global Gallery Guide
    ● Scrape 11k+ gallery websites
    ● Extract artworks, artist and exhibitions

    View full-size slide

  38. Saatchi Global Gallery Guide
    Crawling:
    ● Use Scrapy, running a batch of sites
    per process. Many processes at once.
    ● Deploy on Scrapy Cloud
    ● Rank links to follow, prioritizing likely
    sources of content
    ● scikit-learn for selecting links to crawl
    ● webpager to follow pagination
    ● Limit crawl depth and requests per
    website

    View full-size slide

  39. Saatchi Global Gallery Guide
    Extraction:
    ● webstruct for all structured data
    extraction
    ● scikit-learn for feature extraction
    ● scikit-image - face recognition to
    distinguish between artists and
    artworks
    ● fuzzywuzzy string matching
    ● goose to clean input html
    ● Store item hashes and only export
    updates

    View full-size slide

  40. Saatchi Global Gallery Guide
    r = requests.get('http://oh-wow.com/artists/charlie-billingham/')
    response = scrapy.TextResponse(r.url, body=r.text,
    encoding='utf-8', headers={'content-type': 'text/html'})
    ext = ArtworkExtractor()
    print ext.extract_artworks(response)
    [{'MEDIUM': u'Oil and acrylic on linen',
    'PHOTO':'http://oh-wow.com/wp-content/uploads/2014/0...',
    'SIZE': u'39.5 x 31.5 inches 100.3 x 80 cm',
    'TITLE': u'Unforced Error',
    'YEAR': u'2015'}] CHARLIE BILLINGHAM
    Unforced Error, 2015
    Artwork extraction example

    View full-size slide

  41. Saatchi Global Gallery Guide
    evaluation:
    ● measure accuracy (precision
    and recall)
    ● avoid false positives
    ● test everything, improve
    iteratively

    View full-size slide

  42. Web Scraping Challenges
    Difficulty typically
    depends on the size of
    data, complexity of
    extracted items and
    accuracy requirements

    View full-size slide

  43. Web Scraping Challenges
    But getting clean data from the
    web is a dirty business!
    Some things will kill your
    scraping..

    View full-size slide

  44. Irregular Structure
    HTML parsers, our go-to tool,
    require sane and consistent HTML
    Structure
    In practise, some websites will:
    ● use many different templates
    ● run multivariate testing
    ● have very, very broken HTML

    View full-size slide

  45. JavaScript
    Many sites require JavaScript, or browser
    rendering, to be scraped.
    ● Splash is a scriptable browser available via
    an API. Works well with Scrapy.
    ● Automate web browser interaction with
    Selenium
    ● Most JS-heavy sites call APIs. You can do that
    too! use browser tools to inspect
    ● js2xml can make JS easier to parse,
    sometimes pull data with regexp
    ● Maybe the mobile site is better?

    View full-size slide

  46. Proxies
    ● Sometimes need to crawl from a
    specific location
    ● Many hosting centres (e.g. EC2) are
    frequently entirely banned
    ● Privacy can be important
    ● Multiple proxies for sustained
    reliable crawling
    e.g. Tor, illuminati, open proxies, private providers and Crawlera

    View full-size slide

  47. Ethics
    ● is your web scraping causing harm?
    ● crawl at a reasonable rate,
    especially on smaller websites
    ● identify your bot via a user agent
    ● respect robots.txt on broad crawls

    View full-size slide

  48. Q & A
    Ask me anything..

    View full-size slide

  49. THANK YOU!
    Shane Evans
    @shaneaevans
    Visit our booth
    Talk to us at the recruiting session
    Stick around for the “Dive into Scrapy” talk

    View full-size slide