$30 off During Our Annual Pro Sale. View Details »

Writing a Crawler

Writing a Crawler

How to write a crawler using Scrapy.

Mingshen Sun

October 19, 2015
Tweet

More Decks by Mingshen Sun

Other Decks in Programming

Transcript

  1. ANSR Lab Group Study
    Writing a Crawler
    Bob Mingshen Sun
    June 1, 2015

    View Slide

  2. Writing a Crawler — Outline
    1 Introduction
    Definition
    Goals
    Challenges
    2 Scrapy at Glance
    3 Writing a Crawler
    Item
    Spider
    Item Pipeline
    Architecture
    Middleware
    Command Line Tools
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 2 / 56

    View Slide

  3. Writing a Crawler — Definition
    Definition
    Web crawler is an Internet bot which systematically browses the World
    Wide Web, typically for the purpose of Web indexing. A Web crawler
    may also be called a Web spider, an ant, an automatic indexer, or a
    Web scutter.
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 3 / 56

    View Slide

  4. View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. Writing a Crawler — Goals
    Web Scraping: extracting information from website
    downloading resources (e.g., Android apps)
    extracting OSNs information (e.g., Alice follows Bob)
    harvesting scores or comments for movies or products (e.g., The
    Godfather scores 9.2 in IMDB)
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 8 / 56

    View Slide

  9. Writing a Crawler — Challenges
    Pagination
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 9 / 56

    View Slide

  10. Writing a Crawler — Challenges
    Crawler Detection
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 10 / 56

    View Slide

  11. Writing a Crawler — Challenges
    Cookie
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 11 / 56

    View Slide

  12. Writing a Crawler — Challenges
    pagination
    crawler/bot detection
    cookie, referer, user agent
    JavaScript

    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 12 / 56

    View Slide

  13. Writing a Crawler — Introducing Scrapy
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 13 / 56

    View Slide

  14. View Slide

  15. Writing a Crawler — Introducing Scrapy
    An open source and collaborative framework for extracting the data
    you need from websites. In a fast, simple, yet extensible way.
    Fast and powerful: write the rules to extract the data and let
    Scrapy do the rest
    Easily extensible: extensible by design, plug new functionality
    easily without having to touch the core
    Portable, Python: written in Python and runs on Linux, Windows,
    Mac and BSD
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 15 / 56

    View Slide

  16. Writing a Crawler — Introducing Scrapy
    EXCITED!!! Python???
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 16 / 56

    View Slide

  17. Writing a Crawler — Learning Python
    A quick & dirty introduction to Python programming language.
    from scrapy import Spider, Item, Field
    class Post(Item):
    title = Field()
    class BlogSpider(Spider):
    name, start_urls = 'blogspider', ['http://blog.scrapinghub.com']
    def parse(self, response):
    res = []
    for e in response.css("h2 a::text"):
    res.append(Post(title=e.extract()))
    return res
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 17 / 56

    View Slide

  18. Writing a Crawler — Scrapy at Glance
    $ scrapy runspider myspider.py
    ...
    2015-05-28 21:59:41+0800 [blogspider] DEBUG: Crawled (200) (refer
    2015-05-28 21:59:41+0800 [blogspider] DEBUG: Scraped from <200 http://blog.scrapinghub.com>
    'title': u'Gender Inequality Across Programming\xa0Languages'}
    2015-05-28 21:59:42+0800 [blogspider] INFO: Dumping Scrapy stats:
    'downloader/request_bytes': 219,
    'downloader/request_count': 1,
    'downloader/request_method_count/GET': 1,
    'downloader/response_bytes': 91182,
    'downloader/response_count': 1,
    'downloader/response_status_count/200': 1,
    'finish_reason': 'finished',
    ...
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 18 / 56

    View Slide

  19. Writing a Crawler — Scrapy at Glance
    1 pick a website
    2 define the data you want to scrape
    3 write a spider to extract the data
    4 run the spider to extract the data
    5 review scraped data
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 19 / 56

    View Slide

  20. Writing a Crawler — Scrapy at Glance
    pick a website: IMDB Top 250
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 20 / 56

    View Slide

  21. Writing a Crawler — Scrapy at Glance
    define the data you want to scrape
    rank
    movie title
    publish year
    IMDB rating
    from scrapy.item import Item, Field
    class MovieItem(Item):
    rank = Field()
    title = Field()
    year = Field()
    rating = Field()
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 21 / 56

    View Slide

  22. Writing a Crawler — Scrapy at Glance
    write a spider to extract the data
    1 read HTML code
    2 find common patterns
    3 determine XPATH
    4 write a spider
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 22 / 56

    View Slide

  23. Writing a Crawler — Scrapy at Glance
    write a spider to extract the data: 1. read HTML code
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 23 / 56

    View Slide

  24. Writing a Crawler — Scrapy at Glance
    write a spider to extract the data: 2. find common patterns
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 24 / 56

    View Slide

  25. Writing a Crawler — Scrapy at Glance
    write a spider to extract the data: 3. determine XPATH
    HTML Preliminaries


    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 25 / 56

    View Slide

  26. Writing a Crawler — Scrapy at Glance
    write a spider to extract the data: 3. determine XPATH
    rank:
    //td[@class='titleColumn']/span[@name='ir']/text()
    title: //td[@class='titleColumn']/a/text()
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 26 / 56

    View Slide

  27. Writing a Crawler — Scrapy at Glance
    write a spider to extract the data: 4. write a spider
    from scrapy import Item, Field, Spider
    from scrapy.selector import Selector
    class MovieItem(Item):
    rank = Field()
    title = Field()
    year = Field()
    rating = Field()
    class IMDBSpider(Spider):
    name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top']
    def parse(self, response):
    sel = Selector(response)
    rank_list = sel.xpath("//td[@class='titleColumn']/span[@name='ir']/text()").extract()
    title_list = sel.xpath("//td[@class='titleColumn']/a/text()").extract()
    year_list = sel.xpath("//td[@class='titleColumn']/span[@name='rd']/text()").extract()
    rating_list = sel.xpath("//td[@class='ratingColumn imdbRating']/strong/text()").extract()
    movie_items = []
    for i in range(len(rank_list)):
    movie_item = MovieItem()
    movie_item['rank'] = rank_list[i][:-1]
    movie_item['title'] = title_list[i]
    movie_item['year'] = year_list[i][1:-1]
    movie_item['rating'] = rating_list[i]
    movie_items.append(movie_item)
    return movie_items
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 27 / 56

    View Slide

  28. Writing a Crawler — Scrapy at Glance
    run the spider to extract the data
    $ scrapy runspider imdb_spider.py -o imdb.json -t json
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 28 / 56

    View Slide

  29. Writing a Crawler — Scrapy at Glance
    review scraped data
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 29 / 56

    View Slide

  30. Writing a Crawler — Scrapy at Glance
    What else?
    Built-in support for selecting and
    extracting data from HTML and XML
    sources
    JSON, CSV, XML, Storage
    middleware: user-agent spoofing
    Interactive shell console
    Support for creating spiders based
    on pre-defined templates
    Service
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 30 / 56

    View Slide

  31. Wanna crawl millions of
    webpages?

    View Slide

  32. View Slide

  33. View Slide

  34. Stay tuned. Let’s do it from
    scratch.

    View Slide

  35. Wring a Crawler — Installation
    Installing Scrapy: Python 2.7, pip, lxml, OpenSSL
    $ pip install Scrapy
    Platform specific installation notes
    Windows (forget it)
    Ubuntu
    Don’t use the python-scrapy package from official apt
    sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv
    627220E7
    echo ’deb http://archive.scrapy.org/ubuntu scrapy main’ | sudo tee
    /etc/apt/sources.list.d/scrapy.list
    sudo apt-get update && sudo apt-get install scrapy-0.24
    Arch Linux: sudo pacman -S scrapy
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 35 / 56

    View Slide

  36. Wring a Crawler — Creating a Project
    Creating a project: scrapy startproject tutorial
    scrapy.cfg: configuration file
    tutorial/: python module, you’ll
    later import your code from here.
    tutorial/items.py: items file.
    tutorial/pipelines.py: pipelines
    file.
    tutorial/settings.py: settings
    file.
    tutorial/spiders/: a directory
    where you’ll later put your spiders.
    tutorial/
    scrapy.cfg
    tutorial/
    __init__.py
    items.py
    pipelines.py
    settings.py
    spiders/
    __init__.py
    ...
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 36 / 56

    View Slide

  37. Wring a Crawler — Item
    Items are containers
    they work like simple python dicts
    but provide additional protection against populating undeclared
    fields, to prevent typos.
    from scrapy.item import Item, Field
    class MovieItem(Item):
    id = Field()
    title = Field()
    year = Field()
    rating = Field()
    # ...
    class ReviewItem(Item):
    id = Field()
    star_rating = Field()
    time = Field()
    country = Field()
    review = Field()
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 37 / 56

    View Slide

  38. Writing a Crawler — Spider
    Spiders are user-written classes used to scrape information from a
    domain (or group of domains). Defines
    an initial list of URLs to download
    how to follow links
    how to parse the contents of those pages to extract items
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 38 / 56

    View Slide

  39. Writing a Crawler — Spider
    To create a Spider, you must subclass scrapy.spider.Spider, and
    define the three main, mandatory, attributes:
    name
    start_urls
    parse() is a method of the spider
    will be called with the downloaded Response object of each start
    URL
    the response is passed to the method as the first and only
    argument
    for parsing the response data and extracting scraped data and
    more URLs to follow
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 39 / 56

    View Slide

  40. Wring a Crawler — Spider
    class IMDBSpider(Spider):
    name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top']
    def parse(self, response):
    sel = Selector(response)
    movie_item = MovieItem()
    movie_item['title'] = sel.xpath("...").extract()
    movie_item['year'] = sel.xpath("...").extract()
    movie_item['rating'] = sel.xpath("...").extract()
    return movie_item
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 40 / 56

    View Slide

  41. We want to crawl all movies
    in IMDB!

    View Slide

  42. Wring a Crawler — Spider
    class IMDBSpider(Spider):
    name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top']
    def parse(self, response):
    sel = Selector(response)
    movie_item = MovieItem()
    movie_item['title'] = sel.xpath("...").extract()
    movie_item['year'] = sel.xpath("...").extract()
    movie_item['rating'] = sel.xpath("...").extract()
    yield movie_item
    for url in sel.xpath('//a/@href').extract():
    yield Request(url, callback=self.parse)
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 42 / 56

    View Slide

  43. Wring a Crawler — CrawlSpider
    class scrapy.contrib.spiders.CrawlSpider
    the most commonly used spider for crawling regular websites
    it provides a convenient mechanism for following links by defining
    a set of rules
    start from it and override it as needed for more custom
    functionality
    or just implement your own spider
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 43 / 56

    View Slide

  44. Wring a Crawler — CrawlSpider
    class IMDBSpider(CrawlSpider):
    name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top']
    rules = (
    Rule(SgmlLinkExtractor(allow=('category\.php', ),
    deny=('subsection\.php', ))),
    Rule(SgmlLinkExtractor(allow=('regular expression', )),
    callback='parse_review),
    Rule(SgmlLinkExtractor(allow=('regular expression', )),
    callback='parse_movie),
    )
    def parse_movie(self, response):
    sel = Selector(response)
    movie_item = MovieItem()
    movie_item['title'] = sel.xpath("...").extract()
    movie_item['year'] = sel.xpath("...").extract()
    movie_item['rating'] = sel.xpath("...").extract()
    yield movie_item
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 44 / 56

    View Slide

  45. Wring a Crawler — Item Pipeline
    After an item has been scraped by a spider, it is sent to the Item
    Pipeline which process it through several components that are
    executed sequentially.
    Each item pipeline component is a Python class that implements a
    simple method.
    receive an Item and perform an action over it
    deciding if the Item should continue through the pipeline
    or be dropped and no longer processed
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 45 / 56

    View Slide

  46. Wring a Crawler — Item Pipeline
    Typical use for item pipelines are:
    cleansing HTML data
    validating scraped data (checking that the items contain certain
    fields)
    checking for duplicates (and dropping them)
    storing the scraped item in a database
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 46 / 56

    View Slide

  47. Wring a Crawler — Item Pipeline
    class PricePipeline(object):
    vat_factor = 1.15
    def process_item(self, item, spider):
    if item['price']:
    if item['price_excludes_vat']:
    item['price'] = item['price'] * self.vat_factor
    return item
    else:
    raise DropItem("Missing price in %s" % item)
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 47 / 56

    View Slide

  48. Wring a Crawler — Item Pipeline
    class JsonWriterPipeline(object):
    def __init__(self):
    self.file = open('items.jl', 'wb')
    def process_item(self, item, spider):
    line = json.dumps(dict(item)) + "\n"
    self.file.write(line)
    return item
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 48 / 56

    View Slide

  49. Wring a Crawler — Item Pipeline
    Activating an Item Pipeline component To activate an Item Pipeline
    component you must add its class to the ITEM_PIPELINES setting, like
    in the following example:
    ITEM_PIPELINES = {
    'myproject.pipeline.PricePipeline': 300,
    'myproject.pipeline.JsonWriterPipeline': 800,
    }
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 49 / 56

    View Slide

  50. Writing a Crawler — Scrapy Architecture
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 50 / 56

    View Slide

  51. Writing a Crawler — Downloader Middlewares
    The downloader middleware is a framework of hooks into Scrapy’s
    request/response processing. It’s a light, low-level system for globally
    altering Scrapy’s requests and responses.
    CookiesMiddleware
    DefaultHeadersMiddleware
    DownloadTimeoutMiddleware
    HttpAuthMiddleware
    UserAgentMiddleware
    AjaxCrawlMiddleware
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 51 / 56

    View Slide

  52. Writing a Crawler — Spider Middleware
    The spider middleware is a framework of hooks into Scrapy’s spider
    processing mechanism where you can plug custom functionality to
    process the responses that are sent to Spiders for processing and to
    process the requests and items that are generated from spiders.
    DepthMiddleware
    HttpErrorMiddleware
    UrlLengthMiddleware
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 52 / 56

    View Slide

  53. Writing a Crawler — Command Line Tools
    Command line tools for scrapy
    create a project: scrapy startproject myproject
    create a new spider:
    scrapy genspider mydomain mydomain.com
    start crawling: scrapy crawl myproject
    pause and continue:
    scrapy crawl somespider -s JOBDIR=crawls/somespider-1
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 53 / 56

    View Slide

  54. Summary
    1 Introduction
    Definition
    Goals
    Challenges
    2 Scrapy at Glance
    3 Writing a Crawler
    Item
    Spider
    Item Pipeline
    Architecture
    Middleware
    Command Line Tools
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 54 / 56

    View Slide

  55. Thank you!
    Q&A and Demo

    View Slide

  56. References
    http://www.improgrammer.net/linux-world-map/
    http://thehackernews.com/2013/06/most-sophisticated-android-malware-ever.html
    http://www.makelinux.net/android/internals/
    http://wiki.tei-c.org/index.php/File:0xbabaf000l.png
    http://phdcomics.com/comics/archive.php?comicid=1796
    Scrapy: http://scrapy.org/
    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 56 / 56

    View Slide