Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Writing a Crawler

Writing a Crawler

How to write a crawler using Scrapy.

Mingshen Sun

October 19, 2015
Tweet

More Decks by Mingshen Sun

Other Decks in Programming

Transcript

  1. Writing a Crawler — Outline 1 Introduction Definition Goals Challenges

    2 Scrapy at Glance 3 Writing a Crawler Item Spider Item Pipeline Architecture Middleware Command Line Tools Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 2 / 56
  2. Writing a Crawler — Definition Definition Web crawler is an

    Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or a Web scutter. Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 3 / 56
  3. Writing a Crawler — Goals Web Scraping: extracting information from

    website downloading resources (e.g., Android apps) extracting OSNs information (e.g., Alice follows Bob) harvesting scores or comments for movies or products (e.g., The Godfather scores 9.2 in IMDB) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 8 / 56
  4. Writing a Crawler — Challenges pagination crawler/bot detection cookie, referer,

    user agent JavaScript … Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 12 / 56
  5. Writing a Crawler — Introducing Scrapy An open source and

    collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way. Fast and powerful: write the rules to extract the data and let Scrapy do the rest Easily extensible: extensible by design, plug new functionality easily without having to touch the core Portable, Python: written in Python and runs on Linux, Windows, Mac and BSD Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 15 / 56
  6. Writing a Crawler — Learning Python A quick & dirty

    introduction to Python programming language. from scrapy import Spider, Item, Field class Post(Item): title = Field() class BlogSpider(Spider): name, start_urls = 'blogspider', ['http://blog.scrapinghub.com'] def parse(self, response): res = [] for e in response.css("h2 a::text"): res.append(Post(title=e.extract())) return res Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 17 / 56
  7. Writing a Crawler — Scrapy at Glance $ scrapy runspider

    myspider.py ... 2015-05-28 21:59:41+0800 [blogspider] DEBUG: Crawled (200) <GET http://blog.scrapinghub.com> (refer 2015-05-28 21:59:41+0800 [blogspider] DEBUG: Scraped from <200 http://blog.scrapinghub.com> 'title': u'Gender Inequality Across Programming\xa0Languages'} 2015-05-28 21:59:42+0800 [blogspider] INFO: Dumping Scrapy stats: 'downloader/request_bytes': 219, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 91182, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', ... Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 18 / 56
  8. Writing a Crawler — Scrapy at Glance 1 pick a

    website 2 define the data you want to scrape 3 write a spider to extract the data 4 run the spider to extract the data 5 review scraped data Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 19 / 56
  9. Writing a Crawler — Scrapy at Glance pick a website:

    IMDB Top 250 Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 20 / 56
  10. Writing a Crawler — Scrapy at Glance define the data

    you want to scrape rank movie title publish year IMDB rating from scrapy.item import Item, Field class MovieItem(Item): rank = Field() title = Field() year = Field() rating = Field() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 21 / 56
  11. Writing a Crawler — Scrapy at Glance write a spider

    to extract the data 1 read HTML code 2 find common patterns 3 determine XPATH 4 write a spider Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 22 / 56
  12. Writing a Crawler — Scrapy at Glance write a spider

    to extract the data: 1. read HTML code Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 23 / 56
  13. Writing a Crawler — Scrapy at Glance write a spider

    to extract the data: 2. find common patterns Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 24 / 56
  14. Writing a Crawler — Scrapy at Glance write a spider

    to extract the data: 3. determine XPATH HTML Preliminaries <div id="logo"></div> <div class="rating"></div> Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 25 / 56
  15. Writing a Crawler — Scrapy at Glance write a spider

    to extract the data: 3. determine XPATH rank: //td[@class='titleColumn']/span[@name='ir']/text() title: //td[@class='titleColumn']/a/text() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 26 / 56
  16. Writing a Crawler — Scrapy at Glance write a spider

    to extract the data: 4. write a spider from scrapy import Item, Field, Spider from scrapy.selector import Selector class MovieItem(Item): rank = Field() title = Field() year = Field() rating = Field() class IMDBSpider(Spider): name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) rank_list = sel.xpath("//td[@class='titleColumn']/span[@name='ir']/text()").extract() title_list = sel.xpath("//td[@class='titleColumn']/a/text()").extract() year_list = sel.xpath("//td[@class='titleColumn']/span[@name='rd']/text()").extract() rating_list = sel.xpath("//td[@class='ratingColumn imdbRating']/strong/text()").extract() movie_items = [] for i in range(len(rank_list)): movie_item = MovieItem() movie_item['rank'] = rank_list[i][:-1] movie_item['title'] = title_list[i] movie_item['year'] = year_list[i][1:-1] movie_item['rating'] = rating_list[i] movie_items.append(movie_item) return movie_items Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 27 / 56
  17. Writing a Crawler — Scrapy at Glance run the spider

    to extract the data $ scrapy runspider imdb_spider.py -o imdb.json -t json Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 28 / 56
  18. Writing a Crawler — Scrapy at Glance review scraped data

    Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 29 / 56
  19. Writing a Crawler — Scrapy at Glance What else? Built-in

    support for selecting and extracting data from HTML and XML sources JSON, CSV, XML, Storage middleware: user-agent spoofing Interactive shell console Support for creating spiders based on pre-defined templates Service Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 30 / 56
  20. Wring a Crawler — Installation Installing Scrapy: Python 2.7, pip,

    lxml, OpenSSL $ pip install Scrapy Platform specific installation notes Windows (forget it) Ubuntu Don’t use the python-scrapy package from official apt sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv 627220E7 echo ’deb http://archive.scrapy.org/ubuntu scrapy main’ | sudo tee /etc/apt/sources.list.d/scrapy.list sudo apt-get update && sudo apt-get install scrapy-0.24 Arch Linux: sudo pacman -S scrapy Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 35 / 56
  21. Wring a Crawler — Creating a Project Creating a project:

    scrapy startproject tutorial scrapy.cfg: configuration file tutorial/: python module, you’ll later import your code from here. tutorial/items.py: items file. tutorial/pipelines.py: pipelines file. tutorial/settings.py: settings file. tutorial/spiders/: a directory where you’ll later put your spiders. tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 36 / 56
  22. Wring a Crawler — Item Items are containers they work

    like simple python dicts but provide additional protection against populating undeclared fields, to prevent typos. from scrapy.item import Item, Field class MovieItem(Item): id = Field() title = Field() year = Field() rating = Field() # ... class ReviewItem(Item): id = Field() star_rating = Field() time = Field() country = Field() review = Field() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 37 / 56
  23. Writing a Crawler — Spider Spiders are user-written classes used

    to scrape information from a domain (or group of domains). Defines an initial list of URLs to download how to follow links how to parse the contents of those pages to extract items Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 38 / 56
  24. Writing a Crawler — Spider To create a Spider, you

    must subclass scrapy.spider.Spider, and define the three main, mandatory, attributes: name start_urls parse() is a method of the spider will be called with the downloaded Response object of each start URL the response is passed to the method as the first and only argument for parsing the response data and extracting scraped data and more URLs to follow Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 39 / 56
  25. Wring a Crawler — Spider class IMDBSpider(Spider): name, start_urls =

    'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() return movie_item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 40 / 56
  26. Wring a Crawler — Spider class IMDBSpider(Spider): name, start_urls =

    'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() yield movie_item for url in sel.xpath('//a/@href').extract(): yield Request(url, callback=self.parse) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 42 / 56
  27. Wring a Crawler — CrawlSpider class scrapy.contrib.spiders.CrawlSpider the most commonly

    used spider for crawling regular websites it provides a convenient mechanism for following links by defining a set of rules start from it and override it as needed for more custom functionality or just implement your own spider Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 43 / 56
  28. Wring a Crawler — CrawlSpider class IMDBSpider(CrawlSpider): name, start_urls =

    'imdbspider', ['http://www.imdb.com/chart/top'] rules = ( Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), Rule(SgmlLinkExtractor(allow=('regular expression', )), callback='parse_review), Rule(SgmlLinkExtractor(allow=('regular expression', )), callback='parse_movie), ) def parse_movie(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() yield movie_item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 44 / 56
  29. Wring a Crawler — Item Pipeline After an item has

    been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially. Each item pipeline component is a Python class that implements a simple method. receive an Item and perform an action over it deciding if the Item should continue through the pipeline or be dropped and no longer processed Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 45 / 56
  30. Wring a Crawler — Item Pipeline Typical use for item

    pipelines are: cleansing HTML data validating scraped data (checking that the items contain certain fields) checking for duplicates (and dropping them) storing the scraped item in a database Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 46 / 56
  31. Wring a Crawler — Item Pipeline class PricePipeline(object): vat_factor =

    1.15 def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price'] = item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s" % item) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 47 / 56
  32. Wring a Crawler — Item Pipeline class JsonWriterPipeline(object): def __init__(self):

    self.file = open('items.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 48 / 56
  33. Wring a Crawler — Item Pipeline Activating an Item Pipeline

    component To activate an Item Pipeline component you must add its class to the ITEM_PIPELINES setting, like in the following example: ITEM_PIPELINES = { 'myproject.pipeline.PricePipeline': 300, 'myproject.pipeline.JsonWriterPipeline': 800, } Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 49 / 56
  34. Writing a Crawler — Downloader Middlewares The downloader middleware is

    a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses. CookiesMiddleware DefaultHeadersMiddleware DownloadTimeoutMiddleware HttpAuthMiddleware UserAgentMiddleware AjaxCrawlMiddleware Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 51 / 56
  35. Writing a Crawler — Spider Middleware The spider middleware is

    a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to process the requests and items that are generated from spiders. DepthMiddleware HttpErrorMiddleware UrlLengthMiddleware Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 52 / 56
  36. Writing a Crawler — Command Line Tools Command line tools

    for scrapy create a project: scrapy startproject myproject create a new spider: scrapy genspider mydomain mydomain.com start crawling: scrapy crawl myproject pause and continue: scrapy crawl somespider -s JOBDIR=crawls/somespider-1 Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 53 / 56
  37. Summary 1 Introduction Definition Goals Challenges 2 Scrapy at Glance

    3 Writing a Crawler Item Spider Item Pipeline Architecture Middleware Command Line Tools Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 54 / 56