Writing a Crawler

ANSR Lab Group Study Writing a Crawler Bob Mingshen Sun
June 1, 2015

Writing a Crawler — Outline 1 Introduction Deﬁnition Goals Challenges
2 Scrapy at Glance 3 Writing a Crawler Item Spider Item Pipeline Architecture Middleware Command Line Tools Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 2 / 56

Writing a Crawler — Definition Deﬁnition Web crawler is an
Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or a Web scutter. Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 3 / 56

Writing a Crawler — Goals Web Scraping: extracting information from
website downloading resources (e.g., Android apps) extracting OSNs information (e.g., Alice follows Bob) harvesting scores or comments for movies or products (e.g., The Godfather scores 9.2 in IMDB) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 8 / 56

Writing a Crawler — Challenges Pagination Bob Mingshen Sun ANSR
Lab Group Study June 1, 2015 9 / 56

Writing a Crawler — Challenges Crawler Detection Bob Mingshen Sun
ANSR Lab Group Study June 1, 2015 10 / 56

Writing a Crawler — Challenges Cookie Bob Mingshen Sun ANSR

Writing a Crawler — Challenges pagination crawler/bot detection cookie, referer,
user agent JavaScript … Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 12 / 56

Writing a Crawler — Introducing Scrapy Bob Mingshen Sun ANSR

Writing a Crawler — Introducing Scrapy An open source and
collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way. Fast and powerful: write the rules to extract the data and let Scrapy do the rest Easily extensible: extensible by design, plug new functionality easily without having to touch the core Portable, Python: written in Python and runs on Linux, Windows, Mac and BSD Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 15 / 56

Writing a Crawler — Introducing Scrapy EXCITED!!! Python??? Bob Mingshen
Sun ANSR Lab Group Study June 1, 2015 16 / 56

Writing a Crawler — Learning Python A quick & dirty
introduction to Python programming language. from scrapy import Spider, Item, Field class Post(Item): title = Field() class BlogSpider(Spider): name, start_urls = 'blogspider', ['http://blog.scrapinghub.com'] def parse(self, response): res = [] for e in response.css("h2 a::text"): res.append(Post(title=e.extract())) return res Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 17 / 56

Writing a Crawler — Scrapy at Glance $ scrapy runspider
myspider.py ... 2015-05-28 21:59:41+0800 [blogspider] DEBUG: Crawled (200) <GET http://blog.scrapinghub.com> (refer 2015-05-28 21:59:41+0800 [blogspider] DEBUG: Scraped from <200 http://blog.scrapinghub.com> 'title': u'Gender Inequality Across Programming\xa0Languages'} 2015-05-28 21:59:42+0800 [blogspider] INFO: Dumping Scrapy stats: 'downloader/request_bytes': 219, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 91182, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', ... Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 18 / 56

Writing a Crawler — Scrapy at Glance 1 pick a
website 2 deﬁne the data you want to scrape 3 write a spider to extract the data 4 run the spider to extract the data 5 review scraped data Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 19 / 56

Writing a Crawler — Scrapy at Glance pick a website:
IMDB Top 250 Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 20 / 56

Writing a Crawler — Scrapy at Glance deﬁne the data
you want to scrape rank movie title publish year IMDB rating from scrapy.item import Item, Field class MovieItem(Item): rank = Field() title = Field() year = Field() rating = Field() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 21 / 56

Writing a Crawler — Scrapy at Glance write a spider
to extract the data 1 read HTML code 2 ﬁnd common patterns 3 determine XPATH 4 write a spider Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 22 / 56

to extract the data: 1. read HTML code Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 23 / 56

to extract the data: 2. ﬁnd common patterns Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 24 / 56

to extract the data: 3. determine XPATH HTML Preliminaries <div id="logo"></div> <div class="rating"></div> Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 25 / 56

to extract the data: 3. determine XPATH rank: //td[@class='titleColumn']/span[@name='ir']/text() title: //td[@class='titleColumn']/a/text() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 26 / 56

to extract the data: 4. write a spider from scrapy import Item, Field, Spider from scrapy.selector import Selector class MovieItem(Item): rank = Field() title = Field() year = Field() rating = Field() class IMDBSpider(Spider): name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) rank_list = sel.xpath("//td[@class='titleColumn']/span[@name='ir']/text()").extract() title_list = sel.xpath("//td[@class='titleColumn']/a/text()").extract() year_list = sel.xpath("//td[@class='titleColumn']/span[@name='rd']/text()").extract() rating_list = sel.xpath("//td[@class='ratingColumn imdbRating']/strong/text()").extract() movie_items = [] for i in range(len(rank_list)): movie_item = MovieItem() movie_item['rank'] = rank_list[i][:-1] movie_item['title'] = title_list[i] movie_item['year'] = year_list[i][1:-1] movie_item['rating'] = rating_list[i] movie_items.append(movie_item) return movie_items Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 27 / 56

Writing a Crawler — Scrapy at Glance run the spider
to extract the data $ scrapy runspider imdb_spider.py -o imdb.json -t json Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 28 / 56

Writing a Crawler — Scrapy at Glance review scraped data
Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 29 / 56

Writing a Crawler — Scrapy at Glance What else? Built-in
support for selecting and extracting data from HTML and XML sources JSON, CSV, XML, Storage middleware: user-agent spooﬁng Interactive shell console Support for creating spiders based on pre-deﬁned templates Service Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 30 / 56

Wanna crawl millions of webpages?

Stay tuned. Let’s do it from scratch.

Wring a Crawler — Installation Installing Scrapy: Python 2.7, pip,
lxml, OpenSSL $ pip install Scrapy Platform speciﬁc installation notes Windows (forget it) Ubuntu Don’t use the python-scrapy package from oﬃcial apt sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv 627220E7 echo ’deb http://archive.scrapy.org/ubuntu scrapy main’ | sudo tee /etc/apt/sources.list.d/scrapy.list sudo apt-get update && sudo apt-get install scrapy-0.24 Arch Linux: sudo pacman -S scrapy Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 35 / 56

Wring a Crawler — Creating a Project Creating a project:
scrapy startproject tutorial scrapy.cfg: configuration file tutorial/: python module, you’ll later import your code from here. tutorial/items.py: items file. tutorial/pipelines.py: pipelines file. tutorial/settings.py: settings file. tutorial/spiders/: a directory where you’ll later put your spiders. tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 36 / 56

Wring a Crawler — Item Items are containers they work
like simple python dicts but provide additional protection against populating undeclared ﬁelds, to prevent typos. from scrapy.item import Item, Field class MovieItem(Item): id = Field() title = Field() year = Field() rating = Field() # ... class ReviewItem(Item): id = Field() star_rating = Field() time = Field() country = Field() review = Field() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 37 / 56

Writing a Crawler — Spider Spiders are user-written classes used
to scrape information from a domain (or group of domains). Deﬁnes an initial list of URLs to download how to follow links how to parse the contents of those pages to extract items Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 38 / 56

Writing a Crawler — Spider To create a Spider, you
must subclass scrapy.spider.Spider, and deﬁne the three main, mandatory, attributes: name start_urls parse() is a method of the spider will be called with the downloaded Response object of each start URL the response is passed to the method as the ﬁrst and only argument for parsing the response data and extracting scraped data and more URLs to follow Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 39 / 56

Wring a Crawler — Spider class IMDBSpider(Spider): name, start_urls =
'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() return movie_item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 40 / 56

We want to crawl all movies in IMDB!

Wring a Crawler — Spider class IMDBSpider(Spider): name, start_urls =
'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() yield movie_item for url in sel.xpath('//a/@href').extract(): yield Request(url, callback=self.parse) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 42 / 56

Wring a Crawler — CrawlSpider class scrapy.contrib.spiders.CrawlSpider the most commonly
used spider for crawling regular websites it provides a convenient mechanism for following links by deﬁning a set of rules start from it and override it as needed for more custom functionality or just implement your own spider Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 43 / 56

Wring a Crawler — CrawlSpider class IMDBSpider(CrawlSpider): name, start_urls =
'imdbspider', ['http://www.imdb.com/chart/top'] rules = ( Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), Rule(SgmlLinkExtractor(allow=('regular expression', )), callback='parse_review), Rule(SgmlLinkExtractor(allow=('regular expression', )), callback='parse_movie), ) def parse_movie(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() yield movie_item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 44 / 56

Wring a Crawler — Item Pipeline After an item has
been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially. Each item pipeline component is a Python class that implements a simple method. receive an Item and perform an action over it deciding if the Item should continue through the pipeline or be dropped and no longer processed Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 45 / 56

Wring a Crawler — Item Pipeline Typical use for item
pipelines are: cleansing HTML data validating scraped data (checking that the items contain certain ﬁelds) checking for duplicates (and dropping them) storing the scraped item in a database Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 46 / 56

Wring a Crawler — Item Pipeline class PricePipeline(object): vat_factor =
1.15 def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price'] = item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s" % item) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 47 / 56

Wring a Crawler — Item Pipeline class JsonWriterPipeline(object): def __init__(self):
self.file = open('items.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 48 / 56

Wring a Crawler — Item Pipeline Activating an Item Pipeline
component To activate an Item Pipeline component you must add its class to the ITEM_PIPELINES setting, like in the following example: ITEM_PIPELINES = { 'myproject.pipeline.PricePipeline': 300, 'myproject.pipeline.JsonWriterPipeline': 800, } Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 49 / 56

Writing a Crawler — Scrapy Architecture Bob Mingshen Sun ANSR

Writing a Crawler — Downloader Middlewares The downloader middleware is
a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses. CookiesMiddleware DefaultHeadersMiddleware DownloadTimeoutMiddleware HttpAuthMiddleware UserAgentMiddleware AjaxCrawlMiddleware Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 51 / 56

Writing a Crawler — Spider Middleware The spider middleware is
a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to process the requests and items that are generated from spiders. DepthMiddleware HttpErrorMiddleware UrlLengthMiddleware Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 52 / 56

Writing a Crawler — Command Line Tools Command line tools
for scrapy create a project: scrapy startproject myproject create a new spider: scrapy genspider mydomain mydomain.com start crawling: scrapy crawl myproject pause and continue: scrapy crawl somespider -s JOBDIR=crawls/somespider-1 Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 53 / 56

Summary 1 Introduction Deﬁnition Goals Challenges 2 Scrapy at Glance
3 Writing a Crawler Item Spider Item Pipeline Architecture Middleware Command Line Tools Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 54 / 56

Thank you! Q&A and Demo

References http://www.improgrammer.net/linux-world-map/ http://thehackernews.com/2013/06/most-sophisticated-android-malware-ever.html http://www.makelinux.net/android/internals/ http://wiki.tei-c.org/index.php/File:0xbabaf000l.png http://phdcomics.com/comics/archive.php?comicid=1796 Scrapy: http://scrapy.org/ Bob Mingshen
Sun ANSR Lab Group Study June 1, 2015 56 / 56

Writing a Crawler

Writing a Crawler

More Decks by Mingshen Sun

Other Decks in Programming

Featured

Transcript