Writing a Crawler - Speaker Deck

Slide 1

Slide 1 text

ANSR Lab Group Study Writing a Crawler Bob Mingshen Sun June 1, 2015

Slide 2

Slide 2 text

Writing a Crawler — Outline 1 Introduction Deﬁnition Goals Challenges 2 Scrapy at Glance 3 Writing a Crawler Item Spider Item Pipeline Architecture Middleware Command Line Tools Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 2 / 56

Slide 3

Slide 3 text

Writing a Crawler — Definition Deﬁnition Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or a Web scutter. Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 3 / 56

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Writing a Crawler — Goals Web Scraping: extracting information from website downloading resources (e.g., Android apps) extracting OSNs information (e.g., Alice follows Bob) harvesting scores or comments for movies or products (e.g., The Godfather scores 9.2 in IMDB) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 8 / 56

Slide 9

Slide 9 text

Writing a Crawler — Challenges Pagination Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 9 / 56

Slide 10

Slide 10 text

Writing a Crawler — Challenges Crawler Detection Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 10 / 56

Slide 11

Slide 11 text

Writing a Crawler — Challenges Cookie Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 11 / 56

Slide 12

Slide 12 text

Writing a Crawler — Challenges pagination crawler/bot detection cookie, referer, user agent JavaScript … Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 12 / 56

Slide 13

Slide 13 text

Writing a Crawler — Introducing Scrapy Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 13 / 56

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Writing a Crawler — Introducing Scrapy An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way. Fast and powerful: write the rules to extract the data and let Scrapy do the rest Easily extensible: extensible by design, plug new functionality easily without having to touch the core Portable, Python: written in Python and runs on Linux, Windows, Mac and BSD Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 15 / 56

Slide 16

Slide 16 text

Writing a Crawler — Introducing Scrapy EXCITED!!! Python??? Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 16 / 56

Slide 17

Slide 17 text

Writing a Crawler — Learning Python A quick & dirty introduction to Python programming language. from scrapy import Spider, Item, Field class Post(Item): title = Field() class BlogSpider(Spider): name, start_urls = 'blogspider', ['http://blog.scrapinghub.com'] def parse(self, response): res = [] for e in response.css("h2 a::text"): res.append(Post(title=e.extract())) return res Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 17 / 56

Slide 18

Slide 18 text

Writing a Crawler — Scrapy at Glance $ scrapy runspider myspider.py ... 2015-05-28 21:59:41+0800 [blogspider] DEBUG: Crawled (200) (refer 2015-05-28 21:59:41+0800 [blogspider] DEBUG: Scraped from <200 http://blog.scrapinghub.com> 'title': u'Gender Inequality Across Programming\xa0Languages'} 2015-05-28 21:59:42+0800 [blogspider] INFO: Dumping Scrapy stats: 'downloader/request_bytes': 219, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 91182, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', ... Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 18 / 56

Slide 19

Slide 19 text

Writing a Crawler — Scrapy at Glance 1 pick a website 2 deﬁne the data you want to scrape 3 write a spider to extract the data 4 run the spider to extract the data 5 review scraped data Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 19 / 56

Slide 20

Slide 20 text

Writing a Crawler — Scrapy at Glance pick a website: IMDB Top 250 Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 20 / 56

Slide 21

Slide 21 text

Writing a Crawler — Scrapy at Glance deﬁne the data you want to scrape rank movie title publish year IMDB rating from scrapy.item import Item, Field class MovieItem(Item): rank = Field() title = Field() year = Field() rating = Field() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 21 / 56

Slide 22

Slide 22 text

Writing a Crawler — Scrapy at Glance write a spider to extract the data 1 read HTML code 2 ﬁnd common patterns 3 determine XPATH 4 write a spider Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 22 / 56

Slide 23

Slide 23 text

Writing a Crawler — Scrapy at Glance write a spider to extract the data: 1. read HTML code Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 23 / 56

Slide 24

Slide 24 text

Writing a Crawler — Scrapy at Glance write a spider to extract the data: 2. ﬁnd common patterns Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 24 / 56

Slide 25

Slide 25 text

Writing a Crawler — Scrapy at Glance write a spider to extract the data: 3. determine XPATH HTML Preliminaries

Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 25 / 56

Slide 26

Slide 26 text

Writing a Crawler — Scrapy at Glance write a spider to extract the data: 3. determine XPATH rank: //td[@class='titleColumn']/span[@name='ir']/text() title: //td[@class='titleColumn']/a/text() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 26 / 56

Slide 27

Slide 27 text

Writing a Crawler — Scrapy at Glance write a spider to extract the data: 4. write a spider from scrapy import Item, Field, Spider from scrapy.selector import Selector class MovieItem(Item): rank = Field() title = Field() year = Field() rating = Field() class IMDBSpider(Spider): name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) rank_list = sel.xpath("//td[@class='titleColumn']/span[@name='ir']/text()").extract() title_list = sel.xpath("//td[@class='titleColumn']/a/text()").extract() year_list = sel.xpath("//td[@class='titleColumn']/span[@name='rd']/text()").extract() rating_list = sel.xpath("//td[@class='ratingColumn imdbRating']/strong/text()").extract() movie_items = [] for i in range(len(rank_list)): movie_item = MovieItem() movie_item['rank'] = rank_list[i][:-1] movie_item['title'] = title_list[i] movie_item['year'] = year_list[i][1:-1] movie_item['rating'] = rating_list[i] movie_items.append(movie_item) return movie_items Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 27 / 56

Slide 28

Slide 28 text

Writing a Crawler — Scrapy at Glance run the spider to extract the data $ scrapy runspider imdb_spider.py -o imdb.json -t json Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 28 / 56

Slide 29

Slide 29 text

Writing a Crawler — Scrapy at Glance review scraped data Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 29 / 56

Slide 30

Slide 30 text

Writing a Crawler — Scrapy at Glance What else? Built-in support for selecting and extracting data from HTML and XML sources JSON, CSV, XML, Storage middleware: user-agent spooﬁng Interactive shell console Support for creating spiders based on pre-deﬁned templates Service Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 30 / 56

Slide 31

Slide 31 text

Wanna crawl millions of webpages?

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Stay tuned. Let’s do it from scratch.

Slide 35

Slide 35 text

Wring a Crawler — Installation Installing Scrapy: Python 2.7, pip, lxml, OpenSSL $ pip install Scrapy Platform speciﬁc installation notes Windows (forget it) Ubuntu Don’t use the python-scrapy package from oﬃcial apt sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv 627220E7 echo ’deb http://archive.scrapy.org/ubuntu scrapy main’ | sudo tee /etc/apt/sources.list.d/scrapy.list sudo apt-get update && sudo apt-get install scrapy-0.24 Arch Linux: sudo pacman -S scrapy Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 35 / 56

Slide 36

Slide 36 text

Wring a Crawler — Creating a Project Creating a project: scrapy startproject tutorial scrapy.cfg: configuration file tutorial/: python module, you’ll later import your code from here. tutorial/items.py: items file. tutorial/pipelines.py: pipelines file. tutorial/settings.py: settings file. tutorial/spiders/: a directory where you’ll later put your spiders. tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 36 / 56

Slide 37

Slide 37 text

Wring a Crawler — Item Items are containers they work like simple python dicts but provide additional protection against populating undeclared ﬁelds, to prevent typos. from scrapy.item import Item, Field class MovieItem(Item): id = Field() title = Field() year = Field() rating = Field() # ... class ReviewItem(Item): id = Field() star_rating = Field() time = Field() country = Field() review = Field() Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 37 / 56

Slide 38

Slide 38 text

Writing a Crawler — Spider Spiders are user-written classes used to scrape information from a domain (or group of domains). Deﬁnes an initial list of URLs to download how to follow links how to parse the contents of those pages to extract items Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 38 / 56

Slide 39

Slide 39 text

Writing a Crawler — Spider To create a Spider, you must subclass scrapy.spider.Spider, and deﬁne the three main, mandatory, attributes: name start_urls parse() is a method of the spider will be called with the downloaded Response object of each start URL the response is passed to the method as the ﬁrst and only argument for parsing the response data and extracting scraped data and more URLs to follow Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 39 / 56

Slide 40

Slide 40 text

Wring a Crawler — Spider class IMDBSpider(Spider): name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() return movie_item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 40 / 56

Slide 41

Slide 41 text

We want to crawl all movies in IMDB!

Slide 42

Slide 42 text

Wring a Crawler — Spider class IMDBSpider(Spider): name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top'] def parse(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() yield movie_item for url in sel.xpath('//a/@href').extract(): yield Request(url, callback=self.parse) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 42 / 56

Slide 43

Slide 43 text

Wring a Crawler — CrawlSpider class scrapy.contrib.spiders.CrawlSpider the most commonly used spider for crawling regular websites it provides a convenient mechanism for following links by deﬁning a set of rules start from it and override it as needed for more custom functionality or just implement your own spider Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 43 / 56

Slide 44

Slide 44 text

Wring a Crawler — CrawlSpider class IMDBSpider(CrawlSpider): name, start_urls = 'imdbspider', ['http://www.imdb.com/chart/top'] rules = ( Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), Rule(SgmlLinkExtractor(allow=('regular expression', )), callback='parse_review), Rule(SgmlLinkExtractor(allow=('regular expression', )), callback='parse_movie), ) def parse_movie(self, response): sel = Selector(response) movie_item = MovieItem() movie_item['title'] = sel.xpath("...").extract() movie_item['year'] = sel.xpath("...").extract() movie_item['rating'] = sel.xpath("...").extract() yield movie_item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 44 / 56

Slide 45

Slide 45 text

Wring a Crawler — Item Pipeline After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially. Each item pipeline component is a Python class that implements a simple method. receive an Item and perform an action over it deciding if the Item should continue through the pipeline or be dropped and no longer processed Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 45 / 56

Slide 46

Slide 46 text

Wring a Crawler — Item Pipeline Typical use for item pipelines are: cleansing HTML data validating scraped data (checking that the items contain certain ﬁelds) checking for duplicates (and dropping them) storing the scraped item in a database Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 46 / 56

Slide 47

Slide 47 text

Wring a Crawler — Item Pipeline class PricePipeline(object): vat_factor = 1.15 def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price'] = item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s" % item) Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 47 / 56

Slide 48

Slide 48 text

Wring a Crawler — Item Pipeline class JsonWriterPipeline(object): def __init__(self): self.file = open('items.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 48 / 56

Slide 49

Slide 49 text

Wring a Crawler — Item Pipeline Activating an Item Pipeline component To activate an Item Pipeline component you must add its class to the ITEM_PIPELINES setting, like in the following example: ITEM_PIPELINES = { 'myproject.pipeline.PricePipeline': 300, 'myproject.pipeline.JsonWriterPipeline': 800, } Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 49 / 56

Slide 50

Slide 50 text

Writing a Crawler — Scrapy Architecture Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 50 / 56

Slide 51

Slide 51 text

Writing a Crawler — Downloader Middlewares The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses. CookiesMiddleware DefaultHeadersMiddleware DownloadTimeoutMiddleware HttpAuthMiddleware UserAgentMiddleware AjaxCrawlMiddleware Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 51 / 56

Slide 52

Slide 52 text

Writing a Crawler — Spider Middleware The spider middleware is a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to process the requests and items that are generated from spiders. DepthMiddleware HttpErrorMiddleware UrlLengthMiddleware Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 52 / 56

Slide 53

Slide 53 text

Writing a Crawler — Command Line Tools Command line tools for scrapy create a project: scrapy startproject myproject create a new spider: scrapy genspider mydomain mydomain.com start crawling: scrapy crawl myproject pause and continue: scrapy crawl somespider -s JOBDIR=crawls/somespider-1 Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 53 / 56

Slide 54

Slide 54 text

Summary 1 Introduction Deﬁnition Goals Challenges 2 Scrapy at Glance 3 Writing a Crawler Item Spider Item Pipeline Architecture Middleware Command Line Tools Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 54 / 56

Slide 55

Slide 55 text

Thank you! Q&A and Demo

Slide 56

Slide 56 text

References http://www.improgrammer.net/linux-world-map/ http://thehackernews.com/2013/06/most-sophisticated-android-malware-ever.html http://www.makelinux.net/android/internals/ http://wiki.tei-c.org/index.php/File:0xbabaf000l.png http://phdcomics.com/comics/archive.php?comicid=1796 Scrapy: http://scrapy.org/ Bob Mingshen Sun ANSR Lab Group Study June 1, 2015 56 / 56