Europython - Dive into Scrapy

DIVE INTO SCRAPY Juan Riaza @juanriaza

@juanriaza Juan Riaza Who am I? • Developer Evangelist @
Scrapinghub • Pythonista, Djangonaut from 0.96 • I like to reverse engineer stuff

APIs Why Web Scraping Semantic web

What is Web Scraping The main goal in scraping is
to extract structured data from unstructured sources, typically, web pages.

What for • Monitor prices • Leads generation • Aggregate
information • Your imagination is the limit

Do you speak HTTP? Headers, Query String Status Codes Methods
Persistence GET, POST, PUT, HEAD… 2XX, 3XX, 4XX, 418 , 5XX, 999 Accept-language, UA*… Cookies

Standard Library HTTP for humans Let’s perform a request urllib2
python-requests

import requests req = requests.get('http://scrapinghub.com/about/') Show me the code! What
now?

HTML is not a regular language

lxml pythonic binding for the C libraries libxml2 and libxslt
beautifulsoup html.parser, lxml, html5lib HTML Parsers

import requests import lxml.html req = requests.get('https://ep2015.europython.eu/schedule/list/') tree = lxml.html.fromstring(req.text)
for tr in tree.xpath('//td[@data-talk]/div[@class="name"]/a'): name = tr.xpath('text()') url = tr.xpath('@href') Show me the code!

“Those who don't understand xpath are cursed to reinvent it,
poorly.”

Scrapy-ify early on

“An open source and collaborative framework for extracting the data
you need from websites. In a fast, simple, yet extensible way.”

$ scrapy shell <url> An interactive shell console Invaluable tool
for developing and debugging your spiders

An interactive shell console >>> response.url 'http://example.com' >>> response.xpath('//h1/text()') [<Selector
xpath='//h1/text()' data=u'Example Domain'>] >>> view(response) # open in browser >>> fetch('http://www.google.com') # fetch other URL

$ scrapy startproject <name> europython ├── europython │ ├── __init__.py
│ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ └── __init__.py └── scrapy.cfg Starting a project

What is a spider

import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com']
start_urls = [ 'http://www.example.com/1.html', ] def parse(self, response): msg = 'A response from %s just arrived!' % response.url self.logger.info(msg) What is a Spider?

What is a Spider? import scrapy from europython.items import MyItem
class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse) def parse(self, response): for h3 in response.xpath('//h3').extract(): yield MyItem(title=h3) for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)

import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com']
start_urls = [ ‘http://www.example.com/1.html' ] def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) What is a Spider? 1.0

Items import scrapy class Product(scrapy.Item): name = scrapy.Field() price =
scrapy.Field() stock = scrapy.Field() last_updated = scrapy.Field(serializer=str)

Item Loaders

Item exporters •Built-in support for generating feed exports in multiple
formats (JSON, CSV, XML) and storing them in multiple backends •Backends FTP, S3, local filesystem

from django.db import models class Person(models.Model): name = models.CharField(max_length=255) age
= models.IntegerField() from scrapy.contrib.djangoitem import DjangoItem class PersonItem(DjangoItem): django_model = Person DjangoItem

Under the hood

Architecture

Item pipelines • Default field value • Validating scraped data
• Checking for duplicates • Storing items • 3rd party integrations

Middlewares • Session handling • Retrying bad requests • Modifying
requests • Randomize User Agent

Batteries included • Logging • Stats collection • Testing: contracts
• Telnet console: inspect a Scrapy process

scrapinghub/pycon-speakers Project Example

Avoid getting banned • Rotate your User Agent • Disable
cookies • Randomized download delays • Use a pool of rotating IPs • Crawlera

Broad Crawls: Frontera

A service daemon to run Scrapy spiders $ scrapyd-deploy Deployment
1.0 scrapyd

Scrapy Cloud $ shub deploy

TONS of Open Source Fully remote distributed team About us

Professional Services Products Mandatory Sales Slide • Scrapy Cloud •
Crawlera

We’re hiring!

Gracias. Juan Riaza @juanriaza

Europython - Dive into Scrapy

Europython - Dive into Scrapy

More Decks by juanriaza

Other Decks in Programming

Featured

Transcript