Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Europython - Dive into Scrapy

Europython - Dive into Scrapy

juanriaza

July 21, 2015
Tweet

More Decks by juanriaza

Other Decks in Programming

Transcript

  1. @juanriaza Juan Riaza Who am I? • Developer Evangelist @

    Scrapinghub • Pythonista, Djangonaut from 0.96 • I like to reverse engineer stuff
  2. What is Web Scraping The main goal in scraping is

    to extract structured data from unstructured sources, typically, web pages.
  3. What for • Monitor prices • Leads generation • Aggregate

    information • Your imagination is the limit
  4. Do you speak HTTP? Headers, Query String Status Codes Methods

    Persistence GET, POST, PUT, HEAD… 2XX, 3XX, 4XX, 418 , 5XX, 999 Accept-language, UA*… Cookies
  5. lxml pythonic binding for the C libraries libxml2 and libxslt

    beautifulsoup html.parser, lxml, html5lib HTML Parsers
  6. import requests import lxml.html req = requests.get('https://ep2015.europython.eu/schedule/list/') tree = lxml.html.fromstring(req.text)

    for tr in tree.xpath('//td[@data-talk]/div[@class="name"]/a'): name = tr.xpath('text()') url = tr.xpath('@href') Show me the code!
  7. “An open source and collaborative framework for extracting the data

    you need from websites. In a fast, simple, yet extensible way.”
  8. $ scrapy shell <url> An interactive shell console Invaluable tool

    for developing and debugging your spiders
  9. An interactive shell console >>> response.url 'http://example.com' >>> response.xpath('//h1/text()') [<Selector

    xpath='//h1/text()' data=u'Example Domain'>] >>> view(response) # open in browser >>> fetch('http://www.google.com') # fetch other URL
  10. $ scrapy startproject <name> europython ├── europython │ ├── __init__.py

    │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ └── __init__.py └── scrapy.cfg Starting a project
  11. import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com']

    start_urls = [ 'http://www.example.com/1.html', ] def parse(self, response): msg = 'A response from %s just arrived!' % response.url self.logger.info(msg) What is a Spider?
  12. What is a Spider? import scrapy from europython.items import MyItem

    class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse) def parse(self, response): for h3 in response.xpath('//h3').extract(): yield MyItem(title=h3) for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
  13. import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com']

    start_urls = [ ‘http://www.example.com/1.html' ] def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) What is a Spider? 1.0
  14. Items import scrapy class Product(scrapy.Item): name = scrapy.Field() price =

    scrapy.Field() stock = scrapy.Field() last_updated = scrapy.Field(serializer=str)
  15. Item exporters •Built-in support for generating feed exports in multiple

    formats (JSON, CSV, XML) and storing them in multiple backends •Backends FTP, S3, local filesystem
  16. from django.db import models class Person(models.Model): name = models.CharField(max_length=255) age

    = models.IntegerField() from scrapy.contrib.djangoitem import DjangoItem class PersonItem(DjangoItem): django_model = Person DjangoItem
  17. Item pipelines • Default field value • Validating scraped data

    • Checking for duplicates • Storing items • 3rd party integrations
  18. Avoid getting banned • Rotate your User Agent • Disable

    cookies • Randomized download delays • Use a pool of rotating IPs • Crawlera