Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Europython - Dive into Scrapy

Europython - Dive into Scrapy


July 21, 2015

More Decks by juanriaza

Other Decks in Programming


  1. DIVE INTO SCRAPY Juan Riaza @juanriaza

  2. @juanriaza Juan Riaza Who am I? • Developer Evangelist @

    Scrapinghub • Pythonista, Djangonaut from 0.96 • I like to reverse engineer stuff
  3. APIs Why Web Scraping Semantic web

  4. What is Web Scraping The main goal in scraping is

    to extract structured data from unstructured sources, typically, web pages.
  5. What for • Monitor prices • Leads generation • Aggregate

    information • Your imagination is the limit
  6. Do you speak HTTP? Headers, Query String Status Codes Methods

    Persistence GET, POST, PUT, HEAD… 2XX, 3XX, 4XX, 418 , 5XX, 999 Accept-language, UA*… Cookies
  7. Standard Library HTTP for humans Let’s perform a request urllib2

  8. import requests req = requests.get('http://scrapinghub.com/about/') Show me the code! What

  9. HTML is not a regular language

  10. lxml pythonic binding for the C libraries libxml2 and libxslt

    beautifulsoup html.parser, lxml, html5lib HTML Parsers
  11. import requests import lxml.html req = requests.get('https://ep2015.europython.eu/schedule/list/') tree = lxml.html.fromstring(req.text)

    for tr in tree.xpath('//td[@data-talk]/div[@class="name"]/a'): name = tr.xpath('text()') url = tr.xpath('@href') Show me the code!
  12. “Those who don't understand xpath are cursed to reinvent it,

  13. Scrapy-ify early on

  14. “An open source and collaborative framework for extracting the data

    you need from websites. In a fast, simple, yet extensible way.”
  15. $ scrapy shell <url> An interactive shell console Invaluable tool

    for developing and debugging your spiders
  16. An interactive shell console >>> response.url 'http://example.com' >>> response.xpath('//h1/text()') [<Selector

    xpath='//h1/text()' data=u'Example Domain'>] >>> view(response) # open in browser >>> fetch('http://www.google.com') # fetch other URL
  17. $ scrapy startproject <name> europython ├── europython │ ├── __init__.py

    │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ └── __init__.py └── scrapy.cfg Starting a project
  18. What is a spider

  19. import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com']

    start_urls = [ 'http://www.example.com/1.html', ] def parse(self, response): msg = 'A response from %s just arrived!' % response.url self.logger.info(msg) What is a Spider?
  20. What is a Spider? import scrapy from europython.items import MyItem

    class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse) def parse(self, response): for h3 in response.xpath('//h3').extract(): yield MyItem(title=h3) for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
  21. import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com']

    start_urls = [ ‘http://www.example.com/1.html' ] def parse(self, response): for h3 in response.xpath(‘//h3/text()’).extract(): yield {‘title’: h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse) What is a Spider? 1.0
  22. Items import scrapy class Product(scrapy.Item): name = scrapy.Field() price =

    scrapy.Field() stock = scrapy.Field() last_updated = scrapy.Field(serializer=str)
  23. Item Loaders

  24. Item exporters •Built-in support for generating feed exports in multiple

    formats (JSON, CSV, XML) and storing them in multiple backends •Backends FTP, S3, local filesystem
  25. from django.db import models class Person(models.Model): name = models.CharField(max_length=255) age

    = models.IntegerField() from scrapy.contrib.djangoitem import DjangoItem class PersonItem(DjangoItem): django_model = Person DjangoItem
  26. Under the hood

  27. Architecture

  28. Item pipelines • Default field value • Validating scraped data

    • Checking for duplicates • Storing items • 3rd party integrations
  29. Middlewares • Session handling • Retrying bad requests • Modifying

    requests • Randomize User Agent
  30. Batteries included • Logging • Stats collection • Testing: contracts

    • Telnet console: inspect a Scrapy process
  31. scrapinghub/pycon-speakers Project Example

  32. Avoid getting banned • Rotate your User Agent • Disable

    cookies • Randomized download delays • Use a pool of rotating IPs • Crawlera
  33. Broad Crawls: Frontera

  34. A service daemon to run Scrapy spiders $ scrapyd-deploy Deployment

    1.0 scrapyd
  35. Scrapy Cloud $ shub deploy

  36. TONS of Open Source Fully remote distributed team About us

  37. Professional Services Products Mandatory Sales Slide • Scrapy Cloud •

  38. We’re hiring!

  39. Gracias. Juan Riaza @juanriaza