Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crawling the web like a boss

Crawling the web like a boss

On Saturday I was pleased to be Speaker at HackAgilize 2016. It was an small event hosted by [Agilize](http://www.agilize.com.br) and aimed to develop some projects through Coding Dojos and discuss technologies, stacks, etc. My presentation focused on bringing up the main concepts related to Scrapy and quickly put the theory in practice by creating some projects with the participants.

Victor Martinez

October 15, 2016
Tweet

More Decks by Victor Martinez

Other Decks in Programming

Transcript

  1. An open source and collaborative framework for extracting the data

    you need from websites. In a fast, simple, yet extensible way.
  2. DOWNLOADER MIDDLEWARE - Hooks into Scrapy’s spider processing mechanism -

    Custom functionality - Responses sent to Spiders - Requests and items generated from spiders - Hooks into Scrapy’s request/response processing - Low-level system for globally altering Scrapy’s requests and responses. SPIDER MIDDLEWARE
  3. $ apt-get -y install python-pip $ sudo easy_install pip $

    pip --help OSX Ubuntu $ pip --help PIP
  4. $ apt-get -y install python-pip $ sudo easy_install pip $

    pip --help OSX Ubuntu $ pip --help PIP $ pip install virtualenv $ virtualenv venv $ source venv/bin/activate VirtualEnv
  5. $ apt-get -y install python-pip $ sudo easy_install pip $

    pip --help OSX Ubuntu $ pip --help PIP $ pip install virtualenv $ virtualenv venv $ source venv/bin/activate VirtualEnv Install Scrapy $ pip install scrapy
  6. $ scrapy startproject <project_name> New Scrapy project 'agilize', using template

    directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in: /Users/victormartinez/Workspace/python/agilize You can start your first spider with: cd agilize scrapy genspider example example.com $ scrapy startproject agilize
  7. agilize/ ʮʒʒ scrapy.cfg ʦʒʒ agilize/ ʮʒʒ __init__.py ʮʒʒ items.py ʮʒʒ

    pipelines.py ʮʒʒ settings.py ʦʒʒ spiders/ ʦʒʒ __init__.py $ scrapy startproject <project_name> $ scrapy startproject agilize
  8. $ scrapy genspider <spider> <url> $ scrapy genspider myspider 'www.blog.agilize.com.br'

    # -*- coding: utf-8 -*- import scrapy class MySpider(scrapy.Spider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) def parse(self, response): pass
  9. # -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule

    from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.CrawlSpider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) rules = ( Rule( # Rule 1 LinkExtractor( ), callback=‘parse_item’, ), ) def parse_item(self, response): pass
  10. # -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule

    from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.CrawlSpider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) rules = ( Rule( # Rule 1 LinkExtractor( ), callback=‘parse_item’, ), ) def parse_item(self, response): pass # -*- coding: utf-8 -*- import scrapy class MySpider(scrapy.Spider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) def parse(self, response): pass