Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crawling the web like a boss

Crawling the web like a boss

On Saturday I was pleased to be Speaker at HackAgilize 2016. It was an small event hosted by [Agilize](http://www.agilize.com.br) and aimed to develop some projects through Coding Dojos and discuss technologies, stacks, etc. My presentation focused on bringing up the main concepts related to Scrapy and quickly put the theory in practice by creating some projects with the participants.

6dbafc7a4ba86959b02c97995bf7be70?s=128

Victor Martinez

October 15, 2016
Tweet

Transcript

  1. Crawling the web like a boss Victor (Frodo) Martinez Software

    Engineer @ Jusbrasil HackAgilize 2016
  2. Victor Frodo Martinez Software Engineer @ Jusbrasil vcrmartinez vcrmartinez@gmail.com victormartinez.github.io

    twitter email blog
  3. None
  4. ~200 spiders ~202 K news/month

  5. None
  6. [Automated] Web Data Extraction

  7. Maintenance Automation Knowledge Controlled Environment

  8. None
  9. An open source and collaborative framework for extracting the data

    you need from websites. In a fast, simple, yet extensible way.
  10. None
  11. SPIDERS SCHEDULER DOWNLOADER ITEM PIPELINES ENGINE SPIDER MIDDLEWARE DOWNLOADER MIDDLEWARE

  12. ENGINE Controls the data flow between all components Triggers events

    when certain actions occur
  13. SPIDERS Classes written by users to parse responses and extract

    items
  14. SCHEDULER Receives requests from the engine Enqueues requests for feeding

    them when the engine requests them
  15. DOWNLOADER MIDDLEWARE - Hooks into Scrapy’s spider processing mechanism -

    Custom functionality - Responses sent to Spiders - Requests and items generated from spiders - Hooks into Scrapy’s request/response processing - Low-level system for globally altering Scrapy’s requests and responses. SPIDER MIDDLEWARE
  16. DOWNLOADER Fetches web pages and feeding them to the engine

  17. ITEM PIPELINES Processes the items once they have been extracted

    [cleansing, validation, persistence]
  18. None
  19. Talk is cheap. Show me the code. - Linus Torvalds

  20. http://scrapy.org/

  21. http://scrapy.org/

  22. http://scrapy.org/

  23. $ apt-get -y install python-pip $ sudo easy_install pip $

    pip --help OSX Ubuntu $ pip --help PIP
  24. $ apt-get -y install python-pip $ sudo easy_install pip $

    pip --help OSX Ubuntu $ pip --help PIP $ pip install virtualenv $ virtualenv venv $ source venv/bin/activate VirtualEnv
  25. $ apt-get -y install python-pip $ sudo easy_install pip $

    pip --help OSX Ubuntu $ pip --help PIP $ pip install virtualenv $ virtualenv venv $ source venv/bin/activate VirtualEnv Install Scrapy $ pip install scrapy
  26. $ scrapy startproject <project_name> $ scrapy startproject agilize

  27. $ scrapy startproject <project_name> New Scrapy project 'agilize', using template

    directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in: /Users/victormartinez/Workspace/python/agilize You can start your first spider with: cd agilize scrapy genspider example example.com $ scrapy startproject agilize
  28. agilize/ ʮʒʒ scrapy.cfg ʦʒʒ agilize/ ʮʒʒ __init__.py ʮʒʒ items.py ʮʒʒ

    pipelines.py ʮʒʒ settings.py ʦʒʒ spiders/ ʦʒʒ __init__.py $ scrapy startproject <project_name> $ scrapy startproject agilize
  29. $ scrapy genspider <spider> <url>

  30. $ scrapy genspider <spider> <url> $ scrapy genspider myspider 'www.blog.agilize.com.br'

  31. $ scrapy genspider <spider> <url> $ scrapy genspider myspider 'www.blog.agilize.com.br'

    # -*- coding: utf-8 -*- import scrapy class MySpider(scrapy.Spider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) def parse(self, response): pass
  32. # -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule

    from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.CrawlSpider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) rules = ( Rule( # Rule 1 LinkExtractor( ), callback=‘parse_item’, ), ) def parse_item(self, response): pass
  33. # -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule

    from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.CrawlSpider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) rules = ( Rule( # Rule 1 LinkExtractor( ), callback=‘parse_item’, ), ) def parse_item(self, response): pass # -*- coding: utf-8 -*- import scrapy class MySpider(scrapy.Spider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) def parse(self, response): pass
  34. def parse(self, response): title = response.xpath("//div[@class='posts']/div[@class='post']/div[@class='post-content']/h1/ text()").extract_first()
 body = response.xpath(“//div[@class='post-content']/div[@class='post-excerpt']").extract_first()

    return {‘title’: title, ‘body’: body}
  35. $ scrapy crawl <spider_name>

  36. $ scrapy crawl <spider_name> -o <filename>.json $ scrapy crawl agilize

    -o items.json
  37. None
  38. Extract quotes from the main page http://quotes.toscrape.com/tag/humor/

  39. Extract all blog posts in the first three pages blog.agilize.com.br

  40. Extract lecturers information from “Portal da Transparência” website http://www.portaldatransparencia.gov.br/ [Nome,

    Tipo do Servidor, Matrícula, Cargo, Jornada de Trabalho]
  41. None
  42. Victor Frodo Martinez Software Engineer @ Jusbrasil vcrmartinez vcrmartinez@gmail.com victormartinez.github.io

    Thanks!