Crawling the web like a boss

Crawling the web like a boss Victor (Frodo) Martinez Software
Engineer @ Jusbrasil HackAgilize 2016

Victor Frodo Martinez Software Engineer @ Jusbrasil vcrmartinez [email protected] victormartinez.github.io
twitter email blog

~200 spiders ~202 K news/month

[Automated] Web Data Extraction

Maintenance Automation Knowledge Controlled Environment

An open source and collaborative framework for extracting the data
you need from websites. In a fast, simple, yet extensible way.

SPIDERS SCHEDULER DOWNLOADER ITEM PIPELINES ENGINE SPIDER MIDDLEWARE DOWNLOADER MIDDLEWARE

ENGINE Controls the data ﬂow between all components Triggers events
when certain actions occur

SPIDERS Classes written by users to parse responses and extract
items

SCHEDULER Receives requests from the engine Enqueues requests for feeding
them when the engine requests them

DOWNLOADER MIDDLEWARE - Hooks into Scrapy’s spider processing mechanism -
Custom functionality - Responses sent to Spiders - Requests and items generated from spiders - Hooks into Scrapy’s request/response processing - Low-level system for globally altering Scrapy’s requests and responses. SPIDER MIDDLEWARE

DOWNLOADER Fetches web pages and feeding them to the engine

ITEM PIPELINES Processes the items once they have been extracted
[cleansing, validation, persistence]

Talk is cheap. Show me the code. - Linus Torvalds

http://scrapy.org/

$ apt-get -y install python-pip $ sudo easy_install pip $
pip --help OSX Ubuntu $ pip --help PIP

pip --help OSX Ubuntu $ pip --help PIP $ pip install virtualenv $ virtualenv venv $ source venv/bin/activate VirtualEnv

pip --help OSX Ubuntu $ pip --help PIP $ pip install virtualenv $ virtualenv venv $ source venv/bin/activate VirtualEnv Install Scrapy $ pip install scrapy

$ scrapy startproject <project_name> $ scrapy startproject agilize

$ scrapy startproject <project_name> New Scrapy project 'agilize', using template
directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in: /Users/victormartinez/Workspace/python/agilize You can start your ﬁrst spider with: cd agilize scrapy genspider example example.com $ scrapy startproject agilize

agilize/ ʮʒʒ scrapy.cfg ʦʒʒ agilize/ ʮʒʒ __init__.py ʮʒʒ items.py ʮʒʒ
pipelines.py ʮʒʒ settings.py ʦʒʒ spiders/ ʦʒʒ __init__.py $ scrapy startproject <project_name> $ scrapy startproject agilize

$ scrapy genspider <spider> <url>

$ scrapy genspider <spider> <url> $ scrapy genspider myspider 'www.blog.agilize.com.br'

$ scrapy genspider <spider> <url> $ scrapy genspider myspider 'www.blog.agilize.com.br'
# -*- coding: utf-8 -*- import scrapy class MySpider(scrapy.Spider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) def parse(self, response): pass

# -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.CrawlSpider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) rules = ( Rule( # Rule 1 LinkExtractor( ), callback=‘parse_item’, ), ) def parse_item(self, response): pass

# -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.CrawlSpider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) rules = ( Rule( # Rule 1 LinkExtractor( ), callback=‘parse_item’, ), ) def parse_item(self, response): pass # -*- coding: utf-8 -*- import scrapy class MySpider(scrapy.Spider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) def parse(self, response): pass

def parse(self, response): title = response.xpath("//div[@class='posts']/div[@class='post']/div[@class='post-content']/h1/ text()").extract_ﬁrst()  body = response.xpath(“//div[@class='post-content']/div[@class='post-excerpt']").extract_ﬁrst()
return {‘title’: title, ‘body’: body}

$ scrapy crawl <spider_name>

$ scrapy crawl <spider_name> -o <ﬁlename>.json $ scrapy crawl agilize
-o items.json

Extract quotes from the main page http://quotes.toscrape.com/tag/humor/

Extract all blog posts in the ﬁrst three pages blog.agilize.com.br

Extract lecturers information from “Portal da Transparência” website http://www.portaldatransparencia.gov.br/ [Nome,
Tipo do Servidor, Matrícula, Cargo, Jornada de Trabalho]

Victor Frodo Martinez Software Engineer @ Jusbrasil vcrmartinez [email protected] victormartinez.github.io
Thanks!

Crawling the web like a boss

Crawling the web like a boss

More Decks by Victor Martinez

Other Decks in Programming

Featured

Transcript