Crawling the web like a boss

Slide 1

Slide 1 text

Crawling the web like a boss Victor (Frodo) Martinez Software Engineer @ Jusbrasil HackAgilize 2016

Slide 2

Slide 2 text

Victor Frodo Martinez Software Engineer @ Jusbrasil vcrmartinez [email protected] victormartinez.github.io twitter email blog

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

~200 spiders ~202 K news/month

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

[Automated] Web Data Extraction

Slide 7

Slide 7 text

Maintenance Automation Knowledge Controlled Environment

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

SPIDERS SCHEDULER DOWNLOADER ITEM PIPELINES ENGINE SPIDER MIDDLEWARE DOWNLOADER MIDDLEWARE

Slide 12

Slide 12 text

ENGINE Controls the data ﬂow between all components Triggers events when certain actions occur

Slide 13

Slide 13 text

SPIDERS Classes written by users to parse responses and extract items

Slide 14

Slide 14 text

SCHEDULER Receives requests from the engine Enqueues requests for feeding them when the engine requests them

Slide 15

Slide 15 text

DOWNLOADER MIDDLEWARE - Hooks into Scrapy’s spider processing mechanism - Custom functionality - Responses sent to Spiders - Requests and items generated from spiders - Hooks into Scrapy’s request/response processing - Low-level system for globally altering Scrapy’s requests and responses. SPIDER MIDDLEWARE

Slide 16

Slide 16 text

DOWNLOADER Fetches web pages and feeding them to the engine

Slide 17

Slide 17 text

ITEM PIPELINES Processes the items once they have been extracted [cleansing, validation, persistence]

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Talk is cheap. Show me the code. - Linus Torvalds

Slide 20

Slide 20 text

http://scrapy.org/

Slide 21

Slide 21 text

http://scrapy.org/

Slide 22

Slide 22 text

http://scrapy.org/

Slide 23

Slide 23 text

$ apt-get -y install python-pip $ sudo easy_install pip $ pip --help OSX Ubuntu $ pip --help PIP

Slide 24

Slide 24 text

$ apt-get -y install python-pip $ sudo easy_install pip $ pip --help OSX Ubuntu $ pip --help PIP $ pip install virtualenv $ virtualenv venv $ source venv/bin/activate VirtualEnv

Slide 25

Slide 25 text

$ apt-get -y install python-pip $ sudo easy_install pip $ pip --help OSX Ubuntu $ pip --help PIP $ pip install virtualenv $ virtualenv venv $ source venv/bin/activate VirtualEnv Install Scrapy $ pip install scrapy

Slide 26

Slide 26 text

$ scrapy startproject $ scrapy startproject agilize

Slide 27

Slide 27 text

$ scrapy startproject New Scrapy project 'agilize', using template directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in: /Users/victormartinez/Workspace/python/agilize You can start your ﬁrst spider with: cd agilize scrapy genspider example example.com $ scrapy startproject agilize

Slide 28

Slide 28 text

agilize/ ʮʒʒ scrapy.cfg ʦʒʒ agilize/ ʮʒʒ __init__.py ʮʒʒ items.py ʮʒʒ pipelines.py ʮʒʒ settings.py ʦʒʒ spiders/ ʦʒʒ __init__.py $ scrapy startproject $ scrapy startproject agilize

Slide 29

Slide 29 text

$ scrapy genspider

Slide 30

Slide 30 text

$ scrapy genspider $ scrapy genspider myspider 'www.blog.agilize.com.br'

Slide 31

Slide 31 text

$ scrapy genspider $ scrapy genspider myspider 'www.blog.agilize.com.br' # -*- coding: utf-8 -*- import scrapy class MySpider(scrapy.Spider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) def parse(self, response): pass

Slide 32

Slide 32 text

Slide 33

Slide 33 text

# -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.CrawlSpider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) rules = ( Rule( # Rule 1 LinkExtractor( ), callback=‘parse_item’, ), ) def parse_item(self, response): pass # -*- coding: utf-8 -*- import scrapy class MySpider(scrapy.Spider): name = “myspider" allowed_domains = [“blog.agilize.com.br”] start_urls = ( ‘http://www.blog.agilize.com.br/' ) def parse(self, response): pass

Slide 34

Slide 34 text

def parse(self, response): title = response.xpath("//div[@class='posts']/div[@class='post']/div[@class='post-content']/h1/ text()").extract_ﬁrst()  body = response.xpath(“//div[@class='post-content']/div[@class='post-excerpt']").extract_ﬁrst() return {‘title’: title, ‘body’: body}

Slide 35

Slide 35 text

$ scrapy crawl

Slide 36

Slide 36 text

$ scrapy crawl -o <ﬁlename>.json $ scrapy crawl agilize -o items.json

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Extract quotes from the main page http://quotes.toscrape.com/tag/humor/

Slide 39

Slide 39 text

Extract all blog posts in the ﬁrst three pages blog.agilize.com.br

Slide 40

Slide 40 text

Extract lecturers information from “Portal da Transparência” website http://www.portaldatransparencia.gov.br/ [Nome, Tipo do Servidor, Matrícula, Cargo, Jornada de Trabalho]

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Victor Frodo Martinez Software Engineer @ Jusbrasil vcrmartinez [email protected] victormartinez.github.io Thanks!