From website to JSON data in 30 minutes with Scrapy

Slide 1

Slide 1 text

WEB SITES TO JSON DATA IN 30 MINUTES WITH SCRAPY (Blame me if it takes more time!)

Slide 2

Slide 2 text

Emanuele Palazzetti (Python backend Developer) @ https://evonove.it git.palazzetti.me :: @palazzem :: github.com/palazzem

Slide 3

Slide 3 text

Before we begin, I want to tell you a story An unhappy story of me searching a "magic" framework: SCRAPY

Slide 4

Slide 4 text

(During an ordinary day) Everything goes right... ...until a customer knocks to your door... ...he wants to talk about a really important project...

Slide 5

Slide 5 text

Default approach: hear the customer (Not always the right thing to do)

Slide 6

Slide 6 text

Customer: "I've a brilliant idea!" Probably this is the "point of no return"

Slide 7

Slide 7 text

Customer: "Let's gather the Internet searching for companies information"

Slide 8

Slide 8 text

Google custom search search everything you need JSON/XML/... data extraction save all results in your (web) application Ok we have a deal!

Slide 9

Slide 9 text

Customer: IT'S TOO EXPENSIVE

Slide 10

Slide 10 text

Customer: WE MUST BYPASS GOOGLE

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

I found myself in a maelstorm Me: "Ok. What's your plan?" (I'll never say this again)

Slide 13

Slide 13 text

Customer: "Surf the Internet and you're done!" [...] "No?"

Slide 14

Slide 14 text

In this situation you should say what is good and what is bad Especially when we talk about ideas Especially when we talk about "surf"

Slide 15

Slide 15 text

Good idea / Bad idea

Slide 16

Slide 16 text

Customer requirements: gather information from customers website but we don't have customers website we have VAT, phone number, and other heterogeneous data use standard Google search page follow ﬁrst N links grab (steal) emails!

Slide 17

Slide 17 text

ScraperWiki? SCRAPY! $ pip install scrapy $ scrapy startproject harvester

Slide 18

Slide 18 text

$ tree . ├── harvester │ ├── __init__.py │ ├── items.py # Item class │ ├── pipelines.py │ ├── settings.py │ └── spiders # Spiders class folder │ └── __init__.py └── scrapy.cfg

Slide 19

Slide 19 text

Item All web pages contain unstructured data Item class is a container used to collect scraped data Provides a dictionary-like API Class attributes are Field() objects

Slide 20

Slide 20 text

CompanyItem from scrapy.item import Item, Field class CompanyItem(Item): link = Field() emails = Field() Just a note: >>> company['name'] = 'Evonove' # setting unknown field Traceback (most recent call last): ... KeyError: 'CompanyItem does not support field: name'

Slide 21

Slide 21 text

Spider deﬁnes the custom behavior during scraping how Spider follows links how Spider ﬁnds unstructured data through Selectors how to populate Item() object

Slide 22

Slide 22 text

My first Spider class SpiderBot(Spider): name = 'bot' start_urls = [ "https://evonove.it/contacts/" ]

Slide 23

Slide 23 text

Parse the response! def parse(self, response): selector = Selector(response) # Use regex to find valid emails emails = selector.re(self.EMAIL_REGEX) if (emails): item = CompanyItem() item['link'] = response.url item['emails'] = emails return item return Easy thanks to Selector!

Slide 24

Slide 24 text

We don't have customers website def __init__(self, filename=None, *args, **kwargs): super(SpiderBot, self).__init__(*args, **kwargs) # Get start urls from file self.start_urls = search_urls_from_file(filename) if filename else [] search_urls_from_file https://www.google.it/search?q={}

Slide 25

Slide 25 text

What is missing? scrape Google search! ﬁnd all results grab ﬁrst 5 links follow them scrape resulting page! Did I tell you Scrapy is batteries included?

Slide 26

Slide 26 text

Hello CrawlSpider! class SpiderBot(CrawlSpider): name = 'bot' rules = ( Rule( SgmlLinkExtractor(), ), )

Slide 27

Slide 27 text

Ok maybe not *all* links class SpiderBot(CrawlSpider): name = 'bot' rules = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), ), )

Slide 28

Slide 28 text

And only first five class SpiderBot(CrawlSpider): name = 'bot' rules = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5]

Slide 29

Slide 29 text

Oh yes... I should parse all responses! class SpiderBot(CrawlSpider): name = 'bot' rules = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', callback='parse_item', # parse() is now parse_item() ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5]

Slide 30

Slide 30 text

Customer's spider! class SpiderBot(CrawlSpider): name = 'bot' rules = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', callback='parse_item', ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5] def __init__(self, filename=None, *args, **kwargs): super(SpiderBot, self).__init__(*args, **kwargs) # Get start urls from file self.start_urls = urls_from_file(filename) if filename else [] def parse_item(self, response): selector = Selector(response)

Slide 31

Slide 31 text

Launch our Spider! $ scrapy crawl bot -a filename=urls.txt

Slide 32

Slide 32 text

[...] TO JSON DATA [...] We need to collect scraped CompanyItem We need to serialize CompanyItem We need to write serialized data

Slide 33

Slide 33 text

Scrapy...my old scrapy... $ scrapy crawl bot -a filename=urls.txt -o emails.json -t json

Slide 34

Slide 34 text

(At the end of an ordinary day) The customer is happy... ...he has all data he wants...

Slide 35

Slide 35 text

...and I need to get my invoice paid before Google blacklist his IPs! Thanks for your attention!

Slide 36

Slide 36 text

Any questions? git.palazzetti.me :: @palazzem :: github.com/palazzem github.com/palazzem/google-blacklist-me