From website to JSON data in 30 minutes with Scrapy

WEB SITES TO JSON DATA IN 30 MINUTES WITH SCRAPY
(Blame me if it takes more time!)

Emanuele Palazzetti (Python backend Developer) @ https://evonove.it git.palazzetti.me :: @palazzem
:: github.com/palazzem

Before we begin, I want to tell you a story
An unhappy story of me searching a "magic" framework: SCRAPY

(During an ordinary day) Everything goes right... ...until a customer
knocks to your door... ...he wants to talk about a really important project...

Default approach: hear the customer (Not always the right thing
to do)

Customer: "I've a brilliant idea!" Probably this is the "point
of no return"

Customer: "Let's gather the Internet searching for companies information"

Google custom search search everything you need JSON/XML/... data extraction
save all results in your (web) application Ok we have a deal!

Customer: IT'S TOO EXPENSIVE

Customer: WE MUST BYPASS GOOGLE

I found myself in a maelstorm Me: "Ok. What's your
plan?" (I'll never say this again)

Customer: "Surf the Internet and you're done!" [...] "No?"

In this situation you should say what is good and
what is bad Especially when we talk about ideas Especially when we talk about "surf"

Good idea / Bad idea

Customer requirements: gather information from customers website but we don't
have customers website we have VAT, phone number, and other heterogeneous data use standard Google search page follow ﬁrst N links grab (steal) emails!

ScraperWiki? SCRAPY! $ pip install scrapy $ scrapy startproject harvester

$ tree . ├── harvester │ ├── __init__.py │ ├──
items.py # Item class │ ├── pipelines.py │ ├── settings.py │ └── spiders # Spiders class folder │ └── __init__.py └── scrapy.cfg

Item All web pages contain unstructured data Item class is
a container used to collect scraped data Provides a dictionary-like API Class attributes are Field() objects

CompanyItem from scrapy.item import Item, Field class CompanyItem(Item): link =
Field() emails = Field() Just a note: >>> company['name'] = 'Evonove' # setting unknown field Traceback (most recent call last): ... KeyError: 'CompanyItem does not support field: name'

Spider deﬁnes the custom behavior during scraping how Spider follows
links how Spider ﬁnds unstructured data through Selectors how to populate Item() object

My first Spider class SpiderBot(Spider): name = 'bot' start_urls =
[ "https://evonove.it/contacts/" ]

Parse the response! def parse(self, response): selector = Selector(response) #
Use regex to find valid emails emails = selector.re(self.EMAIL_REGEX) if (emails): item = CompanyItem() item['link'] = response.url item['emails'] = emails return item return Easy thanks to Selector!

We don't have customers website def __init__(self, filename=None, *args, **kwargs):
super(SpiderBot, self).__init__(*args, **kwargs) # Get start urls from file self.start_urls = search_urls_from_file(filename) if filename else [] search_urls_from_file https://www.google.it/search?q={}

What is missing? scrape Google search! ﬁnd all results grab
ﬁrst 5 links follow them scrape resulting page! Did I tell you Scrapy is batteries included?

Hello CrawlSpider! class SpiderBot(CrawlSpider): name = 'bot' rules = (
Rule( SgmlLinkExtractor(), ), )

Ok maybe not *all* links class SpiderBot(CrawlSpider): name = 'bot'
rules = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), ), )

And only first five class SpiderBot(CrawlSpider): name = 'bot' rules
= ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5]

Oh yes... I should parse all responses! class SpiderBot(CrawlSpider): name
= 'bot' rules = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', callback='parse_item', # parse() is now parse_item() ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5]

Customer's spider! class SpiderBot(CrawlSpider): name = 'bot' rules = (
Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', callback='parse_item', ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5] def __init__(self, filename=None, *args, **kwargs): super(SpiderBot, self).__init__(*args, **kwargs) # Get start urls from file self.start_urls = urls_from_file(filename) if filename else [] def parse_item(self, response): selector = Selector(response)

Launch our Spider! $ scrapy crawl bot -a filename=urls.txt

[...] TO JSON DATA [...] We need to collect scraped
CompanyItem We need to serialize CompanyItem We need to write serialized data

Scrapy...my old scrapy... $ scrapy crawl bot -a filename=urls.txt -o
emails.json -t json

(At the end of an ordinary day) The customer is
happy... ...he has all data he wants...

...and I need to get my invoice paid before Google
blacklist his IPs! Thanks for your attention!

Any questions? git.palazzetti.me :: @palazzem :: github.com/palazzem github.com/palazzem/google-blacklist-me

From website to JSON data in 30 minutes with Sc...

From website to JSON data in 30 minutes with Scrapy

More Decks by Emanuele Palazzetti

Other Decks in Technology

Featured

Transcript