Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From website to JSON data in 30 minutes with Scrapy

From website to JSON data in 30 minutes with Scrapy

Scrapy is an open source framework to extract unstructured data from websites easily. Built for web scraping, Scrapy can be used for different scopes like crawling, monitoring and web applications tests.

After an introduction to all major features, a real use case will be shown to create a fully functional crawling system which is able to scrape and store all unstructured data gathered from web pages.

Emanuele Palazzetti

May 24, 2014
Tweet

More Decks by Emanuele Palazzetti

Other Decks in Technology

Transcript

  1. WEB SITES TO JSON DATA IN 30 MINUTES WITH SCRAPY

    (Blame me if it takes more time!)
  2. Before we begin, I want to tell you a story

    An unhappy story of me searching a "magic" framework: SCRAPY
  3. (During an ordinary day) Everything goes right... ...until a customer

    knocks to your door... ...he wants to talk about a really important project...
  4. Google custom search search everything you need JSON/XML/... data extraction

    save all results in your (web) application Ok we have a deal!
  5. I found myself in a maelstorm Me: "Ok. What's your

    plan?" (I'll never say this again)
  6. In this situation you should say what is good and

    what is bad Especially when we talk about ideas Especially when we talk about "surf"
  7. Customer requirements: gather information from customers website but we don't

    have customers website we have VAT, phone number, and other heterogeneous data use standard Google search page follow first N links grab (steal) emails!
  8. $ tree . ├── harvester │ ├── __init__.py │ ├──

    items.py # Item class │ ├── pipelines.py │ ├── settings.py │ └── spiders # Spiders class folder │ └── __init__.py └── scrapy.cfg
  9. Item All web pages contain unstructured data Item class is

    a container used to collect scraped data Provides a dictionary-like API Class attributes are Field() objects
  10. CompanyItem from scrapy.item import Item, Field class CompanyItem(Item): link =

    Field() emails = Field() Just a note: >>> company['name'] = 'Evonove' # setting unknown field Traceback (most recent call last): ... KeyError: 'CompanyItem does not support field: name'
  11. Spider defines the custom behavior during scraping how Spider follows

    links how Spider finds unstructured data through Selectors how to populate Item() object
  12. Parse the response! def parse(self, response): selector = Selector(response) #

    Use regex to find valid emails emails = selector.re(self.EMAIL_REGEX) if (emails): item = CompanyItem() item['link'] = response.url item['emails'] = emails return item return Easy thanks to Selector!
  13. We don't have customers website def __init__(self, filename=None, *args, **kwargs):

    super(SpiderBot, self).__init__(*args, **kwargs) # Get start urls from file self.start_urls = search_urls_from_file(filename) if filename else [] search_urls_from_file https://www.google.it/search?q={}
  14. What is missing? scrape Google search! find all results grab

    first 5 links follow them scrape resulting page! Did I tell you Scrapy is batteries included?
  15. Ok maybe not *all* links class SpiderBot(CrawlSpider): name = 'bot'

    rules = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), ), )
  16. And only first five class SpiderBot(CrawlSpider): name = 'bot' rules

    = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5]
  17. Oh yes... I should parse all responses! class SpiderBot(CrawlSpider): name

    = 'bot' rules = ( Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', callback='parse_item', # parse() is now parse_item() ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5]
  18. Customer's spider! class SpiderBot(CrawlSpider): name = 'bot' rules = (

    Rule( SgmlLinkExtractor(restrict_xpaths=('//li//h3',)), process_links='filter_links', callback='parse_item', ), ) def filter_links(self, links): # Get only first 5 links if available return links[0:5] def __init__(self, filename=None, *args, **kwargs): super(SpiderBot, self).__init__(*args, **kwargs) # Get start urls from file self.start_urls = urls_from_file(filename) if filename else [] def parse_item(self, response): selector = Selector(response)
  19. [...] TO JSON DATA [...] We need to collect scraped

    CompanyItem We need to serialize CompanyItem We need to write serialized data
  20. (At the end of an ordinary day) The customer is

    happy... ...he has all data he wants...
  21. ...and I need to get my invoice paid before Google

    blacklist his IPs! Thanks for your attention!