Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scrapy: it GETs the web

Scrapy: it GETs the web

Scrapy lets you straightforwardly pull data out of the web. It helps you retry if the site is down, extract content from pages using CSS selectors (or XPath), and cover your code with tests. It downloads asynchronously with high performance. You program to a simple model, and it's good for web APIs, too.

Talk info: https://us.pycon.org/2013/schedule/presentation/135/

Full slides and code: https://github.com/paulproteus/scrapy-slides-sphinx

PyCon 2013

March 16, 2013
Tweet

More Decks by PyCon 2013

Other Decks in Programming

Transcript

  1. Part 0. My history with scraping 2004: Taught 3-week mini-class

    on mechanize + BeautifulSoup 2009: "Volunteer opportunity finder" within OpenHatch 2011: vidscraper. multiprocessing? gevent? 2012: oh-bugimporters rewrite w/ Scrapy
  2. Part I. Scraping is easy Download with urllib2, or requests,

    or mechanize, or ... Examine with browser inspectors Parse pages with lxml/BeautifulSoup Select with XPath or CSS selectors
  3. Part II. Rewriting some non-scrapy code Task: Get a list

    of speakers SCHED_PAGE='https://us.pycon.org/2013/schedule/'
  4. Part II. Rewriting some non-scrapy code Task: Get a list

    of speakers SCHED_PAGE='https://us.pycon.org/2013/schedule/' import requests import lxml.html data = requests.get(SCHED_PAGE) parsed = lxml.html.fromstring(data.content) print [x.text_content() for x in parsed.cssselect('span.speaker')]
  5. Part II. Rewriting some non-scrapy code Task: Get a list

    of speakers and talk titles SCHED_PAGE='https://us.pycon.org/2013/schedule/' import requests import lxml.html data = requests.get(SCHED_PAGE) parsed = lxml.html.fromstring(data.content) print [x.text_content() for x in parsed.cssselect('span.speaker')]
  6. Now capture preso titles def store_datum(author, preso_title): pass # actual

    logic here... def main(): data = requests.get(SCHED_PAGE) parsed = lxml.html.fromstring(data.response) for speaker_span in parsed.cssselect('span.speaker'): text = speaker_span.text_content() store_datum(author, preso_title)
  7. Now capture preso titles def store_datum(author, preso_title): pass # actual

    logic here... def main(): data = requests.get(SCHED_PAGE) parsed = lxml.html.fromstring(data.response) for speaker_span in parsed.cssselect('span.speaker'): text = speaker_span.text_content() store_datum(author, preso_title)
  8. scrapy.items.Item class PyConPreso(scrapy.item.Item): author = Field() preso = Field() #

    Similar to... {'author': None, 'preso': None} >>> p['preso_name'] = 'Asheesh' KeyError: 'PyConPreso does not support field: preso_name'
  9. Better def store_datum(author, preso_title): pass # actual logic here... def

    get_data(): # ... for speaker_span in parsed.cssselect('span.speaker'): preso = None # FIXME text = speaker_span.text_content() item = PyConPreso( author=text.strip(), preso=store_datum(author, preso_title)) out_data.append(item) return out_data
  10. Data is complicated >>> p.author 'Asheesh Laroia, Jessica McKellar, Dana

    Bauer, Daniel Choi' Scrapy-ify early on. Maybe you'll need multiple HTTP requests. Maybe you'll just want testable code.
  11. scrapy.spider.BaseSpider import lxml.html START_URL = '...' class PyConSiteSpider(BaseSpider): start_urls =

    [START_URL] def parse(self, response): parsed = lxml.html.fromstring( response.body_as_unicode) slots = parsed.cssselect('span.speaker') for speaker in speakers: author = None # placeholder preso = None # placeholder yield PyConPreso( author=author, preso=preso)
  12. How you run it $ scrapy runspider your_spider.py 2013-03-12 18:04:07-0700

    [Demo] DEBUG: Crawled (200) <GET ...> (referer: None) 2013-03-12 18:04:07-0700 [Demo] DEBUG: Scraped from <200 ...> {} 2013-03-12 18:04:07-0700 [Demo] INFO: Closing spider (finished) 2013-03-12 18:04:07-0700 [Demo] INFO: Dumping spider stats: {'downloader/request_bytes': 513, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 75142, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 3, 13, 1, 4, 7, 567078), 'item_scraped_count': 1, 'scheduler/memory_enqueued': 2, 'start_time': datetime.datetime(2013, 3, 13, 1, 4, 5, 144944)} 2013-03-12 18:04:07-0700 [Demo] INFO: Spider closed (finished)
  13. Scrapy wants you to make a project $ scrapy startproject

    tutorial creates tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py
  14. Awesome features... telnet localhost 6023 Gives >>> est() Execution engine

    status time()-engine.start_time : 21.3188259602 engine.is_idle() : False …
  15. Awesome features... telnet localhost 6023 Gives >>> est() Execution engine

    status time()-engine.start_time : 21.3188259602 engine.is_idle() : False … >>> import os; os.system('eject') 0 >>> # Hmm.
  16. If you're not done, say so def parse(response): # do

    some work... req = request(new_url) yield req
  17. If you're not done, say so def parse(response): # do

    some work... req = request(new_url, callback=next_page_handler) yield req def next_page_handler(response): # do some work... yield Item()
  18. If you're not done, say so def parse(response): # do

    some work... req = Request(new_url, callback=next_page_handler) req.meta['data'] = 'to keep around' yield req def next_page_handler(response): data = response.meta['data'] # pull data out # do some work... yield Item()
  19. Performance Crawl 500 projects' bug trackers: 26 hours Add multiprocessing:

    +1-10 MB * N workers After Scrapy: N=200 simultaneous requests 1 hour 10 min
  20. Part V. Testing class PyConSiteSpider(BaseSpider): def parse(self, response): # ...

    for speaker in speakers: # ... yield PyConPreso( author=author, preso=preso)
  21. Part V. Testing class PyConSiteSpider(BaseSpider): def parse(self, response): # ...

    for speaker in speakers: # ... yield PyConPreso( author=author, preso=preso) test: def test_spider(): resp = HtmlResponse(url='', body=open('saved- data.html').read()) spidey = PyconSiteSpider() expected = [PyConPreso(author=a, preso=b), ...] items = list(spidey.parse(resp)) assert items == expected
  22. More testing def test_spider(self): spidey = PyConSiteSpider() request_iterable = spider.start_requests()

    url2filename = {'http://example.com/': 'path/to/sample.html'} expected = [...] ar = autoresponse.Autoresponder( url2filename=url2filename, url2errors={}) items = ar.respond_recursively(request_iterable) self.assertEqual(expected, items)
  23. A setting for everything settings.USER_AGENT settings.CONCURRENT_REQUESTS_PER_DOMAIN (= e.g. 1) settings.CONCURRENT_REQUEST

    (= e.g. 800) settings.RETRY_ENABLED (= True by default) settings.RETRY_TIMES settings.RETRY_HTTP_CODES Great intro-to-scraping docs
  24. JavaScript import spidermonkey def parse(self, response): # to get a

    tag... script_content = doc.xpath('//script')[0].text_content() # to run the JS... r = spidermonkey.Runtime() ctx = r.new_context() n = cx.eval_script("1 + 2") + 3 # n == 6
  25. JavaScript import spidermonkey def parse(self, response): script_content = doc.xpath('//script')[0].text_content() #

    get tag r = spidermonkey.Runtime() ctx = r.new_context() n = cx.eval_script(script_content) # execute script import selenium class MySpider(BaseSpider): def __init__(self): self.browser = selenium.selenium(...) # configure self.browser.start() # synchronously launch def parse(self, response): self.browser.open(response.url) # GET by browser self.browser.select('//ul') # in-browser XPath
  26. Django from scrapy.contrib.djangoitem import DjangoItem from myapp.models import Poll #

    in scrapy items.py class PollItem(DjangoItem): django_model = Poll
  27. Django from scrapy.contrib.djangoitem import DjangoItem from myapp.models import Poll #

    in scrapy items.py class PollItem(DjangoItem): django_model = Poll # in scrapy pipelines.py class PollPipeline(object): def process_item(self, item, spider): item.save()
  28. Django from scrapy.contrib.djangoitem import DjangoItem from myapp.models import Poll #

    in scrapy items.py class PollItem(DjangoItem): django_model = Poll # in scrapy pipelines.py class PollPipeline(object): def process_item(self, item, spider): item.save() Or just write a Django management command to deal with the JSON.