Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scrapy: it GETs the web

Scrapy: it GETs the web

Scrapy lets you straightforwardly pull data out of the web. It helps you retry if the site is down, extract content from pages using CSS selectors (or XPath), and cover your code with tests. It downloads asynchronously with high performance. You program to a simple model, and it's good for web APIs, too.

Talk info: https://us.pycon.org/2013/schedule/presentation/135/

Full slides and code: https://github.com/paulproteus/scrapy-slides-sphinx

Afcfefa1f067d10bd021de0cc2e5e806?s=128

PyCon 2013

March 16, 2013
Tweet

Transcript

  1. Scrapy: it GETs the Web Asheesh Laroia asheesh@asheesh.org

  2. Part 0. My history with scraping 2004: Taught 3-week mini-class

    on mechanize + BeautifulSoup 2009: "Volunteer opportunity finder" within OpenHatch 2011: vidscraper. multiprocessing? gevent? 2012: oh-bugimporters rewrite w/ Scrapy
  3. Part I. Scraping is easy Download with urllib2, or requests,

    or mechanize, or ... Examine with browser inspectors Parse pages with lxml/BeautifulSoup Select with XPath or CSS selectors
  4. Part II. Rewriting some non-scrapy code Task: Get a list

    of speakers
  5. Part II. Rewriting some non-scrapy code Task: Get a list

    of speakers SCHED_PAGE='https://us.pycon.org/2013/schedule/'
  6. Part II. Rewriting some non-scrapy code Task: Get a list

    of speakers SCHED_PAGE='https://us.pycon.org/2013/schedule/' import requests import lxml.html data = requests.get(SCHED_PAGE) parsed = lxml.html.fromstring(data.content) print [x.text_content() for x in parsed.cssselect('span.speaker')]
  7. Part II. Rewriting some non-scrapy code Task: Get a list

    of speakers and talk titles SCHED_PAGE='https://us.pycon.org/2013/schedule/' import requests import lxml.html data = requests.get(SCHED_PAGE) parsed = lxml.html.fromstring(data.content) print [x.text_content() for x in parsed.cssselect('span.speaker')]
  8. Now capture preso titles You could def store_datum(author, preso_title): pass

    # actual logic here...
  9. Now capture preso titles def store_datum(author, preso_title): pass # actual

    logic here... def main(): data = requests.get(SCHED_PAGE) parsed = lxml.html.fromstring(data.response) for speaker_span in parsed.cssselect('span.speaker'): text = speaker_span.text_content() store_datum(author, preso_title)
  10. Now capture preso titles def store_datum(author, preso_title): pass # actual

    logic here... def main(): data = requests.get(SCHED_PAGE) parsed = lxml.html.fromstring(data.response) for speaker_span in parsed.cssselect('span.speaker'): text = speaker_span.text_content() store_datum(author, preso_title)
  11. scrapy.items.Item class PyConPreso(scrapy.item.Item): author = Field() preso = Field()

  12. scrapy.items.Item class PyConPreso(scrapy.item.Item): author = Field() preso = Field() #

    Similar to... {'author': None, 'preso': None}
  13. scrapy.items.Item class PyConPreso(scrapy.item.Item): author = Field() preso = Field() #

    Similar to... {'author': None, 'preso': None} >>> p['preso_name'] = 'Asheesh' KeyError: 'PyConPreso does not support field: preso_name'
  14. Better def store_datum(author, preso_title): pass # actual logic here... def

    get_data(): # ... for speaker_span in parsed.cssselect('span.speaker'): preso = None # FIXME text = speaker_span.text_content() item = PyConPreso( author=text.strip(), preso=store_datum(author, preso_title)) out_data.append(item) return out_data
  15. Data is complicated >>> p.author 'Asheesh Laroia, Jessica McKellar, Dana

    Bauer, Daniel Choi'
  16. Data is complicated >>> p.author 'Asheesh Laroia, Jessica McKellar, Dana

    Bauer, Daniel Choi' Scrapy-ify early on. Maybe you'll need multiple HTTP requests. Maybe you'll just want testable code.
  17. scrapy.spider.BaseSpider import lxml.html START_URL = '...' class PyConSiteSpider(BaseSpider): start_urls =

    [START_URL] def parse(self, response): parsed = lxml.html.fromstring( response.body_as_unicode) slots = parsed.cssselect('span.speaker') for speaker in speakers: author = None # placeholder preso = None # placeholder yield PyConPreso( author=author, preso=preso)
  18. How you run it $ scrapy runspider your_spider.py

  19. How you run it $ scrapy runspider your_spider.py 2013-03-12 18:04:07-0700

    [Demo] DEBUG: Crawled (200) <GET ...> (referer: None) 2013-03-12 18:04:07-0700 [Demo] DEBUG: Scraped from <200 ...> {} 2013-03-12 18:04:07-0700 [Demo] INFO: Closing spider (finished) 2013-03-12 18:04:07-0700 [Demo] INFO: Dumping spider stats: {'downloader/request_bytes': 513, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 75142, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 3, 13, 1, 4, 7, 567078), 'item_scraped_count': 1, 'scheduler/memory_enqueued': 2, 'start_time': datetime.datetime(2013, 3, 13, 1, 4, 5, 144944)} 2013-03-12 18:04:07-0700 [Demo] INFO: Spider closed (finished)
  20. How you run it $ scrapy runspider your_spider.py -L ERROR

    $
  21. Customizing output $ scrapy runspider your_spider.py -s FEED_URI=myfile.out $

  22. None
  23. None
  24. Part III. An aside about Scrapy >>> 'Pablo Hoffman' >

    'Asheesh Laroia' True
  25. Part III. An aside about Scrapy Scrapy: 9,000

  26. Part III. An aside about Scrapy Scrapy: 9,000 Mechanize: 20,000

  27. Part III. An aside about Scrapy Scrapy: 9,000 Mechanize: 20,000

    Requests: 475,000
  28. Scrapy wants you to make a project $ scrapy startproject

    tutorial creates tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py
  29. Awesome features

  30. Awesome features... telnet localhost 6023

  31. Awesome features... telnet localhost 6023 Gives >>> est() Execution engine

    status time()-engine.start_time : 21.3188259602 engine.is_idle() : False …
  32. Awesome features... telnet localhost 6023 Gives >>> est() Execution engine

    status time()-engine.start_time : 21.3188259602 engine.is_idle() : False … >>> import os; os.system('eject') 0 >>> # Hmm.
  33. Awesome features... $ scrapy runspider your_spider.py -s TELNETCONSOLE_ENABLED=0 -s WEBSERVICE_ENABLED=0

  34. Awesome features... Semi-complex integration with other pieces of code.

  35. Part IV. Async

  36. If you're not done, say so def parse(response): # do

    some work...
  37. If you're not done, say so def parse(response): # do

    some work... req = request(new_url) yield req
  38. If you're not done, say so def parse(response): # do

    some work... req = request(new_url, callback=next_page_handler) yield req def next_page_handler(response): # do some work... yield Item()
  39. If you're not done, say so def parse(response): # do

    some work... req = Request(new_url, callback=next_page_handler) req.meta['data'] = 'to keep around' yield req def next_page_handler(response): data = response.meta['data'] # pull data out # do some work... yield Item()
  40. Performance Crawl 500 projects' bug trackers: 26 hours

  41. Performance Crawl 500 projects' bug trackers: 26 hours Add multiprocessing:

    +1-10 MB * N workers
  42. Performance Crawl 500 projects' bug trackers: 26 hours Add multiprocessing:

    +1-10 MB * N workers After Scrapy: N=200 simultaneous requests 1 hour 10 min
  43. Part V. Testing class PyConSiteSpider(BaseSpider): def parse(self, response): # ...

    for speaker in speakers: # ... yield PyConPreso( author=author, preso=preso)
  44. Part V. Testing class PyConSiteSpider(BaseSpider): def parse(self, response): # ...

    for speaker in speakers: # ... yield PyConPreso( author=author, preso=preso) test: def test_spider(): resp = HtmlResponse(url='', body=open('saved- data.html').read()) spidey = PyconSiteSpider() expected = [PyConPreso(author=a, preso=b), ...] items = list(spidey.parse(resp)) assert items == expected
  45. More testing def test_spider(self): spidey = PyConSiteSpider() request_iterable = spider.start_requests()

    url2filename = {'http://example.com/': 'path/to/sample.html'} expected = [...] ar = autoresponse.Autoresponder( url2filename=url2filename, url2errors={}) items = ar.respond_recursively(request_iterable) self.assertEqual(expected, items)
  46. Part VI. Wacky tricks

  47. A setting for everything settings.USER_AGENT settings.CONCURRENT_REQUESTS_PER_DOMAIN (= e.g. 1) settings.CONCURRENT_REQUEST

    (= e.g. 800) settings.RETRY_ENABLED (= True by default) settings.RETRY_TIMES settings.RETRY_HTTP_CODES Great intro-to-scraping docs
  48. JavaScript import spidermonkey def parse(self, response): # to get a

    tag... script_content = doc.xpath('//script')[0].text_content() # to run the JS... r = spidermonkey.Runtime() ctx = r.new_context() n = cx.eval_script("1 + 2") + 3 # n == 6
  49. JavaScript import spidermonkey def parse(self, response): script_content = doc.xpath('//script')[0].text_content() #

    get tag r = spidermonkey.Runtime() ctx = r.new_context() n = cx.eval_script(script_content) # execute script import selenium class MySpider(BaseSpider): def __init__(self): self.browser = selenium.selenium(...) # configure self.browser.start() # synchronously launch def parse(self, response): self.browser.open(response.url) # GET by browser self.browser.select('//ul') # in-browser XPath
  50. Django from scrapy.contrib.djangoitem import DjangoItem

  51. Django from scrapy.contrib.djangoitem import DjangoItem from myapp.models import Poll #

    in scrapy items.py class PollItem(DjangoItem): django_model = Poll
  52. Django from scrapy.contrib.djangoitem import DjangoItem from myapp.models import Poll #

    in scrapy items.py class PollItem(DjangoItem): django_model = Poll # in scrapy pipelines.py class PollPipeline(object): def process_item(self, item, spider): item.save()
  53. Django from scrapy.contrib.djangoitem import DjangoItem from myapp.models import Poll #

    in scrapy items.py class PollItem(DjangoItem): django_model = Poll # in scrapy pipelines.py class PollPipeline(object): def process_item(self, item, spider): item.save() Or just write a Django management command to deal with the JSON.
  54. Best-case integration Leave your HTTP to Scrapy. Impatient? Item Pipeline.

    Patient? Feed Exporter.
  55. Twisted minus Twisted