Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scrapy: it GETs the web

Scrapy: it GETs the web

Scrapy lets you straightforwardly pull data out of the web. It helps you retry if the site is down, extract content from pages using CSS selectors (or XPath), and cover your code with tests. It downloads asynchronously with high performance. You program to a simple model, and it's good for web APIs, too.

Talk info: https://us.pycon.org/2013/schedule/presentation/135/

Full slides and code: https://github.com/paulproteus/scrapy-slides-sphinx

PyCon 2013

March 16, 2013
Tweet

More Decks by PyCon 2013

Other Decks in Programming

Transcript

  1. Scrapy: it GETs the Web
    Asheesh Laroia
    [email protected]

    View Slide

  2. Part 0. My history with scraping
    2004: Taught 3-week mini-class on mechanize +
    BeautifulSoup
    2009: "Volunteer opportunity finder" within OpenHatch
    2011: vidscraper. multiprocessing? gevent?
    2012: oh-bugimporters rewrite w/ Scrapy

    View Slide

  3. Part I. Scraping is easy
    Download with urllib2, or requests, or mechanize, or ...
    Examine with browser inspectors
    Parse pages with lxml/BeautifulSoup
    Select with XPath or CSS selectors

    View Slide

  4. Part II. Rewriting some non-scrapy
    code
    Task: Get a list of speakers

    View Slide

  5. Part II. Rewriting some non-scrapy
    code
    Task: Get a list of speakers
    SCHED_PAGE='https://us.pycon.org/2013/schedule/'

    View Slide

  6. Part II. Rewriting some non-scrapy
    code
    Task: Get a list of speakers
    SCHED_PAGE='https://us.pycon.org/2013/schedule/'
    import requests
    import lxml.html
    data = requests.get(SCHED_PAGE)
    parsed = lxml.html.fromstring(data.content)
    print [x.text_content()
    for x in parsed.cssselect('span.speaker')]

    View Slide

  7. Part II. Rewriting some non-scrapy
    code
    Task: Get a list of speakers and talk titles
    SCHED_PAGE='https://us.pycon.org/2013/schedule/'
    import requests
    import lxml.html
    data = requests.get(SCHED_PAGE)
    parsed = lxml.html.fromstring(data.content)
    print [x.text_content()
    for x in parsed.cssselect('span.speaker')]

    View Slide

  8. Now capture preso titles
    You could
    def store_datum(author, preso_title):
    pass # actual logic here...

    View Slide

  9. Now capture preso titles
    def store_datum(author, preso_title):
    pass # actual logic here...
    def main():
    data = requests.get(SCHED_PAGE)
    parsed = lxml.html.fromstring(data.response)
    for speaker_span in parsed.cssselect('span.speaker'):
    text = speaker_span.text_content()
    store_datum(author, preso_title)

    View Slide

  10. Now capture preso titles
    def store_datum(author, preso_title):
    pass # actual logic here...
    def main():
    data = requests.get(SCHED_PAGE)
    parsed = lxml.html.fromstring(data.response)
    for speaker_span in parsed.cssselect('span.speaker'):
    text = speaker_span.text_content()
    store_datum(author, preso_title)

    View Slide

  11. scrapy.items.Item
    class PyConPreso(scrapy.item.Item):
    author = Field()
    preso = Field()

    View Slide

  12. scrapy.items.Item
    class PyConPreso(scrapy.item.Item):
    author = Field()
    preso = Field()
    # Similar to...
    {'author': None,
    'preso': None}

    View Slide

  13. scrapy.items.Item
    class PyConPreso(scrapy.item.Item):
    author = Field()
    preso = Field()
    # Similar to...
    {'author': None,
    'preso': None}
    >>> p['preso_name'] = 'Asheesh'
    KeyError: 'PyConPreso does not support field: preso_name'

    View Slide

  14. Better
    def store_datum(author, preso_title):
    pass # actual logic here...
    def get_data():
    # ...
    for speaker_span in parsed.cssselect('span.speaker'):
    preso = None # FIXME
    text = speaker_span.text_content()
    item = PyConPreso(
    author=text.strip(),
    preso=store_datum(author, preso_title))
    out_data.append(item)
    return out_data

    View Slide

  15. Data is complicated
    >>> p.author
    'Asheesh Laroia, Jessica McKellar, Dana Bauer, Daniel Choi'

    View Slide

  16. Data is complicated
    >>> p.author
    'Asheesh Laroia, Jessica McKellar, Dana Bauer, Daniel Choi'
    Scrapy-ify early on.
    Maybe you'll need multiple HTTP requests. Maybe you'll
    just want testable code.

    View Slide

  17. scrapy.spider.BaseSpider
    import lxml.html
    START_URL = '...'
    class PyConSiteSpider(BaseSpider):
    start_urls = [START_URL]
    def parse(self, response):
    parsed = lxml.html.fromstring(
    response.body_as_unicode)
    slots = parsed.cssselect('span.speaker')
    for speaker in speakers:
    author = None # placeholder
    preso = None # placeholder
    yield PyConPreso(
    author=author, preso=preso)

    View Slide

  18. How you run it
    $ scrapy runspider your_spider.py

    View Slide

  19. How you run it
    $ scrapy runspider your_spider.py
    2013-03-12 18:04:07-0700 [Demo] DEBUG: Crawled (200)
    (referer: None)
    2013-03-12 18:04:07-0700 [Demo] DEBUG: Scraped from <200 ...>
    {}
    2013-03-12 18:04:07-0700 [Demo] INFO: Closing spider (finished)
    2013-03-12 18:04:07-0700 [Demo] INFO: Dumping spider stats:
    {'downloader/request_bytes': 513,
    'downloader/request_count': 2,
    'downloader/request_method_count/GET': 2,
    'downloader/response_bytes': 75142,
    'downloader/response_count': 2,
    'downloader/response_status_count/200': 1,
    'downloader/response_status_count/301': 1,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2013, 3, 13, 1, 4, 7, 567078),
    'item_scraped_count': 1,
    'scheduler/memory_enqueued': 2,
    'start_time': datetime.datetime(2013, 3, 13, 1, 4, 5, 144944)}
    2013-03-12 18:04:07-0700 [Demo] INFO: Spider closed (finished)

    View Slide

  20. How you run it
    $ scrapy runspider your_spider.py -L ERROR
    $

    View Slide

  21. Customizing output
    $ scrapy runspider your_spider.py -s FEED_URI=myfile.out
    $

    View Slide

  22. View Slide

  23. View Slide

  24. Part III. An aside about Scrapy
    >>> 'Pablo Hoffman' > 'Asheesh Laroia'
    True

    View Slide

  25. Part III. An aside about Scrapy
    Scrapy: 9,000

    View Slide

  26. Part III. An aside about Scrapy
    Scrapy: 9,000
    Mechanize: 20,000

    View Slide

  27. Part III. An aside about Scrapy
    Scrapy: 9,000
    Mechanize: 20,000
    Requests: 475,000

    View Slide

  28. Scrapy wants you to make a project
    $ scrapy startproject tutorial
    creates
    tutorial/
    scrapy.cfg
    tutorial/
    __init__.py
    items.py
    pipelines.py
    settings.py
    spiders/
    __init__.py

    View Slide

  29. Awesome features

    View Slide

  30. Awesome features...
    telnet localhost 6023

    View Slide

  31. Awesome features...
    telnet localhost 6023
    Gives
    >>> est()
    Execution engine status
    time()-engine.start_time : 21.3188259602
    engine.is_idle() : False

    View Slide

  32. Awesome features...
    telnet localhost 6023
    Gives
    >>> est()
    Execution engine status
    time()-engine.start_time : 21.3188259602
    engine.is_idle() : False

    >>> import os; os.system('eject')
    0
    >>> # Hmm.

    View Slide

  33. Awesome features...
    $ scrapy runspider your_spider.py -s
    TELNETCONSOLE_ENABLED=0 -s
    WEBSERVICE_ENABLED=0

    View Slide

  34. Awesome features...
    Semi-complex integration with other pieces of code.

    View Slide

  35. Part IV. Async

    View Slide

  36. If you're not done, say so
    def parse(response):
    # do some work...

    View Slide

  37. If you're not done, say so
    def parse(response):
    # do some work...
    req = request(new_url)
    yield req

    View Slide

  38. If you're not done, say so
    def parse(response):
    # do some work...
    req = request(new_url,
    callback=next_page_handler)
    yield req
    def next_page_handler(response):
    # do some work...
    yield Item()

    View Slide

  39. If you're not done, say so
    def parse(response):
    # do some work...
    req = Request(new_url,
    callback=next_page_handler)
    req.meta['data'] = 'to keep around'
    yield req
    def next_page_handler(response):
    data = response.meta['data'] # pull data out
    # do some work...
    yield Item()

    View Slide

  40. Performance
    Crawl 500 projects' bug trackers:
    26 hours

    View Slide

  41. Performance
    Crawl 500 projects' bug trackers:
    26 hours
    Add multiprocessing:
    +1-10 MB * N workers

    View Slide

  42. Performance
    Crawl 500 projects' bug trackers:
    26 hours
    Add multiprocessing:
    +1-10 MB * N workers
    After Scrapy:
    N=200 simultaneous requests
    1 hour 10 min

    View Slide

  43. Part V. Testing
    class PyConSiteSpider(BaseSpider):
    def parse(self, response):
    # ...
    for speaker in speakers:
    # ...
    yield PyConPreso(
    author=author, preso=preso)

    View Slide

  44. Part V. Testing
    class PyConSiteSpider(BaseSpider):
    def parse(self, response):
    # ...
    for speaker in speakers:
    # ...
    yield PyConPreso(
    author=author, preso=preso)
    test:
    def test_spider():
    resp = HtmlResponse(url='', body=open('saved-
    data.html').read())
    spidey = PyconSiteSpider()
    expected = [PyConPreso(author=a, preso=b), ...]
    items = list(spidey.parse(resp))
    assert items == expected

    View Slide

  45. More testing
    def test_spider(self):
    spidey = PyConSiteSpider()
    request_iterable = spider.start_requests()
    url2filename = {'http://example.com/':
    'path/to/sample.html'}
    expected = [...]
    ar = autoresponse.Autoresponder(
    url2filename=url2filename,
    url2errors={})
    items = ar.respond_recursively(request_iterable)
    self.assertEqual(expected, items)

    View Slide

  46. Part VI. Wacky tricks

    View Slide

  47. A setting for everything
    settings.USER_AGENT
    settings.CONCURRENT_REQUESTS_PER_DOMAIN (= e.g.
    1)
    settings.CONCURRENT_REQUEST (= e.g. 800)
    settings.RETRY_ENABLED (= True by default)
    settings.RETRY_TIMES
    settings.RETRY_HTTP_CODES
    Great intro-to-scraping docs

    View Slide

  48. JavaScript
    import spidermonkey
    def parse(self, response):
    # to get a tag...
    script_content = doc.xpath('//script')[0].text_content()
    # to run the JS...
    r = spidermonkey.Runtime()
    ctx = r.new_context()
    n = cx.eval_script("1 + 2") + 3
    # n == 6

    View Slide

  49. JavaScript
    import spidermonkey
    def parse(self, response):
    script_content = doc.xpath('//script')[0].text_content() # get
    tag
    r = spidermonkey.Runtime()
    ctx = r.new_context()
    n = cx.eval_script(script_content) # execute script
    import selenium
    class MySpider(BaseSpider):
    def __init__(self):
    self.browser = selenium.selenium(...) # configure
    self.browser.start() # synchronously launch
    def parse(self, response):
    self.browser.open(response.url) # GET by browser
    self.browser.select('//ul') # in-browser XPath

    View Slide

  50. Django
    from scrapy.contrib.djangoitem import DjangoItem

    View Slide

  51. Django
    from scrapy.contrib.djangoitem import DjangoItem
    from myapp.models import Poll
    # in scrapy items.py
    class PollItem(DjangoItem):
    django_model = Poll

    View Slide

  52. Django
    from scrapy.contrib.djangoitem import DjangoItem
    from myapp.models import Poll
    # in scrapy items.py
    class PollItem(DjangoItem):
    django_model = Poll
    # in scrapy pipelines.py
    class PollPipeline(object):
    def process_item(self, item, spider):
    item.save()

    View Slide

  53. Django
    from scrapy.contrib.djangoitem import DjangoItem
    from myapp.models import Poll
    # in scrapy items.py
    class PollItem(DjangoItem):
    django_model = Poll
    # in scrapy pipelines.py
    class PollPipeline(object):
    def process_item(self, item, spider):
    item.save()
    Or just write a Django management command to deal
    with the JSON.

    View Slide

  54. Best-case integration
    Leave your HTTP to Scrapy.
    Impatient? Item Pipeline.
    Patient? Feed Exporter.

    View Slide

  55. Twisted minus Twisted

    View Slide