Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Crawling & Metadata Extraction in Python

Web Crawling & Metadata Extraction in Python

Web crawling is a hard problem and the web is messy. There is no shortage of semantic web standards -- basically, everyone has one. How do you make sense of the noise of our web of billions of pages?

This talk presents two key technologies that can be used: Scrapy, an open source & scalable web crawling framework, and Mr. Schemato, a new, open source semantic web validator and distiller.

Talk given by Andrew Montalenti, CTO of Parse.ly. See http://parse.ly

Slides were built with reST and S5, and thus are available in raw text form here (quite pleasant to browse): https://raw.github.com/Parsely/python-crawling-slides/master/index.rst

You can also view these slides directly in the browser, using your arrow keys to navigate. http://bit.ly/crawling-slides

Andrew Montalenti

October 27, 2012
Tweet

More Decks by Andrew Montalenti

Other Decks in Programming

Transcript

  1. Web Crawling and Metadata with
    Python
    Author:Andrew Montalenti
    Date: 2012-10-26
    Web Crawling and Metadata with Python

    View Slide

  2. Meta Information
    Me: I've been using Python for 10 years. I use Python full-time, and have for the last 3
    years.
    Startup: I'm co-founder/CTO of Parse.ly ❏, a tech startup in the digital media space.
    E-mail me: [email protected]
    Follow me on Twitter: @amontalenti ❏
    Connect on LinkedIn: http://linkedin.com/in/andrewmontalenti ❏

    View Slide

  3. Parse.ly
    What do we do?
    How do we do it?

    View Slide

  4. Meta Slide
    http://bit.ly/crawling-slides ❏

    View Slide

  5. Crawler
    "A computer program that browses the World Wide Web in a methodical, automated
    manner or in an orderly fashion."
    Open source examples:
    • Apache Nutch: built by Doug Cutting, creator of Lucene/Hadoop
    • Heritrix: built by the Internet Archive

    View Slide

  6. Web Data
    40 billion pages on the Web today (Google)
    Growing: size was "just" 15 billion in October 2010
    "Deep Web" means it's even bigger

    View Slide

  7. Crawling, Spidering, Scraping
    These terms are almost synonymous, but sometimes have different meaning and
    connotations.
    My take:
    • Crawling: downloads and processes web content at one or more URLs
    • Spidering: walks links found in web content, either for a single domain or across
    the web
    • Scraping: uses knowledge of HTML pages to convert them into structured data

    View Slide

  8. My first experience with crawlers
    Parse.ly Reader: personalized news reader built in mid 2009
    Crawled 500K web sources for content personalized to individual interests
    First crawler was really dumb: based on RSS/Atom detection and feed fetching
    Seeded from domains appearing on top aggregators like Google News
    Technology: multiprocessing, Postgres, Solr

    View Slide

  9. My current experience
    Parse.ly shifted into web publisher analytics and APIs in 2010/2011
    Upgrades: Scrapy, MongoDB, Redis, Solr, Celery

    View Slide

  10. Parse.ly Network Stats
    >200 top publishing domains (Quantcast top-10,000 sites)
    >3B pageviews/month across network
    >10M unique URLs in our index
    >1TB of hot production data running in memory

    View Slide

  11. Parse.ly Crawl Infrastructure
    Have written 125 custom Scrapy crawlers with >10K of custom crawler code
    (Not proud of this fact; more on this later)
    Production environment in Rackspace Cloud; several worker nodes
    Implementation and QA runs in Scrapy Cloud
    Caching and retry strategy implemented atop Redis
    Eventual storage in MongoDB, Solr
    Implemented as Scrapy Components and Pipelines

    View Slide

  12. Our Business
    We're not in the crawling business.
    We're in the analytics and APIs business.

    View Slide

  13. Our Strategy
    We aim to be the #1 technology partner for large-scale publishers.
    Crawling: means to an end.
    URLs => Structured Metadata.
    Metadata Soup: Schema.org, rNews, OpenGraph, hNews, HTML5, ...

    View Slide

  14. Parse.ly Demo Time!
    Yay!

    View Slide

  15. Reflections on Scaling Crawlers
    You don't want to write your own crawler infrastructure from scratch.
    TRUST ME.
    Lots of hidden problems --
    Abstractions: asynchronous network I/O (Twisted), data processing pipelines
    HTTP/web: retries, throttling, backoff, concurrency, cookie/form handling
    Infrastructure: crawling queues, health monitoring

    View Slide

  16. Don't use Nutch, Heritrix
    Didier and I tried to understand, and even customize, Nutch in the early days.
    We love Lucene/Solr, so we figured it'd be a good fit.
    But no -- it's a WORLD OF PAIN.
    (They are for building search engines and archives -- not structured metadata.)

    View Slide

  17. Use Scrapy
    It's really Pythonic.
    It's built on proven tools, like Twisted, w3lib, and lxml.
    It's getting better and better.
    Just trust me: use Scrapy.

    View Slide

  18. Scrapy Overview
    $ git clone git://github.com/scrapy/dirbot.git
    $ cd dirbot
    $ mkvirtualenv dirbot
    $ pip install scrapy
    $ pip install ipython
    $ scrapy list
    dmoz
    $ scrapy crawl dmoz
    [scrapy] INFO: Scrapy 0.16.0 started (bot: dirbot)
    ...

    View Slide

  19. Example Output
    [dmoz] DEBUG: Crawled (200)
    [dmoz] DEBUG: Crawled (200)
    [dmoz] DEBUG: Scraped from <200 http://dmoz.org/Comp.../Python/Resources/>
    Website: name=[u'Top'] url=[u'/']
    [dmoz] DEBUG: Scraped from <200 http://dmoz.org/Comp.../Python/Resources/>
    Website: name=[u'Computers'] url=[u'/Computers/']
    [dmoz] DEBUG: Scraped from <200 http://dmoz.org/Comp.../Python/Resources/>
    Website: name=[u'Programming'] url=[u'/Computers/Programming/']
    ...
    [dmoz] DEBUG: Scraped from <200 http://dmoz.org/.../Python/Books/>
    Website: name=[u'Text Processing in Python'] url=[u'http://gnosis.cx/TPiP/']
    [dmoz] INFO: Spider closed (finished)
    Links:
    • http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ ❏
    • http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ ❏

    View Slide

  20. Spider Example
    class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]
    def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    items = []
    for site in sites:
    item = Website()
    item['name'] = site.select('a/text()').extract()
    item['url'] = site.select('a/@href').extract()
    item['description'] = site.select('text()').extract()
    items.append(item)
    return items

    View Slide

  21. Live Demos!
    Examples: DailyCaller.com, ArsTechnica.com
    • https://www.stypi.com/pixelmonkey/dailycaller.py ❏
    • https://www.stypi.com/pixelmonkey/arstechnica.py ❏

    View Slide

  22. Item
    from scrapy.item import Item, Field
    class DbItem(Item):
    title = Field()
    link = Field()

    View Slide

  23. DailyCaller: imperative style
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from dirbot.items import DbItem
    class DailycallerSpider(BaseSpider):
    name = "dailycaller.com"
    allowed_domains = ["dailycaller.com"]
    start_urls = ["http://dailycaller.com"]
    def parse(self, response):
    hxs = HtmlXPathSelector(response)
    item = DbItem()
    item["title"] = hxs.select("//h1/text()").extract()[0]
    item["link"] = hxs.select("//link[@rel='canonical']/@href").extract()[0]
    return item

    View Slide

  24. ArsTechnica: declarative style
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from dirbot.items import DbItem
    from scrapy.contrib.loader import XPathItemLoader
    from scrapy.contrib.loader.processor import TakeFirst
    class ArstechnicaSpider(BaseSpider):
    name = "arstechnica.com"
    allowed_domains = ["arstechnica.com"]
    start_urls = ["http://arstechnica.com"]
    def parse(self, response):
    loader = XPathItemLoader(item=DbItem(), response=response)
    loader.add_xpath("title", "//meta/[@property='og:title']/@content")
    loader.add_xpath("link", "//link[@rel='canonical']/@href")
    item = loader.load_item()
    item["title"] = loader.get_value(item["title"], TakeFirst(), unicode.title)
    item["link"] = loader.get_value(item["link"], TakeFirst())
    return item

    View Slide

  25. Live Spider Shell
    >>> fetch("http://dailycaller.com/2012...-most-rallies/")
    [dailycaller.com] INFO: Spider opened
    [dailycaller.com] DEBUG: Crawled (200)
    [s] Available Scrapy objects:
    [s] hxs
    [s] item Website: name=None url=None
    [s] request
    [s] response <200 http://dailycaller.com/...es/>
    [s] settings
    [s] spider
    [s] Useful shortcuts:
    [s] shelp() Shell help
    [s] fetch(req_or_url) Fetch request (or URL) and update local objects
    [s] view(response) View response in a browser
    >>> hxs.select("//title/text()")

    View Slide

  26. Scrapy Cloud Demo
    How we host, test, and QA our spiders across millions of pages.

    View Slide

  27. Schemato Overview
    Domo arigato, Mr. Schemato!

    View Slide

  28. Schemato Distilling
    from distillers import Distill, Distiller
    class NewsDistiller(Distiller):
    title = Distill("s:headline", "og:title")
    image_url = Distill("s:associatedMedia.ImageObject/url", "og:image")
    pub_date = Distill("s:datePublished")
    author = Distill("s:author", "s:creator.Person/name")
    section = Distill("s:articleSection")
    description = Distill("s:description", "og:description")
    link = Distill("s:url", "og:url")
    site = Distill("og:site_id")
    id = Distill("s:identifier")

    View Slide

  29. Schemato Distilling in Action
    >>> from distillery import NewsDistiller
    >>> from schemato import Schemato
    >>> lnk = "http://www.cnn.com/2012/10/26/world/europe/italy-berlusconi-convicted/index.html"
    >>> cnn = Schemato(lnk)
    >>> distiller = NewsDistiller(cnn)
    >>> distiller.distill()
    {'author': "Ben Wedeman",
    'id': None,
    'image_url': 'http://i2.cdn.turner.com/cnn/...-video-tease.jpg',
    'link': 'http://www.cnn.com/2012/10/26/world/europe/italy-berlusconi-convicted/index.html',
    'pub_date': '2012-10-26T14:36:35Z',
    'section': 'world',
    'title': 'Ex-Italian PM Berlusconi handed 4-year prison term for tax fraud',
    'site': 'CNN',
    'description': 'Flamboyant former Italian Prime Minister...'}

    View Slide

  30. Schemato: Bridging Gaps Between Standards
    Facebook OpenGraph provided image_url and link.
    Schema.org NewsArticle provided the rest.
    >>> distiller.sources
    {'author': 's:author',
    'id': None,
    'image_url': 'og:image',
    'link': 'og:url',
    'pub_date': 's:datePublished',
    'section': 's:articleSection',
    'title': 's:headline',
    'site': 'og:site_id',
    'description': 's:description'}

    View Slide

  31. Data sets to get started
    Are you interested in tackling some of these web crawling problems on your own project?
    If so, you may want some data to get started.
    I currently sell a few news data sets that help with this:
    • 30M news headlines and 500K web sources, 30gb of JSON data ($300)
    • 15K news domains that are the most popular in US market ($100)
    You could use either of these to build your own Google News, for example.
    Interested? Find me after or tweet me: @amontalenti ❏

    View Slide

  32. Schemato: A Call to Action
    The time is ripe for the semantic web.
    Want to build the ultimate web metadata validator, distiller, and extractor?
    Want to work on getting Schemato to run across millions of URLs?
    Want your contributions open source on Github?
    Find me at the sprints on Sunday!

    View Slide

  33. Baby Turtles
    Use your powers wisely, and always remember...

    View Slide

  34. Magic Turtles!
    It's turtles all the way down!

    View Slide

  35. Tweet and Meet
    What did you think?
    Tweet @amontalenti ❏ with #pydata hash tag!
    Rate this talk! http://bit.ly/rate-andrew ❏
    Connect on LinkedIn: http://linkedin.com/in/andrewmontalenti ❏

    View Slide