Web Crawling & Metadata Extraction in Python

Web Crawling & Metadata Extraction in Python

Web crawling is a hard problem and the web is messy. There is no shortage of semantic web standards -- basically, everyone has one. How do you make sense of the noise of our web of billions of pages?

This talk presents two key technologies that can be used: Scrapy, an open source & scalable web crawling framework, and Mr. Schemato, a new, open source semantic web validator and distiller.

Talk given by Andrew Montalenti, CTO of Parse.ly. See http://parse.ly

Slides were built with reST and S5, and thus are available in raw text form here (quite pleasant to browse): https://raw.github.com/Parsely/python-crawling-slides/master/index.rst

You can also view these slides directly in the browser, using your arrow keys to navigate. http://bit.ly/crawling-slides

36988ea18935a2bd1278a97c6ba03707?s=128

Andrew Montalenti

October 27, 2012
Tweet