Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Scraping with Python and Scrapy

Web Scraping with Python and Scrapy

Slides from my presentation on a tool using Python, Scrapy, and Selenium given on May 15th at DevCoMO in Columbia, MO.

Avatar for Ceili Cornelison

Ceili Cornelison

May 16, 2013
Tweet

Other Decks in Programming

Transcript

  1. Some background on me... Developer at Delta Systems NOT a

    Python developer Student at the University of Missouri
  2. Some background on me... Developer at Delta Systems NOT a

    Python developer Student at the University of Missouri @cornelisonc
  3. Some background on scraping “a computer software technique of extracting

    information from websites” Transformation of unstructured web data into structured storable, analyzable data
  4. Some background on scraping “a computer software technique of extracting

    information from websites” Transformation of unstructured web data into structured storable, analyzable data Online price comparison, weather data monitoring, and extracting data from really unfriendly web apps with no public API
  5. With great power... eBay, Inc. v. Bidder’s Edge, Inc. ‘Trespass

    to Chattel’ - don’t cause people’s stuff problems
  6. With great power... eBay, Inc. v. Bidder’s Edge, Inc. ‘Trespass

    to Chattel’ - don’t cause people’s stuff problems United States v. Andrew Auernheimer (weev)
  7. With great power... eBay, Inc. v. Bidder’s Edge, Inc. ‘Trespass

    to Chattel’ - don’t cause people’s stuff problems United States v. Andrew Auernheimer (weev) Computer Fraud and Abuse Act
  8. With great power... eBay, Inc. v. Bidder’s Edge, Inc. ‘Trespass

    to Chattel’ - don’t cause people’s stuff problems United States v. Andrew Auernheimer (weev) Computer Fraud and Abuse Act Can exist in a legal grey area
  9. With great power... eBay, Inc. v. Bidder’s Edge, Inc. ‘Trespass

    to Chattel’ - don’t cause people’s stuff problems United States v. Andrew Auernheimer (weev) Computer Fraud and Abuse Act Can exist in a legal grey area But Google does it!
  10. Signs you may be up to no good Evading captcha,

    bypassing firewalls, and other security features
  11. Signs you may be up to no good Evading captcha,

    bypassing firewalls, and other security features Potential for server overload
  12. Signs you may be up to no good Evading captcha,

    bypassing firewalls, and other security features Potential for server overload SHOULD you be accessing this data?
  13. Signs you may be up to no good Evading captcha,

    bypassing firewalls, and other security features Potential for server overload SHOULD you be accessing this data? Read the Terms and Conditions! (No, really.)
  14. Scrapy - open source Python web scraping “Scrapy is a

    fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing”
  15. Scrapy - open source Python web scraping “Scrapy is a

    fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing” Dependencies: Python 2.6 or 2.7, OpenSSL, and either pip or easy_install Python package managers
  16. THE PROBLEM... https://www.courts.mo.gov/casenet/base/welcome.do Information is available, but not easy to

    access Time consuming, repetitive, could be done by a monkey... OR A COMPUTER!
  17. THE SOLUTION??? Scrapy! How to access the data? XPath selectors.

    But there’s a form... and all the links are Javascript...
  18. THE SOLUTION??? Scrapy! How to access the data? XPath selectors.

    But there’s a form... and all the links are Javascript... There’s a library for that.