Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Scraping

Web Scraping

Avatar for Michael Rawlings

Michael Rawlings

November 19, 2012
Tweet

Other Decks in Technology

Transcript

  1. Overview • When to use • Http Requests • GET

    • POST • Cookies • User Agent • Parsing • HTML • Text Parsing (Regex) • Road Blocks • Javascript & AJAX • Captchas
  2. When to use • As a last resort • First

    check if the site that you are scraping provides some kind of API or Feed • Then, read the Terms of Service for the site or speak with the site owner to determine if it is acceptable to scrape the site. • Not for spamming purposes. /\w\S*\@\w\S*\.\w+/
  3. HTTP Requests • HTTP = HyperText Transfer Protocol • Client-server

    protocol built on top of TCP • Two (primary) methods: GET and POST • The request message consists of the following: • A request line. • Headers • An optional message body.
  4. HTTP Request Headers • GET /wiki/Hypertext_Transfer_Protocol HTTP/1.1 • Host: en.wikipedia.org

    • Connection: keep-alive • Cache-Control: max-age=0 • User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11 • Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 • Referer: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDQ QFjAA&url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHypertext_Transfer_Pr otocol&ei=fgakUIbNNsX2qQHT8YCIAg&usg=AFQjCNEF3onc9o56WnZh- og7sukbTo6JCg&sig2=LiVlhdsMyG7UdeiMfungXg • Accept-Encoding: gzip,deflate,sdch • Accept-Language: en-US,en;q=0.8 • Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3 • Cookie: [value was stripped] • If-Modified-Since: Wed, 14 Nov 2012 16:54:48 GMT
  5. HTTP Response • The response message consists of the following:

    • A status line. • Headers • An optional message body. • 5 Classes of Status Codes: • 1 – Informational • 2 – Success (200) • 3 – Redirect (301, 302) • 4 – Client Error (403, 404, 405) • 5 – Server Error (500, 503)
  6. HTTP Response Headers • HTTP/1.0 304 Not Modified • Date:

    Wed, 14 Nov 2012 20:49:00 GMT • Content-Type: text/html; charset=UTF-8 • Last-Modified: Wed, 14 Nov 2012 16:54:48 GMT • Age: 942 • X-Cache: HIT from cp1010.eqiad.wmnet • X-Cache-Lookup: HIT from cp1010.eqiad.wmnet:3128 • X-Cache: MISS from cp1015.eqiad.wmnet • X-Cache-Lookup: MISS from cp1015.eqiad.wmnet:80 • Connection: keep-alive
  7. Cookies • Key-value pairs stored on the client • Sent

    with every request to the server • Set by a response header from the server • Often used by server side languages to keep track of a user (Session ID cookie). • Can be necessary to send these in order to get the data you want.
  8. User Agent • Tells the web server what kind of

    browser/program you are using to make the request. • You can easily spoof this, but if you’re having to do that you probably shouldn’t be scraping the site. • Eg. Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11
  9. Robots.txt • Instructions for bots accessing the site. • Your

    scraper is a bot. • You should listen to what this file says. • Basically a list of User-Agents and Disallowed paths
  10. Why Not? • HTML is a context free language •

    Fundamentally more complex than Regular Expressions • Type 2 vs Type 3
  11. XML Parser? • If everyone used XHTML and actually followed

    the standard • <script><![CDATA[ alert(‘hi there’); ]]></script> • <br>, <hr> • <ol><li>first<li>second<li>third</ol>
  12. HTML Parser! • Very similar to an XML parser but

    specifically for HTML (and the stupid things that web developers write). • Most parsers mimic the jQuery API. • All you need to know are CSS selectors.
  13. Text Parsing – Use Regex • My phone number is

    540.521.1492 • /\d{3}\.\d{3}\.\d{4}/ • There are 20,300 people online. • /There are (.*) people online\./
  14. Road Blocks • Javascript • You could try parsing the

    script with regex • OR you could emulate the browser environment. • AJAX • Just make the request yourself and use whatever you need from the response – typically you’ll get back XML or JSON which is easier to deal with anyways. • Captcha • Don’t do it. But if you must…
  15. Libraries • PHP • http – Zend HTTP Client •

    parsing - PHPquery • Node.js • http – http (core module) • parsing – jsdom + jQuery • Ruby • http – Net::HTTP • parsing - nokogiri • Coldfusion • http – cfhttp (built in) • parsing – jSoup (jar file)
  16. Steps 1. Make an http request with the necessary headers

    and parameters 2. Send the response body to the html parser 3. Use CSS selectors to get the text contents of the element(s) you are looking for 4. Use regular expressions to parse data from the text