Web Scraping

Web Scraping Michael Rawlings Embark Web Solutions

Overview • When to use • Http Requests • GET
• POST • Cookies • User Agent • Parsing • HTML • Text Parsing (Regex) • Road Blocks • Javascript & AJAX • Captchas

When to use • As a last resort • First
check if the site that you are scraping provides some kind of API or Feed • Then, read the Terms of Service for the site or speak with the site owner to determine if it is acceptable to scrape the site. • Not for spamming purposes. /\w\S*\@\w\S*\.\w+/

HTTP Requests • HTTP = HyperText Transfer Protocol • Client-server
protocol built on top of TCP • Two (primary) methods: GET and POST • The request message consists of the following: • A request line. • Headers • An optional message body.

HTTP Request Headers • GET /wiki/Hypertext_Transfer_Protocol HTTP/1.1 • Host: en.wikipedia.org
• Connection: keep-alive • Cache-Control: max-age=0 • User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11 • Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 • Referer: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CDQ QFjAA&url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHypertext_Transfer_Pr otocol&ei=fgakUIbNNsX2qQHT8YCIAg&usg=AFQjCNEF3onc9o56WnZh- og7sukbTo6JCg&sig2=LiVlhdsMyG7UdeiMfungXg • Accept-Encoding: gzip,deflate,sdch • Accept-Language: en-US,en;q=0.8 • Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3 • Cookie: [value was stripped] • If-Modified-Since: Wed, 14 Nov 2012 16:54:48 GMT

HTTP Response • The response message consists of the following:
• A status line. • Headers • An optional message body. • 5 Classes of Status Codes: • 1 – Informational • 2 – Success (200) • 3 – Redirect (301, 302) • 4 – Client Error (403, 404, 405) • 5 – Server Error (500, 503)

HTTP Response Headers • HTTP/1.0 304 Not Modified • Date:
Wed, 14 Nov 2012 20:49:00 GMT • Content-Type: text/html; charset=UTF-8 • Last-Modified: Wed, 14 Nov 2012 16:54:48 GMT • Age: 942 • X-Cache: HIT from cp1010.eqiad.wmnet • X-Cache-Lookup: HIT from cp1010.eqiad.wmnet:3128 • X-Cache: MISS from cp1015.eqiad.wmnet • X-Cache-Lookup: MISS from cp1015.eqiad.wmnet:80 • Connection: keep-alive

GET vs POST •http://google.com/?q=roanoke

Cookies • Key-value pairs stored on the client • Sent
with every request to the server • Set by a response header from the server • Often used by server side languages to keep track of a user (Session ID cookie). • Can be necessary to send these in order to get the data you want.

User Agent • Tells the web server what kind of
browser/program you are using to make the request. • You can easily spoof this, but if you’re having to do that you probably shouldn’t be scraping the site. • Eg. Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11

Robots.txt • Instructions for bots accessing the site. • Your
scraper is a bot. • You should listen to what this file says. • Basically a list of User-Agents and Disallowed paths

Parsing <html> </html>

Can I Parse HTML with Regex?

Why Not? • HTML is a context free language •
Fundamentally more complex than Regular Expressions • Type 2 vs Type 3

XML Parser? • If everyone used XHTML and actually followed
the standard • <script><![CDATA[ alert(‘hi there’); ]]></script> • <br>, <hr> • <ol><li>first<li>second<li>third</ol>

HTML Parser! • Very similar to an XML parser but
specifically for HTML (and the stupid things that web developers write). • Most parsers mimic the jQuery API. • All you need to know are CSS selectors.

Text Parsing – Use Regex • My phone number is
540.521.1492 • /\d{3}\.\d{3}\.\d{4}/ • There are 20,300 people online. • /There are (.*) people online\./

Road Blocks • Javascript • You could try parsing the
script with regex • OR you could emulate the browser environment. • AJAX • Just make the request yourself and use whatever you need from the response – typically you’ll get back XML or JSON which is easier to deal with anyways. • Captcha • Don’t do it. But if you must…

Libraries • PHP • http – Zend HTTP Client •
parsing - PHPquery • Node.js • http – http (core module) • parsing – jsdom + jQuery • Ruby • http – Net::HTTP • parsing - nokogiri • Coldfusion • http – cfhttp (built in) • parsing – jSoup (jar file)

Steps 1. Make an http request with the necessary headers
and parameters 2. Send the response body to the html parser 3. Use CSS selectors to get the text contents of the element(s) you are looking for 4. Use regular expressions to parse data from the text

Web Scraping

Web Scraping

Michael Rawlings

Other Decks in Technology

Featured

Transcript

Web Scraping Michael Rawlings Embark Web Solutions

Overview • When to use • Http Requests • GET

When to use • As a last resort • First

HTTP Requests • HTTP = HyperText Transfer Protocol • Client-server

HTTP Request Headers • GET /wiki/Hypertext_Transfer_Protocol HTTP/1.1 • Host: en.wikipedia.org

HTTP Response • The response message consists of the following:

HTTP Response Headers • HTTP/1.0 304 Not Modified • Date:

GET vs POST •http://google.com/?q=roanoke

Cookies • Key-value pairs stored on the client • Sent

User Agent • Tells the web server what kind of

Robots.txt • Instructions for bots accessing the site. • Your

Parsing <html> </html>

Can I Parse HTML with Regex?

Why Not? • HTML is a context free language •

XML Parser? • If everyone used XHTML and actually followed

HTML Parser! • Very similar to an XML parser but

Text Parsing – Use Regex • My phone number is

Road Blocks • Javascript • You could try parsing the

Libraries • PHP • http – Zend HTTP Client •

Steps 1. Make an http request with the necessary headers