Scraping Techniques in Python

Slide 1

Slide 1 text

Scraping Techniques in Python Stefano Cotta Ramusino 2009/05/09

Slide 2

Slide 2 text

2009/05/09 2009/05/09 Page Page 2 2 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 What's the scraping? What's the scraping? Origin: scraping data from mainframes from green texts on black screens to new data structures or API Nowadays: forcing data from old websites in something new (web ≥ 2.0)

Slide 3

Slide 3 text

2009/05/09 2009/05/09 Page Page 3 3 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 Why to scrape web pages? Why to scrape web pages? To have online resources available in data structures and fles you want, such as: XML, db, PDF and so on...

Slide 4

Slide 4 text

2009/05/09 2009/05/09 Page Page 4 4 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 How to scrape? How to scrape? Necessary elements: Fuzzy logic Pattern recognition This is true hacking technique

Slide 5

Slide 5 text

2009/05/09 2009/05/09 Page Page 5 5 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 Why Python? Why Python? A lot of libraries Simple regexp, but powerful Not only an unique technique available

Slide 6

Slide 6 text

2009/05/09 2009/05/09 Page Page 6 6 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 Some books Some books Atom and RSS – Leslie Orchard Wiley Publishing, 2005 Python in a nutshell - Alex Martelli O'Reilly, 2006 Beginning Python - Magnus Lie Hetland Apress, 2008

Slide 7

Slide 7 text

2009/05/09 2009/05/09 Page Page 7 7 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 Libraries inside Python Libraries inside Python HTMLParser re

Slide 8

Slide 8 text

2009/05/09 2009/05/09 Page Page 8 8 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 HTMLParser HTMLParser

Slide 9

Slide 9 text

2009/05/09 2009/05/09 Page Page 9 9 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 HTMLParser HTMLParser

Slide 10

Slide 10 text

2009/05/09 2009/05/09 Page Page 10 10 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 Third libraries Third libraries Beautiful Soup mechanize lxml html5lib scrapemark pyquery scrapy

Slide 11

Slide 11 text

2009/05/09 2009/05/09 Page Page 11 11 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 Third libraries Third libraries Pros Cons Beautiful Soup pure some errors mechanize simple parsing lxml speed unusual scrapemark template no flexibility

Slide 12

Slide 12 text

2009/05/09 2009/05/09 Page Page 12 12 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 Beautiful Soup Beautiful Soup www.crummy.com/software/BeautifulSoup

Slide 13

Slide 13 text

2009/05/09 2009/05/09 Page Page 13 13 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 mechanize mechanize wwwsearch.sourceforge.net/mechanize

Slide 14

Slide 14 text

2009/05/09 2009/05/09 Page Page 14 14 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 lxml lxml codespeak.net/lxml

Slide 15

Slide 15 text

2009/05/09 2009/05/09 Page Page 15 15 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 scrapemark scrapemark arshaw.com/scrapemark

Slide 16

Slide 16 text

2009/05/09 2009/05/09 Page Page 16 16 Stefano Cotta Ramusino < Stefano Cotta Ramusino PYCON3-IT 2009 > PYCON3-IT 2009 Questions and answers Questions and answers www.whitone.tk [email protected]