Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping Techniques in Python

Scraping Techniques in Python

Talk: https://www.youtube.com/watch?v=IbkId9WGvGM

The scraping is a technique with which you can extract informations from websites: this method is used, for example, by search engines to index the web contents.

Python is well suited to perform operations of this type: we will discuss methods to parse web pages, including complex ones and how is it possible to make automatic login to sites where there are authentication forms with mutable structure.

Using Python you can automate the surfing of webpages we visit every day!

Stefano Cotta Ramusino

May 09, 2009
Tweet

More Decks by Stefano Cotta Ramusino

Other Decks in Programming

Transcript

  1. 2009/05/09 2009/05/09 Page Page 2 2 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 What's the scraping? What's the scraping? Origin: scraping data from mainframes from green texts on black screens to new data structures or API Nowadays: forcing data from old websites in something new (web ≥ 2.0)
  2. 2009/05/09 2009/05/09 Page Page 3 3 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 Why to scrape web pages? Why to scrape web pages? To have online resources available in data structures and fles you want, such as: XML, db, PDF and so on...
  3. 2009/05/09 2009/05/09 Page Page 4 4 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 How to scrape? How to scrape? Necessary elements: Fuzzy logic Pattern recognition This is true hacking technique
  4. 2009/05/09 2009/05/09 Page Page 5 5 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 Why Python? Why Python? A lot of libraries Simple regexp, but powerful Not only an unique technique available
  5. 2009/05/09 2009/05/09 Page Page 6 6 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 Some books Some books Atom and RSS – Leslie Orchard Wiley Publishing, 2005 Python in a nutshell - Alex Martelli O'Reilly, 2006 Beginning Python - Magnus Lie Hetland Apress, 2008
  6. 2009/05/09 2009/05/09 Page Page 7 7 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 Libraries inside Python Libraries inside Python HTMLParser re
  7. 2009/05/09 2009/05/09 Page Page 8 8 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 HTMLParser HTMLParser
  8. 2009/05/09 2009/05/09 Page Page 9 9 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 HTMLParser HTMLParser
  9. 2009/05/09 2009/05/09 Page Page 10 10 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 Third libraries Third libraries Beautiful Soup mechanize lxml html5lib scrapemark pyquery scrapy
  10. 2009/05/09 2009/05/09 Page Page 11 11 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 Third libraries Third libraries Pros Cons Beautiful Soup pure some errors mechanize simple parsing lxml speed unusual scrapemark template no flexibility
  11. 2009/05/09 2009/05/09 Page Page 12 12 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 Beautiful Soup Beautiful Soup www.crummy.com/software/BeautifulSoup
  12. 2009/05/09 2009/05/09 Page Page 13 13 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 mechanize mechanize wwwsearch.sourceforge.net/mechanize
  13. 2009/05/09 2009/05/09 Page Page 14 14 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 lxml lxml codespeak.net/lxml
  14. 2009/05/09 2009/05/09 Page Page 15 15 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 scrapemark scrapemark arshaw.com/scrapemark
  15. 2009/05/09 2009/05/09 Page Page 16 16 Stefano Cotta Ramusino <

    Stefano Cotta Ramusino <[email protected] [email protected]> PYCON3-IT 2009 > PYCON3-IT 2009 Questions and answers Questions and answers www.whitone.tk [email protected]