Python tools for webscraping

Slide 1

Slide 1 text

Python tools for webscraping José Manuel Ortega @jmortegac

Slide 2

Slide 2 text

SpeakerDeck space https://speakerdeck.com/jmortega

Slide 3

Slide 3 text

Github repository https://github.com/jmortega/pydata_webscraping

Slide 4

Slide 4 text

Agenda Scraping techniques Introduction to webscraping Python tools for webscraping Scrapy project

Slide 5

Slide 5 text

Scraping techniques  Screen scraping  Report mining  Web scraping  Spiders /Crawlers

Slide 6

Slide 6 text

Screen scraping  Selenium  Mechanize  Robobrowser

Slide 7

Slide 7 text

Selenium  Open Source framework for automating browsers  Python-Module http://pypi.python.org/pypi/selenium  pip install selenium  Firefox-Driver

Slide 8

Slide 8 text

Selenium  find_element_ by_link_text(‘text’): find the link by text by_css_selector: just like with lxml css by_tag_name: ‘a’ for the first link or all links by_xpath: practice xpath regex by_class_name: CSS related, but this finds all different types that have the same class

Slide 9

Slide 9 text

Selenium youtube

Slide 10

Slide 10 text

Selenium youtube search

Slide 11

Slide 11 text

Report mining Miner

Slide 12

Slide 12 text

Webscraping

Slide 13

Slide 13 text

Python tools  Requests  Beautiful Soup 4  Pyquery  Webscraping  Scrapy

Slide 14

Slide 14 text

Spiders /crawlers  A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider. https://en.wikipedia.org/wiki/Web_crawler

Slide 15

Slide 15 text

Spiders /crawlers

Slide 16

Slide 16 text

Spiders /crawlers scrapinghub.com

Slide 17

Slide 17 text

Requests http://docs.python-requests.org/en/latest

Slide 18

Slide 18 text

Requests

Slide 19

Slide 19 text

Web scraping with Python 1. Download webpage with requests 2. Parse the page with BeautifulSoup/lxml 3. Select elements with Regular expressions,XPath or css selectors

Slide 20

Slide 20 text

Xpath selectors Expression Meaning name matches all nodes on the current level with the specified name name[n] matches the nth element on the current level with the specified name / Do selection from the root // Do selection from current node * matches all nodes on the current level . Or .. Select current / parent node @name the attribute with the specified name [@key='value'] all elements with an attribute that matches the specified key/value pair name[@key='value'] all elements with the specified name and an attribute that matches the specified key/value pair [text()='value'] all elements with the specified text name[text()='value'] all elements with the specified name and text

Slide 21

Slide 21 text

BeautifulSoup  Parsers support lxml,html5lib  Installation  pip install lxml  pip install html5lib  pip install beautifulsoup4  http://www.crummy.com/software/BeautifulSoup

Slide 22

Slide 22 text

BeautifulSoup  soup = BeautifulSoup(html_doc,’lxml’)  Print all: print(soup.prettify())  Print text: print(soup.get_text()) from bs4 import BeautifulSoup

Slide 23

Slide 23 text

BeautifulSoup functions  find_all(‘a’)Returns all links  find(‘title’)Returns the first element  get(‘href’)Returns the attribute href value  (element).text  Returns the text inside an element for link in soup.find_all('a'): print(link.get('href'))

Slide 24

Slide 24 text

External/internal links

Slide 25

Slide 25 text

External/internal links http://pydata.org/madrid2016

Slide 26

Slide 26 text

Webscraping pip install webscraping #Download instance D = download.Download() #get page html = D.get('http://pydata.org/madrid2016/schedule/') #get element where is located information xpath.search(html, '//td[@class="slot slot-talk"]')

Slide 27

Slide 27 text

Pydata agenda code structure

Slide 28

Slide 28 text

Extract data from pydata agenda

Slide 29

Slide 29 text

PyQuery

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Scrapy installation pip install scrapy

Slide 33

Slide 33 text

Scrapy Uses a mechanism based on XPath expressions called Xpath Selectors. Uses Parser LXML to find elements Twisted for asyncronous operations

Slide 34

Slide 34 text

Scrapy advantages  Faster than mechanize because it uses asynchronous operations (Twisted).  Scrapy has better support for html parsing.  Scrapy has better support for unicode characters, redirections, gzipped responses, encodings.  You can export the extracted data directly to JSON,XML and CSV.

Slide 35

Slide 35 text

Architecture

Slide 36

Slide 36 text

Scrapy Shell scrapy shell from scrapy.select import Selector hxs = Selector(response) Info = hxs.select(‘//div[@class=“slot-inner”]’)

Slide 37

Slide 37 text

Scrapy Shell scrapy shell http://scrapy.org

Slide 38

Slide 38 text

Scrapy project $ scrapy startproject scrapy.cfg: the project configuration file. tutorial/:the project’s python module. items.py: the project’s items file. pipelines.py : the project’s pipelines file. setting.py : the project’s setting file. spiders/ : spiders directory.

Slide 39

Slide 39 text

Pydata conferences

Slide 40

Slide 40 text

Spider generating $ scrapy genspider -t basic $ scrapy list Spiders list

Slide 41

Slide 41 text

Pydata spyder

Slide 42

Slide 42 text

Pydata sypder

Slide 43

Slide 43 text

Pipelines  ITEM_PIPELINES = {'pydataSchedule.pipelines.PyDataSQLitePipeline': 100, 'pydataSchedule.pipelines.PyDataJSONPipeline':200,}  pipelines.py

Slide 44

Slide 44 text

Pydata SQLitePipeline

Slide 45

Slide 45 text

Execution $ scrapy crawl $ scrapy crawl -o items.json -t json $ scrapy crawl -o items.csv -t csv $ scrapy crawl -o items.xml -t xml

Slide 46

Slide 46 text

Pydata conferences

Slide 47

Slide 47 text

Pydata conferences

Slide 48

Slide 48 text

Pydata conferences

Slide 49

Slide 49 text

Launch spiders without scrapy command

Slide 50

Slide 50 text

Scrapy Cloud http://doc.scrapinghub.com/scrapy-cloud.html https://dash.scrapinghub.com >>pip install shub >>shub login >>Insert your ScrapingHub API Key:

Slide 51

Slide 51 text

Scrapy Cloud /scrapy.cfg # Project: demo [deploy] url =https://dash.scrapinghub.com/api/scrapyd/ #API_KEY username = ec6334d7375845fdb876c1d10b2b1622 password = #project identifier project = 25767

Slide 52

Slide 52 text

Scrapy Cloud $ shub deploy

Slide 53

Slide 53 text

Scrapy Cloud

Slide 54

Slide 54 text

Scrapy Cloud

Slide 55

Slide 55 text

Scrapy Cloud

Slide 56

Slide 56 text

Scrapy Cloud Scheduling curl -u APIKEY: https://dash.scrapinghub.com/api/schedule.json -d project=PROJECT -d spider=SPIDER

Slide 57

Slide 57 text

References  http://www.crummy.com/software/BeautifulSoup  http://scrapy.org  https://pypi.python.org/pypi/mechanize  http://docs.webscraping.com  http://docs.python-requests.org/en/latest  http://selenium-python.readthedocs.org/index.html  https://github.com/REMitchell/python-scraping