Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Scrapy an overview
Slide 2
Slide 2 text
/skræpi/
Slide 3
Slide 3 text
Web Crawler vs. Web Scraper
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Slide 7
Slide 7 text
Stable Active Large community
Slide 8
Slide 8 text
~200 pages of docs
Slide 9
Slide 9 text
Commercial support
Slide 10
Slide 10 text
Framework?
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
Twisted event loop (reactor)
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
Your code goes here
Slide 17
Slide 17 text
The scraping logic
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
Slide 20
Slide 20 text
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Slide 21
Slide 21 text
Media download Persistence Post-processing
Slide 22
Slide 22 text
Data flow control
Slide 23
Slide 23 text
Queuing
Slide 24
Slide 24 text
Talk is cheap, show me the code.
Slide 25
Slide 25 text
$ pip install Scrapy $ scrapy startproject home_news
Slide 26
Slide 26 text
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Slide 27
Slide 27 text
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Project root
Slide 28
Slide 28 text
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Project config
Slide 29
Slide 29 text
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Project module
Slide 30
Slide 30 text
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Your items
Slide 31
Slide 31 text
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Your pipelines
Slide 32
Slide 32 text
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Your settings
Slide 33
Slide 33 text
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... Your spiders...
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
Slide 36
Slide 36 text
//*[@id="glbmateria"]/div[2]/h1/text()
Slide 37
Slide 37 text
//*[@id="materialetra"]/div/div/p[1]/text()
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
$ pwd /home/caco/studies/scrapy_news/home_news
Slide 40
Slide 40 text
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
Slide 41
Slide 41 text
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t json
Slide 42
Slide 42 text
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t json
Slide 43
Slide 43 text
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t json
Slide 44
Slide 44 text
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t json (feed exporters: json,csv,xml)
Slide 45
Slide 45 text
No content
Slide 46
Slide 46 text
No content
Slide 47
Slide 47 text
No content
Slide 48
Slide 48 text
Other nice features ● scrapyd: run as a service ● Webservice (issue commands via http requests) ● Signals ● Stats module ● Contribs (CrawlSpider etc)
Slide 49
Slide 49 text
Obrigado! @cacovsky Thanks! @cacovsky
Slide 50
Slide 50 text
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412