Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
200
2
Share
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Other Decks in Programming
See All in Programming
When benchmarks go bad - what I learned from measuring performance wrong
hollycummins
0
360
〜バイブコーディングを超えて〜 チームで実験し続けたAI駆動開発
tigertora7571
0
190
ソフトウェア設計の結合バランス #phperkaigi
kajitack
0
490
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
540
Agentic UI in the Frontend: Architectures with Open Standards @JAX 2026 in Mainz
manfredsteyer
PRO
0
100
PHPでローカル環境用のSSL/TLS証明書を発行することはできるのか? #phpconkagawa
akase244
0
350
cloudnative conference 2026 flyle
azihsoyn
0
130
書き換えて学ぶTemporal #fukts
pirosikick
2
360
なぜあなたのコードには「コシ」がないのか?〜AI時代に問う、最後まで美味しい設計と戦略〜 #phpconkagawa / phpconkagawa2026
shogogg
0
150
UaaL×Androidアプリのメモリ計測 — Memory Profilerの先へ
rio432
0
130
Lightning-Fast Method Calls with Ruby 4.1 ZJIT / RubyKaigi 2026
k0kubun
3
2.6k
Firefoxにコントリビューションして得られた学び
ken7253
2
160
Featured
See All Featured
Exploring anti-patterns in Rails
aemeredith
3
350
Being A Developer After 40
akosma
91
590k
Skip the Path - Find Your Career Trail
mkilby
1
120
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Bioeconomy Workshop: Dr. Julius Ecuru, Opportunities for a Bioeconomy in West Africa
akademiya2063
PRO
1
110
How to make the Groovebox
asonas
2
2.2k
Typedesign – Prime Four
hannesfritz
42
3k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.9k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
10k
Crafting Experiences
bethany
1
140
GitHub's CSS Performance
jonrohan
1032
470k
Digital Projects Gone Horribly Wrong (And the UX Pros Who Still Save the Day) - Dean Schuster
uxyall
0
1.3k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412