Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
200
2
Share
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Other Decks in Programming
See All in Programming
ふりがな Deep Dive try! Swift Tokyo 2026
watura
0
210
Don't Prompt Harder, Structure Better
kitasuke
0
740
3分でわかるatama plusのQA/about atama plus QA
atamaplus
0
160
Oxlintとeslint-plugin-react-hooks 明日から始められそう?
t6adev
0
240
iOS機能開発のAI環境と起きた変化
ryunakayama
0
180
クラウドネイティブなエンジニアに向ける Raycastの魅力と実際の活用事例
nealle
2
180
Alternatives to JPA 2026
debop
0
110
Cache-moi si tu peux : patterns et pièges du cache en production - Devoxx France 2026 - Conférence
slecache
0
210
tRPCの概要と少しだけパフォーマンス
misoton665
2
160
一度始めたらやめられない開発効率向上術 / Findy あなたのdotfilesを教えて!
k0kubun
4
2.9k
「Linuxサーバー構築標準教科書」を読んでみた #ツナギメオフライン.7
akase244
0
1.4k
Offline should be the norm: building local-first apps with CRDTs & Kotlin Multiplatform
renaudmathieu
0
210
Featured
See All Featured
Writing Fast Ruby
sferik
630
63k
The Curse of the Amulet
leimatthew05
1
11k
Ten Tips & Tricks for a 🌱 transition
stuffmc
0
98
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Designing Powerful Visuals for Engaging Learning
tmiket
1
340
How to Ace a Technical Interview
jacobian
281
24k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.9k
Six Lessons from altMBA
skipperchong
29
4.2k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.6k
How to Align SEO within the Product Triangle To Get Buy-In & Support - #RIMC
aleyda
1
1.5k
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
320
Digital Projects Gone Horribly Wrong (And the UX Pros Who Still Save the Day) - Dean Schuster
uxyall
0
1.1k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412