Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
200
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Other Decks in Programming
See All in Programming
Javaの型とAI時代に型が大事な理由 / java types and type in AI era
kishida
2
140
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
790
Inside Stream API
skrb
1
730
New "Type" system on PicoRuby
pocke
1
960
DynamoDBには集計系のクエリがないけどなんとかしたい
musan
1
180
Skillsは効率化、Agentsは"自分の拡張"——Builder時代のエージェント編成(CC Night 2026)
wemra
1
140
正しくソフトウェアを作る、前提を疑うための認知の視点 / doubt-premise
minodriven
21
6.7k
Even G2とAWSで推しのエージェントを召喚しよう!
har1101
1
120
Vue × Nuxt × Oxc どこまで使える?実運用の現在地
andpad
0
260
Snowflake Summitでの新機能 CoCo / CoWork / snowflake-summit-2026-overall-what-new-coco
tatsuhiro
1
150
TypeScript+Orvalで実現する型安全かつ堅牢でスケーラブルなマルチチャネル通知基盤 / TSKaigi Night talks ~after conference~
d0riven
0
340
JJUG CCC 2026 Spring: JSpecify で実現する Kotlin フレンドリーな Java API 設計
ternbusty
1
180
Featured
See All Featured
Exploring anti-patterns in Rails
aemeredith
3
410
Technical Leadership for Architectural Decision Making
baasie
3
410
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
The Language of Interfaces
destraynor
162
27k
The Cult of Friendly URLs
andyhume
79
6.9k
Writing Fast Ruby
sferik
630
63k
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
230
Documentation Writing (for coders)
carmenintech
77
5.4k
SERP Conf. Vienna - Web Accessibility: Optimizing for Inclusivity and SEO
sarafernandez
2
1.5k
Designing Powerful Visuals for Engaging Learning
tmiket
1
420
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
200
Build The Right Thing And Hit Your Dates
maggiecrowley
39
3.2k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412