Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
JusBrasil
April 12, 2013
Programming
2
200
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
ポーリング処理廃止によるイベント駆動アーキテクチャへの移行
seitarof
3
1.3k
GoのDB アクセスにおける 「型安全」と「柔軟性」の両立 - Bob という選択肢
tak848
0
290
Java 21/25 Virtual Threads 소개
debop
0
310
「速くなった気がする」をデータで疑う
senleaf24
0
110
おれのAgentic Coding 2026/03
tsukasagr
1
120
Codex CLI でつくる、Issue から merge までの開発フロー
amata1219
0
250
CS教育のDX AIによる育成の効率化
niftycorp
PRO
0
170
「接続」—パフォーマンスチューニングの最後の一手 〜点と点を結ぶ、その一瞬のために〜
kentaroutakeda
4
2.2k
GC言語のWasm化とComponent Modelサポートの実践と課題 - Scalaの場合
tanishiking
0
130
How to stabilize UI tests using XCTest
akkeylab
0
150
Tamach-sre-3_ANDPAD-shimaison93
mane12yurks38
0
200
AI活用のコスパを最大化する方法
ochtum
0
360
Featured
See All Featured
コードの90%をAIが書く世界で何が待っているのか / What awaits us in a world where 90% of the code is written by AI
rkaga
61
43k
Statistics for Hackers
jakevdp
799
230k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
118
110k
The Curse of the Amulet
leimatthew05
1
11k
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
280
Deep Space Network (abreviated)
tonyrice
0
97
Building an army of robots
kneath
306
46k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
A Soul's Torment
seathinner
5
2.6k
The Power of CSS Pseudo Elements
geoffreycrofte
82
6.2k
Testing 201, or: Great Expectations
jmmastey
46
8.1k
Writing Fast Ruby
sferik
630
63k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412