Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
190
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
new(1.26) ← これすき / kamakura.go #8
utgwkk
0
1.5k
ご飯食べながらエージェントが開発できる。そう、Agentic Engineeringならね。
yokomachi
1
280
Agent Skills Workshop - AIへの頼み方を仕組み化する
gotalab555
14
7.8k
2026/02/04 AIキャラクター人格の実装論 口 調の模倣から、コンテキスト制御による 『思想』と『行動』の創発へ
sr2mg4
0
670
守る「だけ」の優しいEMを抜けて、 事業とチームを両方見る視点を身につけた話
maroon8021
3
160
ふつうの Rubyist、ちいさなデバイス、大きな一年
bash0c7
0
100
AWS Infrastructure as Code の新機能 2025 総まとめ 〜SA 4人による怒涛のデモ祭り〜
konokenj
10
3.1k
AI時代のソフトウェア開発でも「人が仕様を書く」から始めよう-医療IT現場での実践とこれから
koukimiura
0
130
文字コードの話
qnighy
43
17k
オブザーバビリティ駆動開発って実際どうなの?
yohfee
3
670
24時間止められないシステムを守る-医療ITにおけるランサムウェア対策の実際
koukimiura
2
180
Ruby x Terminal
a_matsuda
5
570
Featured
See All Featured
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
130
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
79
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.8k
Agile that works and the tools we love
rasmusluckow
331
21k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.1k
The Limits of Empathy - UXLibs8
cassininazir
1
240
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
Docker and Python
trallard
47
3.8k
16th Malabo Montpellier Forum Presentation
akademiya2063
PRO
0
63
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412