Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
190
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
PHPUnitの限界をPlaywrightで補完するテストアプローチ
yuzneri
0
370
[DevinMeetupTokyo2025] コード書かせないDevinの使い方
takumiyoshikawa
2
250
DataformでPythonする / dataform-de-python
snhryt
0
150
実践 Dev Containers × Claude Code
touyu
1
120
Streamlitで実現できるようになったこと、実現してくれたこと
ayumu_yamaguchi
2
270
0から始めるモジュラーモノリス-クリーンなモノリスを目指して
sushi0120
0
250
Dart 参戦!!静的型付き言語界の隠れた実力者
kno3a87
0
160
ZeroETLで始めるDynamoDBとS3の連携
afooooil
0
150
[SRE NEXT] 複雑なシステムにおけるUser Journey SLOの導入
yakenji
1
910
Claude Code で Astro blog を Pages から Workers へ移行してみた
codehex
0
170
PHPカンファレンス関西2025 基調講演
sugimotokei
6
1.1k
JetBrainsのAI機能の紹介 #jjug
yusuke
0
180
Featured
See All Featured
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
34
6k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
Balancing Empowerment & Direction
lara
1
530
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
Done Done
chrislema
185
16k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
130
19k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
The Cost Of JavaScript in 2023
addyosmani
51
8.7k
Automating Front-end Workflow
addyosmani
1370
200k
Why Our Code Smells
bkeepers
PRO
337
57k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412