Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Scrapy Overview
Search
JusBrasil
April 12, 2013
Programming
2
190
Scrapy Overview
An overview of the Scrapy framework by @cacovsky
JusBrasil
April 12, 2013
Tweet
Share
Other Decks in Programming
See All in Programming
CDK開発におけるコーディング規約の運用
yamanashi_ren01
2
120
2024年のkintone API振り返りと2025年 / kintone API look back in 2024
tasshi
0
220
楽しく向き合う例外対応
okutsu
0
120
Software Architecture
hschwentner
6
2.1k
Rubyで始める関数型ドメインモデリング
shogo_tksk
0
110
密集、ドキュメントのコロケーション with AWS Lambda
satoshi256kbyte
0
190
PHPのバージョンアップ時にも役立ったAST
matsuo_atsushi
0
110
How mixi2 Uses TiDB for SNS Scalability and Performance
kanmo
37
14k
Honoのおもしろいミドルウェアをみてみよう
yusukebe
1
210
動作確認やテストで漏れがちな観点3選
starfish719
6
1k
Amazon S3 TablesとAmazon S3 Metadataを触ってみた / 20250201-jawsug-tochigi-s3tables-s3metadata
kasacchiful
0
170
Conform を推す - Advocating for Conform
mizoguchicoji
3
690
Featured
See All Featured
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
47
5.2k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
114
50k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
9
440
A Tale of Four Properties
chriscoyier
158
23k
Site-Speed That Sticks
csswizardry
4
380
The Pragmatic Product Professional
lauravandoore
32
6.4k
Git: the NoSQL Database
bkeepers
PRO
427
64k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
The Power of CSS Pseudo Elements
geoffreycrofte
75
5.5k
Build The Right Thing And Hit Your Dates
maggiecrowley
34
2.5k
StorybookのUI Testing Handbookを読んだ
zakiyama
28
5.5k
Transcript
Scrapy an overview
/skræpi/
Web Crawler vs. Web Scraper
None
None
Scrapy Framework Scraping / Crawling / Monitoring / Testing
Stable Active Large community
~200 pages of docs
Commercial support
Framework?
None
None
None
Twisted event loop (reactor)
None
Your code goes here
The scraping logic
None
HttpErrorMiddleware UrlLengthMiddleware DepthMiddleware
HttpProxyMiddleware HttpCacheMiddleware RedirectMiddleware
Media download Persistence Post-processing
Data flow control
Queuing
Talk is cheap, show me the code.
$ pip install Scrapy $ scrapy startproject home_news
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project root
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project config
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Project module
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your items
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your pipelines
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your settings
home_news/ scrapy.cfg home_news/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
Your spiders...
None
//*[@id="glbcorpo"]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div/div/a/@href
//*[@id="glbmateria"]/div[2]/h1/text()
//*[@id="materialetra"]/div/div/p[1]/text()
None
$ pwd /home/caco/studies/scrapy_news/home_news
$ pwd /home/caco/studies/scrapy_news/home_news (project root)
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json
$ pwd /home/caco/studies/scrapy_news/home_news $ scrapy crawl g1 -o scraped_data.json -t
json (feed exporters: json,csv,xml)
None
None
None
Other nice features • scrapyd: run as a service •
Webservice (issue commands via http requests) • Signals • Stats module • Contribs (CrawlSpider etc)
Obrigado! @cacovsky Thanks! @cacovsky
Images Spatula http://www.duebuoi.it/x/uk_usd/catalog/p/spatulas~805-16x10.html Spiderman http://tincan21.deviantart.com/art/muro-spidey-307810412