Scrapy internals

Scrapy internals Alexander Sibiryakov, 16-17 July 2017, PyConRU 2017 [email protected]
made by

Talk scope • Design of complex asynchronous application, • Flow-control
issues, • open source life.

Scrapy: web scraping • extraction of structured data, • Selecting
and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel • Interactive shell, • Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs, • Robust encoding support and auto-detection,

Main features • Extensible: spider, signals, middlewares, extensions, and pipelines,
COOKIES Robots.txt Form submission Telnet console Graceful shutdown by signal

Scrapy architecture

Twisted • Event-driven network programming framework • Event loop and
Deferreds («Обещания») • Protocols and transport: • TCP, UDP, SSL, UNIX sockets • HTTP, DNS, SMTP/IMAP, IRC • Cross platform

Glyph Lefkowitz Creator of Twisted

–Twisted source code self._nameResolver = _SimpleResolverComplexiﬁer(resolver)

Twisted event loop https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html events: [e1: Event, e2: Event,
… eN] Event: func, args, desired_time min: O(1)

x86 time sources • Real Time Clock - absolute time,
1 sec. precision, • 8254 chip previously, • HPET (High Precision Event Timer), at least 10Mhz • single counter for periodic mode, • many for one-shot mode, • compares actual timer value and target • RDTSC/RDTSCP - CPU clock cycles • Proprietary timers

Twisted.Deferred • callback • errback • addCallback, addErrback • cancel
• addTimeout • pause/unpause

Internal components intercommunication

Web agent pipeline

Downloader Slots:

PROBLEMS

Throttling between internal components • Downloader, • Scraper • Item
pipelines (cleansing, validating, dups, storing,..) • Feed exports (serialization + disk/network IO) • ?

Flow control: memory • Unlimited downloading -> unlimited items growth
from cascading feed pages. • maintain limit per amount of memory used for Responses in queue (~5Mb)

Flow control: CPU spending more time on than > reactor.callLater(
0.1 , d.errback, _failure) an artiﬁcial delay in 100ms Callbacks-> CPU io

Summarizing • concurrent items limits, • memory consumption limits, •
scheduling of new calls with delays. if limit is reached -> don’t pickup new request from scheduler

It just stopped… • Why? • Some Deferred was lost?
• Where? • How to debug? No silver bullet. > self.heartbeat = task.LoopingCall(nextcall.schedule) + extensive logging

Design your async application well Iterations State diagrams

Вопросы Alexander Sibiryakov, Scrapinghub Ltd., [email protected]

Scrapy internals

Scrapy internals

Alexander Sibiryakov

More Decks by Alexander Sibiryakov

Other Decks in Programming

Featured

Transcript