Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scrapy internals

Scrapy internals

Scrapy itself is a good example of modern asynchronous application. Moreover it's a swiss army knife with all kinds of extensions: Item pipelines, HTML/CSS selectors, Middlewares. In this talk, I’m going to explain how the Scrapy’s internal processing pipeline works, the design of it’s downloader queue and all the things needed to debug it: Scrapy shell, telnet console, memory consumption debugging.

Scrapy is a 100% asynchronous web scraping framework built with Twisted event loop with 21K GitHub stars!

Avatar for Alexander Sibiryakov

Alexander Sibiryakov

July 17, 2017
Tweet

More Decks by Alexander Sibiryakov

Other Decks in Programming

Transcript

  1. Scrapy: web scraping • extraction of structured data, • Selecting

    and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel • Interactive shell, • Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs, • Robust encoding support and auto-detection,
  2. Main features • Extensible: spider, signals, middlewares, extensions, and pipelines,

    COOKIES Robots.txt Form submission Telnet console Graceful shutdown by signal
  3. Twisted • Event-driven network programming framework • Event loop and

    Deferreds («Обещания») • Protocols and transport: • TCP, UDP, SSL, UNIX sockets • HTTP, DNS, SMTP/IMAP, IRC • Cross platform
  4. x86 time sources • Real Time Clock - absolute time,

    1 sec. precision, • 8254 chip previously, • HPET (High Precision Event Timer), at least 10Mhz • single counter for periodic mode, • many for one-shot mode, • compares actual timer value and target • RDTSC/RDTSCP - CPU clock cycles • Proprietary timers
  5. Throttling between internal components • Downloader, • Scraper • Item

    pipelines (cleansing, validating, dups, storing,..) • Feed exports (serialization + disk/network IO) • ?
  6. Flow control: memory • Unlimited downloading -> unlimited items growth

    from cascading feed pages. • maintain limit per amount of memory used for Responses in queue (~5Mb)
  7. Flow control: CPU spending more time on than > reactor.callLater(

    0.1 , d.errback, _failure) an artificial delay in 100ms Callbacks-> CPU io
  8. Summarizing • concurrent items limits, • memory consumption limits, •

    scheduling of new calls with delays. if limit is reached -> don’t pickup new request from scheduler
  9. It just stopped… • Why? • Some Deferred was lost?

    • Where? • How to debug? No silver bullet. > self.heartbeat = task.LoopingCall(nextcall.schedule) + extensive logging