Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Aaaarrgghh, Spider! Web scraping with Scrapy

Aaaarrgghh, Spider! Web scraping with Scrapy

A quick yet reasonably thorough introduction to Scrapy, the high-level web crawling & scraping framework for Python, given at PyCon Thailand 2018.

* What is scraping?
* What is Scrapy?
* Core concepts demo
* Problems and solutions
* Resources

Michael Kohl

June 17, 2018
Tweet

More Decks by Michael Kohl

Other Decks in Technology

Transcript

  1. • Michael Kohl (@citizen428) • Based in Bangkok, Thailand •

    CTO @ Lockstep Labs • Ruby developer since 2003 (“Splitter!”) • Py-curious, but mostly for ML / data science PyCon Thailand 2018 locksteplabs.com Yours truly
  2. PyCon Thailand 2018 locksteplabs.com Outline • What is scraping? •

    Scraping in Python • Why Scrapy? • Core concepts demo • Problems and solutions
  3. PyCon Thailand 2018 locksteplabs.com Scraping • AKA screen scraping, web

    harvesting, data extraction • Extracting (structured) data from (unstructured) web sites • Use cases: • web indexing • data mining • price comparison • change detection • data mashups
  4. PyCon Thailand 2018 locksteplabs.com Danger Zone • Legal grey area

    • eBay v. Bidder’s edge (2000) • Intel v. Hamidi (2003) • Facebook (2009) • Anti-scraping measures • Bot-detection • Captchas •robots.txt
  5. PyCon Thailand 2018 locksteplabs.com Scraping in Python • requests •

    lxml • cssselect • Beautiful Soup • Selenium
  6. PyCon Thailand 2018 locksteplabs.com Scrapy • Fast: event-driven networking (Twisted)

    • High-level: Useful abstractions • Crawling: Predefined crawlers, easy to write your own • Scraping: Selectors & feed exporters • Framework: Comprehensive toolset • Commercial support in form of Scrapinghub • FOSS
  7. PyCon Thailand 2018 locksteplabs.com Concepts • Spiders (Spider, CrawlSpider…) •

    Selectors (XPath, CSS, Regex, etc) • Items & item loaders • Input & output processors • Pipelines
  8. PyCon Thailand 2018 locksteplabs.com Problems • Javascript. It’s ALWAYS Javascript.


    Selenium, scrapy-splash • Captchas
 Decaptcha, Death By Captcha • Writing scrapers is boring
 Scrapely, Portia • Deployment
 ScrapingHub, Scrapyd
  9. PyCon Thailand 2018 locksteplabs.com Resources • Scrapy https://scrapy.org • scrapy-splash

    https://github.com/scrapinghub/splash • Decaptcha https://github.com/pombredanne/ decaptcha • Death By Captcha http://www.deathbycaptcha.com • Scrapely https://github.com/scrapy/scrapely • Portia https://github.com/scrapinghub/portia • ScrapingHub https://scrapinghub.com • Scrapyd https://github.com/scrapy/scrapyd Resources