Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Aaaarrgghh, Spider! Web scraping with Scrapy

Michael Kohl
December 14, 2017

Aaaarrgghh, Spider! Web scraping with Scrapy

An intro to web scraping with Scrapy given at ThaiPy in December 2017.

Michael Kohl

December 14, 2017
Tweet

More Decks by Michael Kohl

Other Decks in Programming

Transcript

  1. • Michael Kohl (@citizen428) • Based in Bangkok, Thailand •

    CTO @ Lockstep Labs • Ruby developer since 2004 • Always end up speaking at ThaiPy ¯\_(ツ)_/¯ ThaiPy December 2017 locksteplabs.com Yours truly
  2. ThaiPy December 2017 locksteplabs.com Outline • What is scraping? •

    What is Scrapy? • Core concepts demo • Problems and solutions • Resources
  3. ThaiPy December 2017 locksteplabs.com Scraping • AKA screen scraping, web

    harvesting, data extraction • Extracting data from web sites • Turning unstructured data into structured data • Use cases: web indexing, data mining, price comparison, change detection, mashups etc. • Anti-scraping measures likes robots.txt, captchas, bot detection frameworks etc. • Legal grey area
  4. ThaiPy December 2017 locksteplabs.com Scrapy • Fast: event-driven networking (Twisted)

    • High-level: Comprehensive with many useful abstractions and tools • Crawling: Predefined crawlers, easy to write your own • Scraping: Selectors & feed exporters • Commercial support in form of Scrapinghub • FOSS
  5. ThaiPy December 2017 locksteplabs.com Concepts • Spiders (Spider, CrawlSpider, etc)

    • Selectors (XPath, CSS, Regex, etc) • Items & item loaders • Input & output processors • Pipelines
  6. ThaiPy December 2017 locksteplabs.com Problems • Javascript. It’s ALWAYS Javascript.


    Selenium, scrapy-splash • Captchas
 Decaptcha, Death By Captcha • Writing scrapers is boring
 Scrapely, Portia • Deployment
 ScrapingHub, Scrapyd
  7. ThaiPy December 2017 locksteplabs.com Resources • Scrapy
 https://scrapy.org • scrapy-splash


    https://github.com/scrapinghub/splash • Decaptcha
 https://github.com/pombredanne/decaptcha • Death By Captcha
 http://www.deathbycaptcha.com • Scrapely
 https://github.com/scrapy/scrapely • Portia
 https://github.com/scrapinghub/portia • ScrapingHub
 https://scrapinghub.com • Scrapyd
 https://github.com/scrapy/scrapyd Resources