Aaaarrgghh, Spider! Web scraping with Scrapy

Aaaarrgghh, Spider! Web scraping with Scrapy

An intro to web scraping with Scrapy given at ThaiPy in December 2017.

B3881a28fe402dd2d1de44717486cae8?s=128

Michael Kohl

December 14, 2017
Tweet

Transcript

  1. 2.

    • Michael Kohl (@citizen428) • Based in Bangkok, Thailand •

    CTO @ Lockstep Labs • Ruby developer since 2004 • Always end up speaking at ThaiPy ¯\_(ツ)_/¯ ThaiPy December 2017 locksteplabs.com Yours truly
  2. 4.

    ThaiPy December 2017 locksteplabs.com Outline • What is scraping? •

    What is Scrapy? • Core concepts demo • Problems and solutions • Resources
  3. 5.

    ThaiPy December 2017 locksteplabs.com Scraping • AKA screen scraping, web

    harvesting, data extraction • Extracting data from web sites • Turning unstructured data into structured data • Use cases: web indexing, data mining, price comparison, change detection, mashups etc. • Anti-scraping measures likes robots.txt, captchas, bot detection frameworks etc. • Legal grey area
  4. 7.

    ThaiPy December 2017 locksteplabs.com Scrapy • Fast: event-driven networking (Twisted)

    • High-level: Comprehensive with many useful abstractions and tools • Crawling: Predefined crawlers, easy to write your own • Scraping: Selectors & feed exporters • Commercial support in form of Scrapinghub • FOSS
  5. 8.

    ThaiPy December 2017 locksteplabs.com Concepts • Spiders (Spider, CrawlSpider, etc)

    • Selectors (XPath, CSS, Regex, etc) • Items & item loaders • Input & output processors • Pipelines
  6. 10.

    ThaiPy December 2017 locksteplabs.com Problems • Javascript. It’s ALWAYS Javascript.


    Selenium, scrapy-splash • Captchas
 Decaptcha, Death By Captcha • Writing scrapers is boring
 Scrapely, Portia • Deployment
 ScrapingHub, Scrapyd
  7. 11.

    ThaiPy December 2017 locksteplabs.com Resources • Scrapy
 https://scrapy.org • scrapy-splash


    https://github.com/scrapinghub/splash • Decaptcha
 https://github.com/pombredanne/decaptcha • Death By Captcha
 http://www.deathbycaptcha.com • Scrapely
 https://github.com/scrapy/scrapely • Portia
 https://github.com/scrapinghub/portia • ScrapingHub
 https://scrapinghub.com • Scrapyd
 https://github.com/scrapy/scrapyd Resources