Aaaarrgghh, Spider! Web scraping with Scrapy

Aaaarrgghh, Spider! Web scraping with Scrapy

An intro to web scraping with Scrapy given at ThaiPy in December 2017.

B3881a28fe402dd2d1de44717486cae8?s=128

Michael Kohl

December 14, 2017
Tweet

Transcript

  1. Aaaarrgghh, Spider!
 Web scraping with Scrapy Michael Kohl ThaiPy December

    2017
  2. • Michael Kohl (@citizen428) • Based in Bangkok, Thailand •

    CTO @ Lockstep Labs • Ruby developer since 2004 • Always end up speaking at ThaiPy ¯\_(ツ)_/¯ ThaiPy December 2017 locksteplabs.com Yours truly
  3. ThaiPy December 2017 locksteplabs.com Aaaarrgghh!

  4. ThaiPy December 2017 locksteplabs.com Outline • What is scraping? •

    What is Scrapy? • Core concepts demo • Problems and solutions • Resources
  5. ThaiPy December 2017 locksteplabs.com Scraping • AKA screen scraping, web

    harvesting, data extraction • Extracting data from web sites • Turning unstructured data into structured data • Use cases: web indexing, data mining, price comparison, change detection, mashups etc. • Anti-scraping measures likes robots.txt, captchas, bot detection frameworks etc. • Legal grey area
  6. ThaiPy December 2017 locksteplabs.com Scrapy “A fast high-level web crawling

    & scraping framework for Python”
  7. ThaiPy December 2017 locksteplabs.com Scrapy • Fast: event-driven networking (Twisted)

    • High-level: Comprehensive with many useful abstractions and tools • Crawling: Predefined crawlers, easy to write your own • Scraping: Selectors & feed exporters • Commercial support in form of Scrapinghub • FOSS
  8. ThaiPy December 2017 locksteplabs.com Concepts • Spiders (Spider, CrawlSpider, etc)

    • Selectors (XPath, CSS, Regex, etc) • Items & item loaders • Input & output processors • Pipelines
  9. ThaiPy December 2017 locksteplabs.com Demo

  10. ThaiPy December 2017 locksteplabs.com Problems • Javascript. It’s ALWAYS Javascript.


    Selenium, scrapy-splash • Captchas
 Decaptcha, Death By Captcha • Writing scrapers is boring
 Scrapely, Portia • Deployment
 ScrapingHub, Scrapyd
  11. ThaiPy December 2017 locksteplabs.com Resources • Scrapy
 https://scrapy.org • scrapy-splash


    https://github.com/scrapinghub/splash • Decaptcha
 https://github.com/pombredanne/decaptcha • Death By Captcha
 http://www.deathbycaptcha.com • Scrapely
 https://github.com/scrapy/scrapely • Portia
 https://github.com/scrapinghub/portia • ScrapingHub
 https://scrapinghub.com • Scrapyd
 https://github.com/scrapy/scrapyd Resources
  12. ThaiPy December 2017 locksteplabs.com Thank you!