Aaaarrgghh, Spider! Web scraping with Scrapy

Aaaarrgghh, Spider! Web scraping with Scrapy

A quick yet reasonably thorough introduction to Scrapy, the high-level web crawling & scraping framework for Python, given at PyCon Thailand 2018.

* What is scraping?
* What is Scrapy?
* Core concepts demo
* Problems and solutions
* Resources

B3881a28fe402dd2d1de44717486cae8?s=128

Michael Kohl

June 17, 2018
Tweet

Transcript

  1. Aaaarrgghh, Spider!
 Web scraping with Scrapy Michael Kohl PyCon Thailand

    2018
  2. PyCon Thailand 2018 locksteplabs.com No food…yet!

  3. • Michael Kohl (@citizen428) • Based in Bangkok, Thailand •

    CTO @ Lockstep Labs • Ruby developer since 2003 (“Splitter!”) • Py-curious, but mostly for ML / data science PyCon Thailand 2018 locksteplabs.com Yours truly
  4. PyCon Thailand 2018 locksteplabs.com Aaaarrgghh!

  5. PyCon Thailand 2018 locksteplabs.com Outline • What is scraping? •

    Scraping in Python • Why Scrapy? • Core concepts demo • Problems and solutions
  6. PyCon Thailand 2018 locksteplabs.com Scraping • AKA screen scraping, web

    harvesting, data extraction • Extracting (structured) data from (unstructured) web sites • Use cases: • web indexing • data mining • price comparison • change detection • data mashups
  7. PyCon Thailand 2018 locksteplabs.com Danger Zone • Legal grey area

    • eBay v. Bidder’s edge (2000) • Intel v. Hamidi (2003) • Facebook (2009) • Anti-scraping measures • Bot-detection • Captchas •robots.txt
  8. PyCon Thailand 2018 locksteplabs.com Scraping in Python • requests •

    lxml • cssselect • Beautiful Soup • Selenium
  9. PyCon Thailand 2018 locksteplabs.com Scrapy “A fast high-level web crawling

    & scraping framework for Python”
  10. PyCon Thailand 2018 locksteplabs.com Scrapy • Fast: event-driven networking (Twisted)

    • High-level: Useful abstractions • Crawling: Predefined crawlers, easy to write your own • Scraping: Selectors & feed exporters • Framework: Comprehensive toolset • Commercial support in form of Scrapinghub • FOSS
  11. PyCon Thailand 2018 locksteplabs.com Concepts • Spiders (Spider, CrawlSpider…) •

    Selectors (XPath, CSS, Regex, etc) • Items & item loaders • Input & output processors • Pipelines
  12. PyCon Thailand 2018 locksteplabs.com Demo

  13. PyCon Thailand 2018 locksteplabs.com Code

  14. PyCon Thailand 2018 locksteplabs.com Problems • Javascript. It’s ALWAYS Javascript.


    Selenium, scrapy-splash • Captchas
 Decaptcha, Death By Captcha • Writing scrapers is boring
 Scrapely, Portia • Deployment
 ScrapingHub, Scrapyd
  15. PyCon Thailand 2018 locksteplabs.com Scrapinghub

  16. PyCon Thailand 2018 locksteplabs.com Portia

  17. PyCon Thailand 2018 locksteplabs.com Resources • Scrapy https://scrapy.org • scrapy-splash

    https://github.com/scrapinghub/splash • Decaptcha https://github.com/pombredanne/ decaptcha • Death By Captcha http://www.deathbycaptcha.com • Scrapely https://github.com/scrapy/scrapely • Portia https://github.com/scrapinghub/portia • ScrapingHub https://scrapinghub.com • Scrapyd https://github.com/scrapy/scrapyd Resources
  18. PyCon Thailand 2018 locksteplabs.com Thank you!