Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Aaaarrgghh, Spider! Web scraping with Scrapy

Aaaarrgghh, Spider! Web scraping with Scrapy

A quick yet reasonably thorough introduction to Scrapy, the high-level web crawling & scraping framework for Python, given at PyCon Thailand 2018.

* What is scraping?
* What is Scrapy?
* Core concepts demo
* Problems and solutions
* Resources

B3881a28fe402dd2d1de44717486cae8?s=128

Michael Kohl

June 17, 2018
Tweet

More Decks by Michael Kohl

Other Decks in Technology

Transcript

  1. Aaaarrgghh, Spider!
 Web scraping with Scrapy Michael Kohl PyCon Thailand

    2018
  2. PyCon Thailand 2018 locksteplabs.com No food…yet!

  3. • Michael Kohl (@citizen428) • Based in Bangkok, Thailand •

    CTO @ Lockstep Labs • Ruby developer since 2003 (“Splitter!”) • Py-curious, but mostly for ML / data science PyCon Thailand 2018 locksteplabs.com Yours truly
  4. PyCon Thailand 2018 locksteplabs.com Aaaarrgghh!

  5. PyCon Thailand 2018 locksteplabs.com Outline • What is scraping? •

    Scraping in Python • Why Scrapy? • Core concepts demo • Problems and solutions
  6. PyCon Thailand 2018 locksteplabs.com Scraping • AKA screen scraping, web

    harvesting, data extraction • Extracting (structured) data from (unstructured) web sites • Use cases: • web indexing • data mining • price comparison • change detection • data mashups
  7. PyCon Thailand 2018 locksteplabs.com Danger Zone • Legal grey area

    • eBay v. Bidder’s edge (2000) • Intel v. Hamidi (2003) • Facebook (2009) • Anti-scraping measures • Bot-detection • Captchas •robots.txt
  8. PyCon Thailand 2018 locksteplabs.com Scraping in Python • requests •

    lxml • cssselect • Beautiful Soup • Selenium
  9. PyCon Thailand 2018 locksteplabs.com Scrapy “A fast high-level web crawling

    & scraping framework for Python”
  10. PyCon Thailand 2018 locksteplabs.com Scrapy • Fast: event-driven networking (Twisted)

    • High-level: Useful abstractions • Crawling: Predefined crawlers, easy to write your own • Scraping: Selectors & feed exporters • Framework: Comprehensive toolset • Commercial support in form of Scrapinghub • FOSS
  11. PyCon Thailand 2018 locksteplabs.com Concepts • Spiders (Spider, CrawlSpider…) •

    Selectors (XPath, CSS, Regex, etc) • Items & item loaders • Input & output processors • Pipelines
  12. PyCon Thailand 2018 locksteplabs.com Demo

  13. PyCon Thailand 2018 locksteplabs.com Code

  14. PyCon Thailand 2018 locksteplabs.com Problems • Javascript. It’s ALWAYS Javascript.


    Selenium, scrapy-splash • Captchas
 Decaptcha, Death By Captcha • Writing scrapers is boring
 Scrapely, Portia • Deployment
 ScrapingHub, Scrapyd
  15. PyCon Thailand 2018 locksteplabs.com Scrapinghub

  16. PyCon Thailand 2018 locksteplabs.com Portia

  17. PyCon Thailand 2018 locksteplabs.com Resources • Scrapy https://scrapy.org • scrapy-splash

    https://github.com/scrapinghub/splash • Decaptcha https://github.com/pombredanne/ decaptcha • Death By Captcha http://www.deathbycaptcha.com • Scrapely https://github.com/scrapy/scrapely • Portia https://github.com/scrapinghub/portia • ScrapingHub https://scrapinghub.com • Scrapyd https://github.com/scrapy/scrapyd Resources
  18. PyCon Thailand 2018 locksteplabs.com Thank you!