Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Aaaarrgghh, Spider! Web scraping with Scrapy

Aaaarrgghh, Spider! Web scraping with Scrapy

A quick yet reasonably thorough introduction to Scrapy, the high-level web crawling & scraping framework for Python, given at PyCon Thailand 2018.

* What is scraping?
* What is Scrapy?
* Core concepts demo
* Problems and solutions
* Resources

Michael Kohl

June 17, 2018
Tweet

More Decks by Michael Kohl

Other Decks in Technology

Transcript

  1. Aaaarrgghh, Spider!

    Web scraping with Scrapy
    Michael Kohl

    PyCon Thailand 2018

    View Slide

  2. PyCon Thailand 2018
    locksteplabs.com
    No food…yet!

    View Slide

  3. • Michael Kohl (@citizen428)

    • Based in Bangkok, Thailand

    • CTO @ Lockstep Labs

    • Ruby developer since 2003 (“Splitter!”)

    • Py-curious, but mostly for ML / data science
    PyCon Thailand 2018
    locksteplabs.com
    Yours truly

    View Slide

  4. PyCon Thailand 2018
    locksteplabs.com
    Aaaarrgghh!

    View Slide

  5. PyCon Thailand 2018
    locksteplabs.com
    Outline
    • What is scraping?

    • Scraping in Python

    • Why Scrapy?

    • Core concepts demo

    • Problems and solutions

    View Slide

  6. PyCon Thailand 2018
    locksteplabs.com
    Scraping
    • AKA screen scraping, web harvesting, data extraction

    • Extracting (structured) data from (unstructured) web sites

    • Use cases:

    • web indexing

    • data mining

    • price comparison

    • change detection

    • data mashups

    View Slide

  7. PyCon Thailand 2018
    locksteplabs.com
    Danger Zone
    • Legal grey area

    • eBay v. Bidder’s edge (2000)

    • Intel v. Hamidi (2003)

    • Facebook (2009)

    • Anti-scraping measures

    • Bot-detection

    • Captchas

    •robots.txt

    View Slide

  8. PyCon Thailand 2018
    locksteplabs.com
    Scraping in Python
    • requests

    • lxml

    • cssselect

    • Beautiful Soup

    • Selenium

    View Slide

  9. PyCon Thailand 2018
    locksteplabs.com
    Scrapy
    “A fast high-level
    web crawling &
    scraping framework
    for Python”

    View Slide

  10. PyCon Thailand 2018
    locksteplabs.com
    Scrapy
    • Fast: event-driven networking (Twisted)

    • High-level: Useful abstractions

    • Crawling: Predefined crawlers, easy to write your own

    • Scraping: Selectors & feed exporters

    • Framework: Comprehensive toolset

    • Commercial support in form of Scrapinghub

    • FOSS

    View Slide

  11. PyCon Thailand 2018
    locksteplabs.com
    Concepts
    • Spiders (Spider, CrawlSpider…)

    • Selectors (XPath, CSS, Regex, etc)

    • Items & item loaders

    • Input & output processors

    • Pipelines

    View Slide

  12. PyCon Thailand 2018
    locksteplabs.com
    Demo

    View Slide

  13. PyCon Thailand 2018
    locksteplabs.com
    Code

    View Slide

  14. PyCon Thailand 2018
    locksteplabs.com
    Problems
    • Javascript. It’s ALWAYS Javascript.

    Selenium, scrapy-splash

    • Captchas

    Decaptcha, Death By Captcha

    • Writing scrapers is boring

    Scrapely, Portia

    • Deployment

    ScrapingHub, Scrapyd

    View Slide

  15. PyCon Thailand 2018
    locksteplabs.com
    Scrapinghub

    View Slide

  16. PyCon Thailand 2018
    locksteplabs.com
    Portia

    View Slide

  17. PyCon Thailand 2018
    locksteplabs.com
    Resources
    • Scrapy https://scrapy.org

    • scrapy-splash https://github.com/scrapinghub/splash

    • Decaptcha https://github.com/pombredanne/
    decaptcha

    • Death By Captcha http://www.deathbycaptcha.com

    • Scrapely https://github.com/scrapy/scrapely

    • Portia https://github.com/scrapinghub/portia

    • ScrapingHub https://scrapinghub.com

    • Scrapyd https://github.com/scrapy/scrapyd
    Resources

    View Slide

  18. PyCon Thailand 2018
    locksteplabs.com
    Thank you!

    View Slide