$30 off During Our Annual Pro Sale. View Details »

Aaaarrgghh, Spider! Web scraping with Scrapy

Michael Kohl
December 14, 2017

Aaaarrgghh, Spider! Web scraping with Scrapy

An intro to web scraping with Scrapy given at ThaiPy in December 2017.

Michael Kohl

December 14, 2017
Tweet

More Decks by Michael Kohl

Other Decks in Programming

Transcript

  1. Aaaarrgghh, Spider!

    Web scraping with Scrapy
    Michael Kohl

    ThaiPy December 2017

    View Slide

  2. • Michael Kohl (@citizen428)

    • Based in Bangkok, Thailand

    • CTO @ Lockstep Labs

    • Ruby developer since 2004

    • Always end up speaking at ThaiPy ¯\_(ツ)_/¯
    ThaiPy December 2017
    locksteplabs.com
    Yours truly

    View Slide

  3. ThaiPy December 2017
    locksteplabs.com
    Aaaarrgghh!

    View Slide

  4. ThaiPy December 2017
    locksteplabs.com
    Outline
    • What is scraping?

    • What is Scrapy?

    • Core concepts demo

    • Problems and solutions

    • Resources

    View Slide

  5. ThaiPy December 2017
    locksteplabs.com
    Scraping
    • AKA screen scraping, web harvesting, data extraction

    • Extracting data from web sites

    • Turning unstructured data into structured data

    • Use cases: web indexing, data mining, price comparison,
    change detection, mashups etc.

    • Anti-scraping measures likes robots.txt, captchas, bot
    detection frameworks etc.

    • Legal grey area

    View Slide

  6. ThaiPy December 2017
    locksteplabs.com
    Scrapy
    “A fast high-level
    web crawling &
    scraping framework
    for Python”

    View Slide

  7. ThaiPy December 2017
    locksteplabs.com
    Scrapy
    • Fast: event-driven networking (Twisted)

    • High-level: Comprehensive with many useful
    abstractions and tools

    • Crawling: Predefined crawlers, easy to write your own

    • Scraping: Selectors & feed exporters

    • Commercial support in form of Scrapinghub

    • FOSS

    View Slide

  8. ThaiPy December 2017
    locksteplabs.com
    Concepts
    • Spiders (Spider, CrawlSpider, etc)

    • Selectors (XPath, CSS, Regex, etc)

    • Items & item loaders

    • Input & output processors

    • Pipelines

    View Slide

  9. ThaiPy December 2017
    locksteplabs.com
    Demo

    View Slide

  10. ThaiPy December 2017
    locksteplabs.com
    Problems
    • Javascript. It’s ALWAYS Javascript.

    Selenium, scrapy-splash

    • Captchas

    Decaptcha, Death By Captcha

    • Writing scrapers is boring

    Scrapely, Portia

    • Deployment

    ScrapingHub, Scrapyd

    View Slide

  11. ThaiPy December 2017
    locksteplabs.com
    Resources
    • Scrapy

    https://scrapy.org

    • scrapy-splash

    https://github.com/scrapinghub/splash

    • Decaptcha

    https://github.com/pombredanne/decaptcha

    • Death By Captcha

    http://www.deathbycaptcha.com

    • Scrapely

    https://github.com/scrapy/scrapely

    • Portia

    https://github.com/scrapinghub/portia

    • ScrapingHub

    https://scrapinghub.com

    • Scrapyd

    https://github.com/scrapy/scrapyd

    Resources

    View Slide

  12. ThaiPy December 2017
    locksteplabs.com
    Thank you!

    View Slide