Aaaarrgghh, Spider! Web scraping with Scrapy

Aaaarrgghh, Spider!  Web scraping with Scrapy Michael Kohl ThaiPy December
2017

• Michael Kohl (@citizen428) • Based in Bangkok, Thailand •
CTO @ Lockstep Labs • Ruby developer since 2004 • Always end up speaking at ThaiPy ¯\_(ツ)_/¯ ThaiPy December 2017 locksteplabs.com Yours truly

ThaiPy December 2017 locksteplabs.com Aaaarrgghh!

ThaiPy December 2017 locksteplabs.com Outline • What is scraping? •
What is Scrapy? • Core concepts demo • Problems and solutions • Resources

ThaiPy December 2017 locksteplabs.com Scraping • AKA screen scraping, web
harvesting, data extraction • Extracting data from web sites • Turning unstructured data into structured data • Use cases: web indexing, data mining, price comparison, change detection, mashups etc. • Anti-scraping measures likes robots.txt, captchas, bot detection frameworks etc. • Legal grey area

ThaiPy December 2017 locksteplabs.com Scrapy “A fast high-level web crawling
& scraping framework for Python”

ThaiPy December 2017 locksteplabs.com Scrapy • Fast: event-driven networking (Twisted)
• High-level: Comprehensive with many useful abstractions and tools • Crawling: Predeﬁned crawlers, easy to write your own • Scraping: Selectors & feed exporters • Commercial support in form of Scrapinghub • FOSS

ThaiPy December 2017 locksteplabs.com Concepts • Spiders (Spider, CrawlSpider, etc)
• Selectors (XPath, CSS, Regex, etc) • Items & item loaders • Input & output processors • Pipelines

ThaiPy December 2017 locksteplabs.com Demo

ThaiPy December 2017 locksteplabs.com Problems • Javascript. It’s ALWAYS Javascript. 
Selenium, scrapy-splash • Captchas  Decaptcha, Death By Captcha • Writing scrapers is boring  Scrapely, Portia • Deployment  ScrapingHub, Scrapyd

ThaiPy December 2017 locksteplabs.com Resources • Scrapy  https://scrapy.org • scrapy-splash 
https://github.com/scrapinghub/splash • Decaptcha  https://github.com/pombredanne/decaptcha • Death By Captcha  http://www.deathbycaptcha.com • Scrapely  https://github.com/scrapy/scrapely • Portia  https://github.com/scrapinghub/portia • ScrapingHub  https://scrapinghub.com • Scrapyd  https://github.com/scrapy/scrapyd Resources

ThaiPy December 2017 locksteplabs.com Thank you!

Aaaarrgghh, Spider! Web scraping with Scrapy

Aaaarrgghh, Spider! Web scraping with Scrapy

Michael Kohl

More Decks by Michael Kohl

Other Decks in Programming

Featured

Transcript