An intro to web scraping with Scrapy given at ThaiPy in December 2017.
Aaaarrgghh, Spider! Web scraping with ScrapyMichael KohlThaiPy December 2017
View Slide
• Michael Kohl (@citizen428)• Based in Bangkok, Thailand• CTO @ Lockstep Labs• Ruby developer since 2004• Always end up speaking at ThaiPy ¯\_(ツ)_/¯ThaiPy December 2017locksteplabs.comYours truly
ThaiPy December 2017locksteplabs.comAaaarrgghh!
ThaiPy December 2017locksteplabs.comOutline• What is scraping?• What is Scrapy?• Core concepts demo• Problems and solutions• Resources
ThaiPy December 2017locksteplabs.comScraping• AKA screen scraping, web harvesting, data extraction• Extracting data from web sites• Turning unstructured data into structured data• Use cases: web indexing, data mining, price comparison,change detection, mashups etc.• Anti-scraping measures likes robots.txt, captchas, botdetection frameworks etc.• Legal grey area
ThaiPy December 2017locksteplabs.comScrapy“A fast high-levelweb crawling &scraping frameworkfor Python”
ThaiPy December 2017locksteplabs.comScrapy• Fast: event-driven networking (Twisted)• High-level: Comprehensive with many usefulabstractions and tools• Crawling: Predefined crawlers, easy to write your own• Scraping: Selectors & feed exporters• Commercial support in form of Scrapinghub• FOSS
ThaiPy December 2017locksteplabs.comConcepts• Spiders (Spider, CrawlSpider, etc)• Selectors (XPath, CSS, Regex, etc)• Items & item loaders• Input & output processors• Pipelines
ThaiPy December 2017locksteplabs.comDemo
ThaiPy December 2017locksteplabs.comProblems• Javascript. It’s ALWAYS Javascript. Selenium, scrapy-splash• Captchas Decaptcha, Death By Captcha• Writing scrapers is boring Scrapely, Portia• Deployment ScrapingHub, Scrapyd
ThaiPy December 2017locksteplabs.comResources• Scrapy https://scrapy.org• scrapy-splash https://github.com/scrapinghub/splash• Decaptcha https://github.com/pombredanne/decaptcha• Death By Captcha http://www.deathbycaptcha.com• Scrapely https://github.com/scrapy/scrapely• Portia https://github.com/scrapinghub/portia• ScrapingHub https://scrapinghub.com• Scrapyd https://github.com/scrapy/scrapydResources
ThaiPy December 2017locksteplabs.comThank you!