A quick yet reasonably thorough introduction to Scrapy, the high-level web crawling & scraping framework for Python, given at PyCon Thailand 2018.
* What is scraping? * What is Scrapy? * Core concepts demo * Problems and solutions * Resources
Aaaarrgghh, Spider! Web scraping with ScrapyMichael KohlPyCon Thailand 2018
View Slide
PyCon Thailand 2018locksteplabs.comNo food…yet!
• Michael Kohl (@citizen428)• Based in Bangkok, Thailand• CTO @ Lockstep Labs• Ruby developer since 2003 (“Splitter!”)• Py-curious, but mostly for ML / data sciencePyCon Thailand 2018locksteplabs.comYours truly
PyCon Thailand 2018locksteplabs.comAaaarrgghh!
PyCon Thailand 2018locksteplabs.comOutline• What is scraping?• Scraping in Python• Why Scrapy?• Core concepts demo• Problems and solutions
PyCon Thailand 2018locksteplabs.comScraping• AKA screen scraping, web harvesting, data extraction• Extracting (structured) data from (unstructured) web sites• Use cases: • web indexing• data mining• price comparison• change detection• data mashups
PyCon Thailand 2018locksteplabs.comDanger Zone• Legal grey area • eBay v. Bidder’s edge (2000)• Intel v. Hamidi (2003)• Facebook (2009)• Anti-scraping measures• Bot-detection• Captchas•robots.txt
PyCon Thailand 2018locksteplabs.comScraping in Python• requests• lxml• cssselect• Beautiful Soup• Selenium
PyCon Thailand 2018locksteplabs.comScrapy“A fast high-levelweb crawling &scraping frameworkfor Python”
PyCon Thailand 2018locksteplabs.comScrapy• Fast: event-driven networking (Twisted)• High-level: Useful abstractions• Crawling: Predefined crawlers, easy to write your own• Scraping: Selectors & feed exporters• Framework: Comprehensive toolset • Commercial support in form of Scrapinghub• FOSS
PyCon Thailand 2018locksteplabs.comConcepts• Spiders (Spider, CrawlSpider…)• Selectors (XPath, CSS, Regex, etc)• Items & item loaders• Input & output processors• Pipelines
PyCon Thailand 2018locksteplabs.comDemo
PyCon Thailand 2018locksteplabs.comCode
PyCon Thailand 2018locksteplabs.comProblems• Javascript. It’s ALWAYS Javascript. Selenium, scrapy-splash• Captchas Decaptcha, Death By Captcha• Writing scrapers is boring Scrapely, Portia• Deployment ScrapingHub, Scrapyd
PyCon Thailand 2018locksteplabs.comScrapinghub
PyCon Thailand 2018locksteplabs.comPortia
PyCon Thailand 2018locksteplabs.comResources• Scrapy https://scrapy.org• scrapy-splash https://github.com/scrapinghub/splash• Decaptcha https://github.com/pombredanne/decaptcha• Death By Captcha http://www.deathbycaptcha.com• Scrapely https://github.com/scrapy/scrapely• Portia https://github.com/scrapinghub/portia• ScrapingHub https://scrapinghub.com• Scrapyd https://github.com/scrapy/scrapydResources
PyCon Thailand 2018locksteplabs.comThank you!