Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scrapy

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

 Scrapy

La presentacion de Scrapy en el Data Science Meetup.

Avatar for matiskay

matiskay

May 05, 2016
Tweet

More Decks by matiskay

Other Decks in Programming

Transcript

  1. Dictionary • Web Scraping: The process of extracting data from

    the web. • Spider: A piece a software designed to extract links and items from webpages. • Crawl: Visit all the pages of interest on a site using your spider. • Scrapy Cloud: Hosted crawling at scrapinghub.com
  2. Types of Spiders • Spider — Base class for scrapy

    spiders. • SitemapSpider — Allows you to crawl a site by discovering the URLs using Sitemaps. • XMLFeedSpider — XMLFeedSpider is designed for parsing XML feeds by iterating through them by a certain node name. • CrawlSpider — It provides a convenient mechanism for following links by defining a set of rules.
  3. How to deploy to Scrapy Cloud • shub login •

    shub deploy • shub schedule blogspider
  4. Scrapy Workflow • Crear dos proyectos uno de produccion y

    desarrollo. • project-prod • project-dev • En desarrollo se hace la verificacion de los datos y todas las posibles transformaciones en los datos. • shub permite deployar usando el ultimo commit de la rama.
  5. Tools for Cleaning data • w3lib • Remove comments, or

    tags from HTML snippets • extract base url from HTML snippets • translate entites on HTML strings • convert raw HTTP headers to dicts and vice-versa • construct HTTP auth header • converting HTML pages to unicode • sanitize urls (like browsers do) • extract arguments from urls
  6. Machine Learning & NLP • NLTK (Natural Language Toolkit) —

    http:// www.nltk.org/ • Scikit Learn (Machine Learning in Python) — http:// scikit-learn.org/ • Scikit Image (Image Processing in Python) — http:// scikit-image.org/ • Gensim (Topic Modelling for Humans)— http:// radimrehurek.com/gensim/ • TextBlob (Simplified Text Processing) — http:// textblob.readthedocs.org/en/dev/
  7. Recomendacion de Productos • Extraer la informacion de los supermercados.

    • Utilizar esa informacion para recomendar donde comprar los productos basados en la receta que se va preparar.
  8. Venta de productos por refencia • Extraer la informacion de

    sitios de moda. • Utilizar algoritmos para extraer colores y utilizar esa informacion para el buscador. • Crear un sitema ETL.
  9. Recomendacion de Trabajos • Extraer la informacion de sitios de

    trabajo. • Utilizar algoritmos de Procesamiento Natural de Lenguaje para clasificar los trabajos.