Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scrapy

 Scrapy

La presentacion de Scrapy en el Data Science Meetup.

matiskay

May 05, 2016
Tweet

More Decks by matiskay

Other Decks in Programming

Transcript

  1. Dictionary • Web Scraping: The process of extracting data from

    the web. • Spider: A piece a software designed to extract links and items from webpages. • Crawl: Visit all the pages of interest on a site using your spider. • Scrapy Cloud: Hosted crawling at scrapinghub.com
  2. Types of Spiders • Spider — Base class for scrapy

    spiders. • SitemapSpider — Allows you to crawl a site by discovering the URLs using Sitemaps. • XMLFeedSpider — XMLFeedSpider is designed for parsing XML feeds by iterating through them by a certain node name. • CrawlSpider — It provides a convenient mechanism for following links by defining a set of rules.
  3. How to deploy to Scrapy Cloud • shub login •

    shub deploy • shub schedule blogspider
  4. Scrapy Workflow • Crear dos proyectos uno de produccion y

    desarrollo. • project-prod • project-dev • En desarrollo se hace la verificacion de los datos y todas las posibles transformaciones en los datos. • shub permite deployar usando el ultimo commit de la rama.
  5. Tools for Cleaning data • w3lib • Remove comments, or

    tags from HTML snippets • extract base url from HTML snippets • translate entites on HTML strings • convert raw HTTP headers to dicts and vice-versa • construct HTTP auth header • converting HTML pages to unicode • sanitize urls (like browsers do) • extract arguments from urls
  6. Machine Learning & NLP • NLTK (Natural Language Toolkit) —

    http:// www.nltk.org/ • Scikit Learn (Machine Learning in Python) — http:// scikit-learn.org/ • Scikit Image (Image Processing in Python) — http:// scikit-image.org/ • Gensim (Topic Modelling for Humans)— http:// radimrehurek.com/gensim/ • TextBlob (Simplified Text Processing) — http:// textblob.readthedocs.org/en/dev/
  7. Recomendacion de Productos • Extraer la informacion de los supermercados.

    • Utilizar esa informacion para recomendar donde comprar los productos basados en la receta que se va preparar.
  8. Venta de productos por refencia • Extraer la informacion de

    sitios de moda. • Utilizar algoritmos para extraer colores y utilizar esa informacion para el buscador. • Crear un sitema ETL.
  9. Recomendacion de Trabajos • Extraer la informacion de sitios de

    trabajo. • Utilizar algoritmos de Procesamiento Natural de Lenguaje para clasificar los trabajos.