Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining the web, no experience required.

Mining the web, no experience required.

How many times have you wanted to find some information on a website only to be disappointed with the filtering and discovery options available. Learn how to get data from a site and look for the information that you really care about.

Scrapinghub

October 25, 2015
Tweet

More Decks by Scrapinghub

Other Decks in Programming

Transcript

  1. Scrapinghub - Who are we? • Provider of cloud based

    web-crawling solutions • Builder of spiders and crawling solutions • Creator of open source projects like Scrapy, Portia and Splash • Find out more at scrapinghub.com Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Splash Portia Scrapy
  2. The Project Obtain and compare house types and prices across

    the country • Build a spider for daft.ie using Portia • Crawl daft.ie to obtain housing data • Process the data using Pandas • Visualise the data using CartoDB Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  3. The Basics Web Scraping - The process of extracting data

    from the web Spider - A piece of software designed to extract links and items from webpages Crawl - Visit all pages of interest on a site using your spider Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  4. Build a spider using Portia • Portia is a tool

    for building spiders without having to write any code. • It has a simple UI for loading pages that you want to extract data from. • Create Samples by highlighting data that you want on a page. • Use these samples to train the extraction algorithm. Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 https://github.com/scrapinghub/portia
  5. Run our spider • Scrapy Cloud - Hosted crawling at

    scrapinghub.com • Scrapyd - Run your own server for crawling • Portiacrawl - Run the spider locally using scrapy Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  6. Process our data with Pandas • The spider has extracted

    the house type, price, BER, number of bedrooms and address for all houses for sale on daft.ie. • Clean and normalise data • Add a geopoint column so the houses can be placed on a map. • Process fields to prepare them for plotting Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce
  7. Visualise the data using CartoDB • Create a dataset from

    our csv file • Plot our data on a map • Compare prices across the country • Compare property type • Compare BER • http://cdb.io/1POBIU8 Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015