Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining the web, no experience required.

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Mining the web, no experience required.

How many times have you wanted to find some information on a website only to be disappointed with the filtering and discovery options available. Learn how to get data from a site and look for the information that you really care about.

Avatar for Scrapinghub

Scrapinghub

October 25, 2015
Tweet

More Decks by Scrapinghub

Other Decks in Programming

Transcript

  1. Scrapinghub - Who are we? • Provider of cloud based

    web-crawling solutions • Builder of spiders and crawling solutions • Creator of open source projects like Scrapy, Portia and Splash • Find out more at scrapinghub.com Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Splash Portia Scrapy
  2. The Project Obtain and compare house types and prices across

    the country • Build a spider for daft.ie using Portia • Crawl daft.ie to obtain housing data • Process the data using Pandas • Visualise the data using CartoDB Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  3. The Basics Web Scraping - The process of extracting data

    from the web Spider - A piece of software designed to extract links and items from webpages Crawl - Visit all pages of interest on a site using your spider Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  4. Build a spider using Portia • Portia is a tool

    for building spiders without having to write any code. • It has a simple UI for loading pages that you want to extract data from. • Create Samples by highlighting data that you want on a page. • Use these samples to train the extraction algorithm. Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 https://github.com/scrapinghub/portia
  5. Run our spider • Scrapy Cloud - Hosted crawling at

    scrapinghub.com • Scrapyd - Run your own server for crawling • Portiacrawl - Run the spider locally using scrapy Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015
  6. Process our data with Pandas • The spider has extracted

    the house type, price, BER, number of bedrooms and address for all houses for sale on daft.ie. • Clean and normalise data • Add a geopoint column so the houses can be placed on a map. • Process fields to prepare them for plotting Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce
  7. Visualise the data using CartoDB • Create a dataset from

    our csv file • Plot our data on a map • Compare prices across the country • Compare property type • Compare BER • http://cdb.io/1POBIU8 Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015