Slide 1

Slide 1 text

Mining the web, no experience required. Ruairí Fahy, 25th October 2015

Slide 2

Slide 2 text

Scrapinghub - Who are we? ● Provider of cloud based web-crawling solutions ● Builder of spiders and crawling solutions ● Creator of open source projects like Scrapy, Portia and Splash ● Find out more at scrapinghub.com Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Splash Portia Scrapy

Slide 3

Slide 3 text

The Project Obtain and compare house types and prices across the country ● Build a spider for daft.ie using Portia ● Crawl daft.ie to obtain housing data ● Process the data using Pandas ● Visualise the data using CartoDB Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Slide 4

Slide 4 text

The Basics Web Scraping - The process of extracting data from the web Spider - A piece of software designed to extract links and items from webpages Crawl - Visit all pages of interest on a site using your spider Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Slide 5

Slide 5 text

Build a spider using Portia ● Portia is a tool for building spiders without having to write any code. ● It has a simple UI for loading pages that you want to extract data from. ● Create Samples by highlighting data that you want on a page. ● Use these samples to train the extraction algorithm. Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 https://github.com/scrapinghub/portia

Slide 6

Slide 6 text

Run our spider ● Scrapy Cloud - Hosted crawling at scrapinghub.com ● Scrapyd - Run your own server for crawling ● Portiacrawl - Run the spider locally using scrapy Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Slide 7

Slide 7 text

Process our data with Pandas ● The spider has extracted the house type, price, BER, number of bedrooms and address for all houses for sale on daft.ie. ● Clean and normalise data ● Add a geopoint column so the houses can be placed on a map. ● Process fields to prepare them for plotting Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce

Slide 8

Slide 8 text

Visualise the data using CartoDB ● Create a dataset from our csv file ● Plot our data on a map ● Compare prices across the country ● Compare property type ● Compare BER ● http://cdb.io/1POBIU8 Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Slide 9

Slide 9 text

We’re Hiring - scrapinghub.com/jobs

Slide 10

Slide 10 text

Thank you! Ruairi Fahy, 25th October 2015 [email protected]