Mining the web, no experience required.

Mining the web, no experience required. Ruairí Fahy, 25th October
2015

Scrapinghub - Who are we? • Provider of cloud based
web-crawling solutions • Builder of spiders and crawling solutions • Creator of open source projects like Scrapy, Portia and Splash • Find out more at scrapinghub.com Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Splash Portia Scrapy

The Project Obtain and compare house types and prices across
the country • Build a spider for daft.ie using Portia • Crawl daft.ie to obtain housing data • Process the data using Pandas • Visualise the data using CartoDB Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

The Basics Web Scraping - The process of extracting data
from the web Spider - A piece of software designed to extract links and items from webpages Crawl - Visit all pages of interest on a site using your spider Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Build a spider using Portia • Portia is a tool
for building spiders without having to write any code. • It has a simple UI for loading pages that you want to extract data from. • Create Samples by highlighting data that you want on a page. • Use these samples to train the extraction algorithm. Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 https://github.com/scrapinghub/portia

Run our spider • Scrapy Cloud - Hosted crawling at
scrapinghub.com • Scrapyd - Run your own server for crawling • Portiacrawl - Run the spider locally using scrapy Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

Process our data with Pandas • The spider has extracted
the house type, price, BER, number of bedrooms and address for all houses for sale on daft.ie. • Clean and normalise data • Add a geopoint column so the houses can be placed on a map. • Process fields to prepare them for plotting Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015 Notebook: https://gist.github.com/ruairif/80102746320d0229a0ce

Visualise the data using CartoDB • Create a dataset from
our csv file • Plot our data on a map • Compare prices across the country • Compare property type • Compare BER • http://cdb.io/1POBIU8 Mining the web, no experience required. - Ruairi Fahy, 25 October 2015 - Scrapinghub ⓒ 2015

We’re Hiring - scrapinghub.com/jobs

Thank you! Ruairi Fahy, 25th October 2015 [email protected]

Mining the web, no experience required.

Mining the web, no experience required.

Scrapinghub

More Decks by Scrapinghub

Other Decks in Programming

Featured

Transcript

Mining the web, no experience required. Ruairí Fahy, 25th October

Scrapinghub - Who are we? • Provider of cloud based

The Project Obtain and compare house types and prices across

The Basics Web Scraping - The process of extracting data

Build a spider using Portia • Portia is a tool

Run our spider • Scrapy Cloud - Hosted crawling at

Process our data with Pandas • The spider has extracted

Visualise the data using CartoDB • Create a dataset from

We’re Hiring - scrapinghub.com/jobs

Thank you! Ruairi Fahy, 25th October 2015 [email protected]