Web Scraping Using Nutch and Solr 3/3

Solr Extracting Data • Start this session with a full
Solr indexed repository – Movie cAiYBD4BQeE showed installation – Movie Th5Scvlyt-E showed Nutch web crawl • This movie will show how to – Extract data from Solr – Extract to xml or csv – Show aim to load into data warehouse • This movie assumes you know Linux

Solr Extracting Data • Progress so far, greyed out area
yet to be examined

Checking Solr Data • Data should have been indexed in
Solr • In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query • The result ( next ) shows the filtered list of urls in Solr

Checking Solr Data

How To Extract • How could we get at Solr
data ? – In admin console via query – Via http solr select – Via curl -o call using solr http select • What format of data – that suits this purpose – Xml – Comma separated variable (csv)

How To Extract • We want to extract two columns
from Solr – tstamp, url • We want to extract as csv ( csv in call below could be xml ) • We want to extract to a file • So we will use an http call – http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv • We will also use a curl call – curl -o <csv file> '<http call>'

How To Extract • Ceate a bash file in Solr
install directory – cd solr-4-2-1/extract ; touch solr_url_extract.bash – chmod 755 solr_url_extract.bash • Add contents to bash file – #!/bin/bash – curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv' – mv result.csv result.csv.$(date +”%Y%m%d.%H%M%S”) • Now run the bash script – ./solr_url_extract.bash

Check Output • Now we check whether we have data
• ls -l shows – result.csv.20130506.124857 • Check the content , wc -l shows 11 lines • Check the content , head -2 shows – tstamp, url – 2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ... • Congratulations, you have extracted data from Solr • It's in CSV format ready to be loaded into a data warehouse

Possible Next Steps • Choose more fields to extract from
data • Allow Nutch crawl to go deeper • Allow Nutch crawl to collect a lot more data • Look at facets in Solr data • Load CSV files into Data Warehouse Staging schema • Next movie will show next step in progress

Contact Us • Feel free to contact us at –
www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

Web Scraping Using Nutch and Solr 3/3

Web Scraping Using Nutch and Solr 3/3

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

Solr Extracting Data • Start this session with a full

Solr Extracting Data • Progress so far, greyed out area

Checking Solr Data • Data should have been indexed in

Checking Solr Data

How To Extract • How could we get at Solr

How To Extract • We want to extract two columns

How To Extract • Ceate a bash file in Solr

Check Output • Now we check whether we have data

Possible Next Steps • Choose more fields to extract from

Contact Us • Feel free to contact us at –