Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Scraping Using Nutch and Solr 3/3

Web Scraping Using Nutch and Solr 3/3

A short presentation ( part 3 of 3 ) describing the use of
open source code nutch and solr to web crawl the internet
and process the data.

Mike Frampton

July 10, 2013
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. Solr Extracting Data • Start this session with a full

    Solr indexed repository – Movie cAiYBD4BQeE showed installation – Movie Th5Scvlyt-E showed Nutch web crawl • This movie will show how to – Extract data from Solr – Extract to xml or csv – Show aim to load into data warehouse • This movie assumes you know Linux
  2. Checking Solr Data • Data should have been indexed in

    Solr • In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query • The result ( next ) shows the filtered list of urls in Solr
  3. How To Extract • How could we get at Solr

    data ? – In admin console via query – Via http solr select – Via curl -o call using solr http select • What format of data – that suits this purpose – Xml – Comma separated variable (csv)
  4. How To Extract • We want to extract two columns

    from Solr – tstamp, url • We want to extract as csv ( csv in call below could be xml ) • We want to extract to a file • So we will use an http call – http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv • We will also use a curl call – curl -o <csv file> '<http call>'
  5. How To Extract • Ceate a bash file in Solr

    install directory – cd solr-4-2-1/extract ; touch solr_url_extract.bash – chmod 755 solr_url_extract.bash • Add contents to bash file – #!/bin/bash – curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv' – mv result.csv result.csv.$(date +”%Y%m%d.%H%M%S”) • Now run the bash script – ./solr_url_extract.bash
  6. Check Output • Now we check whether we have data

    • ls -l shows – result.csv.20130506.124857 • Check the content , wc -l shows 11 lines • Check the content , head -2 shows – tstamp, url – 2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ... • Congratulations, you have extracted data from Solr • It's in CSV format ready to be loaded into a data warehouse
  7. Possible Next Steps • Choose more fields to extract from

    data • Allow Nutch crawl to go deeper • Allow Nutch crawl to collect a lot more data • Look at facets in Solr data • Load CSV files into Data Warehouse Staging schema • Next movie will show next step in progress
  8. Contact Us • Feel free to contact us at –

    www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems