Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Scraping Using Nutch and Solr 2/3

Web Scraping Using Nutch and Solr 2/3

A short presentation ( part 2 of 3 ) describing the use of
open source code nutch and solr to web crawl the internet
and process the data.

Mike Frampton

July 10, 2013
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. Web Scraping Using Nutch and Solr - Part 2 •

    The following example assumes that you have – Watched “web scraping with nutch and solr” – The above movie identity is cAiYBD4BQeE – Set up Linux based Nutch/Solr environment – Run the web scrape in the above movie • Now we will – Clean up that environment – Web scrape a parameterised url – View the urls in the data
  2. Empty Nutch Database • Clean up the Nutch crawl database

    – Previously used apache-nutch-1.6/nutch_start.sh – This contained -dir crawl option – This created apache-nutch-1.6/crawl directory – Which contains our Nutch data • Clean this as – cd apache-nutch-1.6; rm -rf crawl • Only because it contained dummy data ! • Next run of script will create dir again
  3. Empty Solr Database • Clean Solr database via a url

    – Book mark this url – Only use it if you need to empty your data • Run the following ( with solr server running ) – http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'
  4. Set up Nutch • Now we will do something more

    complex • Web scrape a url that has parameters i.e. – http://<site>/<function>?var1=val1&var2=val2 • This web scrape will – Have extra url characters '?=&' – Need greater search depth – Need better url filtering • Remember that you need to get permission to scrape a third party web site
  5. Nutch Configuration • Change seed file for Nutch • apache-nutch-1.6/urls/seed.txt

    • In this instance I will use a url of the form – http://somesite.co.nz/Search?DateRange=7&industry=62 – ( this is not a real url – just an example ) • Change conf regex-urlfilter.txt entry i.e. – # skip URLs containing certain characters – -[*!@] – # accept anything else – +^http://([a-z0-9]*\.)*somesite.co.nz\/Search • This will only consider some site Search urls
  6. Run Nutch • Now run nutch using start script –

    cd apache-nutch-1.6 ; ./nutch_start.bash • Monitor for errors in solr admin log window • The Nutch crawl should end with – crawl finished: crawl
  7. Checking Data • Data should have been indexed in Solr

    • In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query • The result ( next ) shows the filtered list of urls in Solr
  8. Results • Congratulations you have completed your second crawl –

    With parameterised urls – More complex url filtering – With a Solr Query search
  9. Contact Us • Feel free to contact us at –

    www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems