Web Scraping Using Nutch and Solr 2/3

Web Scraping Using Nutch and Solr - Part 2 •
The following example assumes that you have – Watched “web scraping with nutch and solr” – The above movie identity is cAiYBD4BQeE – Set up Linux based Nutch/Solr environment – Run the web scrape in the above movie • Now we will – Clean up that environment – Web scrape a parameterised url – View the urls in the data

Empty Nutch Database • Clean up the Nutch crawl database
– Previously used apache-nutch-1.6/nutch_start.sh – This contained -dir crawl option – This created apache-nutch-1.6/crawl directory – Which contains our Nutch data • Clean this as – cd apache-nutch-1.6; rm -rf crawl • Only because it contained dummy data ! • Next run of script will create dir again

Empty Solr Database • Clean Solr database via a url
– Book mark this url – Only use it if you need to empty your data • Run the following ( with solr server running ) – http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'

Set up Nutch • Now we will do something more
complex • Web scrape a url that has parameters i.e. – http://<site>/<function>?var1=val1&var2=val2 • This web scrape will – Have extra url characters '?=&' – Need greater search depth – Need better url filtering • Remember that you need to get permission to scrape a third party web site

Nutch Configuration • Change seed file for Nutch • apache-nutch-1.6/urls/seed.txt
• In this instance I will use a url of the form – http://somesite.co.nz/Search?DateRange=7&industry=62 – ( this is not a real url – just an example ) • Change conf regex-urlfilter.txt entry i.e. – # skip URLs containing certain characters – -[*!@] – # accept anything else – +^http://([a-z0-9]*\.)*somesite.co.nz\/Search • This will only consider some site Search urls

Run Nutch • Now run nutch using start script –
cd apache-nutch-1.6 ; ./nutch_start.bash • Monitor for errors in solr admin log window • The Nutch crawl should end with – crawl finished: crawl

Checking Data • Data should have been indexed in Solr
• In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query • The result ( next ) shows the filtered list of urls in Solr

Checking Data

Results • Congratulations you have completed your second crawl –
With parameterised urls – More complex url filtering – With a Solr Query search

Contact Us • Feel free to contact us at –
www.semtech-solutions.co.nz – [email protected] • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

Web Scraping Using Nutch and Solr 2/3

Web Scraping Using Nutch and Solr 2/3

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

Web Scraping Using Nutch and Solr - Part 2 •

Empty Nutch Database • Clean up the Nutch crawl database

Empty Solr Database • Clean Solr database via a url

Set up Nutch • Now we will do something more

Nutch Configuration • Change seed file for Nutch • apache-nutch-1.6/urls/seed.txt

Run Nutch • Now run nutch using start script –

Checking Data • Data should have been indexed in Solr

Checking Data

Results • Congratulations you have completed your second crawl –

Contact Us • Feel free to contact us at –