Web Scraping Using Nutch and Solr 1/3

Web Scraping Using Nutch and Solr • A simple example
of using open source code • Web Scrape a single web site - ours • Environment and code – Using Centos V6.2 ( Linux ) – Apache Nutch 1.6 – Solr 4.2.1 – Java 1.6

Nutch and Solr Architecture • Nutch processes urls and feeds
content to Solr • Solr indexes content

Where to get source code • Nutch – http://nutch.apache.org •
Solr – http://lucene.apache.org/solr • Java – http://java.com

Installing Source - Nutch • Nutch is delivered as –
apache-nutch-1.6-bin.tar ( 64M ) – apache-nutch-1.6-src.tar ( 20M ) • Copy each tar file to your desired location • Install each tar file as – tar xvf <tar file> • Second tar file optional

Installing Source - Solr • Solr is delivered as –
solr-4.2.1.zip ( 116M ) • Copy file to your desired location • Install each tar file as – unzip <zip file>

Configuring Nutch Part 1 • Assuming we will crawl a
single web site • Ensure that JAVA_HOME is set • cd apache-nutch-1.6 • Edit agent name in conf/nutch-site.xml <property> <name>http.agent.name</name> <value>Nutch Spider</value> </property> • mkdir -p urls ; cd urls ; touch seed.txt

Configuring Nutch Part 2 • Add following url ( ours
) to seed.txt – http://www.semtech-solutions.co.nz • Change url filtering in conf/regex-urlfilter.txt, change the line – # accept anything else – +. – To be – +^http://([a-z0-9]*\.)*semtech-solutions.co.nz/ • This means that we will filter the urls found to only be from the local site

Configuring Solr Part 1 • cd solr-4.2.1/example/solr/collection1/conf • Add some
extra fields to schema.xml after _version_ field i.e.

Start Solr Server – Part 1 • Within solr-4.2.1/example •
Run the following command • java -jar start.jar • Now try to access admin web page for solr – http://localhost:8983/solr/admin • You should now see the admin web site – ( see next page )

Start Solr Server – Part 2 • Solr Admin web
page

Web Scraping Using Nutch and Solr 1/3

Web Scraping Using Nutch and Solr 1/3

Mike Frampton

More Decks by Mike Frampton

Other Decks in Technology

Featured

Transcript

Web Scraping Using Nutch and Solr • A simple example

Nutch and Solr Architecture • Nutch processes urls and feeds

Where to get source code • Nutch – http://nutch.apache.org •

Installing Source - Nutch • Nutch is delivered as –

Installing Source - Solr • Solr is delivered as –

Configuring Nutch Part 1 • Assuming we will crawl a

Configuring Nutch Part 2 • Add following url ( ours

Configuring Solr Part 1 • cd solr-4.2.1/example/solr/collection1/conf • Add some

Start Solr Server – Part 1 • Within solr-4.2.1/example •

Start Solr Server – Part 2 • Solr Admin web