Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Web Scraping Using Nutch and Solr 1/3

Web Scraping Using Nutch and Solr 1/3

A short presentation ( part 1 of 3 ) describing the use of
open source code nutch and solr to web crawl the internet
and process the data.

Mike Frampton

July 10, 2013
Tweet

More Decks by Mike Frampton

Other Decks in Technology

Transcript

  1. Web Scraping Using Nutch and Solr • A simple example

    of using open source code • Web Scrape a single web site - ours • Environment and code – Using Centos V6.2 ( Linux ) – Apache Nutch 1.6 – Solr 4.2.1 – Java 1.6
  2. Nutch and Solr Architecture • Nutch processes urls and feeds

    content to Solr • Solr indexes content
  3. Where to get source code • Nutch – http://nutch.apache.org •

    Solr – http://lucene.apache.org/solr • Java – http://java.com
  4. Installing Source - Nutch • Nutch is delivered as –

    apache-nutch-1.6-bin.tar ( 64M ) – apache-nutch-1.6-src.tar ( 20M ) • Copy each tar file to your desired location • Install each tar file as – tar xvf <tar file> • Second tar file optional
  5. Installing Source - Solr • Solr is delivered as –

    solr-4.2.1.zip ( 116M ) • Copy file to your desired location • Install each tar file as – unzip <zip file>
  6. Configuring Nutch Part 1 • Assuming we will crawl a

    single web site • Ensure that JAVA_HOME is set • cd apache-nutch-1.6 • Edit agent name in conf/nutch-site.xml <property> <name>http.agent.name</name> <value>Nutch Spider</value> </property> • mkdir -p urls ; cd urls ; touch seed.txt
  7. Configuring Nutch Part 2 • Add following url ( ours

    ) to seed.txt – http://www.semtech-solutions.co.nz • Change url filtering in conf/regex-urlfilter.txt, change the line – # accept anything else – +. – To be – +^http://([a-z0-9]*\.)*semtech-solutions.co.nz/ • This means that we will filter the urls found to only be from the local site
  8. Configuring Solr Part 1 • cd solr-4.2.1/example/solr/collection1/conf • Add some

    extra fields to schema.xml after _version_ field i.e.
  9. Start Solr Server – Part 1 • Within solr-4.2.1/example •

    Run the following command • java -jar start.jar • Now try to access admin web page for solr – http://localhost:8983/solr/admin • You should now see the admin web site – ( see next page )