Nutch

Nutch??? Apache Nutch Apache Nutch is an open source Web
crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over (using solr). Multi-protocol, multi-threaded, distributed crawler Full-text indexer and support for distributed search (Solr) Plugin based, highly modular (if ur into Java)

Receives parsed documents from Nutch so that they can be
searched Parses any non-html documents that Nutch comes across MapReduce and Hadoop Distributed File System (deploy mode) The Crew

Apache Solr Solr is an open source full text search
framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward. Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. However, Solr needs to be have specific field mappings setup to properly handle indexing requests from Nutch. There is a schema.xml file located in the Nutch conf directory which contains a Solr schema that Nutch utilizes and expects to be present when posting data. A recommended course of action would be to use this schema in it's own core instance in Solr, assuming you are running Solr in multicore mode. Nutch-to-Solr Index Structure

Apache Tika Supported Document Formats Package Formats tar, jar, zip,
bzip2, gz, tgz Text Document Formats doc, xls, ppt, rtf, pdf, html, xhtml, OpenDocument, txt Image Formats bmp, gif, png, jpeg, tiff Audio Formats mp3, aiff, au, midi, wav Misc Formats pst (Outlook Mail), xml, class SIDENOTE An endpoint can be configured in solr for parsing and indexing documents without Nutch. I was tinkering around with this before I started working with Nutch. I prefer the flexibilty of the request handler in Solr, because you could configure it to ignore divs and store paragraph tags so search results looked more authentic.

Apache Hadoop Like I said before, MapReduce and Hadoop Distributed
File System (deploy mode) If anyone knows anything about Hadoop, thats great. I hear its a big deal.

Lets Get On With It The Nutch Workflow 1. Inject
Create CrawlDB Insert seed urls Create LinkDB 2. Generate (fetch list) 3. Fetch 4. Parse 5. Update CrawlDB 6. Update LinkDB 7. Index 8. Repeat 2-7 until finished

Looks something like this

Lets just run it Solr up, running, and configured correctly
Create url/seed.txt with seed urls Modify conf/regex-urlfilter.txt to keep nutch within your domain bin/nutch crawl urls -solr http://localhost:8983/solr/nutch -depth 3 -topN 5

Nutch

Nutch

nojanrak

Other Decks in Programming

Featured

Transcript

Nutch??? Apache Nutch Apache Nutch is an open source Web

Receives parsed documents from Nutch so that they can be

Apache Solr Solr is an open source full text search

Apache Tika Supported Document Formats Package Formats tar, jar, zip,

Apache Hadoop Like I said before, MapReduce and Hadoop Distributed

Lets Get On With It The Nutch Workflow 1. Inject

Looks something like this

Lets just run it Solr up, running, and configured correctly