Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nutch

 Nutch

Short introduction to Nutch and other involved Apache projects. Given at TST Media's April Developer Lunch (April 25, 2012).

Avatar for nojanrak

nojanrak

May 24, 2012
Tweet

Other Decks in Programming

Transcript

  1. Nutch??? Apache Nutch Apache Nutch is an open source Web

    crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over (using solr). Multi-protocol, multi-threaded, distributed crawler Full-text indexer and support for distributed search (Solr) Plugin based, highly modular (if ur into Java)
  2. Receives parsed documents from Nutch so that they can be

    searched Parses any non-html documents that Nutch comes across MapReduce and Hadoop Distributed File System (deploy mode) The Crew
  3. Apache Solr Solr is an open source full text search

    framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward. Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. However, Solr needs to be have specific field mappings setup to properly handle indexing requests from Nutch. There is a schema.xml file located in the Nutch conf directory which contains a Solr schema that Nutch utilizes and expects to be present when posting data. A recommended course of action would be to use this schema in it's own core instance in Solr, assuming you are running Solr in multicore mode. Nutch-to-Solr Index Structure
  4. Apache Tika Supported Document Formats Package Formats tar, jar, zip,

    bzip2, gz, tgz Text Document Formats doc, xls, ppt, rtf, pdf, html, xhtml, OpenDocument, txt Image Formats bmp, gif, png, jpeg, tiff Audio Formats mp3, aiff, au, midi, wav Misc Formats pst (Outlook Mail), xml, class SIDENOTE An endpoint can be configured in solr for parsing and indexing documents without Nutch. I was tinkering around with this before I started working with Nutch. I prefer the flexibilty of the request handler in Solr, because you could configure it to ignore divs and store paragraph tags so search results looked more authentic.
  5. Apache Hadoop Like I said before, MapReduce and Hadoop Distributed

    File System (deploy mode) If anyone knows anything about Hadoop, thats great. I hear its a big deal.
  6. Lets Get On With It The Nutch Workflow 1. Inject

    Create CrawlDB Insert seed urls Create LinkDB 2. Generate (fetch list) 3. Fetch 4. Parse 5. Update CrawlDB 6. Update LinkDB 7. Index 8. Repeat 2-7 until finished
  7. Lets just run it Solr up, running, and configured correctly

    Create url/seed.txt with seed urls Modify conf/regex-urlfilter.txt to keep nutch within your domain bin/nutch crawl urls -solr http://localhost:8983/solr/nutch -depth 3 -topN 5