Run Nutch In Eclipse on Linux and Windows nutch version 1.0
Tested with
· Nutch release 1.0
· Eclipse 3.3
· Java 1.6
· Ubuntu (should work on most platforms though)
· Windows XP
Steps
For Windows Users
If you are running Windows (tested on Windows XP) you must first install cygwin
Download cygwin from http://www.cygwin.com/setup.exe
You can learn how to install cygwin from Internet, I will omit the steps of installing here.
After installing cygwin, you can follow rest of these steps.
Install Nutch
· Grab a fresh release of nutch 1.0 - http://lucene.apache.org/nutch/version_control.html
· Set NUTCH_HOME(the location you download the nutch1.0) in environment variables.
· Set NUTCH_JAVA_HOME(the same place as JDK1.6) in environment variables.
· Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory
Create a new java project in Eclipse
· File > New > Project > Java project > click Next
· Name the project (Nutch for instance)
· Select "Create project from existing source" and use the location where you downloaded nutch-1.0
· Click on Next, and wait while Eclipse is scanning the folders
· Add the folder "conf" to the classpath (third tab and then add class folder)
· Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top.
· Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
· Set output dir to "tmp_build", create it if necessary
· DO NOT add "build" to classpath
Configure Nutch
1. Open up $NUTCH_HOME/conf/nutch-site.xml file , add the following content in it:
<configuration>
<property>
<name>http.agent.name</name>
<value>my nutch agent</value>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
<property>
<name>plugin.folders</name>
<value>E:/nutch-1.0/src/plugin</value>
</property>
</configuration>
Note: Here I set the value of “plugin.floders” with absolute path, you can also use a relative path.
2. Optionally you may also set http.agent.url and http.agent.email properties.
3. Make sure Nutch is configured correctly before testing it into Eclipse
Missing org.farng and com.etranslate
Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.
Download them here:
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually).
Build Nutch
If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.
Create Eclipse launcher
.Menu Run->Open Run Dialog.., choose the right project name, and
Set the main class
org.apache.nutch.crawl.Crawl
on tab Arguments, Program Arguments
urls -dir crawl -depth 3 -topN 50 -threads 10
Here: “urls” is the directory in which we write the webpages we want to crawl
· -dir dir names the directory to put the crawl in.
· -threads threads determines the number of threads that will fetch in parallel.
· -depth depth indicates the link depth from the root page that should be crawled.
· -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
in VM arguments
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
Java Heap Size problem
If you find in hadoop.log line similar to this:
2009-05-09 14:03:09,640 WARN mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
You should increase amount of RAM for running applications from eclipse.
Just set it in:
Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments
I've set mine to
-Xms5m -Xmx150m
-Xms (minimum ammount of RAM memory for running applications) -Xmx (maximum)
References:
http://wiki.apache.org/nutch/RunNutchInEclipse0.9
http://wiki.apache.org/nutch/NutchTutorial