在Eclipse中运行Nutch1.0

Run Nutch In Eclipse on Linux and Windows nutch version 1.0

Tested with

·         Nutch release 1.0

·         Eclipse 3.3

·         Java 1.6

·         Ubuntu (should work on most platforms though)

·         Windows XP

Steps

For Windows Users

If you are running Windows (tested on Windows XP) you must first install cygwin

Download cygwin from http://www.cygwin.com/setup.exe

You can learn how to install cygwin from Internet, I will omit the steps of installing here.

After installing cygwin, you can follow rest of these steps.

Install Nutch

·         Grab a fresh release of nutch 1.0 - http://lucene.apache.org/nutch/version_control.html

·         Set NUTCH_HOME(the location you download the nutch1.0) in environment variables.

·         Set NUTCH_JAVA_HOME(the same place as JDK1.6) in environment variables.

·         Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory

Create a new java project in Eclipse

·         File > New > Project > Java project > click Next

·         Name the project (Nutch for instance)

·         Select "Create project from existing source" and use the location where you downloaded nutch-1.0

·         Click on Next, and wait while Eclipse is scanning the folders

·         Add the folder "conf" to the classpath (third tab and then add class folder)

·         Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top.

·         Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries

·         Set output dir to "tmp_build", create it if necessary

·         DO NOT add "build" to classpath

Configure Nutch

1.    Open up $NUTCH_HOME/conf/nutch-site.xml file , add the following content in it:

 

<configuration>
        <property>
                <name>http.agent.name</name>
                <value>my nutch agent</value>
        </property>


        <property>
                <name>http.agent.version</name>
                <value>1.0</value>
        </property>

 

<property>

         <name>plugin.folders</name>

         <value>E:/nutch-1.0/src/plugin</value>

  </property>

</configuration>

 

Note: Here I set the value of “plugin.floders” with absolute path, you can also use a relative path.

2. Optionally you may also set http.agent.url and http.agent.email properties.

3. Make sure Nutch is configured correctly before testing it into Eclipse

 

Missing org.farng and com.etranslate

Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.

Download them here:

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually).

Build Nutch

If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.

Create Eclipse launcher

.Menu Run->Open Run Dialog.., choose the right project name, and

Set the main class

org.apache.nutch.crawl.Crawl

on tab Arguments, Program Arguments

urls -dir crawl -depth 3 -topN 50 -threads 10

Here: “urls” is the directory in which we write the webpages we want to crawl

·         -dir dir names the directory to put the crawl in.

·         -threads threads determines the number of threads that will fetch in parallel.

·         -depth depth indicates the link depth from the root page that should be crawled.

·         -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

 

in VM arguments

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

Java Heap Size problem

If you find in hadoop.log line similar to this:

2009-05-09 14:03:09,640 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space

You should increase amount of RAM for running applications from eclipse.

Just set it in:

Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments

I've set mine to

-Xms5m -Xmx150m

-Xms (minimum ammount of RAM memory for running applications) -Xmx (maximum)

 

References:

http://wiki.apache.org/nutch/RunNutchInEclipse0.9

http://wiki.apache.org/nutch/NutchTutorial

 

 

你可能感兴趣的:(apache,eclipse,hadoop,windows,XP)