一、安装JDK(笔者推荐使用原生的方式安装SUN-JDK6
一、保证TOMCAT的正常安装
二,下载nutch-1.0,解压后,并将它拷贝到/opt/目录下。
cd /opt/nutch-1.0
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
一般来说没有设置JAVA_HOME等环境,会报以下错误:
[: 72: ==: unexpected operator
Error: JAVA_HOME is not set.
这里应编辑gedit /etc/environment来解决这个错误
进入root帐号:
sudo su
然后编辑environment
gedit /etc/environment
编辑后内容如下:
JAVA_HOME="/usr/lib/jvm/jdk/jdk1.6.0_13"
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/lib/jvm/jdk/jdk1.6.0_13/bin"
LANGUAGE="zh_CN:zh:en_US:en"
LANG="zh_CN.UTF-8"
CLASSPATH=/usr/lib/jvm/jdk/jdk1.6.0_13/lib:/usr/lib/jvm/jdk/jdk1.6.0_13/jre/lib
其中/usr/lib/jvm/jdk/jdk1.6.0_13是SUN-JDK的安装位置,笔者将它装到了/usr/lib/jvm/jdk/jdk1.6.0_13目录下
然后重启UBUNTU或者注销
接着进入root帐号
sudo su
再次尝试
cd /opt/nutch-1.0
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
一般可能再次报错:
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
[: 72: ==: unexpected operator
[: 132: ==: unexpected operator
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: org.apache.nutch.crawl.Crawl. Program will exit.
表面上看上去这个错误是CLASSPATH设置错误,其实并不完全如此,
使用
gedit bin/nutch
打开这个nutch看个明白,到底怎么回事?
可以看到(红色标的是重点)
# some Java parameters
if [ "$NUTCH_JAVA_HOME" != "" ]; then
#echo "run java in $NUTCH_JAVA_HOME"
JAVA_HOME=$NUTCH_JAVA_HOME
fi
if [ "$JAVA_HOME" = "" ]; then
echo "Error: JAVA_HOME is not set."
exit 1
fi
上面的红色可以看出当前没有设置JAVA_HOME时,会提示Error: JAVA_HOME is not set.这个错误我们已经在刚才遇到了
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m
# check envvars which might override default args
if [ "$NUTCH_HEAPSIZE" != "" ]; then
#echo "run with heapsize $NUTCH_HEAPSIZE"
JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m"
#echo $JAVA_HEAP_MAX
fi
# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
# so that filenames w/ spaces are handled correctly in loops below
IFS=
# for developers, add plugins, job & test code to CLASSPATH
if [ -d "$NUTCH_HOME/build/plugins" ]; then
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build
fi
if [ -d "$NUTCH_HOME/build/test/classes" ]; then
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes
fi
if [ $IS_CORE == 0 ]
then
for f in $NUTCH_HOME/build/nutch-*.job; do
CLASSPATH=${CLASSPATH}:$f;
done
# for releases, add Nutch job to CLASSPATH
for f in $NUTCH_HOME/nutch-*.job; do
CLASSPATH=${CLASSPATH}:$f;
done
else
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes
fi
# add plugins to classpath
if [ -d "$NUTCH_HOME/plugins" ]; then
CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
fi
# add libs to CLASSPATH
for f in $NUTCH_HOME/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
这是这个错误产生的关键,可以打开lib目录看一下,有如下文件($NUTCH_HOME是指当前nutch的运行目录,当前是在/opt/nutch-1.0处 )
apache-solr-common-1.3.0.jar commons-httpclient-3.0.1.jar icu4j-4_0_1.LICENSE.txt jetty-ext native xerces-2_6_2.jar
apache-solr-solrj-1.3.0.jar commons-lang-2.1.jar jakarta-oro-2.0.8.jar junit-3.8.1.jar servlet-api.jar
commons-beanutils-1.8.0.jar commons-logging-1.0.4.jar jets3t-0.6.1.jar junit-3.8.1.LICENSE.txt taglibs-i18n.jar
commons-cli-2.0-SNAPSHOT.jar commons-logging-api-1.0.4.jar jets3t-0.6.1.LICENSE.txt log4j-1.2.15.jar taglibs-i18n.tld
commons-codec-1.3.jar hadoop-0.19.1-core.jar jetty-5.1.4.jar lucene-core-2.4.0.jar tika-0.1-incubating.jar
commons-collections-3.2.1.jar icu4j-4_0_1.jar jetty-5.1.4.LICENSE.txt lucene-misc-2.4.0.jar xerces-2_6_2-apis.jar
但是就是无法找到org.apache.nutch.crawl.Crawl所在的JAR文件, 这些CLASS文件都被打包成JAR文件了,那么这个JAR文件到哪去了呢?
我们进入/opt/nutch-1.0
ls一下,原来在根目录下,靠!
CHANGES.txt default.properties KEYS LICENSE.txt NOTICE.txt nutch-1.0.war nutchstart.sh~ README.txt taxurls.txt webapps
build.xml conf docs lib logs nutch-1.0.jar nutch-1.0.job nutchstart.sh plugins src taxweb
注意这个 nutch-1.0.jar就是整个org.apache.nutch的核心CLASS文件的打包(包括Crawl),原因已经找到,在CLASSPATH中根本就没有这个 nutch-1.0.jar文件,因为这个文件根本就不在 $NUTCH_HOME/lib/*.jar范围中,所以会出现找不到org.apache.nutch.crawl.Crawl类的情况
for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# cygwin path translation
if $cygwin; then
CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi
问题已经找到
现在我们把 nutch-1.0.jar拷贝到/opt/nutch-1.0/lib下
OK
成功解决了这个难题(这个难题我GOOGLE了很久也没有结果,只好自己搞定)
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
[: 72: ==: unexpected operator
[: 132: ==: unexpected operator
Usage: Crawl [-dir d] [-threads n] [-depth i] [-topN N]
cd /opt/nutch-1.0
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
一般来说没有设置JAVA_HOME等环境,会报以下错误:
[: 72: ==: unexpected operator
Error: JAVA_HOME is not set.
这里应编辑gedit /etc/environment来解决这个错误
进入root帐号:
sudo su
然后编辑environment
gedit /etc/environment
编辑后内容如下:
JAVA_HOME="/usr/lib/jvm/jdk/jdk1.6.0_13"
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/lib/jvm/jdk/jdk1.6.0_13/bin"
LANGUAGE="zh_CN:zh:en_US:en"
LANG="zh_CN.UTF-8"
CLASSPATH=/usr/lib/jvm/jdk/jdk1.6.0_13/lib:/usr/lib/jvm/jdk/jdk1.6.0_13/jre/lib
其中/usr/lib/jvm/jdk/jdk1.6.0_13是SUN-JDK的安装位置,笔者将它装到了/usr/lib/jvm/jdk/jdk1.6.0_13目录下
然后重启UBUNTU或者注销
接着进入root帐号
sudo su
再次尝试
cd /opt/nutch-1.0
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
一般可能再次报错:
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
[: 72: ==: unexpected operator
[: 132: ==: unexpected operator
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
Could not find the main class: org.apache.nutch.crawl.Crawl. Program will exit.
表面上看上去这个错误是CLASSPATH设置错误,其实并不完全如此,
使用
gedit bin/nutch
打开这个nutch看个明白,到底怎么回事?
可以看到(红色标的是重点)
# some Java parameters
if [ "$NUTCH_JAVA_HOME" != "" ]; then
#echo "run java in $NUTCH_JAVA_HOME"
JAVA_HOME=$NUTCH_JAVA_HOME
fi
if [ "$JAVA_HOME" = "" ]; then
echo "Error: JAVA_HOME is not set."
exit 1
fi
上面的红色可以看出当前没有设置JAVA_HOME时,会提示Error: JAVA_HOME is not set.这个错误我们已经在刚才遇到了
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m
# check envvars which might override default args
if [ "$NUTCH_HEAPSIZE" != "" ]; then
#echo "run with heapsize $NUTCH_HEAPSIZE"
JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m"
#echo $JAVA_HEAP_MAX
fi
# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
# so that filenames w/ spaces are handled correctly in loops below
IFS=
# for developers, add plugins, job & test code to CLASSPATH
if [ -d "$NUTCH_HOME/build/plugins" ]; then
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build
fi
if [ -d "$NUTCH_HOME/build/test/classes" ]; then
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes
fi
if [ $IS_CORE == 0 ]
then
for f in $NUTCH_HOME/build/nutch-*.job; do
CLASSPATH=${CLASSPATH}:$f;
done
# for releases, add Nutch job to CLASSPATH
for f in $NUTCH_HOME/nutch-*.job; do
CLASSPATH=${CLASSPATH}:$f;
done
else
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes
fi
# add plugins to classpath
if [ -d "$NUTCH_HOME/plugins" ]; then
CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
fi
# add libs to CLASSPATH
for f in $NUTCH_HOME/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
这是这个错误产生的关键,可以打开lib目录看一下,有如下文件($NUTCH_HOME是指当前nutch的运行目录,当前是在/opt/nutch-1.0处 )
apache-solr-common-1.3.0.jar commons-httpclient-3.0.1.jar icu4j-4_0_1.LICENSE.txt jetty-ext native xerces-2_6_2.jar
apache-solr-solrj-1.3.0.jar commons-lang-2.1.jar jakarta-oro-2.0.8.jar junit-3.8.1.jar servlet-api.jar
commons-beanutils-1.8.0.jar commons-logging-1.0.4.jar jets3t-0.6.1.jar junit-3.8.1.LICENSE.txt taglibs-i18n.jar
commons-cli-2.0-SNAPSHOT.jar commons-logging-api-1.0.4.jar jets3t-0.6.1.LICENSE.txt log4j-1.2.15.jar taglibs-i18n.tld
commons-codec-1.3.jar hadoop-0.19.1-core.jar jetty-5.1.4.jar lucene-core-2.4.0.jar tika-0.1-incubating.jar
commons-collections-3.2.1.jar icu4j-4_0_1.jar jetty-5.1.4.LICENSE.txt lucene-misc-2.4.0.jar xerces-2_6_2-apis.jar
但是就是无法找到org.apache.nutch.crawl.Crawl所在的JAR文件, 这些CLASS文件都被打包成JAR文件了,那么这个JAR文件到哪去了呢?
我们进入/opt/nutch-1.0
ls一下,原来在根目录下,靠!
CHANGES.txt default.properties KEYS LICENSE.txt NOTICE.txt nutch-1.0.war nutchstart.sh~ README.txt taxurls.txt webapps
build.xml conf docs lib logs nutch-1.0.jar nutch-1.0.job nutchstart.sh plugins src taxweb
注意这个 nutch-1.0.jar就是整个org.apache.nutch的核心CLASS文件的打包(包括Crawl),原因已经找到,在CLASSPATH中根本就没有这个 nutch-1.0.jar文件,因为这个文件根本就不在 $NUTCH_HOME/lib/*.jar范围中,所以会出现找不到org.apache.nutch.crawl.Crawl类的情况
for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# cygwin path translation
if $cygwin; then
CLASSPATH=`cygpath -p -w "$CLASSPATH"`
fi
问题已经找到
现在我们把 nutch-1.0.jar拷贝到/opt/nutch-1.0/lib下
OK
成功解决了这个难题(这个难题我GOOGLE了很久也没有结果,只好自己搞定)
root@fjadmin-webcrawler:/opt/nutch-1.0# sh bin/nutch crawl
[: 72: ==: unexpected operator
[: 132: ==: unexpected operator
Usage: Crawl