为了对Nutch进行定制化,需要看懂Nutch的源码。
版本:2.2.1 最新版本
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~·
我们知道执行nutch时,会敲入 ./bin/nutch 通过查看nutch的内容,我们知道这是一个shell脚本
cat nutch|wc -l 244 root@idc200:/usr/local/nutch-2.2.1/runtime/local/bin#
先来分析一下这244行脚本
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#!/bin/bash # # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # # The Nutch command script # # Environment Variables # # NUTCH_JAVA_HOME The java implementation to use. Overrides JAVA_HOME. # # NUTCH_HEAPSIZE The maximum amount of heap to use, in MB. # Default is 1000. # # NUTCH_OPTS Extra Java runtime options. #
上面是1-28行的注释,不解释。
~~~~~~~~~~~~~~~~~~~~~~~~~~~
接下来是 29-33行
cygwin=false case "`uname`" in CYGWIN*) cygwin=true;; esac
通过输入
if $cygwin; then
echo "cygwin is true"
else
echo "cygwin is false"
fi
可知cygwin为false,这是因为我直接在linux环境下运行,而不是window下运行。
~~~~~~~~~~~~~~~~~~~~~~~~~~~
接下来是34-45行
# resolve links - $0 may be a softlink THIS="$0" while [ -h "$THIS" ]; do ls=`ls -ld "$THIS"` link=`expr "$ls" : '.*-> \(.*\)$'` if expr "$link" : '.*/.*' > /dev/null; then THIS="$link" else THIS=`dirname "$THIS"`/"$link" fi done
这段话是什么意思呢?
THIS是第一个参数的值,比如说正常情况下就是"./bin/nutch".
[ -h "$THIS" ]的意思是什么呢?
-h FILE FILE exists and is a symbolic link (same as -L) -h 用来判断$PRG文件是否存在并且是一个符号链接 脚本就是当$PRG存在并且是符号链接时执行do~done之间的脚本
我的不是符号链接,则不用理会这个。
~~~~~~~~~~~~
接下来的是46-73行。
# if no args specified, show usage if [ $# = 0 ]; then echo "Usage: nutch COMMAND" echo "where COMMAND is one of:" # echo " crawl one-step crawler for intranets" echo " inject inject new urls into the database" echo " hostinject creates or updates an existing host table from a text file" echo " generate generate new batches to fetch from crawl db" echo " fetch fetch URLs marked during generate" echo " parse parse URLs marked during fetch" echo " updatedb update web table after parsing" echo " updatehostdb update host table after parsing" echo " readdb read/dump records from page database" echo " readhostdb display entries from the hostDB" echo " elasticindex run the elasticsearch indexer" echo " solrindex run the solr indexer on parsed batches" echo " solrdedup remove duplicates from solr" echo " parsechecker check the parser for a given url" echo " indexchecker check the indexing filters for a given url" echo " plugin load a plugin and run one of its classes main()" echo " nutchserver run a (local) Nutch server on a user defined port" echo " junit runs the given JUnit test" echo " or" echo " CLASSNAME run the class named CLASSNAME"
这个就很简单了,如果没有第2个参数,则打印用法。
~~~~~~~~~~~~~~~~~~
接下来是74-77行
# get arguments COMMAND=$1 shift
这个是什么意思呢?
先将参数1传给COMMAND.然后左移动一位,将参数2变成参数1,参数3变成参数2.
注意:参数0保持不变。
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
接下来是78-81行
# some directories THIS_DIR=`dirname "$THIS"` NUTCH_HOME=`cd "$THIS_DIR/.." ; pwd`
这个比较简单,我的环境下的输出结果是:
./bin /usr/local/nutch-2.2.1/runtime/local
~~~~~~~~~~~~~~~~~~~~
接下来是82-93行
# some Java parameters if [ "$NUTCH_JAVA_HOME" != "" ]; then #echo "run java in $NUTCH_JAVA_HOME" JAVA_HOME=$NUTCH_JAVA_HOME fi if [ "$JAVA_HOME" = "" ]; then echo "Error: JAVA_HOME is not set." exit 1 fi
测试看NUTCH_JAVA_HOME是否为空,我的linux环境里没有配置这个环境变量。
所以NUTCH_JAVA_HOME仍然保持为空。
下面的是测试JAVA_HOME是否设置,否则报错退出。这个没有啥问题。
~~~~~~~~~~~~~~~~~~~~~~
接下来的是94-103行
# NUTCH_JOB if [ -f ${NUTCH_HOME}/*nutch*.job ]; then local=false for f in $NUTCH_HOME/*nutch*.job; do NUTCH_JOB=$f; done else local=true fi
来分析下这段脚本的含义
其实是通过判断文件是否存在来判断运行在deploy模式下还是local模式下。
我当前使用了local文件夹的文件,所以local为true.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
接下来是104-108行
# cygwin path translation if $cygwin; then NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"` fi
这个不用理会。
~~~~~~~~~~~~~~~~··
接下来是109-111行
JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m
我的环境的输出结果为:
/usr/lib/jvm/jdk1.7.0_21/bin/java -Xmx1000m
~~~~~~~~
接下来是112-118行
# check envvars which might override default args if [ "$NUTCH_HEAPSIZE" != "" ]; then #echo "run with heapsize $NUTCH_HEAPSIZE" JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m" #echo $JAVA_HEAP_MAX fi
我的环境的NUTCH_HEAPSIZE没有设置,所以这个也不用理会。
~~~~~~~~
接下来119-125行
# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf} CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar # so that filenames w/ spaces are handled correctly in loops below IFS=
这个就很简单了
我的环境的CLASSPATH输出结果为:
/usr/local/nutch-2.2.1/runtime/local/conf:/usr/lib/jvm/jdk1.7.0_21/lib/tools.jar
~~~~~~~~~~~
接下来是:126-142行
# add libs to CLASSPATH if $local; then for f in $NUTCH_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # local runtime # add plugins to classpath if [ -d "$NUTCH_HOME/plugins" ]; then CLASSPATH=${NUTCH_HOME}:${CLASSPATH} fi fi # cygwin path translation if $cygwin; then CLASSPATH=`cygpath -p -w "$CLASSPATH"` fi
其实就是不断添加jar包到CLASSPATH里
这样,输出就包含了很多的jar包。
~~~~~~~~~~~~~~~
接下来是143-164行
# setup 'java.library.path' for native-hadoop code if necessary # used only in local mode JAVA_LIBRARY_PATH='' if [ -d "${NUTCH_HOME}/lib/native" ]; then JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'` if [ -d "${NUTCH_HOME}/lib/native" ]; then if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} fi fi fi if [ $cygwin = true -a "X${JAVA_LIBRARY_PATH}" != "X" ]; then JAVA_LIBRARY_PATH=`cygpath -p -w "$JAVA_LIBRARY_PATH"` fi # restore ordinary behaviour unset IFS
直接自己加代码echo出这些变量的值就可以。很简单。
~~~~~~~~~~~~~~~
165-184行
# default log directory & file if [ "$NUTCH_LOG_DIR" = "" ]; then NUTCH_LOG_DIR="$NUTCH_HOME/logs" fi if [ "$NUTCH_LOGFILE" = "" ]; then NUTCH_LOGFILE='hadoop.log' fi #Fix log path under cygwin if $cygwin; then NUTCH_LOG_DIR=`cygpath -p -w "$NUTCH_LOG_DIR"` fi NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR" NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE" if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH" fi
我打印出结果如下:
NUTCH_LOGFILE hadoop.log NUTCH_OPTS -Dhadoop.log.dir=/usr/local/nutch-2.2.1/runtime/local/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/usr/local/nutch-2.2.1/runtime/local/lib/native/Linux-amd64-64
~~~~~~~~~~~
185-227行
# figure out which class to run if [ "$COMMAND" = "crawl" ] ; then CLASS=org.apache.nutch.crawl.Crawler elif [ "$COMMAND" = "inject" ] ; then CLASS=org.apache.nutch.crawl.InjectorJob elif [ "$COMMAND" = "hostinject" ] ; then CLASS=org.apache.nutch.host.HostInjectorJob elif [ "$COMMAND" = "generate" ] ; then CLASS=org.apache.nutch.crawl.GeneratorJob elif [ "$COMMAND" = "fetch" ] ; then CLASS=org.apache.nutch.fetcher.FetcherJob elif [ "$COMMAND" = "parse" ] ; then CLASS=org.apache.nutch.parse.ParserJob elif [ "$COMMAND" = "updatedb" ] ; then CLASS=org.apache.nutch.crawl.DbUpdaterJob elif [ "$COMMAND" = "updatehostdb" ] ; then CLASS=org.apache.nutch.host.HostDbUpdateJob elif [ "$COMMAND" = "readdb" ] ; then CLASS=org.apache.nutch.crawl.WebTableReader elif [ "$COMMAND" = "readhostdb" ] ; then CLASS=org.apache.nutch.host.HostDbReader elif [ "$COMMAND" = "elasticindex" ] ; then CLASS=org.apache.nutch.indexer.elastic.ElasticIndexerJob elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob elif [ "$COMMAND" = "solrdedup" ] ; then CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates elif [ "$COMMAND" = "parsechecker" ] ; then CLASS=org.apache.nutch.parse.ParserChecker elif [ "$COMMAND" = "indexchecker" ] ; then CLASS=org.apache.nutch.indexer.IndexingFiltersChecker elif [ "$COMMAND" = "plugin" ] ; then CLASS=org.apache.nutch.plugin.PluginRepository elif [ "$COMMAND" = "nutchserver" ] ; then CLASS=org.apache.nutch.api.NutchServer elif [ "$COMMAND" = "junit" ] ; then CLASSPATH=$CLASSPATH:$NUTCH_HOME/test/classes/ CLASS=junit.textui.TestRunner else CLASS=$COMMAND fi 这个就是根据你输入的命令选择对应的类:
这里的crawl对应着-
org.apache.nutch.crawl.Crawler
~~~~~~~~~~~
228-244行
if $local; then # fix for the external Xerces lib issue with SAXParserFactory NUTCH_OPTS="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl $NUTCH_OPTS" EXEC_CALL="$JAVA $JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH" else # check that hadoop can be found on the path if [ $(which hadoop | wc -l ) -eq 0 ]; then echo "Can't find Hadoop executable. Add HADOOP_HOME/bin to the path or run in local mode." exit -1; fi # distributed mode EXEC_CALL="hadoop jar $NUTCH_JOB" fi
加上最后一行的
exec $EXEC_CALL $CLASS "$@"
其实就相当于
=========================================================
/usr/lib/jvm/jdk1.7.0_21/bin/java
-Xmx1000m
-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
-Dhadoop.log.dir=/usr/local/nutch-2.2.1/runtime/local/logs
-Dhadoop.log.file=hadoop.log
-Djava.library.path=/usr/local/nutch-2.2.1/runtime/local/lib/native/Linux-amd64-64
-classpath
/usr/local/nutch-2.2.1/runtime/local:
/usr/local/nutch-2.2.1/runtime/local/conf:
/usr/lib/jvm/jdk1.7.0_21/lib/tools.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/activation-1.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/aopalliance-1.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/apache-nutch-2.2.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/asm-3.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/avro-1.3.3.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-beanutils-1.7.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-beanutils-core-1.8.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-cli-1.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-codec-1.4.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-collections-3.2.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-configuration-1.6.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-digester-1.8.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-el-1.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-httpclient-3.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-io-2.4.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-lang-2.6.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-logging-1.1.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-math-2.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/commons-net-1.4.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/crawler-commons-0.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/cxf-api-2.5.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/cxf-common-utilities-2.5.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/cxf-rt-bindings-xml-2.5.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/cxf-rt-core-2.5.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/cxf-rt-frontend-jaxrs-2.5.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/cxf-rt-transports-common-2.5.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/cxf-rt-transports-http-2.5.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/elasticsearch-0.19.4.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/geronimo-javamail_1.4_spec-1.7.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/geronimo-stax-api_1.0_spec-1.0.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/gora-core-0.3.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/guava-11.0.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/hadoop-core-1.2.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/hamcrest-core-1.3.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/hsqldb-2.2.8.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/httpclient-4.1.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/httpcore-4.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/icu4j-4.0.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jackson-core-asl-1.8.8.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jackson-jaxrs-1.7.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jackson-mapper-asl-1.8.8.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jackson-xc-1.7.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jaxb-api-2.2.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jaxb-impl-2.2.3-1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jdom-1.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jersey-core-1.8.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jersey-json-1.8.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jersey-server-1.8.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jettison-1.3.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jetty-6.1.26.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jetty-client-6.1.26.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jetty-sslengine-6.1.26.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jetty-util5-6.1.26.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jetty-util-6.1.26.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jline-0.9.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jsr305-1.3.9.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/jsr311-api-1.1.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/junit-4.11.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/juniversalchardet-1.0.3.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/log4j-1.2.16.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/lucene-analyzers-3.6.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/lucene-core-3.6.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/lucene-highlighter-3.6.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/lucene-memory-3.6.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/lucene-queries-3.6.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/neethi-3.0.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/org.osgi.core-4.0.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/org.restlet-2.0.5.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/org.restlet.ext.jackson-2.0.5.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/oro-2.0.8.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/paranamer-2.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/paranamer-ant-2.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/paranamer-generator-2.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/qdox-1.10.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/serializer-2.7.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/servlet-api-2.5-20081211.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/servlet-api-2.5-6.1.14.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/slf4j-api-1.6.6.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/slf4j-log4j12-1.6.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/solr-solrj-3.4.0.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/spring-aop-3.0.6.RELEASE.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/spring-asm-3.0.6.RELEASE.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/spring-beans-3.0.6.RELEASE.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/spring-context-3.0.6.RELEASE.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/spring-core-3.0.6.RELEASE.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/spring-expression-3.0.6.RELEASE.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/spring-web-3.0.6.RELEASE.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/stax2-api-3.1.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/stax-api-1.0.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/stax-api-1.0-2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/tika-core-1.3.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/woodstox-core-asl-4.1.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/wsdl4j-1.6.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/wstx-asl-3.2.7.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/xercesImpl-2.9.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/xml-apis-1.3.04.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/xmlenc-0.52.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/xmlParserAPIs-2.6.2.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/xmlschema-core-2.0.1.jar:
/usr/local/nutch-2.2.1/runtime/local/lib/zookeeper-3.3.1.jar
org.apache.nutch.crawl.Crawler
再加上剩下的若干参数
=========================================================
分析完毕。