以下是转载的执行recrawl的脚本(其实还是可以再优化的,比如参数和备份处理过程等 ),来对比 一下与普通的crawl有啥区别。
# runbot script to run the Nutch bot for crawling and re-crawling. # Usage: bin/runbot [safe] # If executed in 'safe' mode, it doesn't delete the temporary # directories generated during crawl. This might be helpful for # analysis and recovery in case a crawl fails. # # Author: Susam Pal depth=2 threads=5 adddays=5 topN=15 #Comment this statement if you don't want to set topN value # Arguments for rm and mv RMARGS="-rf" MVARGS="--verbose" # Parse arguments if [ "$1" == "safe" ] then safe=yes fi if [ -z "$NUTCH_HOME" ] then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -z "$CATALINA_HOME" ] then CATALINA_HOME=/opt/apache-tomcat-6.0.10 echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script else echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME fi if [ -n "$topN" ] then topN="-topN $topN" else topN="" fi steps=8 echo "----- Inject (Step 1 of $steps) -----" $NUTCH_HOME/bin/nutch inject crawl/crawldb urls echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" for((i=0; i < $depth; i++)) do echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN \ -adddays $adddays if [ $? -ne 0 ] then echo "runbot: Stopping at depth $depth. No more URLs to fetch." break fi segment=`ls -d crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo "runbot: fetch $segment at depth `expr $i + 1` failed." echo "runbot: Deleting segment $segment." rm $RMARGS $segment continue fi $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment done echo "----- Merge Segments (Step 3 of $steps) -----" $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* if [ "$safe" != "yes" ] then rm $RMARGS crawl/segments else rm $RMARGS crawl/BACKUPsegments mv $MVARGS crawl/segments crawl/BACKUPsegments fi mv $MVARGS crawl/MERGEDsegments crawl/segments echo "----- Invert Links (Step 4 of $steps) -----" $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/* echo "----- Index (Step 5 of $steps) -----" $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb \ crawl/segments/* echo "----- Dedup (Step 6 of $steps) -----" $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes echo "----- Merge Indexes (Step 7 of $steps) -----" $NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes echo "----- Loading New Index (Step 8 of $steps) -----" ${CATALINA_HOME}/bin/shutdown.sh if [ "$safe" != "yes" ] then rm $RMARGS crawl/NEWindexes rm $RMARGS crawl/index else rm $RMARGS crawl/BACKUPindexes rm $RMARGS crawl/BACKUPindex mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes mv $MVARGS crawl/index crawl/BACKUPindex fi mv $MVARGS crawl/NEWindex crawl/index ${CATALINA_HOME}/bin/startup.sh echo "runbot: FINISHED: Crawl completed!" echo ""
其中recrawl的差异:
* 指定了adddays参数:由于fetch time在fetch时被更新为下次要抓取的时间,所以,这里是指定当前时间+adddays后哪些urls是过早的,则需要抓取。所以如果是负数表明相对于是正数来产,将缩小抓取范围。即adddays后即将过期的urls将recrawl
* 多了mergesegs job: 将旧版本中的数据用新版本替换,同时将刚生成的segs合并到一个seg或多个指定大小的segs中。个人这个是可选的 。这个应该与下面说的merge job同时存在或去掉。
NOTE:
merge job(合并索引):对于web crawl来说是不应该执行这个的,因为这样做反而达不到分布式搜索的目的了。
此代码已经在修改,增加上验证;剩下删除没有。(个人认为是:如果url到期后再fetch时,如果发现unavailable时将剔除)
* 这里的recrawl ,但index并非是真正的increment index,而是按照之前的(合并的)segments和新的linkdb,crawldb三个目 录重新生成的;所以index的输出目录indexes同样也并非是之前的旧索引文件夹,而是全新的索引(当然没去重)
compare:
see
linux下nutch的增量抓去(recrawl)-- 注意事项1