nutch 1.4 的增量爬取(recrawl)脚本

前言

先来一条最新消息: nutch 1.5发布了! 直接上到tika1.1和hadoop1.0,这下有得继续玩儿了。
不过刚看了一下,即使nutch发布到1.5,但默认还是没有提供增量爬的脚本。nutch的官方wiki上有Susam Pal写的recrawl脚本( http://wiki.apache.org/nutch/Crawl),但是那个脚本不能拿来直接用,因为:
  • ta只能用在local运行nutch的情况
  • ta的搜索是基于lucene,如果使用solr做索引就不行了
  • 如果是lucene做索引,ta只能在nutch1.2及以下使用,nutch1.3+版本中各命令的参数已经改变
以前在用官方的recrawl脚本实现过nutch1.2的增量爬;这段时间在弄nutch1.4,所以基于以前的recrawl脚本写了一个基于nutch1.4+solr3.4的脚本,目前在1master-9slave的hadoop下运行成功,现在记录下来备忘;我之前也是到处google脚本,但都没找到,现在分享出来,一起交流,也请各位能够提出宝贵的修改意见 ^_^

正文

相比于nutch1.2及以前版本,nutch1.3+中不少命令的参数已经改变,比如fetch命令已经不再解析-noParsing参数。
首先先说一下这个脚本的功能要求:
  • 支持nutch1.4在hadoop集群上运行,索引方式是solr
  • 尽量减少重复索引的数据量
  • 在hdfs上保留完整的数据
  • 具有一定的容错能力
脚本如下:(shell脚本能力相当欠缺,写的很丑)

#!/usr/bin/env bash
# recrawl script to run the Nutch bot for crawling and re-crawling.
# Usage: bin/recrawl
# Author: Joey (base on Susam Pal's work)


if [ $# != 3 ]
then
  echo "Usage: recrawl DATADIR URLDIR SOLRSERVERADDRESS"
  echo "where "
  echo "		DATADIR is the parent directory where crawled data will be stored "
  echo "		URLDIR is the parent directory of the injected url files "
  echo "		SOLRSERVERADDRESS is the address of solr index server "
  echo "eg: recrawl hdfs://localhost:9000/user/root/crawleddatadir \
  hdfs://localhost:9000/user/root/crawledurldir http://localhost:8983/solr"
  exit 1
fi

# Crawl options
depth=3
threads=64
topN=128


# Temp segments dir in Hadoop DFS
TEMPSEGMENTSDIR="tmpsegmentsdir"

# Arguments for hadoop dfsshell ls/cp/rm/mv command
LSCOMMAND="hadoop fs -ls"
RMCOMMAND="hadoop fs -rmr"
CPCOMMAND="hadoop fs -cp"
MVCOMMAND="hadoop fs -mv"


if [ -z "$NUTCH_HOME" ]
then
  NUTCH_HOME=.
  echo recrawl: $0 could not find environment variable NUTCH_HOME
  echo recrawl: NUTCH_HOME=$NUTCH_HOME has been set by the script   
else
  echo recrawl: $0 found environment variable NUTCH_HOME=$NUTCH_HOME   
fi


for((j=0;j>-1;j++))
do

	# "----- See if it will go on to crawl -----"
    for switch in `cat $NUTCH_HOME/bin/recrawlswitcher`
    do
        break
    done
    if [ $switch == "off" ]
    then
        echo "--- Shut down the recrawl due to recrawl swithcher is off ---"
        break
    fi

  echo "--- Beginning at count `expr $j + 1` ---"
  steps=6
  echo "----- Inject (Step 1 of $steps) -----"
  $NUTCH_HOME/bin/nutch inject $1/crawldb $2

  echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
  for((i=0;i<$depth;i++))
  do
    echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
    $NUTCH_HOME/bin/nutch generate $1/crawldb $1/segments -topN $topN
    if [ $? -ne 0 ]
    then
      echo "recrawl: Stopping at depth $depth. No more URLs to fetch."
      break
    fi
    segment=`$LSCOMMAND $1/segments/ | tail -1 | awk '{print $8}'`

    echo "--- fetch into segment:"$segment" ---"
    $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
    if [ $? -ne 0 ]
    then
      echo "recrawl: fetch $segment at depth `expr $i + 1` failed."
      echo "recrawl: Deleting segment $segment."
      $RMCOMMAND $segment
      continue
    fi
    
    echo "--- Beginning parsing ---"
    $NUTCH_HOME/bin/nutch parse $segment
    
    echo "--- Beginning updateddb ---"
    $NUTCH_HOME/bin/nutch updatedb $1/crawldb $segment
    
    echo "--- Beginning copy segment dir to temp segments dir ---"
    $CPCOMMAND $segment $TEMPSEGMENTSDIR/
    
  done

  echo "--- Merge Segments (Step 3 of $steps) ---"
  $NUTCH_HOME/bin/nutch mergesegs $1/MERGEDsegments -dir $1/segments
  
  echo "--- Backup old segments dir, delete backups when this-count crawl finished ---"
  $MVCOMMAND $1/segments $1/segmentsBakcup
  if [ $? -ne 0 ]
  then
    echo "recrawl: Failed to backup current segments, so exit in case of data loss"
    exit 1
  fi
  
  echo "--- Move the MERGEDsegments to segments dir ---"
  $MVCOMMAND $1/MERGEDsegments $1/segments
  if [ $? -ne 0 ]
  then
    echo "recrawl: Failed to move MERGEDsegments to segments, so exit in case of data loss "
    exit 1
  fi
  
  echo "----- Invert Links (Step 4 of $steps) -----"
  $NUTCH_HOME/bin/nutch invertlinks $1/linkdb -dir $1/segments

  echo "----- Index (Step 5 of $steps) -----"
  $NUTCH_HOME/bin/nutch solrindex $3 $1/crawldb -linkdb $1/linkdb -dir $TEMPSEGMENTSDIR

  echo "----- Delete Duplicates (Step 6 of $steps) -----"
  $NUTCH_HOME/bin/nutch solrdedup $3
 
  echo "--- The main recrawl process is done, now gona deelete the temp segments dir ---"
  $RMCOMMAND $TEMPSEGMENTSDIR/*
  
  echo "--- Delete the temp old segments backups ---"
  $RMCOMMAND $1/segmentsBakcup

  echo "recrawl: FINISHED: Crawl `expr $i + 1`-th completed!"

done
echo "All FINISHED with `expr $i + 1` count..."




这个脚本测试了几天,运行比较正常;不过这个脚本还可以继续改进,比如:
  • 目前仅在generate、fetch阶段对返回值做了处理,下一步应该在nutch的每个阶段都做返回值进行判断做相应的错误处理
  • 目前出现错误仅是在保证数据不损坏的情况下退出,应该利用sendmail等给管理员发报警邮件
  • 运行时输入的参数可以灵活再调整,目前depth、topN、threads是脚本里定义,datadir、urldir、solraddr是作为参数传递,这个地方有些乱,可以都定义在脚本中


你可能感兴趣的:(爬虫和搜索)