首先,何以见得crawl是inject,generate,fetch,parse,update的集成呢(命令的具体含义及功能会在后续文章中说明),我们打开NUTCH_HOME/runtime/local/bin/crawl
我将主要代码黏贴下来
# initial injection echo "Injecting seed URLs" __bin_nutch inject "$SEEDDIR" -crawlId "$CRAWL_ID" # main loop : rounds of generate - fetch - parse - update for ((a=1; a <= LIMIT ; a++)) do ... echo "Generating a new fetchlist" generate_args=($commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId "$CRAWL_ID" -batchId $batchId) $bin/nutch generate "${generate_args[@]}" ... echo "Fetching : " __bin_nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId "$CRAWL_ID" -threads 50 ... __bin_nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId "$CRAWL_ID" ... __bin_nutch updatedb $commonOptions $batchId -crawlId "$CRAWL_ID" ... echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL" __bin_nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID" ... echo "SOLR dedup -> $SOLRURL" __bin_nutch solrdedup $commonOptions $SOLRURL
接下来手动执行上述步骤
我们一直会处在runtime/local/ 目录下
1,inject:
当然,种子文件先要写好,urls/url 文件中写入想要抓取的网站,我以http://www.6vhao.com为例
在抓取期间,我不想让他抓取除了6vhao.com以外的其他网站,这个可以在conf/regex-urlfilter.txt文件中加入
# accept anything else
+^http://www.6vhao.com/
使用下面的命令开始抓取:
./bin/nutch inject urls/url -crawlId 6vhao
在hbase shell中使用list命令中查看,生成了一张新表6vhao_webpage
scan '6vhao_webpage'查看其内容
ROW COLUMN+CELL com.6vhao.www:http/ column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00 com.6vhao.www:http/ column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA com.6vhao.www:http/ column=mk:_injmrk_, timestamp=1446135434505, value=y com.6vhao.www:http/ column=mk:dist, timestamp=1446135434505, value=0 com.6vhao.www:http/ column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00 com.6vhao.www:http/ column=s:s, timestamp=1446135434505, value=?\x80\x00\x00
可以看出生成了一行hbase 数据,4列族数据,具体含义以后再说
2,generator
使用命令./bin/nutch generate
-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id)"); -noFilter - do not activate the filter plugin to filter the url, default is true -noNorm - do not activate the normalizer plugin to normalize the url, default is true -adddays - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0. -batchId - the batch id
我们指定-crawlId 为 6vhao
./bin/nutch generate -crawlId 6vhao
com.6vhao.www:http/ column=f:bid, timestamp=1446135900858, value=1446135898-215760616 com.6vhao.www:http/ column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00 com.6vhao.www:http/ column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA com.6vhao.www:http/ column=mk:_gnmrk_, timestamp=1446135900858, value=1446135898-215760616 com.6vhao.www:http/ column=mk:_injmrk_, timestamp=1446135900858, value=y com.6vhao.www:http/ column=mk:dist, timestamp=1446135900858, value=0 com.6vhao.www:http/ column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00 com.6vhao.www:http/ column=s:s, timestamp=1446135434505, value=?\x80\x00\x00
对比发现多了2列数据
3,开始抓取 fetch
Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N] <batchId> - crawl identifier returned by Generator, or -all for all generated batchId-s -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id) -threads N - number of fetching threads per task -resume - resume interrupted job -numTasks N - if N > 0 then use this many reduce tasks for fetching (default: mapred.map.tasks)
./bin/nutch fetch -all -crawlId 6vhao -threads 8
数据较多,基本网页的内容全都在,自行到hbase中查看
4,parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force] <batchId> - symbolic batch ID created by Generator -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id) -all - consider pages from all crawl jobs -resume - resume a previous incomplete job -force - force re-parsing even if a page is already parsed root@tong:/opt/nutch/nutch-2.3/runtime/local# ./bin/nutch parse -crawlId 6vhao -all
./bin/nutch parse -crawlId 6vhao -all
pase结果可在hbase中查看
5,update
Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>] <batchId> - crawl identifier returned by Generator, or -all for all generated batchId-s -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id)
./bin/nutch updatedb -all -crawlId 6vhao
结果可在hbase中查看
6,重复2-5步骤,即抓该网站2层的深度
solrindex下节再讲....