从之前 的抓取結果来分析各阶段的情况。其中蓝色表示未修改但要注意的 ,红色表示前后已经修改的 。
injector :只有二个seed urls( 这里没有列出csdn数据)
http://www.163.com/ Version: 7 #7为当前nutch的修改版本
Status: 1 (db_unfetched ) #see CrawlDatum.STATUS_DB_UNFETCHED
Fetch time: Mon Jul 04 14:57:19 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0 #seed url为1.0
Signature: null #page md5摘要,未抓取,所以为空
Metadata:
generator :同样只有二个urls
http://www.163.com/ Version: 7
Status: 1 (db_unfetched )
Fetch time: Mon Jul 04 14:57:19 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata : _ngt_: 1309887693964
fetcher:
-------
crawl_fetch:
http://www.163.com/ Version: 7
Status: 33 (fetch_success )
Fetch time: Sat Jul 09 15:14:02 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata : _ngt_: 1309933252318_pst_: success(1), lastModified=0
crawl_parse:
http://www.163.com/ Version: 7
Status: 65 (signature )
Fetch time: Sat Jul 09 15:14:08 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature : 989844cdb45e225db2b2731315cb5342
Metadata :
//其它情况
http://www.163.com/rss/ Version: 7
Status: 67 (linked )
Fetch time: Sat Jul 09 15:14:08 CST 2011 //未fetched的以parsed的时间记录
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score : 0.01
Signature: null
Metadata:
-------
updatedb(crawldb,可以看出,这个文件存放的是所有历史urls,即global link map ):
http://www.163.com/ Version: 7
Status: 2 (db_fetched)
Fetch time: Mon Aug 08 15:14:02 CST 2011 //已经更新为1个月后的fetch time,表明下次就不要再fetch了
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature : 989844cdb45e225db2b2731315cb5342 //与crawl_parse一样,即没有修改,即整个html的md5值
Metadata : _pst_: success(1), lastModified=0
//其它情况如同在injector阶段一样,以为generator准备
http://www.163.com/rss/ Version: 7
Status: 1 (db_unfetched )
Fetch time: Tue Jul 12 23:49:27 CST 2011 //未fetched的更新为update时的时间
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01
Signature: null
Metadata:
** 关于如何保证fetched过的urlds不再fetch,参阅updatedb
**修改crawldb/current下数据的只有:
* injector
* generator 中generate.update.crawldb参数为true时进行
* updatedb