抓取流程-小结

从之前 的抓取結果来分析各阶段的情况。其中蓝色表示未修改但要注意的红色表示前后已经修改的

 

 

injector :只有二个seed urls( 这里没有列出csdn数据)

http://www.163.com/    Version: 7                #7为当前nutch的修改版本
Status: 1 (db_unfetched )                    #see CrawlDatum.STATUS_DB_UNFETCHED
Fetch time: Mon Jul 04 14:57:19 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0    #seed url为1.0
Signature: null        #page md5摘要,未抓取,所以为空
Metadata:

 

generator :同样只有二个urls

http://www.163.com/    Version: 7
Status: 1 (db_unfetched )
Fetch time: Mon Jul 04 14:57:19 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata : _ngt_: 1309887693964

 

 

fetcher:

-------

crawl_fetch:

http://www.163.com/    Version: 7
Status: 33 (fetch_success )
Fetch time: Sat Jul 09 15:14:02 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata : _ngt_: 1309933252318_pst_: success(1), lastModified=0

 

crawl_parse:

http://www.163.com/    Version: 7
Status: 65 (signature )
Fetch time: Sat Jul 09 15:14:08 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature : 989844cdb45e225db2b2731315cb5342
Metadata

//其它情况

http://www.163.com/rss/    Version: 7
Status: 67 (linked )
Fetch time: Sat Jul 09 15:14:08 CST 2011    //未fetched的以parsed的时间记录
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score : 0.01
Signature: null
Metadata:

-------

 

updatedb(crawldb,可以看出,这个文件存放的是所有历史urls,即global link map ):

http://www.163.com/    Version: 7
Status: 2 (db_fetched)       
Fetch time: Mon Aug 08 15:14:02 CST 2011    //已经更新为1个月后的fetch time,表明下次就不要再fetch了
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature : 989844cdb45e225db2b2731315cb5342   //与crawl_parse一样,即没有修改,即整个html的md5值
Metadata : _pst_: success(1), lastModified=0
//其它情况如同在injector阶段一样,以为generator准备
http://www.163.com/rss/    Version: 7
Status: 1 (db_unfetched )
Fetch time: Tue Jul 12 23:49:27 CST 2011     //未fetched的更新为update时的时间
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01
Signature: null
Metadata:

 

** 关于如何保证fetched过的urlds不再fetch,参阅updatedb

**修改crawldb/current下数据的只有:

* injector

* generator 中generate.update.crawldb参数为true时进行

* updatedb

 

 

你可能感兴趣的:(小结)