nutch crawl的每一步

crawl的每一步具体发生了什么。 

==============准备工作====================== 
(Windows下需要cygwin) 
从SVN check out代码; 
cd到crawler目录; 

==============inject========================== 

$ bin/nutch inject crawl/crawldb urls 
Injector: starting 
Injector: crawlDb: crawl/crawldb 
Injector: urlDir: urls 
Injector: Converting injected urls to crawl db entries. 
Injector: Merging injected urls into crawl db. 
Injector: done 

crawldb目录在这时生成。 

查看里面的内容: 
$ bin/nutch readdb crawl/crawldb -stats 
CrawlDb statistics start: crawl/crawldb 
Statistics for CrawlDb: crawl/crawldb 
TOTAL urls: 1 
retry 0: 1 
min score: 1.0 
avg score: 1.0 
max score: 1.0 
status 1 (db_unfetched): 1 
CrawlDb statistics: done 

===============generate========================= 

$bin/nutch generate crawl/crawldb crawl/segments 
$s1=`ls -d crawl/segments/2* | tail -1` 
Generator: Selecting best-scoring urls due for fetch. 
Generator: starting 
Generator: segment: crawl/segments/20080112224520 
Generator: filtering: true 
Generator: jobtracker is 'local', generating exactly one partition. 
Generator: Partitioning selected urls by host, for politeness. 
Generator: done. 

segments目录在这时生成。但里面只有一个crawl_generate目录: 
$ bin/nutch readseg -list $1 
NAME GENERATED FETCHER START FETCHER END 
FETCHED PARSED 
20080112224520 1 ? ? ? ? 

crawldb的内容此时没变化,仍是1个unfetched url。 

=================fetch============================== 

$bin/nutch fetch $s1 
Fetcher: starting 
Fetcher: segment: crawl/segments/20080112224520 
Fetcher: threads: 10 
fetching http://www.complaints.com/directory/directory.htm 
Fetcher: done 

segments多了些其他子目录。 
$ bin/nutch readseg -list $s1 
NAME GENERATED FETCHER START FETCHER END 
FETCHED PARSED 
20080112224520 1 2008-01-12T22:52:00 2008-01-12T22:52:00 
1 1 

crawldb的内容此时没变化,仍是1个unfetched url。 

================updatedb============================= 
$ bin/nutch updatedb crawl/crawldb $s1 
CrawlDb update: starting 
CrawlDb update: db: crawl/crawldb 
CrawlDb update: segments: [crawl/segments/20080112224520] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: false 
CrawlDb update: URL filtering: false 
CrawlDb update: Merging segment data into db. 
CrawlDb update: done 

这时候crawldb内容就变化了: 
$ bin/nutch readdb crawl/crawldb -stats 
CrawlDb statistics start: crawl/crawldb 
Statistics for CrawlDb: crawl/crawldb 
TOTAL urls: 97 
retry 0: 97 
min score: 0.01 
avg score: 0.02 
max score: 1.0 
status 1 (db_unfetched): 96 
status 2 (db_fetched): 1 
CrawlDb statistics: done 

==============invertlinks ============================== 
$ bin/nutch invertlinks crawl/linkdb crawl/segments/* 
LinkDb: starting 
LinkDb: linkdb: crawl/linkdb 
LinkDb: URL normalize: true 
LinkDb: URL filter: true 
LinkDb: adding segment: crawl/segments/20080112224520 
LinkDb: done 

linkdb目录在这时生成。 

===============index==================================== 
$ bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* 
Indexer: starting 
Indexer: linkdb: crawl/linkdb 
Indexer: adding segment: crawl/segments/20080112224520 
Indexing [http://www.complaints.com/directory/directory.htm] with analyzer 
org 
apache.nutch.analysis.NutchDocumentAnalyzer@ba4211 (null) 
Optimizing index. 
merging segments _ram_0 (1 docs) into _0 (1 docs) 
Indexer: done 

indexes目录在这时生成。 

================测试crawl的结果========================== 
$ bin/nutch org.apache.nutch.searcher.NutchBean complaints 
Total hits: 1 
0 20080112224520/http://www.complaints.com/directory/directory.htm 
Complaints.com - Sitemap by date ?Complaints ... 

参考资料: 
【1】Nutch version 0.8.x tutorial 
http://lucene.apache.org/nutch/tutorial8.html 
【2】 Introduction to Nutch, Part 1: Crawling 
http://today.java.net/lpt/a/255 

[实际写于Jan 13, 12:10 am 2008] 

你可能感兴趣的:(nutch crawl的每一步)