readdb导致fetch job中断的问题

Nutch readdb命令可以用来统计目前crawldb库里面URL的情况

root@namenode:/# bin/crawler readdb /user/root/crawl/crawldb -stats
CrawlDb statistics start: /user/root/crawl/crawldb
Statistics for CrawlDb: /user/root/crawl/crawldb
TOTAL urls:    26400413
retry 0:    26366653
retry 1:    33760
min score:    0.054
avg score:    0.07116767
max score:    4542.56
status 1 (db_unfetched):    25497960
status 2 (db_fetched):    587216
status 3 (db_gone):    12145
status 4 (db_redir_temp):    95622
status 5 (db_redir_perm):    207469
status 6 (db_notmodified):    1
CrawlDb statistics: done

当抓取job正在进行的时候,若使用这个命令会导致正在运行的task重新跑 (在跑readdb之前 map进度已经10%, 当运行过这个命令之后,所有的task发生错误,然后从0%开始重新跑)

Running Jobs

Jobid Priority User Name Map % Complete Map Total Maps Completed Reduce % Complete Reduce Total Reduces Completed Job Scheduling Information
job_201003050945_0010 NORMAL root fetch crawl/segments/20100306172317 0.62%
17 0 0.00%
7 0    

错误

task_201003050945_0010_m_000000 0.22%
10 threads, 3124 successes, 191 errors, 951 others, 2.0 pages/s, 57 kB/s, 6-Mar-2010 17:54:02
java.io.IOException: Task process exit 
with nonzero status of 255.
at org.apache.hadoop.mapred.
TaskRunner.run(TaskRunner.java:425)
7

从前端开是中断了fetch job,然后先运行了 stat job之后重新运行fetch job

Completed Jobs

Jobid Priority User Name Map % Complete Map Total Maps Completed Reduce % Complete Reduce Total Reduces Completed Job Scheduling Information
job_201003050945_0011 NORMAL root stats /user/root/crawl/crawldb 100.00%
42 42 100.00%
7 7    

你可能感兴趣的:(apache,hadoop)