简单的分析了nutch抓取过程,涉及到的mapredue等内容在这不做讨论,时间仓促,很多地方写得不具体,以后有时间再慢慢修改,工作需要又得马上分析nutch相关配置文件,分析整理后会发布上来。转载请注明出处
一共生成5个文件夹,分别是:
l crawldb目录存放下载的URL,以及下载的日期,用来页面更新检查时间.
l linkdb目录存放URL的互联关系,是下载完成后分析得到的.
l segments:存放抓取的页面,下面子目录的个数于获取的页面层数有关系,通常每一层页面会独立存放一个子目录,子目录名称为时间,便于管理.比如我这只抓取了一层页面就只生成了20090508173137目录.每个子目录里又有6个子文件夹如下:
Ø content:每个下载页面的内容。
Ø crawl_fetch:每个下载URL的状态。
Ø crawl_generate:待下载URL集合。
Ø crawl_parse:包含来更新crawldb的外部链接库。
Ø parse_data:包含每个URL解析出的外部链接和元数据
Ø parse_text:包含每个解析过的URL的文本内容。
l indexs:存放每次下载的独立索引目录
l index:符合Lucene格式的索引目录,是indexs里所有index合并后的完整索引
引用到的类主要有以下9个:
1、 nutch.crawl.Inject
用来给抓取数据库添加URL的插入器
2、 nutch.crawl.Generator
用来生成待下载任务列表的生成器
3、 nutch.fetcher.Fetcher
完成抓取特定页面的抓取器
4、 nutch.parse.ParseSegment
负责内容提取和对下级URL提取的内容进行解析的解析器
5、 nutch.crawl.CrawlDb
负责数据库管理的数据库管理工具
6、 nutch.crawl.LinkDb
负责链接管理
7、 nutch.indexer.Indexer
负责创建索引的索引器
8、 nutch.indexer.DeleteDuplicates
删除重复数据
9、 nutch.indexer.IndexMerger
对当前下载内容局部索引和历史索引进行合并的索引合并器
描述:初始化爬取的crawldb,读取URL配置文件,把内容注入爬取数据库.
首先会找到读取URL配置文件的目录urls.如果没创建此目录,nutch1.0下会报错.
得到hadoop处理的临时文件夹:
/tmp/hadoop-Administrator/mapred/
日志信息如下:
2009-05-08 15:41:36,640 INFO Injector - Injector: starting
2009-05-08 15:41:37,031 INFO Injector - Injector: crawlDb: 20090508/crawldb
2009-05-08 15:41:37,781 INFO Injector - Injector: urlDir: urls
接着设置一些初始化信息.
调用hadoop包JobClient.runJob方法,跟踪进入JobClient下的submitJob方法进行提交整个过程.具体原理又涉及到另一个开源项目hadoop的分析,它包括了复杂的
MapReduce架构,此处不做分析。
查看submitJob方法,首先获得jobid,执行configureCommandLineOptions方法后会在上边的临时文件夹生成一个system文件夹,同时在它下边生成一个job_local_0001文件夹.执行writeSplitsFile后在job_local_0001下生成job.split文件.执行writeXml写入job.xml,然后执行jobSubmitClient.submitJob正式提交整个job流程,日志如下:
2009-05-08 15:41:36,640 INFO Injector - Injector: starting
2009-05-08 15:41:37,031 INFO Injector - Injector: crawlDb: 20090508/crawldb
2009-05-08 15:41:37,781 INFO Injector - Injector: urlDir: urls
2009-05-08 15:52:41,734 INFO Injector - Injector: Converting injected urls to crawl db entries.
2009-05-08 15:56:22,203 INFO JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
2009-05-08 16:08:20,796 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-08 16:08:20,984 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-08 16:24:42,593 INFO FileInputFormat - Total input paths to process : 1
2009-05-08 16:38:29,437 INFO FileInputFormat - Total input paths to process : 1
2009-05-08 16:38:29,546 INFO MapTask - numReduceTasks: 1
2009-05-08 16:38:29,562 INFO MapTask - io.sort.mb = 100
2009-05-08 16:38:29,687 INFO MapTask - data buffer = 79691776/99614720
2009-05-08 16:38:29,687 INFO MapTask - record buffer = 262144/327680
2009-05-08 16:38:29,718 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
2009-05-08 16:38:29,921 INFO PluginRepository - Plugin Auto-activation mode: [true]
2009-05-08 16:38:29,921 INFO PluginRepository - Registered Plugins:
2009-05-08 16:38:29,921 INFO PluginRepository - the nutch core extension points (nutch-extensionpoints)
2009-05-08 16:38:29,921 INFO PluginRepository - Basic Query Filter (query-basic)
2009-05-08 16:38:29,921 INFO PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2009-05-08 16:38:29,921 INFO PluginRepository - Basic Indexing Filter (index-basic)
2009-05-08 16:38:29,921 INFO PluginRepository - Html Parse Plug-in (parse-html)
2009-05-08 16:38:29,921 INFO PluginRepository - Site Query Filter (query-site)
2009-05-08 16:38:29,921 INFO PluginRepository - Basic Summarizer Plug-in (summary-basic)
2009-05-08 16:38:29,921 INFO PluginRepository - HTTP Framework (lib-http)
2009-05-08 16:38:29,921 INFO PluginRepository - Text Parse Plug-in (parse-text)
2009-05-08 16:38:29,921 INFO PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2009-05-08 16:38:29,921 INFO PluginRepository - Regex URL Filter (urlfilter-regex)
2009-05-08 16:38:29,921 INFO PluginRepository - Http Protocol Plug-in (protocol-http)
2009-05-08 16:38:29,921 INFO PluginRepository - XML Response Writer Plug-in (response-xml)
2009-05-08 16:38:29,921 INFO PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2009-05-08 16:38:29,921 INFO PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2009-05-08 16:38:29,921 INFO PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2009-05-08 16:38:29,921 INFO PluginRepository - Anchor Indexing Filter (index-anchor)
2009-05-08 16:38:29,921 INFO PluginRepository - JavaScript Parser (parse-js)
2009-05-08 16:38:29,921 INFO PluginRepository - URL Query Filter (query-url)
2009-05-08 16:38:29,921 INFO PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2009-05-08 16:38:29,921 INFO PluginRepository - JSON Response Writer Plug-in (response-json)
2009-05-08 16:38:29,921 INFO PluginRepository - Registered Extension-Points:
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Field Filter (org.apache.nutch.indexer.field.FieldFilter)
2009-05-08 16:38:29,921 INFO PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Search Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-05-08 16:38:29,921 INFO PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-05-08 16:38:29,921 INFO PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-05-08 16:38:29,968 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-08 16:38:29,984 WARN RegexURLNormalizer - can't find rules for scope 'inject', using default
2009-05-08 16:38:29,984 INFO MapTask - Starting flush of map output
2009-05-08 16:38:30,203 INFO MapTask - Finished spill 0
2009-05-08 16:38:30,203 INFO TaskRunner - Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
2009-05-08 16:38:30,218 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/urls/site.txt:0+19
2009-05-08 16:38:30,218 INFO TaskRunner - Task 'attempt_local_0001_m_000000_0' done.
2009-05-08 16:38:30,234 INFO LocalJobRunner -
2009-05-08 16:38:30,250 INFO Merger - Merging 1 sorted segments
2009-05-08 16:38:30,265 INFO Merger - Down to the last merge-pass, with 1 segments left of total size: 53 bytes
2009-05-08 16:38:30,265 INFO LocalJobRunner -
2009-05-08 16:38:30,390 INFO TaskRunner - Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
2009-05-08 16:38:30,390 INFO LocalJobRunner -
2009-05-08 16:38:30,390 INFO TaskRunner - Task attempt_local_0001_r_000000_0 is allowed to commit now
2009-05-08 16:38:30,406 INFO FileOutputCommitter - Saved output of task 'attempt_local_0001_r_000000_0' to file:/tmp/hadoop-Administrator/mapred/temp/inject-temp-474192304
2009-05-08 16:38:30,406 INFO LocalJobRunner - reduce > reduce
2009-05-08 16:38:30,406 INFO TaskRunner - Task 'attempt_local_0001_r_000000_0' done.
执行完后返回的running值如下:
Job: job_local_0001
file: file:/tmp/hadoop-Administrator/mapred/system/job_local_0001/job.xml
tracking URL: http://localhost:8080/
2009-05-08 16:47:14,093 INFO JobClient - Running job: job_local_0001
2009-05-08 16:49:51,859 INFO JobClient - Job complete: job_local_0001
2009-05-08 16:51:36,062 INFO JobClient - Counters: 11
2009-05-08 16:51:36,062 INFO JobClient - File Systems
2009-05-08 16:51:36,062 INFO JobClient - Local bytes read=51591
2009-05-08 16:51:36,062 INFO JobClient - Local bytes written=104337
2009-05-08 16:51:36,062 INFO JobClient - Map-Reduce Framework
2009-05-08 16:51:36,062 INFO JobClient - Reduce input groups=1
2009-05-08 16:51:36,062 INFO JobClient - Combine output records=0
2009-05-08 16:51:36,062 INFO JobClient - Map input records=1
2009-05-08 16:51:36,062 INFO JobClient - Reduce output records=1
2009-05-08 16:51:36,062 INFO JobClient - Map output bytes=49
2009-05-08 16:51:36,062 INFO JobClient - Map input bytes=19
2009-05-08 16:51:36,062 INFO JobClient - Combine input records=0
2009-05-08 16:51:36,062 INFO JobClient - Map output records=1
2009-05-08 16:51:36,062 INFO JobClient - Reduce input records=1
至此第一个runJob方法执行结束.
总结:待写
接下来就是生成crawldb文件夹,并把urls合并注入到它的里面.
JobClient.runJob(mergeJob);
CrawlDb.install(mergeJob, crawlDb);
这个过程首先会在前面提到的临时文件夹下生成job_local_0002目录,和上边一样同样会生成job.split和job.xml,接着完成crawldb的创建,最后删除临时文件夹temp下的文件.
至此inject过程结束.最后部分日志如下:
2009-05-08 17:03:57,250 INFO Injector - Injector: Merging injected urls into crawl db.
2009-05-08 17:10:01,015 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-08 17:10:15,953 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-08 17:10:16,156 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-08 17:12:15,296 INFO FileInputFormat - Total input paths to process : 1
2009-05-08 17:13:40,296 INFO FileInputFormat - Total input paths to process : 1
2009-05-08 17:13:40,406 INFO MapTask - numReduceTasks: 1
2009-05-08 17:13:40,406 INFO MapTask - io.sort.mb = 100
2009-05-08 17:13:40,515 INFO MapTask - data buffer = 79691776/99614720
2009-05-08 17:13:40,515 INFO MapTask - record buffer = 262144/327680
2009-05-08 17:13:40,546 INFO MapTask - Starting flush of map output
2009-05-08 17:13:40,765 INFO MapTask - Finished spill 0
2009-05-08 17:13:40,765 INFO TaskRunner - Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting
2009-05-08 17:13:40,765 INFO LocalJobRunner - file:/tmp/hadoop-Administrator/mapred/temp/inject-temp-474192304/part-00000:0+143
2009-05-08 17:13:40,765 INFO TaskRunner - Task 'attempt_local_0002_m_000000_0' done.
2009-05-08 17:13:40,796 INFO LocalJobRunner -
2009-05-08 17:13:40,796 INFO Merger - Merging 1 sorted segments
2009-05-08 17:13:40,796 INFO Merger - Down to the last merge-pass, with 1 segments left of total size: 53 bytes
2009-05-08 17:13:40,796 INFO LocalJobRunner -
2009-05-08 17:13:40,906 WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2009-05-08 17:13:40,906 INFO CodecPool - Got brand-new compressor
2009-05-08 17:13:40,906 INFO TaskRunner - Task:attempt_local_0002_r_000000_0 is done. And is in the process of commiting
2009-05-08 17:13:40,906 INFO LocalJobRunner -
2009-05-08 17:13:40,906 INFO TaskRunner - Task attempt_local_0002_r_000000_0 is allowed to commit now
2009-05-08 17:13:40,921 INFO FileOutputCommitter - Saved output of task 'attempt_local_0002_r_000000_0' to file:/D:/work/workspace/nutch_crawl/20090508/crawldb/1896567745
2009-05-08 17:13:40,921 INFO LocalJobRunner - reduce > reduce
2009-05-08 17:13:40,937 INFO TaskRunner - Task 'attempt_local_0002_r_000000_0' done.
2009-05-08 17:13:46,781 INFO JobClient - Running job: job_local_0002
2009-05-08 17:14:55,125 INFO JobClient - Job complete: job_local_0002
2009-05-08 17:14:59,328 INFO JobClient - Counters: 11
2009-05-08 17:14:59,328 INFO JobClient - File Systems
2009-05-08 17:14:59,328 INFO JobClient - Local bytes read=103875
2009-05-08 17:14:59,328 INFO JobClient - Local bytes written=209385
2009-05-08 17:14:59,328 INFO JobClient - Map-Reduce Framework
2009-05-08 17:14:59,328 INFO JobClient - Reduce input groups=1
2009-05-08 17:14:59,328 INFO JobClient - Combine output records=0
2009-05-08 17:14:59,328 INFO JobClient - Map input records=1
2009-05-08 17:14:59,328 INFO JobClient - Reduce output records=1
2009-05-08 17:14:59,328 INFO JobClient - Map output bytes=49
2009-05-08 17:14:59,328 INFO JobClient - Map input bytes=57
2009-05-08 17:14:59,328 INFO JobClient - Combine input records=0
2009-05-08 17:14:59,328 INFO JobClient - Map output records=1
2009-05-08 17:14:59,328 INFO JobClient - Reduce input records=1
2009-05-08 17:17:30,984 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-08 17:20:02,390 INFO Injector - Injector: done
描述:从爬取数据库中生成新的segment,然后从中生成待下载任务列表(fetchlist).
LockUtil.createLockFile(fs, lock, force);
首先执行上边方法后会在crawldb目录下生成.locked文件,猜测作用是防止crawldb的数据被修改,真实作用有待验证.
接着执行的过程和上边大同小异,可参考上边步骤,日志如下:
2009-05-08 17:37:18,218 INFO Generator - Generator: Selecting best-scoring urls due for fetch.
2009-05-08 17:37:18,625 INFO Generator - Generator: starting
2009-05-08 17:37:18,937 INFO Generator - Generator: segment: 20090508/segments/20090508173137
2009-05-08 17:37:19,468 INFO Generator - Generator: filtering: true
2009-05-08 17:37:22,312 INFO Generator - Generator: topN: 50
2009-05-08 17:37:51,203 INFO Generator - Generator: jobtracker is 'local', generating exactly one partition.
2009-05-08 17:39:57,609 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-08 17:40:05,234 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-08 17:40:05,406 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-08 17:40:05,437 INFO FileInputFormat - Total input paths to process : 1
2009-05-08 17:40:06,062 INFO FileInputFormat - Total input paths to process : 1
2009-05-08 17:40:06,109 INFO MapTask - numReduceTasks: 1
省略插件加载日志……
2009-05-08 17:40:06,312 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-08 17:40:06,343 INFO FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2009-05-08 17:40:06,343 INFO AbstractFetchSchedule - defaultInterval=2592000
2009-05-08 17:40:06,343 INFO AbstractFetchSchedule - maxInterval=7776000
2009-05-08 17:40:06,343 INFO MapTask - io.sort.mb = 100
2009-05-08 17:40:06,437 INFO MapTask - data buffer = 79691776/99614720
2009-05-08 17:40:06,437 INFO MapTask - record buffer = 262144/327680
2009-05-08 17:40:06,453 WARN RegexURLNormalizer - can't find rules for scope 'partition', using default
2009-05-08 17:40:06,453 INFO MapTask - Starting flush of map output
2009-05-08 17:40:06,625 INFO MapTask - Finished spill 0
2009-05-08 17:40:06,640 INFO TaskRunner - Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting
2009-05-08 17:40:06,640 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143
2009-05-08 17:40:06,640 INFO TaskRunner - Task 'attempt_local_0003_m_000000_0' done.
2009-05-08 17:40:06,656 INFO LocalJobRunner -
2009-05-08 17:40:06,656 INFO Merger - Merging 1 sorted segments
2009-05-08 17:40:06,656 INFO Merger - Down to the last merge-pass, with 1 segments left of total size: 78 bytes
2009-05-08 17:40:06,656 INFO LocalJobRunner –
省略插件加载日志……
2009-05-08 17:40:06,875 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-08 17:40:06,906 INFO FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2009-05-08 17:40:06,906 INFO AbstractFetchSchedule - defaultInterval=2592000
2009-05-08 17:40:06,906 INFO AbstractFetchSchedule - maxInterval=7776000
2009-05-08 17:40:06,906 WARN RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default
2009-05-08 17:40:06,906 INFO TaskRunner - Task:attempt_local_0003_r_000000_0 is done. And is in the process of commiting
2009-05-08 17:40:06,906 INFO LocalJobRunner -
2009-05-08 17:40:06,906 INFO TaskRunner - Task attempt_local_0003_r_000000_0 is allowed to commit now
2009-05-08 17:40:06,906 INFO FileOutputCommitter - Saved output of task 'attempt_local_0003_r_000000_0' to file:/tmp/hadoop-Administrator/mapred/temp/generate-temp-1241774893937
2009-05-08 17:40:06,921 INFO LocalJobRunner - reduce > reduce
2009-05-08 17:40:06,921 INFO TaskRunner - Task 'attempt_local_0003_r_000000_0' done.
2009-05-08 17:40:21,468 INFO JobClient - Running job: job_local_0003
2009-05-08 17:40:31,671 INFO JobClient - Job complete: job_local_0003
2009-05-08 17:40:34,046 INFO JobClient - Counters: 11
2009-05-08 17:40:34,046 INFO JobClient - File Systems
2009-05-08 17:40:34,046 INFO JobClient - Local bytes read=157400
2009-05-08 17:40:34,046 INFO JobClient - Local bytes written=316982
2009-05-08 17:40:34,046 INFO JobClient - Map-Reduce Framework
2009-05-08 17:40:34,046 INFO JobClient - Reduce input groups=1
2009-05-08 17:40:34,046 INFO JobClient - Combine output records=0
2009-05-08 17:40:34,046 INFO JobClient - Map input records=1
2009-05-08 17:40:34,046 INFO JobClient - Reduce output records=1
2009-05-08 17:40:34,046 INFO JobClient - Map output bytes=74
2009-05-08 17:40:34,046 INFO JobClient - Map input bytes=57
2009-05-08 17:40:34,046 INFO JobClient - Combine input records=0
2009-05-08 17:40:34,046 INFO JobClient - Map output records=1
2009-05-08 17:40:34,046 INFO JobClient - Reduce input records=1
接着还是执行submitJob方法提交整个generate过程,生成segments目录,删除临时文件,锁定文件等,当前segments下只生成了crawl_generate一个文件夹.
描述:完成具体的下载任务
2009-05-11 09:45:13,984 WARN Fetcher - Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
2009-05-11 09:45:34,796 INFO Fetcher - Fetcher: starting
2009-05-11 09:45:35,375 INFO Fetcher - Fetcher: segment: 20090508/segments/20090511094102
2009-05-11 09:49:23,984 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 09:49:58,046 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-11 09:49:58,234 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-11 09:49:58,265 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 09:49:58,859 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 09:49:58,906 INFO MapTask - numReduceTasks: 1
2009-05-11 09:49:58,906 INFO MapTask - io.sort.mb = 100
2009-05-11 09:49:59,015 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 09:49:59,015 INFO MapTask - record buffer = 262144/327680
2009-05-11 09:49:59,140 INFO Fetcher - Fetcher: threads: 5
2009-05-11 09:49:59,140 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
2009-05-11 09:49:59,250 INFO Fetcher - QueueFeeder finished: total 1 records.
省略插件加载日志….
2009-05-11 09:49:59,312 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-11 09:49:59,328 INFO Configuration - found resource parse-plugins.xml at file:/D:/work/workspace/nutch_crawl/bin/parse-plugins.xml
2009-05-11 09:49:59,359 INFO Fetcher - fetching http://www.163.com/
2009-05-11 09:49:59,375 INFO Fetcher - -finishing thread FetcherThread, activeThreads=4
2009-05-11 09:49:59,375 INFO Fetcher - -finishing thread FetcherThread, activeThreads=3
2009-05-11 09:49:59,375 INFO Fetcher - -finishing thread FetcherThread, activeThreads=2
2009-05-11 09:49:59,375 INFO Fetcher - -finishing thread FetcherThread, activeThreads=1
2009-05-11 09:49:59,421 INFO Http - http.proxy.host = null
2009-05-11 09:49:59,421 INFO Http - http.proxy.port = 8080
2009-05-11 09:49:59,421 INFO Http - http.timeout = 10000
2009-05-11 09:49:59,421 INFO Http - http.content.limit = 65536
2009-05-11 09:49:59,421 INFO Http - http.agent = nutch/Nutch-1.0 (chinahui; http://www.163.com; [email protected])
2009-05-11 09:49:59,421 INFO Http - protocol.plugin.check.blocking = false
2009-05-11 09:49:59,421 INFO Http - protocol.plugin.check.robots = false
2009-05-11 09:50:00,109 INFO Configuration - found resource tika-mimetypes.xml at file:/D:/work/workspace/nutch_crawl/bin/tika-mimetypes.xml
2009-05-11 09:50:00,156 WARN ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml
2009-05-11 09:50:00,375 INFO Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
2009-05-11 09:50:00,671 INFO SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2009-05-11 09:50:00,687 INFO Fetcher - -finishing thread FetcherThread, activeThreads=0
2009-05-11 09:50:01,375 INFO Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
2009-05-11 09:50:01,375 INFO Fetcher - -activeThreads=0
2009-05-11 09:50:01,375 INFO MapTask - Starting flush of map output
2009-05-11 09:50:01,578 INFO MapTask - Finished spill 0
2009-05-11 09:50:01,578 INFO TaskRunner - Task:attempt_local_0005_m_000000_0 is done. And is in the process of commiting
2009-05-11 09:50:01,578 INFO LocalJobRunner - 0 threads, 1 pages, 0 errors, 0.5 pages/s, 256 kb/s,
2009-05-11 09:50:01,578 INFO TaskRunner - Task 'attempt_local_0005_m_000000_0' done.
2009-05-11 09:50:01,593 INFO LocalJobRunner -
2009-05-11 09:50:01,593 INFO Merger - Merging 1 sorted segments
2009-05-11 09:50:01,593 INFO Merger - Down to the last merge-pass, with 1 segments left of total size: 72558 bytes
2009-05-11 09:50:01,593 INFO LocalJobRunner -
2009-05-11 09:50:01,671 INFO CodecPool - Got brand-new compressor
2009-05-11 09:50:01,734 INFO CodecPool - Got brand-new compressor
2009-05-11 09:50:01,765 INFO CodecPool - Got brand-new compressor
2009-05-11 09:50:01,765 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 09:50:01,921 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-11 09:50:01,984 INFO CodecPool - Got brand-new compressor
2009-05-11 09:50:02,015 INFO CodecPool - Got brand-new compressor
2009-05-11 09:50:02,062 INFO CodecPool - Got brand-new compressor
2009-05-11 09:50:02,093 INFO CodecPool - Got brand-new compressor
2009-05-11 09:50:02,125 INFO CodecPool - Got brand-new compressor
2009-05-11 09:50:02,140 WARN RegexURLNormalizer - can't find rules for scope 'outlink', using default
2009-05-11 09:50:02,171 INFO TaskRunner - Task:attempt_local_0005_r_000000_0 is done. And is in the process of commiting
2009-05-11 09:50:02,171 INFO LocalJobRunner - reduce > reduce
2009-05-11 09:50:02,187 INFO TaskRunner - Task 'attempt_local_0005_r_000000_0' done.
2009-05-11 09:50:44,062 INFO JobClient - Running job: job_local_0005
2009-05-11 09:51:31,328 INFO JobClient - Job complete: job_local_0005
2009-05-11 09:51:32,984 INFO JobClient - Counters: 11
2009-05-11 09:51:33,000 INFO JobClient - File Systems
2009-05-11 09:51:33,000 INFO JobClient - Local bytes read=336424
2009-05-11 09:51:33,000 INFO JobClient - Local bytes written=700394
2009-05-11 09:51:33,000 INFO JobClient - Map-Reduce Framework
2009-05-11 09:51:33,000 INFO JobClient - Reduce input groups=1
2009-05-11 09:51:33,000 INFO JobClient - Combine output records=0
2009-05-11 09:51:33,000 INFO JobClient - Map input records=1
2009-05-11 09:51:33,000 INFO JobClient - Reduce output records=3
2009-05-11 09:51:33,000 INFO JobClient - Map output bytes=72545
2009-05-11 09:51:33,000 INFO JobClient - Map input bytes=78
2009-05-11 09:51:33,000 INFO JobClient - Combine input records=0
2009-05-11 09:51:33,000 INFO JobClient - Map output records=3
2009-05-11 09:51:33,000 INFO JobClient - Reduce input records=3
2009-05-11 09:51:47,750 INFO Fetcher - Fetcher: done
描述:解析下载页面内容
描述:添加子链接到爬取数据库
2009-05-11 10:04:20,890 INFO CrawlDb - CrawlDb update: starting
2009-05-11 10:04:22,500 INFO CrawlDb - CrawlDb update: db: 20090508/crawldb
2009-05-11 10:05:53,593 INFO CrawlDb - CrawlDb update: segments: [20090508/segments/20090511094102]
2009-05-11 10:06:06,031 INFO CrawlDb - CrawlDb update: additions allowed: true
2009-05-11 10:06:07,296 INFO CrawlDb - CrawlDb update: URL normalizing: true
2009-05-11 10:06:09,031 INFO CrawlDb - CrawlDb update: URL filtering: true
2009-05-11 10:07:05,125 INFO CrawlDb - CrawlDb update: Merging segment data into db.
2009-05-11 10:08:11,031 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:09:00,187 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-11 10:09:00,375 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-11 10:10:03,531 INFO FileInputFormat - Total input paths to process : 3
2009-05-11 10:16:25,125 INFO FileInputFormat - Total input paths to process : 3
2009-05-11 10:16:25,203 INFO MapTask - numReduceTasks: 1
2009-05-11 10:16:25,203 INFO MapTask - io.sort.mb = 100
2009-05-11 10:16:25,343 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:16:25,343 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:16:25,343 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:16:25,750 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-11 10:16:25,796 WARN RegexURLNormalizer - can't find rules for scope 'crawldb', using default
2009-05-11 10:16:25,796 INFO MapTask - Starting flush of map output
2009-05-11 10:16:25,984 INFO MapTask - Finished spill 0
2009-05-11 10:16:26,000 INFO TaskRunner - Task:attempt_local_0006_m_000000_0 is done. And is in the process of commiting
2009-05-11 10:16:26,000 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143
2009-05-11 10:16:26,000 INFO TaskRunner - Task 'attempt_local_0006_m_000000_0' done.
2009-05-11 10:16:26,031 INFO MapTask - numReduceTasks: 1
2009-05-11 10:16:26,031 INFO MapTask - io.sort.mb = 100
2009-05-11 10:16:26,140 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:16:26,140 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:16:26,156 INFO CodecPool - Got brand-new decompressor
2009-05-11 10:16:26,171 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:16:26,687 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-11 10:16:26,718 WARN RegexURLNormalizer - can't find rules for scope 'crawldb', using default
2009-05-11 10:16:26,734 INFO MapTask - Starting flush of map output
2009-05-11 10:16:26,750 INFO MapTask - Finished spill 0
2009-05-11 10:16:26,750 INFO TaskRunner - Task:attempt_local_0006_m_000002_0 is done. And is in the process of commiting
2009-05-11 10:16:26,750 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026
2009-05-11 10:16:26,750 INFO TaskRunner - Task 'attempt_local_0006_m_000002_0' done.
2009-05-11 10:16:26,781 INFO LocalJobRunner -
2009-05-11 10:16:26,781 INFO Merger - Merging 3 sorted segments
2009-05-11 10:16:26,781 INFO Merger - Down to the last merge-pass, with 3 segments left of total size: 3706 bytes
2009-05-11 10:16:26,781 INFO LocalJobRunner -
2009-05-11 10:16:26,875 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:16:27,031 INFO FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2009-05-11 10:16:27,031 INFO AbstractFetchSchedule - defaultInterval=2592000
2009-05-11 10:16:27,031 INFO AbstractFetchSchedule - maxInterval=7776000
2009-05-11 10:16:27,046 INFO TaskRunner - Task:attempt_local_0006_r_000000_0 is done. And is in the process of commiting
2009-05-11 10:16:27,046 INFO LocalJobRunner -
2009-05-11 10:16:27,046 INFO TaskRunner - Task attempt_local_0006_r_000000_0 is allowed to commit now
2009-05-11 10:16:27,062 INFO FileOutputCommitter - Saved output of task 'attempt_local_0006_r_000000_0' to file:/D:/work/workspace/nutch_crawl/20090508/crawldb/132216774
2009-05-11 10:16:27,062 INFO LocalJobRunner - reduce > reduce
2009-05-11 10:16:27,062 INFO TaskRunner - Task 'attempt_local_0006_r_000000_0' done.
2009-05-11 10:17:43,984 INFO JobClient - Running job: job_local_0006
2009-05-11 10:18:33,671 INFO JobClient - Job complete: job_local_0006
2009-05-11 10:18:35,906 INFO JobClient - Counters: 11
2009-05-11 10:18:35,906 INFO JobClient - File Systems
2009-05-11 10:18:35,906 INFO JobClient - Local bytes read=936164
2009-05-11 10:18:35,906 INFO JobClient - Local bytes written=1678861
2009-05-11 10:18:35,906 INFO JobClient - Map-Reduce Framework
2009-05-11 10:18:35,906 INFO JobClient - Reduce input groups=57
2009-05-11 10:18:35,906 INFO JobClient - Combine output records=0
2009-05-11 10:18:35,906 INFO JobClient - Map input records=63
2009-05-11 10:18:35,906 INFO JobClient - Reduce output records=57
2009-05-11 10:18:35,906 INFO JobClient - Map output bytes=3574
2009-05-11 10:18:35,906 INFO JobClient - Map input bytes=4079
2009-05-11 10:18:35,906 INFO JobClient - Combine input records=0
2009-05-11 10:18:35,906 INFO JobClient - Map output records=63
2009-05-11 10:18:35,906 INFO JobClient - Reduce input records=63
2009-05-11 10:19:48,078 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:22:51,437 INFO CrawlDb - CrawlDb update: done
描述:分析链接关系,生成反向链接
执行完后生成linkdb目录.
2009-05-11 10:04:20,890 INFO CrawlDb - CrawlDb update: starting
2009-05-11 10:04:22,500 INFO CrawlDb - CrawlDb update: db: 20090508/crawldb
2009-05-11 10:05:53,593 INFO CrawlDb - CrawlDb update: segments: [20090508/segments/20090511094102]
2009-05-11 10:06:06,031 INFO CrawlDb - CrawlDb update: additions allowed: true
2009-05-11 10:06:07,296 INFO CrawlDb - CrawlDb update: URL normalizing: true
2009-05-11 10:06:09,031 INFO CrawlDb - CrawlDb update: URL filtering: true
2009-05-11 10:07:05,125 INFO CrawlDb - CrawlDb update: Merging segment data into db.
2009-05-11 10:08:11,031 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:09:00,187 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-11 10:09:00,375 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-11 10:10:03,531 INFO FileInputFormat - Total input paths to process : 3
2009-05-11 10:16:25,125 INFO FileInputFormat - Total input paths to process : 3
2009-05-11 10:16:25,203 INFO MapTask - numReduceTasks: 1
2009-05-11 10:16:25,203 INFO MapTask - io.sort.mb = 100
2009-05-11 10:16:25,343 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:16:25,343 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:16:25,343 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:16:25,750 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-11 10:16:25,796 WARN RegexURLNormalizer - can't find rules for scope 'crawldb', using default
2009-05-11 10:16:25,796 INFO MapTask - Starting flush of map output
2009-05-11 10:16:25,984 INFO MapTask - Finished spill 0
2009-05-11 10:16:26,000 INFO TaskRunner - Task:attempt_local_0006_m_000000_0 is done. And is in the process of commiting
2009-05-11 10:16:26,000 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143
2009-05-11 10:16:26,000 INFO TaskRunner - Task 'attempt_local_0006_m_000000_0' done.
2009-05-11 10:16:26,031 INFO MapTask - numReduceTasks: 1
2009-05-11 10:16:26,031 INFO MapTask - io.sort.mb = 100
2009-05-11 10:16:26,140 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:16:26,140 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:16:26,156 INFO CodecPool - Got brand-new decompressor
2009-05-11 10:16:26,171 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:16:26,343 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-11 10:16:26,359 WARN RegexURLNormalizer - can't find rules for scope 'crawldb', using default
2009-05-11 10:16:26,359 INFO MapTask - Starting flush of map output
2009-05-11 10:16:26,359 INFO MapTask - Finished spill 0
2009-05-11 10:16:26,375 INFO TaskRunner - Task:attempt_local_0006_m_000001_0 is done. And is in the process of commiting
2009-05-11 10:16:26,375 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_fetch/part-00000/data:0+254
2009-05-11 10:16:26,375 INFO TaskRunner - Task 'attempt_local_0006_m_000001_0' done.
2009-05-11 10:16:26,406 INFO MapTask - numReduceTasks: 1
2009-05-11 10:16:26,406 INFO MapTask - io.sort.mb = 100
2009-05-11 10:16:26,515 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:16:26,515 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:16:26,531 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:16:26,687 INFO Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt
2009-05-11 10:16:26,718 WARN RegexURLNormalizer - can't find rules for scope 'crawldb', using default
2009-05-11 10:16:26,734 INFO MapTask - Starting flush of map output
2009-05-11 10:16:26,750 INFO MapTask - Finished spill 0
2009-05-11 10:16:26,750 INFO TaskRunner - Task:attempt_local_0006_m_000002_0 is done. And is in the process of commiting
2009-05-11 10:16:26,750 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026
2009-05-11 10:16:26,750 INFO TaskRunner - Task 'attempt_local_0006_m_000002_0' done.
2009-05-11 10:16:26,781 INFO LocalJobRunner -
2009-05-11 10:16:26,781 INFO Merger - Merging 3 sorted segments
2009-05-11 10:16:26,781 INFO Merger - Down to the last merge-pass, with 3 segments left of total size: 3706 bytes
2009-05-11 10:16:26,781 INFO LocalJobRunner -
2009-05-11 10:16:26,875 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:16:27,031 INFO FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2009-05-11 10:16:27,031 INFO AbstractFetchSchedule - defaultInterval=2592000
2009-05-11 10:16:27,031 INFO AbstractFetchSchedule - maxInterval=7776000
2009-05-11 10:16:27,046 INFO TaskRunner - Task:attempt_local_0006_r_000000_0 is done. And is in the process of commiting
2009-05-11 10:16:27,046 INFO LocalJobRunner -
2009-05-11 10:16:27,046 INFO TaskRunner - Task attempt_local_0006_r_000000_0 is allowed to commit now
2009-05-11 10:16:27,062 INFO FileOutputCommitter - Saved output of task 'attempt_local_0006_r_000000_0' to file:/D:/work/workspace/nutch_crawl/20090508/crawldb/132216774
2009-05-11 10:16:27,062 INFO LocalJobRunner - reduce > reduce
2009-05-11 10:16:27,062 INFO TaskRunner - Task 'attempt_local_0006_r_000000_0' done.
2009-05-11 10:17:43,984 INFO JobClient - Running job: job_local_0006
2009-05-11 10:18:33,671 INFO JobClient - Job complete: job_local_0006
2009-05-11 10:18:35,906 INFO JobClient - Counters: 11
2009-05-11 10:18:35,906 INFO JobClient - File Systems
2009-05-11 10:18:35,906 INFO JobClient - Local bytes read=936164
2009-05-11 10:18:35,906 INFO JobClient - Local bytes written=1678861
2009-05-11 10:18:35,906 INFO JobClient - Map-Reduce Framework
2009-05-11 10:18:35,906 INFO JobClient - Reduce input groups=57
2009-05-11 10:18:35,906 INFO JobClient - Combine output records=0
2009-05-11 10:18:35,906 INFO JobClient - Map input records=63
2009-05-11 10:18:35,906 INFO JobClient - Reduce output records=57
2009-05-11 10:18:35,906 INFO JobClient - Map output bytes=3574
2009-05-11 10:18:35,906 INFO JobClient - Map input bytes=4079
2009-05-11 10:18:35,906 INFO JobClient - Combine input records=0
2009-05-11 10:18:35,906 INFO JobClient - Map output records=63
2009-05-11 10:18:35,906 INFO JobClient - Reduce input records=63
2009-05-11 10:19:48,078 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:22:51,437 INFO CrawlDb - CrawlDb update: done
2009-05-11 10:26:31,250 INFO LinkDb - LinkDb: starting
2009-05-11 10:26:31,250 INFO LinkDb - LinkDb: linkdb: 20090508/linkdb
2009-05-11 10:26:31,250 INFO LinkDb - LinkDb: URL normalize: true
2009-05-11 10:26:31,250 INFO LinkDb - LinkDb: URL filter: true
2009-05-11 10:26:31,281 INFO LinkDb - LinkDb: adding segment: file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102
2009-05-11 10:26:31,281 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:26:31,296 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-11 10:26:31,453 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-11 10:26:31,484 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 10:26:32,078 INFO JobClient - Running job: job_local_0007
2009-05-11 10:26:32,078 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 10:26:32,125 INFO MapTask - numReduceTasks: 1
2009-05-11 10:26:32,125 INFO MapTask - io.sort.mb = 100
2009-05-11 10:26:32,234 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:26:32,234 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:26:32,250 INFO MapTask - Starting flush of map output
2009-05-11 10:26:32,437 INFO MapTask - Finished spill 0
2009-05-11 10:26:32,453 INFO TaskRunner - Task:attempt_local_0007_m_000000_0 is done. And is in the process of commiting
2009-05-11 10:26:32,453 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_data/part-00000/data:0+1382
2009-05-11 10:26:32,453 INFO TaskRunner - Task 'attempt_local_0007_m_000000_0' done.
2009-05-11 10:26:32,468 INFO LocalJobRunner -
2009-05-11 10:26:32,468 INFO Merger - Merging 1 sorted segments
2009-05-11 10:26:32,468 INFO Merger - Down to the last merge-pass, with 1 segments left of total size: 3264 bytes
2009-05-11 10:26:32,468 INFO LocalJobRunner -
2009-05-11 10:26:32,562 INFO TaskRunner - Task:attempt_local_0007_r_000000_0 is done. And is in the process of commiting
2009-05-11 10:26:32,562 INFO LocalJobRunner -
2009-05-11 10:26:32,562 INFO TaskRunner - Task attempt_local_0007_r_000000_0 is allowed to commit now
2009-05-11 10:26:32,578 INFO FileOutputCommitter - Saved output of task 'attempt_local_0007_r_000000_0' to file:/D:/work/workspace/nutch_crawl/linkdb-1900012851
2009-05-11 10:26:32,578 INFO LocalJobRunner - reduce > reduce
2009-05-11 10:26:32,578 INFO TaskRunner - Task 'attempt_local_0007_r_000000_0' done.
2009-05-11 10:26:33,078 INFO JobClient - Job complete: job_local_0007
2009-05-11 10:26:33,078 INFO JobClient - Counters: 11
2009-05-11 10:26:33,078 INFO JobClient - File Systems
2009-05-11 10:26:33,078 INFO JobClient - Local bytes read=535968
2009-05-11 10:26:33,078 INFO JobClient - Local bytes written=965231
2009-05-11 10:26:33,078 INFO JobClient - Map-Reduce Framework
2009-05-11 10:26:33,078 INFO JobClient - Reduce input groups=56
2009-05-11 10:26:33,078 INFO JobClient - Combine output records=56
2009-05-11 10:26:33,078 INFO JobClient - Map input records=1
2009-05-11 10:26:33,078 INFO JobClient - Reduce output records=56
2009-05-11 10:26:33,078 INFO JobClient - Map output bytes=3384
2009-05-11 10:26:33,078 INFO JobClient - Map input bytes=1254
2009-05-11 10:26:33,078 INFO JobClient - Combine input records=60
2009-05-11 10:26:33,078 INFO JobClient - Map output records=60
2009-05-11 10:26:33,078 INFO JobClient - Reduce input records=56
2009-05-11 10:26:33,078 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:26:33,125 INFO LinkDb - LinkDb: done
描述:创建页面内容索引
生成indexes目录.
2009-05-11 10:31:22,250 INFO Indexer - Indexer: starting
2009-05-11 10:31:45,078 INFO IndexerMapReduce - IndexerMapReduce: crawldb: 20090508/crawldb
2009-05-11 10:31:45,078 INFO IndexerMapReduce - IndexerMapReduce: linkdb: 20090508/linkdb
2009-05-11 10:31:45,078 INFO IndexerMapReduce - IndexerMapReduces: adding segment: file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102
2009-05-11 10:32:30,359 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:32:34,109 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-11 10:32:34,296 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-11 10:32:34,421 INFO FileInputFormat - Total input paths to process : 6
2009-05-11 10:32:35,078 INFO FileInputFormat - Total input paths to process : 6
2009-05-11 10:32:35,140 INFO MapTask - numReduceTasks: 1
2009-05-11 10:32:35,140 INFO MapTask - io.sort.mb = 100
2009-05-11 10:32:35,250 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:32:35,250 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:32:35,265 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:32:35,937 INFO IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-05-11 10:32:35,937 INFO IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2009-05-11 10:32:35,953 INFO MapTask - Starting flush of map output
2009-05-11 10:32:35,968 INFO MapTask - Finished spill 0
2009-05-11 10:32:35,968 INFO TaskRunner - Task:attempt_local_0008_m_000001_0 is done. And is in the process of commiting
2009-05-11 10:32:35,968 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026
2009-05-11 10:32:35,968 INFO TaskRunner - Task 'attempt_local_0008_m_000001_0' done.
2009-05-11 10:32:36,000 INFO MapTask - numReduceTasks: 1
2009-05-11 10:32:36,000 INFO MapTask - io.sort.mb = 100
2009-05-11 10:32:36,125 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:32:36,125 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:32:36,125 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:32:36,281 INFO IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-05-11 10:32:36,281 INFO IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2009-05-11 10:32:36,281 INFO MapTask - Starting flush of map output
2009-05-11 10:32:36,296 INFO MapTask - Finished spill 0
2009-05-11 10:32:36,312 INFO TaskRunner - Task:attempt_local_0008_m_000002_0 is done. And is in the process of commiting
2009-05-11 10:32:36,312 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_data/part-00000/data:0+1382
2009-05-11 10:32:36,312 INFO TaskRunner - Task 'attempt_local_0008_m_000002_0' done.
2009-05-11 10:32:36,343 INFO MapTask - numReduceTasks: 1
2009-05-11 10:32:36,343 INFO MapTask - io.sort.mb = 100
2009-05-11 10:32:36,453 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:32:36,453 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:32:36,453 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:32:36,609 INFO IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-05-11 10:32:36,609 INFO IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2009-05-11 10:32:36,625 INFO MapTask - Starting flush of map output
2009-05-11 10:32:36,625 INFO MapTask - Finished spill 0
2009-05-11 10:32:36,640 INFO TaskRunner - Task:attempt_local_0008_m_000003_0 is done. And is in the process of commiting
2009-05-11 10:32:36,640 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_text/part-00000/data:0+738
2009-05-11 10:32:36,640 INFO TaskRunner - Task 'attempt_local_0008_m_000003_0' done.
2009-05-11 10:32:36,671 INFO MapTask - numReduceTasks: 1
2009-05-11 10:32:36,671 INFO MapTask - io.sort.mb = 100
2009-05-11 10:32:36,781 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:32:36,781 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:32:36,796 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:32:36,937 INFO IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-05-11 10:32:36,953 INFO IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2009-05-11 10:32:36,953 INFO MapTask - Starting flush of map output
2009-05-11 10:32:36,968 INFO MapTask - Finished spill 0
2009-05-11 10:32:36,968 INFO TaskRunner - Task:attempt_local_0008_m_000004_0 is done. And is in the process of commiting
2009-05-11 10:32:36,968 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+3772
2009-05-11 10:32:36,968 INFO TaskRunner - Task 'attempt_local_0008_m_000004_0' done.
2009-05-11 10:32:37,000 INFO MapTask - numReduceTasks: 1
2009-05-11 10:32:37,000 INFO MapTask - io.sort.mb = 100
2009-05-11 10:32:37,109 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:32:37,109 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:32:37,125 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:32:37,281 INFO IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-05-11 10:32:37,281 INFO IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2009-05-11 10:32:37,296 INFO MapTask - Starting flush of map output
2009-05-11 10:32:37,296 INFO MapTask - Finished spill 0
2009-05-11 10:32:37,312 INFO TaskRunner - Task:attempt_local_0008_m_000005_0 is done. And is in the process of commiting
2009-05-11 10:32:37,312 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/linkdb/current/part-00000/data:0+4215
2009-05-11 10:32:37,312 INFO TaskRunner - Task 'attempt_local_0008_m_000005_0' done.
2009-05-11 10:32:37,343 INFO LocalJobRunner -
2009-05-11 10:32:37,359 INFO Merger - Merging 6 sorted segments
2009-05-11 10:32:37,359 INFO Merger - Down to the last merge-pass, with 6 segments left of total size: 13876 bytes
2009-05-11 10:32:37,359 INFO LocalJobRunner -
2009-05-11 10:32:37,359 INFO PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins
省略插件加载日志…
2009-05-11 10:32:37,515 INFO IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2009-05-11 10:32:37,515 INFO IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2009-05-11 10:32:37,546 INFO Configuration - found resource common-terms.utf8 at file:/D:/work/workspace/nutch_crawl/bin/common-terms.utf8
2009-05-11 10:32:38,500 INFO TaskRunner - Task:attempt_local_0008_r_000000_0 is done. And is in the process of commiting
2009-05-11 10:32:38,500 INFO LocalJobRunner - reduce > reduce
2009-05-11 10:32:38,500 INFO TaskRunner - Task 'attempt_local_0008_r_000000_0' done.
2009-05-11 10:33:19,703 INFO JobClient - Running job: job_local_0008
2009-05-11 10:33:50,156 INFO JobClient - Job complete: job_local_0008
2009-05-11 10:33:52,562 INFO JobClient - Counters: 11
2009-05-11 10:33:52,562 INFO JobClient - File Systems
2009-05-11 10:33:52,562 INFO JobClient - Local bytes read=2150441
2009-05-11 10:33:52,562 INFO JobClient - Local bytes written=3845733
2009-05-11 10:33:52,562 INFO JobClient - Map-Reduce Framework
2009-05-11 10:33:52,562 INFO JobClient - Reduce input groups=58
2009-05-11 10:33:52,562 INFO JobClient - Combine output records=0
2009-05-11 10:33:52,562 INFO JobClient - Map input records=177
2009-05-11 10:33:52,562 INFO JobClient - Reduce output records=1
2009-05-11 10:33:52,562 INFO JobClient - Map output bytes=13506
2009-05-11 10:33:52,562 INFO JobClient - Map input bytes=13661
2009-05-11 10:33:52,562 INFO JobClient - Combine input records=0
2009-05-11 10:33:52,562 INFO JobClient - Map output records=177
2009-05-11 10:33:52,562 INFO JobClient - Reduce input records=177
2009-05-11 10:33:57,656 INFO Indexer - Indexer: done
描述:删除重复数据
2009-05-11 10:38:53,671 INFO DeleteDuplicates - Dedup: starting
2009-05-11 10:39:32,890 INFO DeleteDuplicates - Dedup: adding indexes in: 20090508/indexes
2009-05-11 10:39:57,265 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:40:09,015 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-11 10:40:09,218 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-11 10:40:51,890 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 10:42:56,203 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 10:42:56,265 INFO MapTask - numReduceTasks: 1
2009-05-11 10:42:56,265 INFO MapTask - io.sort.mb = 100
2009-05-11 10:42:56,390 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:42:56,390 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:42:56,515 INFO MapTask - Starting flush of map output
2009-05-11 10:42:56,718 INFO MapTask - Finished spill 0
2009-05-11 10:42:56,718 INFO TaskRunner - Task:attempt_local_0009_m_000000_0 is done. And is in the process of commiting
2009-05-11 10:42:56,718 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/indexes/part-00000
2009-05-11 10:42:56,718 INFO TaskRunner - Task 'attempt_local_0009_m_000000_0' done.
2009-05-11 10:42:56,734 INFO LocalJobRunner -
2009-05-11 10:42:56,734 INFO Merger - Merging 1 sorted segments
2009-05-11 10:42:56,734 INFO Merger - Down to the last merge-pass, with 1 segments left of total size: 141 bytes
2009-05-11 10:42:56,734 INFO LocalJobRunner -
2009-05-11 10:42:56,781 INFO TaskRunner - Task:attempt_local_0009_r_000000_0 is done. And is in the process of commiting
2009-05-11 10:42:56,781 INFO LocalJobRunner -
2009-05-11 10:42:56,781 INFO TaskRunner - Task attempt_local_0009_r_000000_0 is allowed to commit now
2009-05-11 10:42:56,796 INFO FileOutputCommitter - Saved output of task 'attempt_local_0009_r_000000_0' to file:/D:/work/workspace/nutch_crawl/dedup-urls-1843604809
2009-05-11 10:42:56,796 INFO LocalJobRunner - reduce > reduce
2009-05-11 10:42:56,796 INFO TaskRunner - Task 'attempt_local_0009_r_000000_0' done.
2009-05-11 10:43:06,515 INFO JobClient - Running job: job_local_0009
2009-05-11 10:43:14,500 INFO JobClient - Job complete: job_local_0009
2009-05-11 10:43:16,296 INFO JobClient - Counters: 11
2009-05-11 10:43:16,296 INFO JobClient - File Systems
2009-05-11 10:43:16,296 INFO JobClient - Local bytes read=710951
2009-05-11 10:43:16,296 INFO JobClient - Local bytes written=1220879
2009-05-11 10:43:16,296 INFO JobClient - Map-Reduce Framework
2009-05-11 10:43:16,296 INFO JobClient - Reduce input groups=1
2009-05-11 10:43:16,296 INFO JobClient - Combine output records=0
2009-05-11 10:43:16,296 INFO JobClient - Map input records=1
2009-05-11 10:43:16,296 INFO JobClient - Reduce output records=1
2009-05-11 10:43:16,296 INFO JobClient - Map output bytes=137
2009-05-11 10:43:16,296 INFO JobClient - Map input bytes=2147483647
2009-05-11 10:43:16,296 INFO JobClient - Combine input records=0
2009-05-11 10:43:16,296 INFO JobClient - Map output records=1
2009-05-11 10:43:16,296 INFO JobClient - Reduce input records=1
2009-05-11 10:44:37,734 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:44:45,953 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-11 10:44:46,140 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-11 10:44:48,781 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 10:45:46,546 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 10:45:46,609 INFO MapTask - numReduceTasks: 1
2009-05-11 10:45:46,609 INFO MapTask - io.sort.mb = 100
2009-05-11 10:45:46,718 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:45:46,718 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:45:46,734 INFO MapTask - Starting flush of map output
2009-05-11 10:45:46,953 INFO MapTask - Finished spill 0
2009-05-11 10:45:46,953 INFO TaskRunner - Task:attempt_local_0010_m_000000_0 is done. And is in the process of commiting
2009-05-11 10:45:46,953 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/dedup-urls-1843604809/part-00000:0+247
2009-05-11 10:45:46,953 INFO TaskRunner - Task 'attempt_local_0010_m_000000_0' done.
2009-05-11 10:45:46,968 INFO LocalJobRunner -
2009-05-11 10:45:46,968 INFO Merger - Merging 1 sorted segments
2009-05-11 10:45:46,968 INFO Merger - Down to the last merge-pass, with 1 segments left of total size: 137 bytes
2009-05-11 10:45:46,968 INFO LocalJobRunner -
2009-05-11 10:45:47,015 INFO TaskRunner - Task:attempt_local_0010_r_000000_0 is done. And is in the process of commiting
2009-05-11 10:45:47,015 INFO LocalJobRunner -
2009-05-11 10:45:47,015 INFO TaskRunner - Task attempt_local_0010_r_000000_0 is allowed to commit now
2009-05-11 10:45:47,015 INFO FileOutputCommitter - Saved output of task 'attempt_local_0010_r_000000_0' to file:/D:/work/workspace/nutch_crawl/dedup-hash-291931517
2009-05-11 10:45:47,015 INFO LocalJobRunner - reduce > reduce
2009-05-11 10:45:47,015 INFO TaskRunner - Task 'attempt_local_0010_r_000000_0' done.
2009-05-11 10:45:52,187 INFO JobClient - Running job: job_local_0010
2009-05-11 10:46:03,984 INFO JobClient - Job complete: job_local_0010
2009-05-11 10:46:06,359 INFO JobClient - Counters: 11
2009-05-11 10:46:06,359 INFO JobClient - File Systems
2009-05-11 10:46:06,359 INFO JobClient - Local bytes read=764171
2009-05-11 10:46:06,359 INFO JobClient - Local bytes written=1327019
2009-05-11 10:46:06,359 INFO JobClient - Map-Reduce Framework
2009-05-11 10:46:06,359 INFO JobClient - Reduce input groups=1
2009-05-11 10:46:06,359 INFO JobClient - Combine output records=0
2009-05-11 10:46:06,359 INFO JobClient - Map input records=1
2009-05-11 10:46:06,359 INFO JobClient - Reduce output records=0
2009-05-11 10:46:06,359 INFO JobClient - Map output bytes=133
2009-05-11 10:46:06,359 INFO JobClient - Map input bytes=141
2009-05-11 10:46:06,359 INFO JobClient - Combine input records=0
2009-05-11 10:46:06,359 INFO JobClient - Map output records=1
2009-05-11 10:46:06,359 INFO JobClient - Reduce input records=1
2009-05-11 10:47:19,953 INFO JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2009-05-11 10:47:19,953 WARN JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2009-05-11 10:47:20,140 WARN JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2009-05-11 10:47:20,156 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 10:47:20,765 INFO JobClient - Running job: job_local_0011
2009-05-11 10:47:20,765 INFO FileInputFormat - Total input paths to process : 1
2009-05-11 10:47:20,796 INFO MapTask - numReduceTasks: 1
2009-05-11 10:47:20,796 INFO MapTask - io.sort.mb = 100
2009-05-11 10:47:20,921 INFO MapTask - data buffer = 79691776/99614720
2009-05-11 10:47:20,921 INFO MapTask - record buffer = 262144/327680
2009-05-11 10:47:20,937 INFO MapTask - Starting flush of map output
2009-05-11 10:47:21,140 INFO MapTask - Index: (0, 2, 6)
2009-05-11 10:47:21,140 INFO TaskRunner - Task:attempt_local_0011_m_000000_0 is done. And is in the process of commiting
2009-05-11 10:47:21,140 INFO LocalJobRunner - file:/D:/work/workspace/nutch_crawl/dedup-hash-291931517/part-00000:0+103
2009-05-11 10:47:21,140 INFO TaskRunner - Task 'attempt_local_0011_m_000000_0' done.
2009-05-11 10:47:21,156 INFO LocalJobRunner -
2009-05-11 10:47:21,156 INFO Merger - Merging 1 sorted segments
2009-05-11 10:47:21,156 INFO Merger - Down to the last merge-pass, with 0 segments left of total size: 0 bytes
2009-05-11 10:47:21,156 INFO LocalJobRunner -
2009-05-11 10:47:21,171 INFO TaskRunner - Task:attempt_local_0011_r_000000_0 is done. And is in the process of commiting
2009-05-11 10:47:21,171 INFO LocalJobRunner - reduce > reduce
2009-05-11 10:47:21,171 INFO TaskRunner - Task 'attempt_local_0011_r_000000_0' done.
2009-05-11 10:47:21,765 INFO JobClient - Job complete: job_local_0011
2009-05-11 10:47:21,765 INFO JobClient - Counters: 11
2009-05-11 10:47:21,765 INFO JobClient - File Systems
2009-05-11 10:47:21,765 INFO JobClient - Local bytes read=816128
2009-05-11 10:47:21,765 INFO JobClient - Local bytes written=1430954
2009-05-11 10:47:21,765 INFO JobClient - Map-Reduce Framework
2009-05-11 10:47:21,765 INFO JobClient - Reduce input groups=0
2009-05-11 10:47:21,765 INFO JobClient - Combine output records=0
2009-05-11 10:47:21,765 INFO JobClient - Map input records=0
2009-05-11 10:47:21,765 INFO JobClient - Reduce output records=0
2009-05-11 10:47:21,765 INFO JobClient - Map output bytes=0
2009-05-11 10:47:21,765 INFO JobClient - Map input bytes=0
2009-05-11 10:47:21,765 INFO JobClient - Combine input records=0
2009-05-11 10:47:21,765 INFO JobClient - Map output records=0
2009-05-11 10:47:21,765 INFO JobClient - Reduce input records=0
2009-05-11 10:47:44,031 INFO DeleteDuplicates - Dedup: done
描述:合并索引文件
首先在tmp/hadoop-Administrator/mapred/local/crawl生成一个临时文件夹20090511094057,用indexes里的数据生成索引添加到20090511094057下的merge-output目录,
fs.completeLocalOutput方法把临时目录的索引写到新生成的index目录下.
2009-05-11 10:53:56,156 INFO IndexMerger - merging indexes to: 20090508/index
2009-05-11 10:58:50,906 INFO IndexMerger - Adding file:/D:/work/workspace/nutch_crawl/20090508/indexes/part-00000
2009-05-11 11:04:36,562 INFO IndexMerger - done merging