


1.1 抓取目录分析


l         crawldb目录存放下载的URL,以及下载的日期,用来页面更新检查时间.

l         linkdb目录存放URL的互联关系,是下载完成后分析得到的.

l         segments:存放抓取的页面,下面子目录的个数于获取的页面层数有关系,通常每一层页面会独立存放一个子目录,子目录名称为时间,便于管理.比如我这只抓取了一层页面就只生成了20090508173137目录.每个子目录里又有6个子文件夹如下:

Ø         content:每个下载页面的内容。

Ø         crawl_fetch:每个下载URL的状态。

Ø         crawl_generate:待下载URL集合。

Ø         crawl_parse:包含来更新crawldb的外部链接库。

Ø         parse_data:包含每个URL解析出的外部链接和元数据

Ø         parse_text:包含每个解析过的URL的文本内容。

l         indexs:存放每次下载的独立索引目录

l         index:符合Lucene格式的索引目录,是indexs里所有index合并后的完整索引

1.2 Crawl过程概述


1、  nutch.crawl.Inject


2、  nutch.crawl.Generator


3、  nutch.fetcher.Fetcher


4、  nutch.parse.ParseSegment


5、  nutch.crawl.CrawlDb


6、  nutch.crawl.LinkDb


7、  nutch.indexer.Indexer


8、  nutch.indexer.DeleteDuplicates


9、  nutch.indexer.IndexMerger


1.3  抓取过程分析

1.3.1   inject方法






2009-05-08 15:41:36,640 INFO  Injector - Injector: starting

2009-05-08 15:41:37,031 INFO  Injector - Injector: crawlDb: 20090508/crawldb

2009-05-08 15:41:37,781 INFO  Injector - Injector: urlDir: urls





2009-05-08 15:41:36,640 INFO  Injector - Injector: starting

2009-05-08 15:41:37,031 INFO  Injector - Injector: crawlDb: 20090508/crawldb

2009-05-08 15:41:37,781 INFO  Injector - Injector: urlDir: urls

2009-05-08 15:52:41,734 INFO  Injector - Injector: Converting injected urls to crawl db entries.

2009-05-08 15:56:22,203 INFO  JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=

2009-05-08 16:08:20,796 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-08 16:08:20,984 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-08 16:24:42,593 INFO  FileInputFormat - Total input paths to process : 1

2009-05-08 16:38:29,437 INFO  FileInputFormat - Total input paths to process : 1

2009-05-08 16:38:29,546 INFO  MapTask - numReduceTasks: 1

2009-05-08 16:38:29,562 INFO  MapTask - io.sort.mb = 100

2009-05-08 16:38:29,687 INFO  MapTask - data buffer = 79691776/99614720

2009-05-08 16:38:29,687 INFO  MapTask - record buffer = 262144/327680

2009-05-08 16:38:29,718 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins

2009-05-08 16:38:29,921 INFO  PluginRepository - Plugin Auto-activation mode: [true]

2009-05-08 16:38:29,921 INFO  PluginRepository - Registered Plugins:

2009-05-08 16:38:29,921 INFO  PluginRepository - the nutch core extension points (nutch-extensionpoints)

2009-05-08 16:38:29,921 INFO  PluginRepository - Basic Query Filter (query-basic)

2009-05-08 16:38:29,921 INFO  PluginRepository - Basic URL Normalizer (urlnormalizer-basic)

2009-05-08 16:38:29,921 INFO  PluginRepository - Basic Indexing Filter (index-basic)

2009-05-08 16:38:29,921 INFO  PluginRepository - Html Parse Plug-in (parse-html)

2009-05-08 16:38:29,921 INFO  PluginRepository - Site Query Filter (query-site)

2009-05-08 16:38:29,921 INFO  PluginRepository - Basic Summarizer Plug-in (summary-basic)

2009-05-08 16:38:29,921 INFO  PluginRepository - HTTP Framework (lib-http)

2009-05-08 16:38:29,921 INFO  PluginRepository - Text Parse Plug-in (parse-text)

2009-05-08 16:38:29,921 INFO  PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)

2009-05-08 16:38:29,921 INFO  PluginRepository - Regex URL Filter (urlfilter-regex)

2009-05-08 16:38:29,921 INFO  PluginRepository - Http Protocol Plug-in (protocol-http)

2009-05-08 16:38:29,921 INFO  PluginRepository - XML Response Writer Plug-in (response-xml)

2009-05-08 16:38:29,921 INFO  PluginRepository - Regex URL Normalizer (urlnormalizer-regex)

2009-05-08 16:38:29,921 INFO  PluginRepository - OPIC Scoring Plug-in (scoring-opic)

2009-05-08 16:38:29,921 INFO  PluginRepository - CyberNeko HTML Parser (lib-nekohtml)

2009-05-08 16:38:29,921 INFO  PluginRepository - Anchor Indexing Filter (index-anchor)

2009-05-08 16:38:29,921 INFO  PluginRepository - JavaScript Parser (parse-js)

2009-05-08 16:38:29,921 INFO  PluginRepository - URL Query Filter (query-url)

2009-05-08 16:38:29,921 INFO  PluginRepository - Regex URL Filter Framework (lib-regex-filter)

2009-05-08 16:38:29,921 INFO  PluginRepository - JSON Response Writer Plug-in (response-json)

2009-05-08 16:38:29,921 INFO  PluginRepository - Registered Extension-Points:

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Field Filter (org.apache.nutch.indexer.field.FieldFilter)

2009-05-08 16:38:29,921 INFO  PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Search Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch URL Normalizer (

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch URL Filter (

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)

2009-05-08 16:38:29,921 INFO  PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)

2009-05-08 16:38:29,921 INFO  PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)

2009-05-08 16:38:29,968 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-08 16:38:29,984 WARN  RegexURLNormalizer - can't find rules for scope 'inject', using default

2009-05-08 16:38:29,984 INFO  MapTask - Starting flush of map output

2009-05-08 16:38:30,203 INFO  MapTask - Finished spill 0

2009-05-08 16:38:30,203 INFO  TaskRunner - Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

2009-05-08 16:38:30,218 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/urls/site.txt:0+19

2009-05-08 16:38:30,218 INFO  TaskRunner - Task 'attempt_local_0001_m_000000_0' done.

2009-05-08 16:38:30,234 INFO  LocalJobRunner -

2009-05-08 16:38:30,250 INFO  Merger - Merging 1 sorted segments

2009-05-08 16:38:30,265 INFO  Merger - Down to the last merge-pass, with 1 segments left of total size: 53 bytes

2009-05-08 16:38:30,265 INFO  LocalJobRunner -

2009-05-08 16:38:30,390 INFO  TaskRunner - Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting

2009-05-08 16:38:30,390 INFO  LocalJobRunner -

2009-05-08 16:38:30,390 INFO  TaskRunner - Task attempt_local_0001_r_000000_0 is allowed to commit now

2009-05-08 16:38:30,406 INFO  FileOutputCommitter - Saved output of task 'attempt_local_0001_r_000000_0' to file:/tmp/hadoop-Administrator/mapred/temp/inject-temp-474192304

2009-05-08 16:38:30,406 INFO  LocalJobRunner - reduce > reduce

2009-05-08 16:38:30,406 INFO  TaskRunner - Task 'attempt_local_0001_r_000000_0' done.



Job: job_local_0001

file: file:/tmp/hadoop-Administrator/mapred/system/job_local_0001/job.xml

tracking URL: http://localhost:8080/

2009-05-08 16:47:14,093 INFO  JobClient - Running job: job_local_0001

2009-05-08 16:49:51,859 INFO  JobClient - Job complete: job_local_0001

2009-05-08 16:51:36,062 INFO  JobClient - Counters: 11

2009-05-08 16:51:36,062 INFO  JobClient -   File Systems

2009-05-08 16:51:36,062 INFO  JobClient -     Local bytes read=51591

2009-05-08 16:51:36,062 INFO  JobClient -     Local bytes written=104337

2009-05-08 16:51:36,062 INFO  JobClient -   Map-Reduce Framework

2009-05-08 16:51:36,062 INFO  JobClient -     Reduce input groups=1

2009-05-08 16:51:36,062 INFO  JobClient -     Combine output records=0

2009-05-08 16:51:36,062 INFO  JobClient -     Map input records=1

2009-05-08 16:51:36,062 INFO  JobClient -     Reduce output records=1

2009-05-08 16:51:36,062 INFO  JobClient -     Map output bytes=49

2009-05-08 16:51:36,062 INFO  JobClient -     Map input bytes=19

2009-05-08 16:51:36,062 INFO  JobClient -     Combine input records=0

2009-05-08 16:51:36,062 INFO  JobClient -     Map output records=1

2009-05-08 16:51:36,062 INFO  JobClient -     Reduce input records=1







CrawlDb.install(mergeJob, crawlDb);



2009-05-08 17:03:57,250 INFO  Injector - Injector: Merging injected urls into crawl db.

2009-05-08 17:10:01,015 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-08 17:10:15,953 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-08 17:10:16,156 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-08 17:12:15,296 INFO  FileInputFormat - Total input paths to process : 1

2009-05-08 17:13:40,296 INFO  FileInputFormat - Total input paths to process : 1

2009-05-08 17:13:40,406 INFO  MapTask - numReduceTasks: 1

2009-05-08 17:13:40,406 INFO  MapTask - io.sort.mb = 100

2009-05-08 17:13:40,515 INFO  MapTask - data buffer = 79691776/99614720

2009-05-08 17:13:40,515 INFO  MapTask - record buffer = 262144/327680

2009-05-08 17:13:40,546 INFO  MapTask - Starting flush of map output

2009-05-08 17:13:40,765 INFO  MapTask - Finished spill 0

2009-05-08 17:13:40,765 INFO  TaskRunner - Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting

2009-05-08 17:13:40,765 INFO  LocalJobRunner - file:/tmp/hadoop-Administrator/mapred/temp/inject-temp-474192304/part-00000:0+143

2009-05-08 17:13:40,765 INFO  TaskRunner - Task 'attempt_local_0002_m_000000_0' done.

2009-05-08 17:13:40,796 INFO  LocalJobRunner -

2009-05-08 17:13:40,796 INFO  Merger - Merging 1 sorted segments

2009-05-08 17:13:40,796 INFO  Merger - Down to the last merge-pass, with 1 segments left of total size: 53 bytes

2009-05-08 17:13:40,796 INFO  LocalJobRunner -

2009-05-08 17:13:40,906 WARN  NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2009-05-08 17:13:40,906 INFO  CodecPool - Got brand-new compressor

2009-05-08 17:13:40,906 INFO  TaskRunner - Task:attempt_local_0002_r_000000_0 is done. And is in the process of commiting

2009-05-08 17:13:40,906 INFO  LocalJobRunner -

2009-05-08 17:13:40,906 INFO  TaskRunner - Task attempt_local_0002_r_000000_0 is allowed to commit now

2009-05-08 17:13:40,921 INFO  FileOutputCommitter - Saved output of task 'attempt_local_0002_r_000000_0' to file:/D:/work/workspace/nutch_crawl/20090508/crawldb/1896567745

2009-05-08 17:13:40,921 INFO  LocalJobRunner - reduce > reduce

2009-05-08 17:13:40,937 INFO  TaskRunner - Task 'attempt_local_0002_r_000000_0' done.

2009-05-08 17:13:46,781 INFO  JobClient - Running job: job_local_0002

2009-05-08 17:14:55,125 INFO  JobClient - Job complete: job_local_0002

2009-05-08 17:14:59,328 INFO  JobClient - Counters: 11

2009-05-08 17:14:59,328 INFO  JobClient -   File Systems

2009-05-08 17:14:59,328 INFO  JobClient -     Local bytes read=103875

2009-05-08 17:14:59,328 INFO  JobClient -     Local bytes written=209385

2009-05-08 17:14:59,328 INFO  JobClient -   Map-Reduce Framework

2009-05-08 17:14:59,328 INFO  JobClient -     Reduce input groups=1

2009-05-08 17:14:59,328 INFO  JobClient -     Combine output records=0

2009-05-08 17:14:59,328 INFO  JobClient -     Map input records=1

2009-05-08 17:14:59,328 INFO  JobClient -     Reduce output records=1

2009-05-08 17:14:59,328 INFO  JobClient -     Map output bytes=49

2009-05-08 17:14:59,328 INFO  JobClient -     Map input bytes=57

2009-05-08 17:14:59,328 INFO  JobClient -     Combine input records=0

2009-05-08 17:14:59,328 INFO  JobClient -     Map output records=1

2009-05-08 17:14:59,328 INFO  JobClient -     Reduce input records=1

2009-05-08 17:17:30,984 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-08 17:20:02,390 INFO  Injector - Injector: done

1.3.2   generate方法


LockUtil.createLockFile(fs, lock, force);



2009-05-08 17:37:18,218 INFO  Generator - Generator: Selecting best-scoring urls due for fetch.

2009-05-08 17:37:18,625 INFO  Generator - Generator: starting

2009-05-08 17:37:18,937 INFO  Generator - Generator: segment: 20090508/segments/20090508173137

2009-05-08 17:37:19,468 INFO  Generator - Generator: filtering: true

2009-05-08 17:37:22,312 INFO  Generator - Generator: topN: 50

2009-05-08 17:37:51,203 INFO  Generator - Generator: jobtracker is 'local', generating exactly one partition.

2009-05-08 17:39:57,609 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-08 17:40:05,234 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-08 17:40:05,406 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-08 17:40:05,437 INFO  FileInputFormat - Total input paths to process : 1

2009-05-08 17:40:06,062 INFO  FileInputFormat - Total input paths to process : 1

2009-05-08 17:40:06,109 INFO  MapTask - numReduceTasks: 1


2009-05-08 17:40:06,312 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-08 17:40:06,343 INFO  FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule

2009-05-08 17:40:06,343 INFO  AbstractFetchSchedule - defaultInterval=2592000

2009-05-08 17:40:06,343 INFO  AbstractFetchSchedule - maxInterval=7776000

2009-05-08 17:40:06,343 INFO  MapTask - io.sort.mb = 100

2009-05-08 17:40:06,437 INFO  MapTask - data buffer = 79691776/99614720

2009-05-08 17:40:06,437 INFO  MapTask - record buffer = 262144/327680

2009-05-08 17:40:06,453 WARN  RegexURLNormalizer - can't find rules for scope 'partition', using default

2009-05-08 17:40:06,453 INFO  MapTask - Starting flush of map output

2009-05-08 17:40:06,625 INFO  MapTask - Finished spill 0

2009-05-08 17:40:06,640 INFO  TaskRunner - Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting

2009-05-08 17:40:06,640 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143

2009-05-08 17:40:06,640 INFO  TaskRunner - Task 'attempt_local_0003_m_000000_0' done.

2009-05-08 17:40:06,656 INFO  LocalJobRunner -

2009-05-08 17:40:06,656 INFO  Merger - Merging 1 sorted segments

2009-05-08 17:40:06,656 INFO  Merger - Down to the last merge-pass, with 1 segments left of total size: 78 bytes

2009-05-08 17:40:06,656 INFO  LocalJobRunner –


2009-05-08 17:40:06,875 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-08 17:40:06,906 INFO  FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule

2009-05-08 17:40:06,906 INFO  AbstractFetchSchedule - defaultInterval=2592000

2009-05-08 17:40:06,906 INFO  AbstractFetchSchedule - maxInterval=7776000

2009-05-08 17:40:06,906 WARN  RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default

2009-05-08 17:40:06,906 INFO  TaskRunner - Task:attempt_local_0003_r_000000_0 is done. And is in the process of commiting

2009-05-08 17:40:06,906 INFO  LocalJobRunner -

2009-05-08 17:40:06,906 INFO  TaskRunner - Task attempt_local_0003_r_000000_0 is allowed to commit now

2009-05-08 17:40:06,906 INFO  FileOutputCommitter - Saved output of task 'attempt_local_0003_r_000000_0' to file:/tmp/hadoop-Administrator/mapred/temp/generate-temp-1241774893937

2009-05-08 17:40:06,921 INFO  LocalJobRunner - reduce > reduce

2009-05-08 17:40:06,921 INFO  TaskRunner - Task 'attempt_local_0003_r_000000_0' done.

2009-05-08 17:40:21,468 INFO  JobClient - Running job: job_local_0003

2009-05-08 17:40:31,671 INFO  JobClient - Job complete: job_local_0003

2009-05-08 17:40:34,046 INFO  JobClient - Counters: 11

2009-05-08 17:40:34,046 INFO  JobClient -   File Systems

2009-05-08 17:40:34,046 INFO  JobClient -     Local bytes read=157400

2009-05-08 17:40:34,046 INFO  JobClient -     Local bytes written=316982

2009-05-08 17:40:34,046 INFO  JobClient -   Map-Reduce Framework

2009-05-08 17:40:34,046 INFO  JobClient -     Reduce input groups=1

2009-05-08 17:40:34,046 INFO  JobClient -     Combine output records=0

2009-05-08 17:40:34,046 INFO  JobClient -     Map input records=1

2009-05-08 17:40:34,046 INFO  JobClient -     Reduce output records=1

2009-05-08 17:40:34,046 INFO  JobClient -     Map output bytes=74

2009-05-08 17:40:34,046 INFO  JobClient -     Map input bytes=57

2009-05-08 17:40:34,046 INFO  JobClient -     Combine input records=0

2009-05-08 17:40:34,046 INFO  JobClient -     Map output records=1

2009-05-08 17:40:34,046 INFO  JobClient -     Reduce input records=1



1.3.3   fetch 方法


2009-05-11 09:45:13,984 WARN  Fetcher - Fetcher: Your '' value should be listed first in 'http.robots.agents' property.

2009-05-11 09:45:34,796 INFO  Fetcher - Fetcher: starting

2009-05-11 09:45:35,375 INFO  Fetcher - Fetcher: segment: 20090508/segments/20090511094102

2009-05-11 09:49:23,984 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 09:49:58,046 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-11 09:49:58,234 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-11 09:49:58,265 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 09:49:58,859 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 09:49:58,906 INFO  MapTask - numReduceTasks: 1

2009-05-11 09:49:58,906 INFO  MapTask - io.sort.mb = 100

2009-05-11 09:49:59,015 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 09:49:59,015 INFO  MapTask - record buffer = 262144/327680

2009-05-11 09:49:59,140 INFO  Fetcher - Fetcher: threads: 5

2009-05-11 09:49:59,140 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins

2009-05-11 09:49:59,250 INFO  Fetcher - QueueFeeder finished: total 1 records.


2009-05-11 09:49:59,312 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 09:49:59,328 INFO  Configuration - found resource parse-plugins.xml at file:/D:/work/workspace/nutch_crawl/bin/parse-plugins.xml

2009-05-11 09:49:59,359 INFO  Fetcher - fetching

2009-05-11 09:49:59,375 INFO  Fetcher - -finishing thread FetcherThread, activeThreads=4

2009-05-11 09:49:59,375 INFO  Fetcher - -finishing thread FetcherThread, activeThreads=3

2009-05-11 09:49:59,375 INFO  Fetcher - -finishing thread FetcherThread, activeThreads=2

2009-05-11 09:49:59,375 INFO  Fetcher - -finishing thread FetcherThread, activeThreads=1

2009-05-11 09:49:59,421 INFO  Http - = null

2009-05-11 09:49:59,421 INFO  Http - http.proxy.port = 8080

2009-05-11 09:49:59,421 INFO  Http - http.timeout = 10000

2009-05-11 09:49:59,421 INFO  Http - http.content.limit = 65536

2009-05-11 09:49:59,421 INFO  Http - http.agent = nutch/Nutch-1.0 (chinahui;;

2009-05-11 09:49:59,421 INFO  Http - protocol.plugin.check.blocking = false

2009-05-11 09:49:59,421 INFO  Http - protocol.plugin.check.robots = false

2009-05-11 09:50:00,109 INFO  Configuration - found resource tika-mimetypes.xml at file:/D:/work/workspace/nutch_crawl/bin/tika-mimetypes.xml

2009-05-11 09:50:00,156 WARN  ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml

2009-05-11 09:50:00,375 INFO  Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0

2009-05-11 09:50:00,671 INFO  SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature

2009-05-11 09:50:00,687 INFO  Fetcher - -finishing thread FetcherThread, activeThreads=0

2009-05-11 09:50:01,375 INFO  Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0

2009-05-11 09:50:01,375 INFO  Fetcher - -activeThreads=0

2009-05-11 09:50:01,375 INFO  MapTask - Starting flush of map output

2009-05-11 09:50:01,578 INFO  MapTask - Finished spill 0

2009-05-11 09:50:01,578 INFO  TaskRunner - Task:attempt_local_0005_m_000000_0 is done. And is in the process of commiting

2009-05-11 09:50:01,578 INFO  LocalJobRunner - 0 threads, 1 pages, 0 errors, 0.5 pages/s, 256 kb/s,

2009-05-11 09:50:01,578 INFO  TaskRunner - Task 'attempt_local_0005_m_000000_0' done.

2009-05-11 09:50:01,593 INFO  LocalJobRunner -

2009-05-11 09:50:01,593 INFO  Merger - Merging 1 sorted segments

2009-05-11 09:50:01,593 INFO  Merger - Down to the last merge-pass, with 1 segments left of total size: 72558 bytes

2009-05-11 09:50:01,593 INFO  LocalJobRunner -

2009-05-11 09:50:01,671 INFO  CodecPool - Got brand-new compressor

2009-05-11 09:50:01,734 INFO  CodecPool - Got brand-new compressor

2009-05-11 09:50:01,765 INFO  CodecPool - Got brand-new compressor

2009-05-11 09:50:01,765 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 09:50:01,921 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 09:50:01,984 INFO  CodecPool - Got brand-new compressor

2009-05-11 09:50:02,015 INFO  CodecPool - Got brand-new compressor

2009-05-11 09:50:02,062 INFO  CodecPool - Got brand-new compressor

2009-05-11 09:50:02,093 INFO  CodecPool - Got brand-new compressor

2009-05-11 09:50:02,125 INFO  CodecPool - Got brand-new compressor

2009-05-11 09:50:02,140 WARN  RegexURLNormalizer - can't find rules for scope 'outlink', using default

2009-05-11 09:50:02,171 INFO  TaskRunner - Task:attempt_local_0005_r_000000_0 is done. And is in the process of commiting

2009-05-11 09:50:02,171 INFO  LocalJobRunner - reduce > reduce

2009-05-11 09:50:02,187 INFO  TaskRunner - Task 'attempt_local_0005_r_000000_0' done.

2009-05-11 09:50:44,062 INFO  JobClient - Running job: job_local_0005

2009-05-11 09:51:31,328 INFO  JobClient - Job complete: job_local_0005

2009-05-11 09:51:32,984 INFO  JobClient - Counters: 11

2009-05-11 09:51:33,000 INFO  JobClient -   File Systems

2009-05-11 09:51:33,000 INFO  JobClient -     Local bytes read=336424

2009-05-11 09:51:33,000 INFO  JobClient -     Local bytes written=700394

2009-05-11 09:51:33,000 INFO  JobClient -   Map-Reduce Framework

2009-05-11 09:51:33,000 INFO  JobClient -     Reduce input groups=1

2009-05-11 09:51:33,000 INFO  JobClient -     Combine output records=0

2009-05-11 09:51:33,000 INFO  JobClient -     Map input records=1

2009-05-11 09:51:33,000 INFO  JobClient -     Reduce output records=3

2009-05-11 09:51:33,000 INFO  JobClient -     Map output bytes=72545

2009-05-11 09:51:33,000 INFO  JobClient -     Map input bytes=78

2009-05-11 09:51:33,000 INFO  JobClient -     Combine input records=0

2009-05-11 09:51:33,000 INFO  JobClient -     Map output records=3

2009-05-11 09:51:33,000 INFO  JobClient -     Reduce input records=3

2009-05-11 09:51:47,750 INFO  Fetcher - Fetcher: done

1.3.4   parse方法


1.3.5   update方法


2009-05-11 10:04:20,890 INFO  CrawlDb - CrawlDb update: starting

2009-05-11 10:04:22,500 INFO  CrawlDb - CrawlDb update: db: 20090508/crawldb

2009-05-11 10:05:53,593 INFO  CrawlDb - CrawlDb update: segments: [20090508/segments/20090511094102]

2009-05-11 10:06:06,031 INFO  CrawlDb - CrawlDb update: additions allowed: true

2009-05-11 10:06:07,296 INFO  CrawlDb - CrawlDb update: URL normalizing: true

2009-05-11 10:06:09,031 INFO  CrawlDb - CrawlDb update: URL filtering: true

2009-05-11 10:07:05,125 INFO  CrawlDb - CrawlDb update: Merging segment data into db.

2009-05-11 10:08:11,031 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:09:00,187 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:09:00,375 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:10:03,531 INFO  FileInputFormat - Total input paths to process : 3

2009-05-11 10:16:25,125 INFO  FileInputFormat - Total input paths to process : 3

2009-05-11 10:16:25,203 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:16:25,203 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:16:25,343 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:25,343 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:16:25,343 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:16:25,750 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:25,796 WARN  RegexURLNormalizer - can't find rules for scope 'crawldb', using default

2009-05-11 10:16:25,796 INFO  MapTask - Starting flush of map output

2009-05-11 10:16:25,984 INFO  MapTask - Finished spill 0

2009-05-11 10:16:26,000 INFO  TaskRunner - Task:attempt_local_0006_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:16:26,000 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143

2009-05-11 10:16:26,000 INFO  TaskRunner - Task 'attempt_local_0006_m_000000_0' done.

2009-05-11 10:16:26,031 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:16:26,031 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:16:26,140 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:26,140 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:16:26,156 INFO  CodecPool - Got brand-new decompressor

2009-05-11 10:16:26,171 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:16:26,687 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:26,718 WARN  RegexURLNormalizer - can't find rules for scope 'crawldb', using default

2009-05-11 10:16:26,734 INFO  MapTask - Starting flush of map output

2009-05-11 10:16:26,750 INFO  MapTask - Finished spill 0

2009-05-11 10:16:26,750 INFO  TaskRunner - Task:attempt_local_0006_m_000002_0 is done. And is in the process of commiting

2009-05-11 10:16:26,750 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026

2009-05-11 10:16:26,750 INFO  TaskRunner - Task 'attempt_local_0006_m_000002_0' done.

2009-05-11 10:16:26,781 INFO  LocalJobRunner -

2009-05-11 10:16:26,781 INFO  Merger - Merging 3 sorted segments

2009-05-11 10:16:26,781 INFO  Merger - Down to the last merge-pass, with 3 segments left of total size: 3706 bytes

2009-05-11 10:16:26,781 INFO  LocalJobRunner -

2009-05-11 10:16:26,875 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:16:27,031 INFO  FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule

2009-05-11 10:16:27,031 INFO  AbstractFetchSchedule - defaultInterval=2592000

2009-05-11 10:16:27,031 INFO  AbstractFetchSchedule - maxInterval=7776000

2009-05-11 10:16:27,046 INFO  TaskRunner - Task:attempt_local_0006_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:16:27,046 INFO  LocalJobRunner -

2009-05-11 10:16:27,046 INFO  TaskRunner - Task attempt_local_0006_r_000000_0 is allowed to commit now

2009-05-11 10:16:27,062 INFO  FileOutputCommitter - Saved output of task 'attempt_local_0006_r_000000_0' to file:/D:/work/workspace/nutch_crawl/20090508/crawldb/132216774

2009-05-11 10:16:27,062 INFO  LocalJobRunner - reduce > reduce

2009-05-11 10:16:27,062 INFO  TaskRunner - Task 'attempt_local_0006_r_000000_0' done.

2009-05-11 10:17:43,984 INFO  JobClient - Running job: job_local_0006

2009-05-11 10:18:33,671 INFO  JobClient - Job complete: job_local_0006

2009-05-11 10:18:35,906 INFO  JobClient - Counters: 11

2009-05-11 10:18:35,906 INFO  JobClient -   File Systems

2009-05-11 10:18:35,906 INFO  JobClient -     Local bytes read=936164

2009-05-11 10:18:35,906 INFO  JobClient -     Local bytes written=1678861

2009-05-11 10:18:35,906 INFO  JobClient -   Map-Reduce Framework

2009-05-11 10:18:35,906 INFO  JobClient -     Reduce input groups=57

2009-05-11 10:18:35,906 INFO  JobClient -     Combine output records=0

2009-05-11 10:18:35,906 INFO  JobClient -     Map input records=63

2009-05-11 10:18:35,906 INFO  JobClient -     Reduce output records=57

2009-05-11 10:18:35,906 INFO  JobClient -     Map output bytes=3574

2009-05-11 10:18:35,906 INFO  JobClient -     Map input bytes=4079

2009-05-11 10:18:35,906 INFO  JobClient -     Combine input records=0

2009-05-11 10:18:35,906 INFO  JobClient -     Map output records=63

2009-05-11 10:18:35,906 INFO  JobClient -     Reduce input records=63

2009-05-11 10:19:48,078 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:22:51,437 INFO  CrawlDb - CrawlDb update: done

1.3.6   invert方法



2009-05-11 10:04:20,890 INFO  CrawlDb - CrawlDb update: starting

2009-05-11 10:04:22,500 INFO  CrawlDb - CrawlDb update: db: 20090508/crawldb

2009-05-11 10:05:53,593 INFO  CrawlDb - CrawlDb update: segments: [20090508/segments/20090511094102]

2009-05-11 10:06:06,031 INFO  CrawlDb - CrawlDb update: additions allowed: true

2009-05-11 10:06:07,296 INFO  CrawlDb - CrawlDb update: URL normalizing: true

2009-05-11 10:06:09,031 INFO  CrawlDb - CrawlDb update: URL filtering: true

2009-05-11 10:07:05,125 INFO  CrawlDb - CrawlDb update: Merging segment data into db.

2009-05-11 10:08:11,031 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:09:00,187 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:09:00,375 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:10:03,531 INFO  FileInputFormat - Total input paths to process : 3

2009-05-11 10:16:25,125 INFO  FileInputFormat - Total input paths to process : 3

2009-05-11 10:16:25,203 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:16:25,203 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:16:25,343 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:25,343 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:16:25,343 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:16:25,750 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:25,796 WARN  RegexURLNormalizer - can't find rules for scope 'crawldb', using default

2009-05-11 10:16:25,796 INFO  MapTask - Starting flush of map output

2009-05-11 10:16:25,984 INFO  MapTask - Finished spill 0

2009-05-11 10:16:26,000 INFO  TaskRunner - Task:attempt_local_0006_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:16:26,000 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+143

2009-05-11 10:16:26,000 INFO  TaskRunner - Task 'attempt_local_0006_m_000000_0' done.

2009-05-11 10:16:26,031 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:16:26,031 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:16:26,140 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:26,140 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:16:26,156 INFO  CodecPool - Got brand-new decompressor

2009-05-11 10:16:26,171 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:16:26,343 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:26,359 WARN  RegexURLNormalizer - can't find rules for scope 'crawldb', using default

2009-05-11 10:16:26,359 INFO  MapTask - Starting flush of map output

2009-05-11 10:16:26,359 INFO  MapTask - Finished spill 0

2009-05-11 10:16:26,375 INFO  TaskRunner - Task:attempt_local_0006_m_000001_0 is done. And is in the process of commiting

2009-05-11 10:16:26,375 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_fetch/part-00000/data:0+254

2009-05-11 10:16:26,375 INFO  TaskRunner - Task 'attempt_local_0006_m_000001_0' done.

2009-05-11 10:16:26,406 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:16:26,406 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:16:26,515 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:16:26,515 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:16:26,531 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:16:26,687 INFO  Configuration - found resource crawl-urlfilter.txt at file:/D:/work/workspace/nutch_crawl/bin/crawl-urlfilter.txt

2009-05-11 10:16:26,718 WARN  RegexURLNormalizer - can't find rules for scope 'crawldb', using default

2009-05-11 10:16:26,734 INFO  MapTask - Starting flush of map output

2009-05-11 10:16:26,750 INFO  MapTask - Finished spill 0

2009-05-11 10:16:26,750 INFO  TaskRunner - Task:attempt_local_0006_m_000002_0 is done. And is in the process of commiting

2009-05-11 10:16:26,750 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026

2009-05-11 10:16:26,750 INFO  TaskRunner - Task 'attempt_local_0006_m_000002_0' done.

2009-05-11 10:16:26,781 INFO  LocalJobRunner -

2009-05-11 10:16:26,781 INFO  Merger - Merging 3 sorted segments

2009-05-11 10:16:26,781 INFO  Merger - Down to the last merge-pass, with 3 segments left of total size: 3706 bytes

2009-05-11 10:16:26,781 INFO  LocalJobRunner -

2009-05-11 10:16:26,875 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:16:27,031 INFO  FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule

2009-05-11 10:16:27,031 INFO  AbstractFetchSchedule - defaultInterval=2592000

2009-05-11 10:16:27,031 INFO  AbstractFetchSchedule - maxInterval=7776000

2009-05-11 10:16:27,046 INFO  TaskRunner - Task:attempt_local_0006_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:16:27,046 INFO  LocalJobRunner -

2009-05-11 10:16:27,046 INFO  TaskRunner - Task attempt_local_0006_r_000000_0 is allowed to commit now

2009-05-11 10:16:27,062 INFO  FileOutputCommitter - Saved output of task 'attempt_local_0006_r_000000_0' to file:/D:/work/workspace/nutch_crawl/20090508/crawldb/132216774

2009-05-11 10:16:27,062 INFO  LocalJobRunner - reduce > reduce

2009-05-11 10:16:27,062 INFO  TaskRunner - Task 'attempt_local_0006_r_000000_0' done.

2009-05-11 10:17:43,984 INFO  JobClient - Running job: job_local_0006

2009-05-11 10:18:33,671 INFO  JobClient - Job complete: job_local_0006

2009-05-11 10:18:35,906 INFO  JobClient - Counters: 11

2009-05-11 10:18:35,906 INFO  JobClient -   File Systems

2009-05-11 10:18:35,906 INFO  JobClient -     Local bytes read=936164

2009-05-11 10:18:35,906 INFO  JobClient -     Local bytes written=1678861

2009-05-11 10:18:35,906 INFO  JobClient -   Map-Reduce Framework

2009-05-11 10:18:35,906 INFO  JobClient -     Reduce input groups=57

2009-05-11 10:18:35,906 INFO  JobClient -     Combine output records=0

2009-05-11 10:18:35,906 INFO  JobClient -     Map input records=63

2009-05-11 10:18:35,906 INFO  JobClient -     Reduce output records=57

2009-05-11 10:18:35,906 INFO  JobClient -     Map output bytes=3574

2009-05-11 10:18:35,906 INFO  JobClient -     Map input bytes=4079

2009-05-11 10:18:35,906 INFO  JobClient -     Combine input records=0

2009-05-11 10:18:35,906 INFO  JobClient -     Map output records=63

2009-05-11 10:18:35,906 INFO  JobClient -     Reduce input records=63

2009-05-11 10:19:48,078 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:22:51,437 INFO  CrawlDb - CrawlDb update: done

2009-05-11 10:26:31,250 INFO  LinkDb - LinkDb: starting

2009-05-11 10:26:31,250 INFO  LinkDb - LinkDb: linkdb: 20090508/linkdb

2009-05-11 10:26:31,250 INFO  LinkDb - LinkDb: URL normalize: true

2009-05-11 10:26:31,250 INFO  LinkDb - LinkDb: URL filter: true

2009-05-11 10:26:31,281 INFO  LinkDb - LinkDb: adding segment: file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102

2009-05-11 10:26:31,281 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:26:31,296 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:26:31,453 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:26:31,484 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 10:26:32,078 INFO  JobClient - Running job: job_local_0007

2009-05-11 10:26:32,078 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 10:26:32,125 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:26:32,125 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:26:32,234 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:26:32,234 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:26:32,250 INFO  MapTask - Starting flush of map output

2009-05-11 10:26:32,437 INFO  MapTask - Finished spill 0

2009-05-11 10:26:32,453 INFO  TaskRunner - Task:attempt_local_0007_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:26:32,453 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_data/part-00000/data:0+1382

2009-05-11 10:26:32,453 INFO  TaskRunner - Task 'attempt_local_0007_m_000000_0' done.

2009-05-11 10:26:32,468 INFO  LocalJobRunner -

2009-05-11 10:26:32,468 INFO  Merger - Merging 1 sorted segments

2009-05-11 10:26:32,468 INFO  Merger - Down to the last merge-pass, with 1 segments left of total size: 3264 bytes

2009-05-11 10:26:32,468 INFO  LocalJobRunner -

2009-05-11 10:26:32,562 INFO  TaskRunner - Task:attempt_local_0007_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:26:32,562 INFO  LocalJobRunner -

2009-05-11 10:26:32,562 INFO  TaskRunner - Task attempt_local_0007_r_000000_0 is allowed to commit now

2009-05-11 10:26:32,578 INFO  FileOutputCommitter - Saved output of task 'attempt_local_0007_r_000000_0' to file:/D:/work/workspace/nutch_crawl/linkdb-1900012851

2009-05-11 10:26:32,578 INFO  LocalJobRunner - reduce > reduce

2009-05-11 10:26:32,578 INFO  TaskRunner - Task 'attempt_local_0007_r_000000_0' done.

2009-05-11 10:26:33,078 INFO  JobClient - Job complete: job_local_0007

2009-05-11 10:26:33,078 INFO  JobClient - Counters: 11

2009-05-11 10:26:33,078 INFO  JobClient -   File Systems

2009-05-11 10:26:33,078 INFO  JobClient -     Local bytes read=535968

2009-05-11 10:26:33,078 INFO  JobClient -     Local bytes written=965231

2009-05-11 10:26:33,078 INFO  JobClient -   Map-Reduce Framework

2009-05-11 10:26:33,078 INFO  JobClient -     Reduce input groups=56

2009-05-11 10:26:33,078 INFO  JobClient -     Combine output records=56

2009-05-11 10:26:33,078 INFO  JobClient -     Map input records=1

2009-05-11 10:26:33,078 INFO  JobClient -     Reduce output records=56

2009-05-11 10:26:33,078 INFO  JobClient -     Map output bytes=3384

2009-05-11 10:26:33,078 INFO  JobClient -     Map input bytes=1254

2009-05-11 10:26:33,078 INFO  JobClient -     Combine input records=60

2009-05-11 10:26:33,078 INFO  JobClient -     Map output records=60

2009-05-11 10:26:33,078 INFO  JobClient -     Reduce input records=56

2009-05-11 10:26:33,078 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:26:33,125 INFO  LinkDb - LinkDb: done

1.3.7   index方法




2009-05-11 10:31:22,250 INFO  Indexer - Indexer: starting

2009-05-11 10:31:45,078 INFO  IndexerMapReduce - IndexerMapReduce: crawldb: 20090508/crawldb

2009-05-11 10:31:45,078 INFO  IndexerMapReduce - IndexerMapReduce: linkdb: 20090508/linkdb

2009-05-11 10:31:45,078 INFO  IndexerMapReduce - IndexerMapReduces: adding segment: file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102

2009-05-11 10:32:30,359 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:32:34,109 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:32:34,296 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:32:34,421 INFO  FileInputFormat - Total input paths to process : 6

2009-05-11 10:32:35,078 INFO  FileInputFormat - Total input paths to process : 6

2009-05-11 10:32:35,140 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:32:35,140 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:32:35,250 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:32:35,250 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:32:35,265 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:32:35,937 INFO  IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:35,937 INFO  IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:35,953 INFO  MapTask - Starting flush of map output

2009-05-11 10:32:35,968 INFO  MapTask - Finished spill 0

2009-05-11 10:32:35,968 INFO  TaskRunner - Task:attempt_local_0008_m_000001_0 is done. And is in the process of commiting

2009-05-11 10:32:35,968 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/crawl_parse/part-00000:0+4026

2009-05-11 10:32:35,968 INFO  TaskRunner - Task 'attempt_local_0008_m_000001_0' done.

2009-05-11 10:32:36,000 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:32:36,000 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:32:36,125 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:32:36,125 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:32:36,125 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:32:36,281 INFO  IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:36,281 INFO  IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:36,281 INFO  MapTask - Starting flush of map output

2009-05-11 10:32:36,296 INFO  MapTask - Finished spill 0

2009-05-11 10:32:36,312 INFO  TaskRunner - Task:attempt_local_0008_m_000002_0 is done. And is in the process of commiting

2009-05-11 10:32:36,312 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_data/part-00000/data:0+1382

2009-05-11 10:32:36,312 INFO  TaskRunner - Task 'attempt_local_0008_m_000002_0' done.

2009-05-11 10:32:36,343 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:32:36,343 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:32:36,453 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:32:36,453 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:32:36,453 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:32:36,609 INFO  IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:36,609 INFO  IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:36,625 INFO  MapTask - Starting flush of map output

2009-05-11 10:32:36,625 INFO  MapTask - Finished spill 0

2009-05-11 10:32:36,640 INFO  TaskRunner - Task:attempt_local_0008_m_000003_0 is done. And is in the process of commiting

2009-05-11 10:32:36,640 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/segments/20090511094102/parse_text/part-00000/data:0+738

2009-05-11 10:32:36,640 INFO  TaskRunner - Task 'attempt_local_0008_m_000003_0' done.

2009-05-11 10:32:36,671 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:32:36,671 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:32:36,781 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:32:36,781 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:32:36,796 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:32:36,937 INFO  IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:36,953 INFO  IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:36,953 INFO  MapTask - Starting flush of map output

2009-05-11 10:32:36,968 INFO  MapTask - Finished spill 0

2009-05-11 10:32:36,968 INFO  TaskRunner - Task:attempt_local_0008_m_000004_0 is done. And is in the process of commiting

2009-05-11 10:32:36,968 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/crawldb/current/part-00000/data:0+3772

2009-05-11 10:32:36,968 INFO  TaskRunner - Task 'attempt_local_0008_m_000004_0' done.

2009-05-11 10:32:37,000 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:32:37,000 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:32:37,109 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:32:37,109 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:32:37,125 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:32:37,281 INFO  IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:37,281 INFO  IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:37,296 INFO  MapTask - Starting flush of map output

2009-05-11 10:32:37,296 INFO  MapTask - Finished spill 0

2009-05-11 10:32:37,312 INFO  TaskRunner - Task:attempt_local_0008_m_000005_0 is done. And is in the process of commiting

2009-05-11 10:32:37,312 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/linkdb/current/part-00000/data:0+4215

2009-05-11 10:32:37,312 INFO  TaskRunner - Task 'attempt_local_0008_m_000005_0' done.

2009-05-11 10:32:37,343 INFO  LocalJobRunner -

2009-05-11 10:32:37,359 INFO  Merger - Merging 6 sorted segments

2009-05-11 10:32:37,359 INFO  Merger - Down to the last merge-pass, with 6 segments left of total size: 13876 bytes

2009-05-11 10:32:37,359 INFO  LocalJobRunner -

2009-05-11 10:32:37,359 INFO  PluginRepository - Plugins: looking in: D:/work/workspace/nutch_crawl/bin/plugins


2009-05-11 10:32:37,515 INFO  IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter

2009-05-11 10:32:37,515 INFO  IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2009-05-11 10:32:37,546 INFO  Configuration - found resource common-terms.utf8 at file:/D:/work/workspace/nutch_crawl/bin/common-terms.utf8

2009-05-11 10:32:38,500 INFO  TaskRunner - Task:attempt_local_0008_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:32:38,500 INFO  LocalJobRunner - reduce > reduce

2009-05-11 10:32:38,500 INFO  TaskRunner - Task 'attempt_local_0008_r_000000_0' done.

2009-05-11 10:33:19,703 INFO  JobClient - Running job: job_local_0008

2009-05-11 10:33:50,156 INFO  JobClient - Job complete: job_local_0008

2009-05-11 10:33:52,562 INFO  JobClient - Counters: 11

2009-05-11 10:33:52,562 INFO  JobClient -   File Systems

2009-05-11 10:33:52,562 INFO  JobClient -     Local bytes read=2150441

2009-05-11 10:33:52,562 INFO  JobClient -     Local bytes written=3845733

2009-05-11 10:33:52,562 INFO  JobClient -   Map-Reduce Framework

2009-05-11 10:33:52,562 INFO  JobClient -     Reduce input groups=58

2009-05-11 10:33:52,562 INFO  JobClient -     Combine output records=0

2009-05-11 10:33:52,562 INFO  JobClient -     Map input records=177

2009-05-11 10:33:52,562 INFO  JobClient -     Reduce output records=1

2009-05-11 10:33:52,562 INFO  JobClient -     Map output bytes=13506

2009-05-11 10:33:52,562 INFO  JobClient -     Map input bytes=13661

2009-05-11 10:33:52,562 INFO  JobClient -     Combine input records=0

2009-05-11 10:33:52,562 INFO  JobClient -     Map output records=177

2009-05-11 10:33:52,562 INFO  JobClient -     Reduce input records=177

2009-05-11 10:33:57,656 INFO  Indexer - Indexer: done

1.3.8   dedup方法


2009-05-11 10:38:53,671 INFO  DeleteDuplicates - Dedup: starting

2009-05-11 10:39:32,890 INFO  DeleteDuplicates - Dedup: adding indexes in: 20090508/indexes

2009-05-11 10:39:57,265 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:40:09,015 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:40:09,218 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:40:51,890 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 10:42:56,203 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 10:42:56,265 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:42:56,265 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:42:56,390 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:42:56,390 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:42:56,515 INFO  MapTask - Starting flush of map output

2009-05-11 10:42:56,718 INFO  MapTask - Finished spill 0

2009-05-11 10:42:56,718 INFO  TaskRunner - Task:attempt_local_0009_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:42:56,718 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/20090508/indexes/part-00000

2009-05-11 10:42:56,718 INFO  TaskRunner - Task 'attempt_local_0009_m_000000_0' done.

2009-05-11 10:42:56,734 INFO  LocalJobRunner -

2009-05-11 10:42:56,734 INFO  Merger - Merging 1 sorted segments

2009-05-11 10:42:56,734 INFO  Merger - Down to the last merge-pass, with 1 segments left of total size: 141 bytes

2009-05-11 10:42:56,734 INFO  LocalJobRunner -

2009-05-11 10:42:56,781 INFO  TaskRunner - Task:attempt_local_0009_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:42:56,781 INFO  LocalJobRunner -

2009-05-11 10:42:56,781 INFO  TaskRunner - Task attempt_local_0009_r_000000_0 is allowed to commit now

2009-05-11 10:42:56,796 INFO  FileOutputCommitter - Saved output of task 'attempt_local_0009_r_000000_0' to file:/D:/work/workspace/nutch_crawl/dedup-urls-1843604809

2009-05-11 10:42:56,796 INFO  LocalJobRunner - reduce > reduce

2009-05-11 10:42:56,796 INFO  TaskRunner - Task 'attempt_local_0009_r_000000_0' done.

2009-05-11 10:43:06,515 INFO  JobClient - Running job: job_local_0009

2009-05-11 10:43:14,500 INFO  JobClient - Job complete: job_local_0009

2009-05-11 10:43:16,296 INFO  JobClient - Counters: 11

2009-05-11 10:43:16,296 INFO  JobClient -   File Systems

2009-05-11 10:43:16,296 INFO  JobClient -     Local bytes read=710951

2009-05-11 10:43:16,296 INFO  JobClient -     Local bytes written=1220879

2009-05-11 10:43:16,296 INFO  JobClient -   Map-Reduce Framework

2009-05-11 10:43:16,296 INFO  JobClient -     Reduce input groups=1

2009-05-11 10:43:16,296 INFO  JobClient -     Combine output records=0

2009-05-11 10:43:16,296 INFO  JobClient -     Map input records=1

2009-05-11 10:43:16,296 INFO  JobClient -     Reduce output records=1

2009-05-11 10:43:16,296 INFO  JobClient -     Map output bytes=137

2009-05-11 10:43:16,296 INFO  JobClient -     Map input bytes=2147483647

2009-05-11 10:43:16,296 INFO  JobClient -     Combine input records=0

2009-05-11 10:43:16,296 INFO  JobClient -     Map output records=1

2009-05-11 10:43:16,296 INFO  JobClient -     Reduce input records=1

2009-05-11 10:44:37,734 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:44:45,953 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:44:46,140 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:44:48,781 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 10:45:46,546 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 10:45:46,609 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:45:46,609 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:45:46,718 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:45:46,718 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:45:46,734 INFO  MapTask - Starting flush of map output

2009-05-11 10:45:46,953 INFO  MapTask - Finished spill 0

2009-05-11 10:45:46,953 INFO  TaskRunner - Task:attempt_local_0010_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:45:46,953 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/dedup-urls-1843604809/part-00000:0+247

2009-05-11 10:45:46,953 INFO  TaskRunner - Task 'attempt_local_0010_m_000000_0' done.

2009-05-11 10:45:46,968 INFO  LocalJobRunner -

2009-05-11 10:45:46,968 INFO  Merger - Merging 1 sorted segments

2009-05-11 10:45:46,968 INFO  Merger - Down to the last merge-pass, with 1 segments left of total size: 137 bytes

2009-05-11 10:45:46,968 INFO  LocalJobRunner -

2009-05-11 10:45:47,015 INFO  TaskRunner - Task:attempt_local_0010_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:45:47,015 INFO  LocalJobRunner -

2009-05-11 10:45:47,015 INFO  TaskRunner - Task attempt_local_0010_r_000000_0 is allowed to commit now

2009-05-11 10:45:47,015 INFO  FileOutputCommitter - Saved output of task 'attempt_local_0010_r_000000_0' to file:/D:/work/workspace/nutch_crawl/dedup-hash-291931517

2009-05-11 10:45:47,015 INFO  LocalJobRunner - reduce > reduce

2009-05-11 10:45:47,015 INFO  TaskRunner - Task 'attempt_local_0010_r_000000_0' done.

2009-05-11 10:45:52,187 INFO  JobClient - Running job: job_local_0010

2009-05-11 10:46:03,984 INFO  JobClient - Job complete: job_local_0010

2009-05-11 10:46:06,359 INFO  JobClient - Counters: 11

2009-05-11 10:46:06,359 INFO  JobClient -   File Systems

2009-05-11 10:46:06,359 INFO  JobClient -     Local bytes read=764171

2009-05-11 10:46:06,359 INFO  JobClient -     Local bytes written=1327019

2009-05-11 10:46:06,359 INFO  JobClient -   Map-Reduce Framework

2009-05-11 10:46:06,359 INFO  JobClient -     Reduce input groups=1

2009-05-11 10:46:06,359 INFO  JobClient -     Combine output records=0

2009-05-11 10:46:06,359 INFO  JobClient -     Map input records=1

2009-05-11 10:46:06,359 INFO  JobClient -     Reduce output records=0

2009-05-11 10:46:06,359 INFO  JobClient -     Map output bytes=133

2009-05-11 10:46:06,359 INFO  JobClient -     Map input bytes=141

2009-05-11 10:46:06,359 INFO  JobClient -     Combine input records=0

2009-05-11 10:46:06,359 INFO  JobClient -     Map output records=1

2009-05-11 10:46:06,359 INFO  JobClient -     Reduce input records=1

2009-05-11 10:47:19,953 INFO  JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized

2009-05-11 10:47:19,953 WARN  JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.

2009-05-11 10:47:20,140 WARN  JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

2009-05-11 10:47:20,156 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 10:47:20,765 INFO  JobClient - Running job: job_local_0011

2009-05-11 10:47:20,765 INFO  FileInputFormat - Total input paths to process : 1

2009-05-11 10:47:20,796 INFO  MapTask - numReduceTasks: 1

2009-05-11 10:47:20,796 INFO  MapTask - io.sort.mb = 100

2009-05-11 10:47:20,921 INFO  MapTask - data buffer = 79691776/99614720

2009-05-11 10:47:20,921 INFO  MapTask - record buffer = 262144/327680

2009-05-11 10:47:20,937 INFO  MapTask - Starting flush of map output

2009-05-11 10:47:21,140 INFO  MapTask - Index: (0, 2, 6)

2009-05-11 10:47:21,140 INFO  TaskRunner - Task:attempt_local_0011_m_000000_0 is done. And is in the process of commiting

2009-05-11 10:47:21,140 INFO  LocalJobRunner - file:/D:/work/workspace/nutch_crawl/dedup-hash-291931517/part-00000:0+103

2009-05-11 10:47:21,140 INFO  TaskRunner - Task 'attempt_local_0011_m_000000_0' done.

2009-05-11 10:47:21,156 INFO  LocalJobRunner -

2009-05-11 10:47:21,156 INFO  Merger - Merging 1 sorted segments

2009-05-11 10:47:21,156 INFO  Merger - Down to the last merge-pass, with 0 segments left of total size: 0 bytes

2009-05-11 10:47:21,156 INFO  LocalJobRunner -

2009-05-11 10:47:21,171 INFO  TaskRunner - Task:attempt_local_0011_r_000000_0 is done. And is in the process of commiting

2009-05-11 10:47:21,171 INFO  LocalJobRunner - reduce > reduce

2009-05-11 10:47:21,171 INFO  TaskRunner - Task 'attempt_local_0011_r_000000_0' done.

2009-05-11 10:47:21,765 INFO  JobClient - Job complete: job_local_0011

2009-05-11 10:47:21,765 INFO  JobClient - Counters: 11

2009-05-11 10:47:21,765 INFO  JobClient -   File Systems

2009-05-11 10:47:21,765 INFO  JobClient -     Local bytes read=816128

2009-05-11 10:47:21,765 INFO  JobClient -     Local bytes written=1430954

2009-05-11 10:47:21,765 INFO  JobClient -   Map-Reduce Framework

2009-05-11 10:47:21,765 INFO  JobClient -     Reduce input groups=0

2009-05-11 10:47:21,765 INFO  JobClient -     Combine output records=0

2009-05-11 10:47:21,765 INFO  JobClient -     Map input records=0

2009-05-11 10:47:21,765 INFO  JobClient -     Reduce output records=0

2009-05-11 10:47:21,765 INFO  JobClient -     Map output bytes=0

2009-05-11 10:47:21,765 INFO  JobClient -     Map input bytes=0

2009-05-11 10:47:21,765 INFO  JobClient -     Combine input records=0

2009-05-11 10:47:21,765 INFO  JobClient -     Map output records=0

2009-05-11 10:47:21,765 INFO  JobClient -     Reduce input records=0

2009-05-11 10:47:44,031 INFO  DeleteDuplicates - Dedup: done

1.3.9   merge方法




2009-05-11 10:53:56,156 INFO  IndexMerger - merging indexes to: 20090508/index

2009-05-11 10:58:50,906 INFO  IndexMerger - Adding file:/D:/work/workspace/nutch_crawl/20090508/indexes/part-00000

2009-05-11 11:04:36,562 INFO  IndexMerger - done merging

