> "fetch: fetch a segment's pages"
> Usage: Fetcher [-logLevel level] [-showThreadID] [-threads n] dir
So far we've created a webdb, primed it withURLs, and created a segment that a Fetcher can write to. Now let's look at the Fetcher itself, and try running it to see what comes out.
net.nutch.fetcher.Fetcher relies on several other classes:
FetcherThread, an inner class
net.nutch.parse.ParserFactory
net.nutch.plugin.PluginRepository
and, of course, any "plugin" classes loaded by the PluginRepository
Fetcher.main() reads arguments, instantiates a new Fetcher object, sets options, then calls run(). The Fetcher constructor is similarly simple; it just instantiates all of the input/output streams:
instance variable |
class |
arguments |
fetchList |
ArrayFile.Reader |
(dir, "fetchlist") |
fetchWriter |
ArrayFile.Writer |
(dir, "fetcher", FetcherOutput.class) |
contentWriter |
ArrayFile.Writer |
(dir, "content", Content.class) |
parseTextWriter |
ArrayFile.Writer |
(dir, "parse_text", ParseText.class) |
parseDataWriter |
ArrayFile.Writer |
(dir, "parse_data", ParseData.class) |
Fetcher.run() instantiates 1..threadCount FetcherThread objects, calls thread.start() on each, sleeps until all threads are gone or a fatal error is logged, then calls close() on the i/o streams.
FetcherThread is an inner class of net.nutch.fetcher.Fetcher that extends java.lang.Thread. It has one instance method, run(), and three static methods: handleFetch(), handleNoFetch(), and logError().
FetcherThread.run() instantiates a new FetchListEntry called "fle", then runs the following in an infinite loop:
If a fatal error was logged, break
Get the next entry in the FetchList, break if none remain
Extract url from FetchListEntry
If the FetchListEntry is not tagged "fetch", call this.handleNoFetch() with status=1. This in turn does:
Get MD5Hash.digest() of url
Build a FetcherOutput(fle, hash, status)
Build empty Content, ParseText, and ParseData objects
Call Fetcher.outputPage() with all of these objects
If is tagged "fetch", call ProtocolFactory and get Protocol and Content objects for this url
Call this.handleFetch(url, fle, content). This in turn does:
Call ParserFactory.getParser() for this content type
Call parser.getParse(content)
Call Fetcher.outputPage() with a new FetcherOutput, including url MD5, the populated Content object, and a new ParseText
On every 100th pass through loop, write a status message to the log
Catch any exceptions and log as necessary
As we can see here, the fetcher relies on Factory classes to choose the code it uses for different content types: ProtocolFactory() finds a Protocol instance for a given url, and ParserFactory finds a Parser for a given contentType.
It should now be apparent that implementing a custom crawler with Nutch will revolve around creating new Protocol/Parser classes, and updating ProtocolFactory/ParserFactory to load them as needed. Let's examine these classes now.
FetcherThread, 一个它的内部类
net.nutch.parse.ParserFactory
net.nutch.plugin.PluginRepository
当然还有一些由 PluginRepository加载的插件类
实例变量名 |
所属类 |
构造函数参数 |
fetchList |
ArrayFile.Reader |
(dir, "fetchlist") |
fetchWriter |
ArrayFile.Writer |
(dir, "fetcher", FetcherOutput.class) |
contentWriter |
ArrayFile.Writer |
(dir, "content", Content.class) |
parseTextWriter |
ArrayFile.Writer |
(dir, "parse_text", ParseText.class) |
parseDataWriter |
ArrayFile.Writer |
(dir, "parse_data", ParseData |
如果有严重错误发生,跳出循环
获取FetchList 的下一个条目,如果没有跳出循环
从fle中获取url
如果该fle并未标记"fetch", 那么调用 this.handleNoFetch() ,调用时传入值为1的status参数. 接着会发生如下步骤:
通过 MD5Hash.digest()获取该url的hash值
构造一个输出流: FetcherOutput(fle, hash, status)
构造空的Content对象, ParseText, 和 ParseData 对象
调用Fetcher.outputPage() 并传入上面构造的所有对象
如果该fle标记"fetch", 调用 ProtocolFactory ,并从该url获取Protocol and Content 对象
调用 this.handleFetch(url, fle, content). 以下各步骤会发生
调用ParserFactory.getParser(contentType, url) (译注:contentType=content.getContentType();)
调用parser.getParse(content)(译注:parser=ParserFactory.getParser(contentType, url) )
每循环100次, 将状态信息写入日志
如果有必要捕捉任何意外并作记录