Dissecting The Nutch Crawler -Command "fetch": net.nutch.fetcher.Fetcher

      英文原文出处: DissectingTheNutchCrawler
  转载本文请注明出处:http://blog.csdn.net/pwlazy

Command "fetch": net.nutch.fetcher.Fetcher

> "fetch: fetch a segment's pages"
> Usage: Fetcher [-logLevel level] [-showThreadID] [-threads n] dir

So far we've created a webdb, primed it withURLs, and created a segment that a Fetcher can write to. Now let's look at the Fetcher itself, and try running it to see what comes out.

net.nutch.fetcher.Fetcher relies on several other classes:

  • FetcherThread, an inner class

  • net.nutch.parse.ParserFactory

  • net.nutch.plugin.PluginRepository

  • and, of course, any "plugin" classes loaded by the PluginRepository

Fetcher.main() reads arguments, instantiates a new Fetcher object, sets options, then calls run(). The Fetcher constructor is similarly simple; it just instantiates all of the input/output streams:

instance variable

class

arguments

fetchList

ArrayFile.Reader

(dir, "fetchlist")

fetchWriter

ArrayFile.Writer

(dir, "fetcher", FetcherOutput.class)

contentWriter

ArrayFile.Writer

(dir, "content", Content.class)

parseTextWriter

ArrayFile.Writer

(dir, "parse_text", ParseText.class)

parseDataWriter

ArrayFile.Writer

(dir, "parse_data", ParseData.class)

Fetcher.run() instantiates 1..threadCount FetcherThread objects, calls thread.start() on each, sleeps until all threads are gone or a fatal error is logged, then calls close() on the i/o streams.

FetcherThread is an inner class of net.nutch.fetcher.Fetcher that extends java.lang.Thread. It has one instance method, run(), and three static methods: handleFetch(), handleNoFetch(), and logError().

FetcherThread.run() instantiates a new FetchListEntry called "fle", then runs the following in an infinite loop:

  1. If a fatal error was logged, break

  2. Get the next entry in the FetchList, break if none remain

  3. Extract url from FetchListEntry

  4. If the FetchListEntry is not tagged "fetch", call this.handleNoFetch() with status=1. This in turn does:

    • Get MD5Hash.digest() of url

    • Build a FetcherOutput(fle, hash, status)

    • Build empty Content, ParseText, and ParseData objects

    • Call Fetcher.outputPage() with all of these objects

  5. If is tagged "fetch", call ProtocolFactory and get Protocol and Content objects for this url

  6. Call this.handleFetch(url, fle, content). This in turn does:

    • Call ParserFactory.getParser() for this content type

    • Call parser.getParse(content)

    • Call Fetcher.outputPage() with a new FetcherOutput, including url MD5, the populated Content object, and a new ParseText

  7. On every 100th pass through loop, write a status message to the log

  8. Catch any exceptions and log as necessary

As we can see here, the fetcher relies on Factory classes to choose the code it uses for different content types: ProtocolFactory() finds a Protocol instance for a given url, and ParserFactory finds a Parser for a given contentType.

It should now be apparent that implementing a custom crawler with Nutch will revolve around creating new Protocol/Parser classes, and updating ProtocolFactory/ParserFactory to load them as needed. Let's examine these classes now.

命令fetch对应net.nutch.fetcher.Fetcher类
该命令用于抓取一个segment的所有网页
该类的调用方式如下: Fetcher [-logLevel level] [-showThreadID] [-threads n] dir

到目前为止,我们产生了一个新的webdb,并注入url,也产生了一个segment待fetcher写入,现在我们看看fetcher本身,我们可以运行它看看到底发生了什么

net.nutch.fetcher.Fetcher类依赖一下个类

  • FetcherThread, 一个它的内部类

  • net.nutch.parse.ParserFactory

  • net.nutch.plugin.PluginRepository

  • 当然还有一些由 PluginRepository加载的插件类

Fetcher的main方法读取输入参数,接着实例化一个新的Fetcher对象,接着设置选项,然后调用run方法。Fetcher构造函数相当简单,仅仅是实例化以下几个输入输出流

实例变量名

所属类

构造函数参数

fetchList

ArrayFile.Reader

(dir, "fetchlist")

fetchWriter

ArrayFile.Writer

(dir, "fetcher", FetcherOutput.class)

contentWriter

ArrayFile.Writer

(dir, "content", Content.class)

parseTextWriter

ArrayFile.Writer

(dir, "parse_text", ParseText.class)

parseDataWriter

ArrayFile.Writer

(dir, "parse_data", ParseData



Fetch类的run方法实例化1到threadCount(译注:在调用CrawlTool的main是传入,默认为10)个 FetcherThread 对象,然后调用每个对象的start方法,接着主线程休眠直到所有子线程运行完毕或者发生了严重错误,最后调用个输入输出流的close方法。

FetcherThread 是一个net.nutch.fetcher.Fetcher的内部类,该类继承了java.lang.Thread,它有一个实例方法run()和3个静态方法:handleFetch(), handleNoFetch(), and logError().

FetcherThread 的run()方法首先实例化一个新的 FetchListEntry 对象叫"fle",接着以无限循环的方式运行一下步骤:

  1. 如果有严重错误发生,跳出循环

  2. 获取FetchList 的下一个条目,如果没有跳出循环 

  3. 从fle中获取url

  4. 如果该fle并未标记"fetch", 那么调用 this.handleNoFetch() ,调用时传入值为1的status参数. 接着会发生如下步骤:

    • 通过 MD5Hash.digest()获取该url的hash值

    • 构造一个输出流: FetcherOutput(fle, hash, status)

    • 构造空的Content对象, ParseText, 和 ParseData 对象

    • 调用Fetcher.outputPage() 并传入上面构造的所有对象

  5. 如果该fle标记"fetch", 调用 ProtocolFactory ,并从该url获取Protocol and Content 对象

  6. 调用 this.handleFetch(url, fle, content). 以下各步骤会发生

    • 调用ParserFactory.getParser(contentType, url) (译注:contentType=content.getContentType();)

    • 调用parser.getParse(content)(译注:parser=ParserFactory.getParser(contentType, url) )

    • 接着调用outputPage(new FetcherOutput(fle, hash, protocolStatus),
                      content, new ParseText(parse.getText()), parse.getData());参数中包含一个新的 FetcherOutput 对象, including url MD5(译注:即参数中的hash),也包含一个被植入的Content对象和一个新的ParseText
  7. 每循环100次, 将状态信息写入日志

  8. 如果有必要捕捉任何意外并作记录

正如我们所见, fetcher依靠各类工厂根据不同内容类型选择代码,例如 ProtocolFactory() 根据给定的url返回相应的 Protocol实例,  ParserFactory 根据给定的内容类型返回相应的Parser实例

很明显,扩展nutch crawler可以通过产生新的Protocol/Parser 或者通过更新 ProtocolFactory /ParserFactory 按需加载他们来实现. 我们现在可以好好看看这些类。

你可能感兴趣的:(Dissecting The Nutch Crawler -Command "fetch": net.nutch.fetcher.Fetcher)