nutch plugin的流程以及如何自定义plugin

PluginRepository 是plugin的入口,保存了所有的Plugins,加载流程如下:

1. 解析plugin.folder下面的所有plugin的plugin.xml文件:

几个主要的解析函数如下:
  (1) parseExtension(rootElement, pluginDescriptor);
解析extension element:
     <extension id="org.apache.nutch.net.urlfilter.urllength"
              name="Nutch URL Length Filter"
              point="org.apache.nutch.net.urlfilter">   
      <implementation id="UrlLengthFilter"
                      class="org.apache.nutch.net.urlfilter.urllength.UrlLengthFilter">   
      </implementation>   
   </extension>
解析后加载到PluginDescriptor:
pPluginDescriptor.addExtension(extension);


  (2)parseLibraries(rootElement, pluginDescriptor);
解析下列的lib element:
   <runtime>
     <library name="lib-http.jar">
        <export name="*"/>
     </library>
   </runtime>
解析后加载到PluginDescriptor:
pDescriptor.addNotExportedLibRelative(libName);
pDescriptor.addNotExportedLibRelative(libName); 


  (3)parseRequires(rootElement, pluginDescriptor);      
解析requires :
   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>
解析后加载到PluginDescriptor:
pDescriptor.addDependency(plugin);   确定依赖关系

(4) parseExtensionPoints(rootElement, pluginDescriptor);
解析extension point:  主要针对nutch-extensionpoints下面的plugin.xml
<extension-point
      id="org.apache.nutch.indexer.field.FieldFilter"
      name="Nutch Field Filter"/>
解析后加载到PluginDescriptor:
pPluginDescriptor.addExtensionPoint(extensionPoint);   

2. 对plugin的过滤:
   根据plugin.includes及plugin.excludes过滤plugin,并检查plugin的依赖关系,确认是否有“missing dependency”或“circular dependency”存在。

3. installExtensionPoints: 集合所有的ExtensionPoints;

4. installExtensions: 验证每个extension是否有对应的ExtensionPoint。

5. extension中的point的value必须在extension-point 中有定义, 即:在定义了某个plugin的plugin.xml之后,必须在nutch-extensionpoints的plugin.xml中注册下 .
6.
Nutch-site.xml中需定义“Plugin.folders”的value,指定plugin的路径。
<property>
          <name>plugin.folders</name>
          <value>plugins</value>
          <description>Directories where nutch plugins are located.  Each element may be a relative or absolute path.  If absolute, it is used as is.  If relative, it is searched for on the classpath.
          </description>
</property>

另外需要定义“plugin.includes”,确定要加载的plugin。
<property>
         <name>plugin.includes</name>
         <value>protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
         <description>Regular expression naming plugin directory names to include.  Any plugin not matching this expression is excluded.
                          In any case you need at least include the nutch-extensionpoints plugin. By        default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the         underlying commons-httpclient library.
           </description>
</property>
7.plugin.folders只是指示了在哪个文件目录可以找到所有的plugin;
plugin.includes必须把plugin包含进去,才能在PluginRepository中get想要的plugin。

 

 

---------------------------------------------------------------自定义plugin--------------------------------------------------------------------------------

1. 定义URLFilter的interface, 必须指定X_POINT_ID:(由于nutch已经定义了URLFilter插件了,这步省略)
public interface URLFilter extends Pluggable, Configurable {
  /** The name of the extension point. */
  public final static String X_POINT_ID = URLFilter.class.getName();
}

2. 定义UrlLengthFilter:
public class UrlLengthFilter implements URLFilter{
        //TODO: 具体实现
}

3. 在"plugin.folder"目录下添加一个 urlfilter-urllength 的plugin,相应的plugin.xml如下: extension节点的id属性值是UrlLengthFilter 类所在的package name,
   implementation 节点属性的class指定需要实现的具体类名,通过该名找到相关的类。

<plugin
   id="urlfilter-urllength"
   name="URL length Filter"
   version="1.0.0"
   provider-name="nutch.org"> 

   <requires>
      <import plugin="nutch-extensionpoints"/>    
   </requires>

   <extension id="org.apache.nutch.net.urlfilter.urllength"
              name="Nutch URL Length Filter"
              point="org.apache.nutch.net.urlfilter">  
      <implementation id="UrlLengthFilter"
                      class="org.apache.nutch.net.urlfilter.urllength.UrlLengthFilter">  
      </implementation>  
   </extension>

</plugin>

4. 确保nutch-extensionpoints的plugin.xml中有如下的extensionpoint定义: 这里id必须与上一步extension中的point对应。
<extension-point
      id="com.roboo.procrawl.net.URLFilter"
      name="Nutch URL Filter"
/>

5. 将这个插件的id(即urlfilter-urllength)添加到nutch-site.xml的"plugin.includes"定义中。

你可能感兴趣的:(apache,xml,.net,json)