Nutch中搜索时把自定义的字段(filed)加入检索条件

1、问题原因分析

错误org.apache.nutch.searcher.QueryException: Not a known field name:publishUrl原因

原因分析:

NutchBeanmain()方法中

final NutchBean bean = new NutchBean(conf);

声明NutchBean 得到一个bean,在他的构造函数中使用LuceneSearchBean()来实现searchBean

searchBean = new LuceneSearchBean(conf, indexDir, indexesDir);

LuceneSearchBean的构造函数调用自己的init()方法进行初始化。其中searcher的使用IndexSearher()进行实例化。

this.searcher = new IndexSearcher(indexDir, this.conf);

IndexSearcher的构造函数中调中自己的()方法来初实例化queryFilters

this.queryFilters = new QueryFilters(conf);

QueryFIlters中有三个field

  private QueryFilter[] queryFilters; //加载的filter

  private HashSet<String> FIELD_NAMES ;   //字段名

  private HashSet<String> RAW_FIELD_NAMES;//字段名

默认加载LanguageQueryFilteDefaultQueryFilter

UrlQueryFilterSiteQueryFilter四个插件

FIELD_NAMES中的值为[site, , lang, DEFAULT, url]

RAW_FIELD_NAMES中的值为[site, , lang]

QueryFilter()的构造函数中会将FIELD_NAMESRAW_FIELD_NAMES存入到ObjectCahche

FIELD_NAMES.addAll(fieldNames);

FIELD_NAMES.addAll(rawFieldNames);

objectCache.setObject("FIELD_NAMES", FIELD_NAMES);

RAW_FIELD_NAMES.addAll(rawFieldNames);

objectCache.setObject("RAW_FIELD_NAMES", RAW_FIELD_NAMES);

 

当执行到 final Hits hits = bean.search(query);时会调用searchBean.search(query);LuceneSearchBean的方法-》IndexSearcher中的Hits search(Query query)this.queryFilters.filter(query); -》QueryFiltes中的BooleanQuery filter(Query input)

input的值为:[处理, output, -url:www, -publishUrl:qqq]

  public BooleanQuery filter(Query input) throws QueryException {

    // first check that all field names are claimed by some plugin

    Clause[] clauses = input.getClauses();

    for (int i = 0; i < clauses.length; i++) {

      Clause c = clauses[i];

      if (!isField(c.getField())) //因为自定义字段publishUrl不在 FIELD_NAMES中,所以此处报错。

        throw new QueryException("Not a known field name:"+c.getField());

    }

 

    // then run each plugin

    BooleanQuery output = new BooleanQuery();

    for (int i = 0; i < this.queryFilters.length; i++) {

      output = this.queryFilters[i].filter(input, output);

    }

    return output;

  }

 

  public boolean isField(String name) {

    return FIELD_NAMES.contains(name);//判断该字段是否在FIELD—NAMES

  }

2、解决方案

Nutch中已经实现了 一个CustomFieldQueryFilter()的插件用于将自定义字段名加到QueryFilter中。

首先修改customs-field.xml文件,如下:

  <entry key="publishUrl.name">publishUrl</entry>

  <entry key="publishUrl.indexed">yes</entry>

  <entry key="publishUrl.stored">yes</entry>

  <entry key="publishUrl.tokenized">yes</entry>

  <entry key="publishUrl.boost">1.0</entry>

  <entry key="publishUrl.multi">false</entry>

 

  <entry key="publishTitle.name">publishTitle</entry>

  <entry key="publishTitle.indexed">yes</entry>

  <entry key="publishTitle.stored">yes</entry>

  <entry key="publishTitle.tokenized">yes</entry>

  <entry key="publishTitle.boost">1.0</entry>

  <entry key="publishTitle.multi">false</entry>

然后修改plugin/query-custom/plugin.xml

把里面的<parameter name="fields" value="publishUrl,publishTitle" />

value改成你自己的字段名

最后在nutch-default.xml中把query-custom加到plugin.includes

你可能感兴趣的:(Nutch中搜索时把自定义的字段(filed)加入检索条件)