1、问题原因分析
错误org.apache.nutch.searcher.QueryException: Not a known field name:publishUrl原因
原因分析:
在NutchBean的main()方法中
final NutchBean bean = new NutchBean(conf); |
声明NutchBean 得到一个bean,在他的构造函数中使用LuceneSearchBean()来实现searchBean。
searchBean = new LuceneSearchBean(conf, indexDir, indexesDir); |
LuceneSearchBean的构造函数调用自己的init()方法进行初始化。其中searcher的使用IndexSearher()进行实例化。
this.searcher = new IndexSearcher(indexDir, this.conf); |
在IndexSearcher的构造函数中调中自己的()方法来初实例化queryFilters
this.queryFilters = new QueryFilters(conf); |
QueryFIlters中有三个field
private QueryFilter[] queryFilters; //加载的filter private HashSet<String> FIELD_NAMES ; //字段名 private HashSet<String> RAW_FIELD_NAMES;//字段名 |
默认加载LanguageQueryFilte、DefaultQueryFilter
UrlQueryFilter、SiteQueryFilter四个插件
FIELD_NAMES中的值为[site, , lang, DEFAULT, url]
RAW_FIELD_NAMES中的值为[site, , lang]
在QueryFilter()的构造函数中会将FIELD_NAMES和RAW_FIELD_NAMES存入到ObjectCahche中
FIELD_NAMES.addAll(fieldNames); FIELD_NAMES.addAll(rawFieldNames); objectCache.setObject("FIELD_NAMES", FIELD_NAMES); RAW_FIELD_NAMES.addAll(rawFieldNames); objectCache.setObject("RAW_FIELD_NAMES", RAW_FIELD_NAMES); |
当执行到 final Hits hits = bean.search(query);时会调用searchBean.search(query);即LuceneSearchBean的方法-》IndexSearcher中的Hits search(Query query):this.queryFilters.filter(query); -》QueryFiltes中的BooleanQuery filter(Query input)
input的值为:[处理, output, -url:www, -publishUrl:qqq]
public BooleanQuery filter(Query input) throws QueryException { // first check that all field names are claimed by some plugin Clause[] clauses = input.getClauses(); for (int i = 0; i < clauses.length; i++) { Clause c = clauses[i]; if (!isField(c.getField())) //因为自定义字段publishUrl不在 FIELD_NAMES中,所以此处报错。 throw new QueryException("Not a known field name:"+c.getField()); }
// then run each plugin BooleanQuery output = new BooleanQuery(); for (int i = 0; i < this.queryFilters.length; i++) { output = this.queryFilters[i].filter(input, output); } return output; } |
public boolean isField(String name) { return FIELD_NAMES.contains(name);//判断该字段是否在FIELD—NAMES中 } |
2、解决方案
在Nutch中已经实现了 一个CustomFieldQueryFilter()的插件用于将自定义字段名加到QueryFilter中。
首先修改customs-field.xml文件,如下:
<entry key="publishUrl.name">publishUrl</entry> <entry key="publishUrl.indexed">yes</entry> <entry key="publishUrl.stored">yes</entry> <entry key="publishUrl.tokenized">yes</entry> <entry key="publishUrl.boost">1.0</entry> <entry key="publishUrl.multi">false</entry>
<entry key="publishTitle.name">publishTitle</entry> <entry key="publishTitle.indexed">yes</entry> <entry key="publishTitle.stored">yes</entry> <entry key="publishTitle.tokenized">yes</entry> <entry key="publishTitle.boost">1.0</entry> <entry key="publishTitle.multi">false</entry> |
然后修改plugin/query-custom/plugin.xml
把里面的<parameter name="fields" value="publishUrl,publishTitle" />
把value改成你自己的字段名
最后在nutch-default.xml中把query-custom加到plugin.includes中