通过一系列的离线活动(对于查询用户而言)的开展,Nutch检索系统相对而言变得简单了许多。在二次开发的时候,需要重点对Nutch的界面及界面显示数据进行适当的调整。
1 摘要提取
1.1 摘要提取源码分析
**
* Low level api to get the most relevant (formatted) sections of the document.
* 底层API,获取文档中最相关的(格式化)部分
* This method has been made public to allow visibility of score information held in TextFragment objects.
* Thanks to Jason Calabrese for help in redefining the interface.
* @param tokenStream
* @param text
* @param maxNumFragments
* @param mergeContiguousFragments
* @throws IOException
*/
public final TextFragment[] getBestTextFragments(
TokenStream tokenStream,
String text,
boolean mergeContiguousFragments,
int maxNumFragments)
throws IOException
{
ArrayList docFrags = new ArrayList();
StringBuffer newText=new StringBuffer();
TextFragment currentFrag = new TextFragment(newText,newText.length(), docFrags.size());
fragmentScorer.startFragment(currentFrag);
docFrags.add(currentFrag);
FragmentQueue fragQueue = new FragmentQueue(maxNumFragments);
try
{
org.apache.lucene.analysis.Token token;
String tokenText;
int startOffset;
int endOffset;
int lastEndOffset = 0;
textFragmenter.start(text);
TokenGroup tokenGroup=new TokenGroup();
token = tokenStream.next();
while ((token!= null)&&(token.startOffset()<maxDocBytesToAnalyze))
{
if((tokenGroup.numTokens>0)&&(tokenGroup.isDistinct(token)))
{
//the current token is distinct from previous tokens -
// markup the cached token group info
startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);
String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
//store any whitespace etc from between this and last group
if (startOffset > lastEndOffset)
newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
newText.append(markedUpText);
lastEndOffset=Math.max(endOffset, lastEndOffset);
tokenGroup.clear();
//check if current token marks the start of a new fragment
if(textFragmenter.isNewFragment(token))
{
currentFrag.setScore(fragmentScorer.getFragmentScore());
//record stats for a new fragment
currentFrag.textEndPos = newText.length();
currentFrag =new TextFragment(newText, newText.length(), docFrags.size());
fragmentScorer.startFragment(currentFrag);
docFrags.add(currentFrag);
}
}
tokenGroup.addToken(token,fragmentScorer.getTokenScore(token));
// if(lastEndOffset>maxDocBytesToAnalyze)
// {
// break;
// }
token = tokenStream.next();
}
currentFrag.setScore(fragmentScorer.getFragmentScore());
if(tokenGroup.numTokens>0)
{
//flush the accumulated text (same code as in above loop)
startOffset = tokenGroup.matchStartOffset;
endOffset = tokenGroup.matchEndOffset;
tokenText = text.substring(startOffset, endOffset);
String markedUpText=formatter.highlightTerm(encoder.encodeText(tokenText), tokenGroup);
//store any whitespace etc from between this and last group
if (startOffset > lastEndOffset)
newText.append(encoder.encodeText(text.substring(lastEndOffset, startOffset)));
newText.append(markedUpText);
lastEndOffset=Math.max(lastEndOffset,endOffset);
}
//Test what remains of the original text beyond the point where we stopped analyzing
if (
// if there is text beyond the last token considered..
(lastEndOffset < text.length())
&&
// and that text is not too large...
(text.length()<maxDocBytesToAnalyze)
)
{
//append it to the last fragment
newText.append(encoder.encodeText(text.substring(lastEndOffset)));
}
currentFrag.textEndPos = newText.length();
//sort the most relevant sections of the text
for (Iterator i = docFrags.iterator(); i.hasNext();)
{
currentFrag = (TextFragment) i.next();
//If you are running with a version of Lucene before 11th Sept 03
// you do not have PriorityQueue.insert() - so uncomment the code below
/*
if (currentFrag.getScore() >= minScore)
{
fragQueue.put(currentFrag);
if (fragQueue.size() > maxNumFragments)
{ // if hit queue overfull
fragQueue.pop(); // remove lowest in hit queue
minScore = ((TextFragment) fragQueue.top()).getScore(); // reset minScore
}
}
*/
//The above code caused a problem as a result of Christoph Goller's 11th Sept 03
//fix to PriorityQueue. The correct method to use here is the new "insert" method
// USE ABOVE CODE IF THIS DOES NOT COMPILE!
fragQueue.insert(currentFrag);
}
//return the most relevant fragments
TextFragment frag[] = new TextFragment[fragQueue.size()];
for (int i = frag.length - 1; i >= 0; i--)
{
frag[i] = (TextFragment) fragQueue.pop();
}
//merge any contiguous fragments to improve readability
if(mergeContiguousFragments)
{
mergeContiguousFragments(frag);
ArrayList fragTexts = new ArrayList();
for (int i = 0; i < frag.length; i++)
{
if ((frag[i] != null) && (frag[i].getScore() > 0))
{
fragTexts.add(frag[i]);
}
}
frag= (TextFragment[]) fragTexts.toArray(new TextFragment[0]);
}
return frag;
}
finally
{
if (tokenStream != null)
{
try
{
tokenStream.close();
}
catch (Exception e)
{
}
}
}
}
1.2 改变摘要长度
Nutch的查询结果中摘要长度是可以改变的,它是以配置工兵方式进行的修改,配置文件是nutch-site:xml:
<configuration>
...
<property>
<name>searcher.summary.length</name>
<value>50</value>//默认为20
<description>
The total number of terms to display in a hit summary.
</description>
</property>
...
</configuration>
Nutch的默认配置可能是在nutch-default.xml中设置的,如果要想覆盖它的配置只需在nutch-site.xml中添加相应的配置就好了。
2网页快照
所谓网页快照及搜索引擎服务器端存储的网页副本。Nutch通过关键字进行搜索网页的时候,会查询出这个关键字对应的相关信息,比如:title、 url、content等等。通过URL可以链接到该URL对应的网页。而网页快照其实是Nutch爬虫爬取下来的网页内容。因此,当点击网页快照时,我 们根据索引文档的ID,去索引出原网页内容。该源代码在查询服务系统中的 cache.jsp中,下面是相关代码:
Hit hit = new Hit(Integer.parseInt(request.getParameter("idx")),
request.getParameter("id"));
HitDetails details = bean.getDetails(hit);
….
String content = new String(bean.getContent(details));
另外还涉及到Nutch 网页快照的中文问题,中文时采用UTF-8取得内容就行了。
修改cached.jsp,把
***
else
content = new String( bean.getContent(details) );
改成
content = new String( bean.getContent(details) ,"utf-8");
如果需要对内容的显示方面做一些修改的话,通过该页面也可以修改。
本文第二部分内容见:http://hi.baidu.com/zhumulangma/blog/item/2c0f05f4b55e38e77709d7ce.html