两个Lucene工具:
Luck 能基于索引文件进行直接的检索,几乎是使用Lucene的必备工具。
Tika能方便地提取出word、pdf等文档文件或网页中的文本数据,为使用Lucene建立索引做好准备。
使用Tika时有两种写法:
①
public String fileToTxt(File f) { Parser parser = new AutoDetectParser(); InputStream is = null; try { Metadata metadata = new Metadata(); metadata.set(Metadata.AUTHOR, "�պ�"); metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName()); is = new FileInputStream(f); ContentHandler handler = new BodyContentHandler(); ParseContext context = new ParseContext(); context.set(Parser.class,parser); parser.parse(is,handler, metadata,context); for(String name:metadata.names()) { System.out.println(name+":"+metadata.get(name)); } return handler.toString(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { .printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } finally { try { if(is!=null) is.close(); } catch (IOException e) { e.printStackTrace(); } } return null; }
②
public String tikaTool(File f) throws IOException, TikaException { Tika tika = new Tika(); Metadata metadata = new Metadata(); metadata.set(Metadata.AUTHOR, "success"); metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName()); String str = tika.parseToString(new FileInputStream(f),metadata); for(String name:metadata.names()) { System.out.println(name+":"+metadata.get(name)); } return str; }
高亮显示
针对搜索在网页上的显示而服务的,所谓的“高亮”是指经过css样式修饰过的效果。注意这几个类:QueryScorer、Fragmenter、Formatter、Highlighter之间的关系。另外,注意,在显示匹配关键词的段落时,第一个参数是分词器得到的。
private String lighterStr(Analyzer a,Query query,String txt,String fieldname) throws IOException,private String lighterStr(Analyzer a,Query query,String txt,String fieldname) throws IOException, InvalidTokenOffsetsException { String str = null; QueryScorer scorer = new QueryScorer(query); Fragmenter fragmenter = new SimpleSpanFragmenter(scorer); Formatter fmt = new SimpleHTMLFormatter("<b>", "</b>"); Highlighter lighter = new Highlighter(fmt, scorer); lighter.setTextFragmenter(fragmenter); str = lighter.getBestFragments(a.tokenStream(fieldname,new StringReader(txt)),txt, 3, "......\n"); if(str==null)return txt; return str; }拼写检查
分两块:先创建拼写检查索引,再利用这些生成的索引文件去获取用户输入的关键词的相近词。