最近项目组要搭一个wiki,经过筛选我们决定使用JSPWiki(网上有大量的分析),待搭完以后,发现他不支持附件文件内容搜索,也就是说,如果这篇wiki中上传了一些doc,xls等文件是不能被搜索到的,但是在jspwiki.properties配置中有如下配置:
jspwiki.searchProvider =LuceneSearchProvider jspwiki.lucene.analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
也就是JSPWiki也是用lucene来做检索的,于是下了他的源码,看了下这个类com.ecyrd.jspwiki.search.LuceneSearchProvider,发现了如下方法:
protected String getAttachmentContent( Attachment att ) { AttachmentManager mgr = m_engine.getAttachmentManager(); //FIXME: Add attachment plugin structure String filename = att.getFileName(); if(filename.endsWith(".txt") || filename.endsWith(".xml") || filename.endsWith(".ini") || filename.endsWith(".html")) { InputStream attStream; try { attStream = mgr.getAttachmentStream( att ); StringWriter sout = new StringWriter(); FileUtil.copyContents( new InputStreamReader(attStream), sout ); attStream.close(); sout.close(); return sout.toString(); } catch (ProviderException e) { log.error("Attachment cannot be loaded", e); return null; } catch (IOException e) { log.error("Attachment cannot be loaded", e); return null; } } ......
就是说支持文本文件附件(txt,xml,ini,html)的内容搜索,试了一下,上传了一个txt文件,果然是可以被查出来,。
于是决定给这个类动动手术,添加点功能让它可以支持doc和xls, 添加了如下代码在下面(office 2003到2007,文件格式不同,分开来写了),然后重新打包(打好的jar,我放在附件里),启动JSPWiki, 实验了一下,word 和 excel的文件可以被查出来了。
注意的是,这个类的作用是,在附件文件上传,用lucene建立了索引,所以实验的话,一定要重新上传文件,在改这个class之前上传的文件时没有用的。
...... else if(filename.endsWith(".doc")){ InputStream attStream = null; try { attStream = mgr.getAttachmentStream(att); WordExtractor extractor = new WordExtractor(attStream); String s = extractor.getText(); log.debug("Extracted text: " + s + " from attachment: " + filename); return s; } catch (Exception e) { log.error("Attachment cannot be loaded", e); return null; } finally { if(attStream != null){ try { attStream.close(); } catch (IOException e) { log.warn("Couldn't close attachment stream for " + filename, e); } } } } else if(filename.endsWith(".docx")){ InputStream attStream = null; try { attStream = mgr.getAttachmentStream(att); XWPFWordExtractor extractor = new XWPFWordExtractor(new XWPFDocument(attStream)); String s = extractor.getText(); log.debug("Extracted text: " + s + " from attachment: " + filename); return s; } catch (Exception e) { log.error("Attachment cannot be loaded", e); return null; } finally { if(attStream != null){ try { attStream.close(); } catch (IOException e) { log.warn("Couldn't close attachment stream for " + filename, e); } } } } else if(filename.endsWith(".xls")){ InputStream attStream = null; try { attStream = mgr.getAttachmentStream(att); HSSFWorkbook workbook=new HSSFWorkbook(attStream); HSSFSheet sheet=null; StringBuffer sb = new StringBuffer(); for(int i = 0; i < workbook.getNumberOfSheets(); i++) { sheet=workbook.getSheetAt(i); if(sheet == null){ continue; } for (int j = 0; j < sheet.getPhysicalNumberOfRows(); j++) { HSSFRow row=sheet.getRow(j); if(row == null){ continue; } for (int k = 0; k < row.getLastCellNum(); k++) { sb.append(row.getCell(k)); sb.append(" "); } } } String s = sb.toString(); log.debug("Extracted text: " + s + " from attachment: " + filename); return s; } catch (Exception e) { log.error("Attachment cannot be loaded", e); return null; } finally { if(attStream != null){ try { attStream.close(); } catch (IOException e) { log.warn("Couldn't close attachment stream for " + filename, e); } } } } else if(filename.endsWith(".xlsx")){ InputStream attStream = null; try { attStream = mgr.getAttachmentStream(att); XSSFWorkbook workbook = new XSSFWorkbook(attStream); XSSFSheet sheet=null; StringBuffer sb = new StringBuffer(); for(int i = 0; i < workbook.getNumberOfSheets(); i++) { sheet = workbook.getSheetAt(i); if(sheet == null){ continue; } for (int j = 0; j < sheet.getPhysicalNumberOfRows(); j++) { XSSFRow row=sheet.getRow(j); if(row == null){ continue; } for (int k = 0; k < row.getLastCellNum(); k++) { sb.append(row.getCell(k)); sb.append(" "); } } } String s = sb.toString(); log.debug("Extracted text: " + s + " from attachment: " + filename); return s; } catch (Exception e) { log.error("Attachment cannot be loaded", e); return null; } finally { if(attStream != null){ try { attStream.close(); } catch (IOException e) { log.warn("Couldn't close attachment stream for " + filename, e); } } } } ......