最近做的一项目,需要用到lucene,项目开始的时候lucene3.4还没有发布,选择了最新的3.3版本
先说一下业务背景:
需要搜索的文件是TXT文件,每天会增量增加,而且文件一直保留,文件存放的结构化数据,具体结构如下: Id|name|address|date
需要根据name address date进行搜索 date需要考虑跨时间段
由于业务每天会生成一个文件,当天的文件日期肯定是一致的,这个由业务方保证,每天的文件最大会有200M左右
考虑到文件是增量增加,然后需要按时间跨度搜索,时间跨度最大可以是三个月之久,响应时间必须在2S内
如果放在一个文件里面,索引的建立会随着文件不断增长而变得无比庞大,同时对于索引的搜索和优化也很麻烦
根据实际情况,考虑了如下个方案:
1、根据文件日期做索引分类,这需要数据提供者配合给每天生成的文件名中必须要包含日期
2、然后按照日期格式生成诸如2011/10/2011-10-11这样的目录结构,对应每天的索引就存放在对应的日期文件夹内
这样的好处有如下几点:
1) 索引源文件里面不需要再处理日期,可以直接把日期字段删除,减少索引文件大小
2)搜索时不需要进行区间搜索,如搜索2011-10-01至2011-10-31号的数据,可以生成31个文件目录,如2011/10/2011-10-01,2011/10/2011-10-02等,
直接去指定日期目录下通过多目录索引搜索文件
3)对于生成的文件可以随时重建索引,因为是分目录索引,所以重建效率非常高,不需要进行专门的索引优化
/**
*根据不同的域使用不同的分词器
*
* @return
*/
public PerFieldAnalyzerWrapper kmsAnalyzer() {
Analyzer standardAnalyzer = new StandardAnalyzer(LUCENE_VERSION);
Analyzer kwAnalyzer = new KeywordAnalyzer();
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(kwAnalyzer);
analyzer.addAnalyzer("content", standardAnalyzer);
return analyzer;
}
//构造索引生成对象
private IndexWriter indexWriter33(String indexPath) throws CorruptIndexException,
LockObtainFailedException, IOException {
File file = new File(indexPath);
LogMergePolicy policy = new LogDocMergePolicy();
// SetUseCompoundFile这个方法可以使Lucene在创建索引库时,会合并多个 Segments 文件到一个 .cfs中
// 此方式有助于减少索引文件数量,对于将来搜索的效率有较大影响。
// 压缩存储(True则为复合索引格式)
policy.setUseCompoundFile(true);
//合并因子,当硬盘上的索引块达到设置的数量时,会合并成一个一个较大的索引块
policy.setMergeFactor(5000);
IndexWriterConfig config = new IndexWriterConfig(LUCENE_VERSION, this.kmsAnalyzer());
config.setOpenMode(OpenMode.CREATE);
config.setMergePolicy(policy);
//最大缓存文档数 根据内存大小进行设置,设置较大的数目可以加快建索引速度
config.setMaxBufferedDocs(200000);
//构造索引存放目录
FSDirectory directory = FSDirectory.open(file);
//写索引对象
IndexWriter indexWriter = new IndexWriter(directory, config);
return indexWriter;
}
/**
* 构造document
*
* @param lineRecord 针对每一行记录构造document
* @return
*/
private Document buildDocument(String lineRecord, boolean fileStatus) {
Document doc = new Document();
String[] columns = lineRecord.split(String.valueOf((char) 5));
//长度为4时,说明是站内词
if (columns.length == 3) {
Field cateId = new Field("cateId", columns[2], Store.NO, Index.ANALYZED);
doc.add(cateId);
}
//长度为6时,表示是站外词
else if (columns.length == 5) {
Field sessionId = new Field("sessionId", columns[2], Store.NO, Index.ANALYZED);
Field countryId = new Field("countryId", columns[3], Store.NO, Index.ANALYZED);
Field urlGourpId = new Field("urlGourpId", columns[4], Store.NO, Index.ANALYZED);
doc.add(sessionId);
doc.add(countryId);
doc.add(urlGourpId);
} else {
logger.error("The file content [" + lineRecord + "] error.");
fileStatus = false;
return new Document();
}
Field id = new Field("id", columns[0], Store.YES, Index.ANALYZED);
Field keyword = new Field("keyword", columns[1], Store.NO, Index.ANALYZED);
//需要进行分词处理
Field content = new Field("content", columns[1], Store.NO, Index.ANALYZED);
//Field date = new Field("date", columns[2], Store.YES, Index.ANALYZED);
doc.add(id);
doc.add(keyword);
doc.add(content);
//doc.add(date);
return doc;
}
public void createIndex(String srcPath, String desPath) {
//文件内容正常标识
boolean fileStatus = true;
String path = null;
//获取所有*.dat文件
List<File> fileList = SEFileUtil.getSrcFiles(SEFileUtil.pathToFile(srcPath),
FILE_SUFFIX_DAT);
// 获取所有以*.lock的文件
List<File> lockFileList = SEFileUtil.getSrcFiles(SEFileUtil.pathToFile(srcPath),
FILE_SUFFIX_LOCK);
//建立索引
label0: for (File file : fileList) {
IndexWriter writer = null;
BufferedReader br = null;
//构造写索引对象
try {
String prxFileName = file.getName().substring(0, file.getName().indexOf("_"));
//需要索引的文件正在生成时不处理
if (lockFileList != null && !lockFileList.isEmpty()) {
for (File lockFile : lockFileList) {
String preLockFileName = lockFile.getName().substring(0,
file.getName().indexOf("_"));
if (preLockFileName.equalsIgnoreCase(prxFileName)) {
lockFileList.remove(lockFile);
continue label0;
}
}
}
//生成索引文件存储路径
path = SEFileUtil.buildFilePath(desPath, prxFileName, "yyyyMMdd");
if (logger.isDebugEnabled()) {
logger.debug("The index file path: " + path);
}
writer = this.indexWriter33(SEFileUtil.createDirectory(path));
br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"));
String record = null;
//生成索引文件
while (StringUtils.isNotBlank(record = br.readLine())) {
writer.addDocument(this.buildDocument(record, fileStatus));
}
writer.optimize();
writer.commit();
} catch (Exception e) {
e.printStackTrace();
return;
} finally {
this.close(writer, br);
}
//文件解析异常时不删除原文件
if (fileStatus) {
if (StringUtils.isNotBlank(this.getIndexCopyToIP())) {
String[] ipArray = this.getIndexCopyToIP().split("\\|");
for (String ip : ipArray) {
int exitValue = this.copyIndex(ip.trim(), path);
if (0 != exitValue) {
logger.error("^_^ Copy index directory [" + path + "] to [" + ip
+ "] failed.");
}
}
}
//删除文件
boolean flag = SEFileUtil.deleteFile(file);
if (!flag) {
logger.error("Delete file failed: " + file.getPath());
}
}
}
}
下面是搜索,搜索由于我的业务相对简单,所以搜索也比较简单,主要有一点就是我需要搜索的返回值只要取其中的ID值就可以了,
开始不太了解lucene,取值时把整个document都装载进去了,所以取文件时很慢,后来通过MapFieldSelector使速度提高了5部以下,具体用法如下
下面是多目录搜索
/**
下面是查询主方法,包括构造搜索条件和多目录索引
*/
public List<Long> searchIndex(Map<String, String> paramMap) {
Long startTime = null;
if (logger.isDebugEnabled()) {
startTime = System.currentTimeMillis();
logger.debug("^_^ start search: " + paramMap);
}
List<Long> ids = null;
//打开索引文件存放目录
String keyword = paramMap.get("keyword");//关键字
String cateId = paramMap.get("cateId");//类目标识
String matchFlag = paramMap.get("matchFlag"); //匹配标识 0:精确,1:模糊
String cateType = paramMap.get("cateType");//02:发布类目 03:展示类目
String siteWord = paramMap.get("siteWord");//0:=站内词,1:=站外词
String sessionId = paramMap.get("sessionId");//来源
String countryId = paramMap.get("countryId");//国家
String urlGourpId = paramMap.get("urlGourpId");//url 组
String fromDate = paramMap.get("startDate");//起始时间
String toDate = paramMap.get("endDate");//结束时间
//获取搜索目录
String searchPath = this.getSearchPath(siteWord, cateType);
//计算时间段内所有日期
List<String> dateStringList = SEDateUtil.getDateRange(fromDate, toDate);
// IndexReader[] subReaders = new IndexReader[dateStringList.size()];
List<IndexReader> subReadersList = new ArrayList<IndexReader>();
boolean flag = true;
try {
//构造索引搜索对象
for (int i = 0; i < dateStringList.size(); i++) {
//获取所有搜索路径文件
String fullPath = SEFileUtil.buildFilePath(searchPath, dateStringList.get(i),
"yyyy-MM-dd");
File file = SEFileUtil.pathToFile(fullPath);
if (!file.isDirectory()) {
if (logger.isDebugEnabled()) {
logger.debug("The directory is not exist: " + fullPath);
}
continue;
}
FSDirectory directory = FSDirectory.open(new File(fullPath));
IndexReader subReader = IndexReader.open(directory);
flag = false;
subReadersList.add(subReader);
}
if (flag) {
return null;
}
IndexReader[] subReaders = subReadersList
.toArray(new IndexReader[subReadersList.size()]);
if (logger.isDebugEnabled()) {
logger.debug("Build search directory consume time: "
+ (System.currentTimeMillis() - startTime));
startTime = System.currentTimeMillis();
}
//获取搜索结果
ids = this.getSearchResult(subReaders, matchFlag, keyword, cateId, sessionId,
countryId, urlGourpId);
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
if (null != subReadersList) {
subReadersList = null;
}
}
if (logger.isDebugEnabled()) {
Long endTime = (System.currentTimeMillis() - startTime);
logger.debug("search end. Consume Time(s): " + endTime);
}
if (null != ids && !ids.isEmpty()) {
//按ID升级排序
Collections.sort(ids, new IndicatorComparator());
}
return ids;
}
/**
*
*/
private List<Long> getSearchResult(IndexReader[] subReaders, String matchFlag, String keyword,
String cateId, String sessionId, String countryId,
String urlGourpId) throws ParseException,
CorruptIndexException, Exception {
List<Long> result = null;
PerFieldAnalyzerWrapper analyzer = buildIndexJob.kmsAnalyzer();
IndexReader multiReader = new MultiReader(subReaders);
BooleanQuery query = new BooleanQuery();
//分词匹配keyword
if ("1".equals(matchFlag) && StringUtil.isNotBlank(keyword)) {
QueryParser queryParser = new QueryParser(BuildIndexJob.LUCENE_VERSION, "content",
analyzer);
//两个词之间的关系是or
queryParser.setDefaultOperator(QueryParser.OR_OPERATOR);
query.add(queryParser.parse(QueryParser.escape(keyword.toLowerCase())), Occur.MUST);
}
//全量匹配keyword
else if ("0".equals(matchFlag) && StringUtils.isNotBlank(keyword)) {
Query kQuery = new TermQuery(new Term("keyword", keyword.toLowerCase()));
query.add(kQuery, Occur.MUST);
}
//categoryId匹配
if (StringUtils.isNotBlank(cateId)) {
Query bQuery = new TermQuery(new Term("cateId", cateId));
query.add(bQuery, Occur.MUST);
}
if (StringUtils.isNotBlank(sessionId)) {
Query bQuery = new TermQuery(new Term("sessionId", sessionId));
query.add(bQuery, Occur.MUST);
}
if (StringUtils.isNotBlank(countryId)) {
Query bQuery = new TermQuery(new Term("countryId", countryId));
query.add(bQuery, Occur.MUST);
}
if (StringUtils.isNotBlank(urlGourpId)) {
Query bQuery = new TermQuery(new Term("urlGourpId", urlGourpId));
query.add(bQuery, Occur.MUST);
}
Long startTime = System.currentTimeMillis();
IndexSearcher search = new IndexSearcher(multiReader);
//最多只返回20W数据,此数据是业务需求
TopDocs topDocs = search.search(query, 200000);
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
if (logger.isDebugEnabled()) {
logger.debug("search result: " + scoreDocs.length);
logger.debug("search consume time: " + (System.currentTimeMillis() - startTime));
startTime = System.currentTimeMillis();
}
if (scoreDocs.length <= 0) {
return null;
}
result = this.getIds(scoreDocs, search);
if (logger.isDebugEnabled()) {
logger.debug("Reader [id] consume time: " + (System.currentTimeMillis() - startTime));
}
return result;
}
/**
* 从doc对象中获取所有的ID集合
*
* @param scoreDocs
* @param multiSearcher
* @return
* @throws CorruptIndexException
* @throws IOException
* @throws ExecutionException
* @throws InterruptedException
*/
public List<Long> getIds(ScoreDoc[] scoreDocs, IndexSearcher search)
throws CorruptIndexException, IOException {
List<Long> ids = new ArrayList<Long>(scoreDocs.length);
Map<String, FieldSelectorResult> fieldSelections = new HashMap<String, FieldSelectorResult>(
1);
fieldSelections.put("id", FieldSelectorResult.LOAD);
FieldSelector fieldSelector = new MapFieldSelector(fieldSelections);
//获取ID集合
for (int i = 0; i < scoreDocs.length; i++) {
Document doc = search.doc(scoreDocs[i].doc, fieldSelector);
ids.add(Long.valueOf(doc.getFieldable("id").stringValue()));
}
return ids;
}
红色部分是取文件时只装载ID,其它的域不装载,主样获取速度会快很多,文件越大加载越快
再有就是在性能测试时,windows和linux区别很大,主要是windows上默认没有使用多线程和内存映射,源代码如下
/** Just like {@link #open(File)}, but allows you to
* also specify a custom {@link LockFactory}. */
public static FSDirectory open(File path, LockFactory lockFactory) throws IOException {
if ((Constants.WINDOWS || Constants.SUN_OS || Constants.LINUX)
&& Constants.JRE_IS_64BIT && MMapDirectory.UNMAP_SUPPORTED) {
return new MMapDirectory(path, lockFactory);
} else if (Constants.WINDOWS) {
return new SimpleFSDirectory(path, lockFactory);
} else {
return new NIOFSDirectory(path, lockFactory);
}
}
所以测试效果必需要在64位的linux上才能体现出
有问题请联系,
[email protected]//
[email protected], gmail打开邮箱最近一直比较慢,不知道是什么原因