lucene 5.5,调用IndexWriter的deleteDocuments()方法时,没有报异常,但索引删除不成功。代码片段如下:
private void addSingleDoc(Path file, final IndexWriter indexWriter, final int id) throws IOException
{
Document document = new Document();
document.add(new IntField("id", id, Field.Store.YES));
document.add(new StringField("name", file.getFileName().toString(), Field.Store.YES));
document.add(new TextField("content", Files.newBufferedReader(file)));
document.add(new LongField("modified", Files.getLastModifiedTime(file).toMillis(), Field.Store.YES));
indexWriter.addDocument(document);
}
public void delete(final int id)
{
Directory directory = null;
IndexWriter indexWriter = null;
try
{
directory = FSDirectory.open(Paths.get(INDEX_LOCATION));
StandardAnalyzer analyzer = new StandardAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
indexWriter = new IndexWriter(directory, config);
indexWriter.deleteDocuments(new Term("id", String.valueOf(id)));
} catch (IOException e)
{
e.printStackTrace();
} finally
{
if(indexWriter != null)
try
{
indexWriter.close();
} catch (IOException e)
{
e.printStackTrace();
}
if(null != directory)
try
{
directory.close();
} catch (IOException e)
{
e.printStackTrace();
}
}
}
添加Document时,各个Field都用的lucene预设好的IntField,TextField等。经测试,删除Document时,只有通过content(类型为TextField)来设置Term,才可删除成功。原因是什么?不妨对比一下IntField和TextField这两个类的源码。
比较容易想到的是这两种Field在索引选项方面的区别。
IntField与TextField都是Field的子类,而Field中通过成员变量FieldType type来表示一个域(Field)的选项,包括存储选项、索引选项。
FiledType的成员如下:
private boolean stored;//存储选项
private boolean tokenized = true;
private boolean storeTermVectors;
private boolean storeTermVectorOffsets;
private boolean storeTermVectorPositions;
private boolean storeTermVectorPayloads;
private boolean omitNorms;
//索引选项
private IndexOptions indexOptions = IndexOptions.NONE;
private NumericType numericType;
private boolean frozen;
private int numericPrecisionStep = NumericUtils.PRECISION_STEP_DEFAULT;
private DocValuesType docValuesType = DocValuesType.NONE;
再看一下IndexOptions的定义:
public enum IndexOptions {
// NOTE: order is important here; FieldInfo uses this
// order to merge two conflicting IndexOptions (always
// "downgrades" by picking the lowest).
/** Not indexed */
NONE,
/** * Only documents are indexed: term frequencies and positions are omitted. * Phrase and other positional queries on the field will throw an exception, and scoring * will behave as if any term in the document appears only once. */
DOCS,
/** * Only documents and term frequencies are indexed: positions are omitted. * This enables normal scoring, except Phrase and other positional queries * will throw an exception. */
DOCS_AND_FREQS,
/** * Indexes documents, frequencies and positions. * This is a typical default for full-text search: full scoring is enabled * and positional queries are supported. */
DOCS_AND_FREQS_AND_POSITIONS,
/** * Indexes documents, frequencies, positions and offsets. * Character offsets are encoded alongside the positions. */
DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
}
也就是说,只有索引选项的值高于DOCS,才可以针对该Field进行keyWord查找和删除。
接着看一下IntField和TextField的区别。
public IntField(String name, int value, Store stored) {
super(name, stored == Store.YES ? TYPE_STORED : TYPE_NOT_STORED);
fieldsData = Integer.valueOf(value);
}
再看一下TYPE_STORED这个静态成员:
public static final FieldType TYPE_STORED = new FieldType();
static {
TYPE_STORED.setTokenized(true);
TYPE_STORED.setOmitNorms(true);
TYPE_STORED.setIndexOptions(IndexOptions.DOCS);
TYPE_STORED.setNumericType(FieldType.NumericType.INT);
TYPE_STORED.setNumericPrecisionStep(NumericUtils.PRECISION_STEP_DEFAULT_32);
TYPE_STORED.setStored(true);
TYPE_STORED.freeze();
}
也就是说,IntField预设了两套FieldType,TYPE_STORED和TYPE_NOT_STORED。indexOptions都为IndexOptions.DOCS。
public TextField(String name, Reader reader) {
super(name, reader, TYPE_NOT_STORED);
}
同样看一下TYPE_NOT_STORED这个静态成员:
/** Indexed, tokenized, stored. */
public static final FieldType TYPE_STORED = new FieldType();
static {
TYPE_NOT_STORED.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
TYPE_NOT_STORED.setTokenized(true);
TYPE_NOT_STORED.freeze();
TYPE_STORED.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
TYPE_STORED.setTokenized(true);
TYPE_STORED.setStored(true);
TYPE_STORED.freeze();
}
TYPE_NOT_STORED的indexOptions是IndexOptions.DOCS_AND_FREQS_AND_POSITIONS。
上述分析得出的结论就是,修改indexOptions。为了验证正确与否,我做了如下测试,将添加Document的代码修改如下:
private void addSingleDoc(Path file, final IndexWriter indexWriter, final int id) throws IOException
{
Document document = new Document();
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
type.setStored(true);
//id域使用自设置的FieldType,不使用lucene预设的IntField
document.add(new Field("id", String.valueOf(id), type));
document.add(new StringField("name", file.getFileName().toString(), Field.Store.YES));
document.add(new TextField("content", Files.newBufferedReader(file)));
document.add(new LongField("modified", Files.getLastModifiedTime(file).toMillis(), Field.Store.YES));
indexWriter.addDocument(document);
}
经测试,可根据指定id成功删除Document。
看到这里,不知道有没有童鞋发现不太对劲的地方。上述例子 ,如果是针对content进行关键字查找,那很正常。content那么长的文本,一般也的确通过分词索引,通过keyWord来查找、删除。但问题是,id就是一个数字啊。根据id号来查找、删除,还要分词索引,这不扯蛋吗?而且id实际上也会做为文本为存储,对id分词索引的话,假如有:1, 10, 110三个id,本来的目的指定删除id为1的Document,结果分词索引后,说不定会把所有id号中有1的都给删除了,那简直就是泰勒展开懵逼了。所以一个很自然的想法就是禁止分词,在lucene 5.5中,这是通过FieldType中的tokenized属性来控制的。禁止分词后,indexOptions只需设置为IndexOptions.DOC即可。
如下解决方案也可成功删除Document,且更好:
private void addSingleDoc(Path file, final IndexWriter indexWriter, final int id) throws IOException
{
Document document = new Document();
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS);
type.setTokenized(false);
type.setStored(true);
document.add(new Field("id", String.valueOf(id), type));
document.add(new StringField("name", file.getFileName().toString(), Field.Store.YES));
document.add(new TextField("content", Files.newBufferedReader(file)));
document.add(new LongField("modified", Files.getLastModifiedTime(file).toMillis(), Field.Store.YES));
indexWriter.addDocument(document);
}
最后,我想说的是,lucene的API为开发者提供不少便利。例如,它针对常用的基础类型的域,都提供了对应的IntField,LongField等。但应用的时候,需要了解其细节原理,以确定是否跟业务需求一致。不能只会应用而不知其所以然,要不然出错在哪里都不知道。刚起步学lucene,以此自勉。