解决lucene中Document删除不了的问题

一、问题引入

lucene 5.5,调用IndexWriter的deleteDocuments()方法时,没有报异常,但索引删除不成功。代码片段如下:

  • 添加Document
private void addSingleDoc(Path file, final IndexWriter indexWriter, final int id) throws IOException
{
    Document document = new Document();
    document.add(new IntField("id", id, Field.Store.YES));
    document.add(new StringField("name", file.getFileName().toString(), Field.Store.YES));
    document.add(new TextField("content", Files.newBufferedReader(file)));
    document.add(new LongField("modified", Files.getLastModifiedTime(file).toMillis(), Field.Store.YES));
    indexWriter.addDocument(document);
}
  • 删除Document
public void delete(final int id)
{
    Directory directory = null;
    IndexWriter indexWriter = null;
    try
    {
        directory = FSDirectory.open(Paths.get(INDEX_LOCATION));
        StandardAnalyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        indexWriter = new IndexWriter(directory, config);
        indexWriter.deleteDocuments(new Term("id", String.valueOf(id)));
    } catch (IOException e)
    {
        e.printStackTrace();
    } finally
    {
        if(indexWriter != null)
            try
            {
                indexWriter.close();
            } catch (IOException e)
            {
                e.printStackTrace();
            }
        if(null != directory)
            try
            {
                directory.close();
            } catch (IOException e)
            {
                e.printStackTrace();
            }
    }
}

添加Document时,各个Field都用的lucene预设好的IntField,TextField等。经测试,删除Document时,只有通过content(类型为TextField)来设置Term,才可删除成功。原因是什么?不妨对比一下IntField和TextField这两个类的源码。

二、背后原因

比较容易想到的是这两种Field在索引选项方面的区别。
IntField与TextField都是Field的子类,而Field中通过成员变量FieldType type来表示一个域(Field)的选项,包括存储选项、索引选项。
FiledType的成员如下:

private boolean stored;//存储选项
private boolean tokenized = true;
private boolean storeTermVectors;
private boolean storeTermVectorOffsets;
private boolean storeTermVectorPositions;
private boolean storeTermVectorPayloads;
private boolean omitNorms;
//索引选项
private IndexOptions indexOptions = IndexOptions.NONE;
private NumericType numericType;
private boolean frozen;
private int numericPrecisionStep = NumericUtils.PRECISION_STEP_DEFAULT;
private DocValuesType docValuesType = DocValuesType.NONE;

再看一下IndexOptions的定义:

public enum IndexOptions { 
  // NOTE: order is important here; FieldInfo uses this
  // order to merge two conflicting IndexOptions (always
  // "downgrades" by picking the lowest).
  /** Not indexed */
  NONE,
  /** * Only documents are indexed: term frequencies and positions are omitted. * Phrase and other positional queries on the field will throw an exception, and scoring * will behave as if any term in the document appears only once. */
  DOCS,
  /** * Only documents and term frequencies are indexed: positions are omitted. * This enables normal scoring, except Phrase and other positional queries * will throw an exception. */  
  DOCS_AND_FREQS,
  /** * Indexes documents, frequencies and positions. * This is a typical default for full-text search: full scoring is enabled * and positional queries are supported. */
  DOCS_AND_FREQS_AND_POSITIONS,
  /** * Indexes documents, frequencies, positions and offsets. * Character offsets are encoded alongside the positions. */
  DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
}

也就是说,只有索引选项的值高于DOCS,才可以针对该Field进行keyWord查找和删除。
接着看一下IntField和TextField的区别。

  • IntField
    添加Document的代码中,我使用的IntField的构造方法定义如下:
public IntField(String name, int value, Store stored) {
    super(name, stored == Store.YES ? TYPE_STORED : TYPE_NOT_STORED);
    fieldsData = Integer.valueOf(value);
}

再看一下TYPE_STORED这个静态成员:

public static final FieldType TYPE_STORED = new FieldType();
static {
    TYPE_STORED.setTokenized(true);
    TYPE_STORED.setOmitNorms(true);
    TYPE_STORED.setIndexOptions(IndexOptions.DOCS);
    TYPE_STORED.setNumericType(FieldType.NumericType.INT);
    TYPE_STORED.setNumericPrecisionStep(NumericUtils.PRECISION_STEP_DEFAULT_32);
    TYPE_STORED.setStored(true);
    TYPE_STORED.freeze();
}

也就是说,IntField预设了两套FieldType,TYPE_STORED和TYPE_NOT_STORED。indexOptions都为IndexOptions.DOCS。

  • TextField
    添加Document时,我使用的TextField的构造方法定义如下:
public TextField(String name, Reader reader) {
  super(name, reader, TYPE_NOT_STORED);
}

同样看一下TYPE_NOT_STORED这个静态成员:

/** Indexed, tokenized, stored. */
public static final FieldType TYPE_STORED = new FieldType();

static {
    TYPE_NOT_STORED.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
    TYPE_NOT_STORED.setTokenized(true);
    TYPE_NOT_STORED.freeze();
    TYPE_STORED.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
    TYPE_STORED.setTokenized(true);
    TYPE_STORED.setStored(true);
    TYPE_STORED.freeze();
}

TYPE_NOT_STORED的indexOptions是IndexOptions.DOCS_AND_FREQS_AND_POSITIONS。

三、初步解决

上述分析得出的结论就是,修改indexOptions。为了验证正确与否,我做了如下测试,将添加Document的代码修改如下:

private void addSingleDoc(Path file, final IndexWriter indexWriter, final int id) throws IOException
{
    Document document = new Document();
    FieldType type = new FieldType();
    type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
    type.setStored(true);
    //id域使用自设置的FieldType,不使用lucene预设的IntField
    document.add(new Field("id", String.valueOf(id), type));
    document.add(new StringField("name", file.getFileName().toString(), Field.Store.YES));
    document.add(new TextField("content", Files.newBufferedReader(file)));
    document.add(new LongField("modified", Files.getLastModifiedTime(file).toMillis(), Field.Store.YES));
    indexWriter.addDocument(document);
}

经测试,可根据指定id成功删除Document。

四、进一步思考

看到这里,不知道有没有童鞋发现不太对劲的地方。上述例子 ,如果是针对content进行关键字查找,那很正常。content那么长的文本,一般也的确通过分词索引,通过keyWord来查找、删除。但问题是,id就是一个数字啊。根据id号来查找、删除,还要分词索引,这不扯蛋吗?而且id实际上也会做为文本为存储,对id分词索引的话,假如有:1, 10, 110三个id,本来的目的指定删除id为1的Document,结果分词索引后,说不定会把所有id号中有1的都给删除了,那简直就是泰勒展开懵逼了。所以一个很自然的想法就是禁止分词,在lucene 5.5中,这是通过FieldType中的tokenized属性来控制的。禁止分词后,indexOptions只需设置为IndexOptions.DOC即可。
如下解决方案也可成功删除Document,且更好:

private void addSingleDoc(Path file, final IndexWriter indexWriter, final int id) throws IOException
{
    Document document = new Document();
    FieldType type = new FieldType();
    type.setIndexOptions(IndexOptions.DOCS);
    type.setTokenized(false);
    type.setStored(true);

    document.add(new Field("id", String.valueOf(id), type));
    document.add(new StringField("name", file.getFileName().toString(), Field.Store.YES));
    document.add(new TextField("content", Files.newBufferedReader(file)));
    document.add(new LongField("modified", Files.getLastModifiedTime(file).toMillis(), Field.Store.YES));
    indexWriter.addDocument(document);
}

最后,我想说的是,lucene的API为开发者提供不少便利。例如,它针对常用的基础类型的域,都提供了对应的IntField,LongField等。但应用的时候,需要了解其细节原理,以确定是否跟业务需求一致。不能只会应用而不知其所以然,要不然出错在哪里都不知道。刚起步学lucene,以此自勉。

你可能感兴趣的:(删除,document,Lucene)