针对Field我们还有最后一个特性要讨论:截断(truncation)。其实就是之前"基本索引操作"中提到的 MaxFieldLength 问题。如果只是从字面上感觉,它就是个设置Field最大长度的值。
那么它到底是设置一个文档中同名域的最大个数还是不同文档的同名域的最大个数?是设置同名域的最大Term数还是一个域的最大Term数?
答案是:一个域的最大Term数。也就是说,我们第十节(Field中含多个值的问题)中所讲的方式索引的同名Field,属于不同的Field实例,并非整个名为"author"的域的最大长度受限,而是其中的每个Field实例的最大长度受限。算了,我表达的一塌糊涂...
直接上代码:
public class IndexTruncationTest extends TestCase{ private Directory directory; public void setUp() throws Exception{ directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), new MaxFieldLength(50)); writer.setInfoStream(System.out); //只有一个Field实例 Document doc = new Document(); String term = ""; for(int i=0; i<100; ++i){ term += " Test"+i; } doc.add(new Field("test", term, Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); writer.close(); } //前50个Term是可以搜索到的 public void testUnTruncated() throws Exception{ IndexSearcher is = new IndexSearcher(directory); Query query = new TermQuery(new Term("test","Test0")); Query query2 = new TermQuery(new Term("test","Test2")); Query query49 = new TermQuery(new Term("test","Test49")); int i0 = TestUtil.hitCount(is, query); int i2 = TestUtil.hitCount(is, query2); int i49 = TestUtil.hitCount(is, query49); assertEquals(1, i0); assertEquals(1, i2); assertEquals(1, i49); is.close(); } //后50个Term被无情的忽略了... public void testTruncated() throws Exception{ IndexSearcher is = new IndexSearcher(directory); Query query50 = new TermQuery(new Term("test","Test50")); Query query52 = new TermQuery(new Term("test","Test52")); Query query99 = new TermQuery(new Term("test","Test99")); int i0 = TestUtil.hitCount(is, query50); int i2 = TestUtil.hitCount(is, query52); int i49 = TestUtil.hitCount(is, query99); assertEquals(0, i0); assertEquals(0, i2); assertEquals(0, i49); is.close(); } }
其中,writer.setInfoStream(PrintStream infoStream); 方法可以用来查看包括maxFieldLength在内的一些信息。我把标准输出流作为参数传给了它,于是在我的控制台上输出了如下信息:
IFD [Tue Jun 14 16:45:11 CST 2011; main]: setInfoStream deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@10e3293
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: setInfoStream: dir=org.apache.lucene.store.RAMDirectory@4a63d8 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@1e0ff2f mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@1d6776d mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@13ad085 ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=50 index=
maxFieldLength 50 reached for field test, ignoring following tokens
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: now flush at close
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: flush: now pause all indexing threads
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: flush: segment=_0 docStoreSegment=_0 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=1 numBufDelTerms=0
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: index before flush
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: DW: flush postings as segment _0 numDocs=1
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: DW: closeDocStore: 2 files to flush to segment _0 numDocs=1
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: DW: oldRAMSize=111616 newFlushedSize=1241 docs/MB=844.944 new/old=1.112%
IW 0 [Tue Jun 14 16:45:11 CST 2011; main]: flushedFiles=[_0.fdt, _0.frq, _0.nrm, _0.fdx, _0.tii, _0.tis, _0.fnm, _0.prx]
IFD [Tue Jun 14 16:45:11 CST 2011; main]: now checkpoint "segments_1" [1 segments ; isCommit = false]
IFD [Tue Jun 14 16:45:11 CST 2011; main]: now checkpoint "segments_1" [1 segments ; isCommit = false]
IFD [Tue Jun 14 16:45:11 CST 2011; main]: delete "_0.fdt"
IFD [Tue Jun 14 16:45:11 CST 2011; main]: delete "_0.frq"
IFD [Tue Jun 14 16:45:11 CST 2011; main]: delete "_0.nrm"
IFD [Tue Jun 14 16:45:11 CST 2011; main]: delete "_0.fdx"
IFD [Tue Jun 14 16:45:11 CST 2011; main]: delete "_0.tii"
IFD [Tue Jun 14 16:45:11 CST 2011; main]: delete "_0.tis"
IFD [Tue Jun 14 16:45:11 CST 2011; main]: delete "_0.prx"
...
当然,除了直接在IndexWriter的构造器中指定maxFieldLength,还可以通过 writer.setMaxFieldLength(xxx); 方法来设定该值,对应的还有个 writer.getMaxFieldLength(xxx); 方法。
另外,Lucene内置了两个限制值,一个是 0x7fffffff,一个是 10000,它们分别是MaxFieldLength.UNLIMITED(无限制,但事实是限制在最大整数范围内)和MaxFieldLength.LIMITED(其值设定为10000).