lucene4的一个很大的变化就是提供了可插拔的编码器架构,可以自行定义索引结构,包括词元,倒排列表,存储字段,词向量,已删除的文档,段信息,字段信息
关于codec:
lucene4中已经提供了多个codec的实现
Lucene40, 默认编码器.Lucene40Codec
Lucene3x, read-only, 可以用来读取采用3.x创建的索引,不能使用该编码器创建索引.Lucene3xCodec
SimpleText, 采用明文的方式存储索引,适合用来学习,不建议在生产环境中使用. SimpleTextCodec
Appending, 针对采用append写入的文件系统,例如hdfs. AppendingCodec
......
关于format:
codec事实上就是有多组的format构成的,一个codec总共包含8个format,
包含PostingsFormat,DocValuesFormat,StoredFieldsFormat,TermVectorsFormat,FieldInfosFormat,SegmentInfoFormat,NormsFormat,LiveDocsFormat
例StoredFieldsFormat用来处理stored fileds,TermVectorsFormat用来处理term vectors。在lucene4中可以自行定制各个format的实现
目前在lucene4中也提供了多个PostingsFormat的实现
Memory:将所有的term和postinglists加载到一个内存中的FST. MemoryPostingsFormat
Direct:写的时候采用默认的Lucene40PostingsFormat,读的时候在将terms和postinglists加载到内存里面.DirectPostingsFormat
Pulsing:默认将词频小于等于1的term采用inline的方式存储.PulsingPostingsFormat
BloomFilter:可以在每个segment上为某个指定的field添加Bloom Filter.实现了"fast-fail"来判断segment上有没有相对应的key。最适合的场景就是在索引的记录数很多,同时segment也很多的情况下为主键添加Bloom Filter。BloomFilteringPostingsFormat需实现在其他的PostingsFormat之上.这里有个关于BloomFilter的测试https://docs.google.com/spreadsheet/ccc?key=0AsKVSn5SGg_wdFNpNTl3R1cxLTluTTcya2hDRnlfdHc#gid=3
Block:提供了索引的压缩同时也加强了检索性能,在未来的版本中可能会变成默认的PostingsFormat。现在要使用此格式的同学得注意,目前这个版本还处在实验阶段,并不能保证索引格式的向后兼容。和Lucene40不同的是BlockPostingsFormat不会创建 .frq和.prx取而代之的是.doc和.pos文件
....
测试代码:
package test; import java.io.File; import java.util.ArrayList; import java.util.List; import java.util.UUID; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.cjk.CJKAnalyzer; import org.apache.lucene.codecs.Codec; import org.apache.lucene.codecs.PostingsFormat; import org.apache.lucene.codecs.appending.AppendingCodec; import org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat; import org.apache.lucene.codecs.lucene3x.Lucene3xCodec; import org.apache.lucene.codecs.lucene40.Lucene40Codec; import org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat; import org.apache.lucene.codecs.simpletext.SimpleTextCodec; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; /** * lucene codec * * @author wuwen * @date 2013-1-14 下午04:54:17 * */ public class LuceneCodecTest { static Codec getCodec(String codecname) { Codec codec = null; if ("Lucene40".equals(codecname)) { codec = new Lucene40Codec(); } else if ("Lucene3x".equals(codecname)) { codec = new Lucene3xCodec(); // throw new UnsupportedOperationException("this codec can only be used for reading"); } else if ("SimpleText".equals(codecname)) { codec = new SimpleTextCodec(); } else if ("Appending".equals(codecname)) { codec = new AppendingCodec(); } else if ("Pulsing40".equals(codecname)) { codec = new Lucene40Codec() { public PostingsFormat getPostingsFormatForField(String field) { return PostingsFormat.forName("Pulsing40"); } }; } else if ("Memory".equals(codecname)) { codec = new Lucene40Codec() { public PostingsFormat getPostingsFormatForField(String field) { return PostingsFormat.forName("Memory"); } }; } else if ("BloomFilter".equals(codecname)) { codec = new Lucene40Codec() { public PostingsFormat getPostingsFormatForField(String field) { return new BloomFilteringPostingsFormat(new Lucene40PostingsFormat()); } }; }else if ("Direct".equals(codecname)) { codec = new Lucene40Codec() { public PostingsFormat getPostingsFormatForField(String field) { return PostingsFormat.forName("Direct"); } }; } else if ("Block".equals(codecname)) { codec = new Lucene40Codec() { public PostingsFormat getPostingsFormatForField(String field) { return PostingsFormat.forName("Block"); } }; } return codec; } public static void main(String[] args) { String[] codecs = {"Lucene40", "Lucene3x", "SimpleText", "Appending", "Pulsing40", "Memory", "BloomFilter", "Direct", "Block"}; String suffixPath = "E:\\lucene\\codec\\"; for (String codecname : codecs) { String indexPath = suffixPath + codecname; Codec codec = getCodec(codecname); Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_40); IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer); config.setOpenMode(IndexWriterConfig.OpenMode.CREATE); config.setCodec(codec); // 设置编码器 IndexWriter writer = null; try { Directory luceneDir = FSDirectory.open(new File(indexPath)); writer = new IndexWriter(luceneDir, config); List<Document> list = new ArrayList<Document>(); Document doc = new Document(); doc.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES)); doc.add(new TextField("Content", "北京时间1月14日04:00(西班牙当地时间13日21:00),2012/13赛季西班牙足球甲级联赛第19轮一场焦点战在纳瓦拉国王球场展开争夺.", Field.Store.YES)); list.add(doc); Document doc1 = new Document(); doc1.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES)); doc1.add(new TextField("Content", "巴萨超皇马18分毁了西甲?媒体惊呼 克鲁伊夫看不下去.", Field.Store.YES)); list.add(doc1); Document doc2 = new Document(); doc2.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES)); doc2.add(new TextField("Content", "what changes in lucene4.", Field.Store.YES)); list.add(doc2); writer.addDocuments(list); } catch (Exception e) { e.printStackTrace(); } finally { if (writer != null) { try { writer.close(); } catch (Exception e) { e.printStackTrace(); } } } } } }