feild和term的关系是什么?
[在feild中切分出term]
===============
为了减小索引文件的大小,Lucene对索引还使用了压缩技术。首先,对词典文件中的关键词进行了压缩,
关键词压缩为<前缀长度,后缀>,例如:当前词为“阿拉伯语”,上一个词为“阿拉伯”,那么“阿拉伯语”压缩为<3,语>。
其次大量用到的是对数字的压缩,数字只保存与上一个值的差值(这样可以减小数字的长度,进而减少保存该数字需要的字节数)。
例如当前文章号是16389(不压缩要用3个字节保存),上一文章号是16382,压缩后保存7(只用一个字节)。
Field Data文件进行了压缩,使用的是zlib。
zlib、zip不如ppmd?
++++++++++++++++++
Lucene indexes may be composed of multiple sub-indexes, or segments.
Each segment has a separate Fieldable Info file.
tocken是term的一次出现,它包含trem文本和相应的起止偏移,以及一个类型字符串。
filter的作用就是限制只查询索引的某个子集,它的作用有点像SQL语句里的where,但又有区别,它不是正规查询的一部分,
只是对数据源进行预处理,然后交给查询语句。注意它执行的是预处理,而不是对查询结果进行过滤,
所以使用filter的代价是很大的,它可能会使一次查询耗时提高一百倍。
+++++++++++++++++
建索引的时候,只有一个线程在处理
When this file Lock File is present, a writer is currently modifying the index (adding or removing documents).
++++++++++++++++++
while (left < right && array[left].fieldInfo.name.compareTo(partition.fieldInfo.name) <= 0) // 这里的"<="应该为"<"
++left;
++++++++++ Merge +++++++++++
mergeFeild一直是二路归并,没有用多路
merge的时候,io有IndexReader完成
?何时开始merge
MergeFactor
涉及的类和文件类型
FieldInfos .fnm
单个文件无法全部载入内存时,怎么处理的?
+++++++++++++++++++++++
加快indexing的方法:
Open a single writer and re-use it for the duration of your indexing session.
Turn off compound file format.(but may run out of file descriptors)
Re-use Document and Field instances
Always add fields in the same order to your Document, when using stored fields or term vectors
>Use autoCommit=false when you open your IndexWriter (3.0.1无此设置了)
>Turn off any features you are not in fact using
>Use a faster analyzer.
>Speed up document construction.
>Don't optimize unless you really need to (for faster searching)
>Index into separate indices then merge.
>setMaxBufferedDocs(int maxBufferedDocs)
控制写入一个新的segment前内存中保存的document的数目,设置较大的数目可以加快建索引速度,默认为-1(不起用)。
* Determines the minimal number of documents required
* before the buffered in-memory documents are flushed as
* a new Segment. Large values generally gives faster
* indexing.
Disabled by default (writer flushes by RAM usage)
>我们可以先把索引写入RAMDirectory,达到一定数量时再批量写进FSDirectory,减少磁盘IO次数。
>setMergeFactor (10w个doc时,100-200是比较好的值)
>setRAMBufferSizeMB 默认16m
Generally for faster
* indexing performance it's best to flush by RAM usage
* instead of document count and use as large a RAM buffer
* as you can.
>优化时间范围限制(当需要搜索指定时间范围内的结果时)
>Quick tips:
Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.
Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.
Keep in the index on fast disks. RAM, if you are paranoid.
>Use multiple threads with one IndexWriter
+++++++++++++++++++++
dspeak@lm-vm01:~/luowl/soft/lucene-3.0.1/src/demo$ java -cp .:/home/dspeak/luowl/search-node/lib/* org.apache.lucene.demo.IndexFiles ~/luowl/soft/data1.xml
+++++++++++++++++++
每次optimize都会重新做索引
+++++++++++++++++++++
LEVEL_LOG_SPAN, level的意义是?
Whenever extra segments (beyond the merge factor upper bound) are encountered,
all segments within the level are merged.
按level分组
++++++++++++++
We keep a separate Posting hash and other state for each
* thread and then merge postings hashes from all threads
* when writing the segment.
++++++++++++++++++++++++
/*
This is the current indexing chain:
DocConsumer / DocConsumerPerThread
--> code: DocFieldProcessor / DocFieldProcessorPerThread
--> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField
--> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField
--> code: DocInverter / DocInverterPerThread / DocInverterPerField
--> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
--> code: TermsHash / TermsHashPerThread / TermsHashPerField
--> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField
--> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField
--> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField
--> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
--> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField
--> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField
*/
++++++++++++++++++++++++++++++
TermVector用来做什么的?
[统计词频,求词的出现位置等等]
++++++++++++++++++++
根据代码分析和测试,不Optimize, close的时候也会flush,FormatPostingsDocsWriter.addDoc会被调用