lucene点滴

feild和term的关系是什么?

[在feild中切分出term]

===============

为了减小索引文件的大小,Lucene对索引还使用了压缩技术。首先,对词典文件中的关键词进行了压缩,

关键词压缩为<前缀长度,后缀>,例如:当前词为“阿拉伯语”,上一个词为“阿拉伯”,那么“阿拉伯语”压缩为<3,语>。

其次大量用到的是对数字的压缩,数字只保存与上一个值的差值(这样可以减小数字的长度,进而减少保存该数字需要的字节数)。

例如当前文章号是16389(不压缩要用3个字节保存),上一文章号是16382,压缩后保存7(只用一个字节)。

Field Data文件进行了压缩,使用的是zlib。

zlib、zip不如ppmd?

++++++++++++++++++

Lucene indexes may be composed of multiple sub-indexes, or segments.

Each segment has a separate Fieldable Info file.

tocken是term的一次出现,它包含trem文本和相应的起止偏移,以及一个类型字符串。

filter的作用就是限制只查询索引的某个子集,它的作用有点像SQL语句里的where,但又有区别,它不是正规查询的一部分,

只是对数据源进行预处理,然后交给查询语句。注意它执行的是预处理,而不是对查询结果进行过滤,

所以使用filter的代价是很大的,它可能会使一次查询耗时提高一百倍。

+++++++++++++++++

建索引的时候,只有一个线程在处理

When this file Lock File is present, a writer is currently modifying the index (adding or removing documents).

++++++++++++++++++

while (left < right && array[left].fieldInfo.name.compareTo(partition.fieldInfo.name) <= 0) // 这里的"<="应该为"<"

++left;

++++++++++ Merge +++++++++++

mergeFeild一直是二路归并,没有用多路

merge的时候,io有IndexReader完成

?何时开始merge

MergeFactor

涉及的类和文件类型

FieldInfos .fnm

单个文件无法全部载入内存时,怎么处理的?

+++++++++++++++++++++++

加快indexing的方法:

Open a single writer and re-use it for the duration of your indexing session.

Turn off compound file format.(but may run out of file descriptors)

Re-use Document and Field instances

Always add fields in the same order to your Document, when using stored fields or term vectors

>Use autoCommit=false when you open your IndexWriter (3.0.1无此设置了)

>Turn off any features you are not in fact using

>Use a faster analyzer.

>Speed up document construction.

>Don't optimize unless you really need to (for faster searching)

>Index into separate indices then merge.

>setMaxBufferedDocs(int maxBufferedDocs)

控制写入一个新的segment前内存中保存的document的数目,设置较大的数目可以加快建索引速度,默认为-1(不起用)。

* Determines the minimal number of documents required

* before the buffered in-memory documents are flushed as

* a new Segment. Large values generally gives faster

* indexing.

Disabled by default (writer flushes by RAM usage)

>我们可以先把索引写入RAMDirectory,达到一定数量时再批量写进FSDirectory,减少磁盘IO次数。

>setMergeFactor (10w个doc时,100-200是比较好的值)

>setRAMBufferSizeMB 默认16m

Generally for faster

* indexing performance it's best to flush by RAM usage

* instead of document count and use as large a RAM buffer

* as you can.

>优化时间范围限制(当需要搜索指定时间范围内的结果时)

>Quick tips:

Keep the size of the index small. Eliminate norms, Term vectors when not needed. Set Store flag for a field only if it a must.

Obvious, but oft-repeated mistake. Create only one instance of Searcher and reuse.

Keep in the index on fast disks. RAM, if you are paranoid.

>Use multiple threads with one IndexWriter

+++++++++++++++++++++

dspeak@lm-vm01:~/luowl/soft/lucene-3.0.1/src/demo$ java -cp .:/home/dspeak/luowl/search-node/lib/* org.apache.lucene.demo.IndexFiles ~/luowl/soft/data1.xml

+++++++++++++++++++

每次optimize都会重新做索引

+++++++++++++++++++++

LEVEL_LOG_SPAN, level的意义是?

Whenever extra segments (beyond the merge factor upper bound) are encountered,

all segments within the level are merged.

按level分组

++++++++++++++

We keep a separate Posting hash and other state for each

* thread and then merge postings hashes from all threads

* when writing the segment.

++++++++++++++++++++++++

/*

This is the current indexing chain:

DocConsumer / DocConsumerPerThread

--> code: DocFieldProcessor / DocFieldProcessorPerThread

--> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField

--> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField

--> code: DocInverter / DocInverterPerThread / DocInverterPerField

--> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField

--> code: TermsHash / TermsHashPerThread / TermsHashPerField

--> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField

--> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField

--> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField

--> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField

--> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField

--> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField

*/

++++++++++++++++++++++++++++++

TermVector用来做什么的?

[统计词频,求词的出现位置等等]

++++++++++++++++++++

根据代码分析和测试,不Optimize, close的时候也会flush,FormatPostingsDocsWriter.addDoc会被调用

你可能感兴趣的:(apache,sql,Lucene,UP,performance)