_N.tis保存了此段内容中的项(term)信息,因为lucene是倒排的索引格式 所以分词出来的term保存在tis文件里 每个term的信息包含了出现此term的doc的频率(多少个doc存在)等信息,每个term的具体信息中包含了出现此term的域编号(fieldnum)等信息
tis的文件结构:
TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos
TIVersion --> UInt32
TermCount --> UInt64
IndexInterval --> UInt32
SkipInterval --> UInt32
MaxSkipLevels --> UInt32
TermInfos --> <TermInfo> TermCount
TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
Term --> <PrefixLength, Suffix, FieldNum>
Suffix --> String
PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt
读取tis文件内容(文档中没有详细说明文件保存的细节可参看org.apache.lucene.index.TermInfosWriter 类)
/**************** * *Create Class:ReadTermIndex.java *Author:a276202460 *Create at:2010-6-7 */ package com.rich.lucene.io; public class ReadTerminfo { /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis"; IndexFileInput input = null; try{ input = new IndexFileInput(indexfile); System.out.println("term index version:"+input.readInt()); long termcount = input.readLong(); System.out.println("term count:"+termcount); System.out.println("term IndexInterval:"+input.readInt()); System.out.println("term SkipInterval:"+input.readInt()); System.out.println("term MaxSkipLevels:"+input.readInt()); for(long i = 0 ;i < termcount;i++){ System.out.println("*****read term info["+i+"]******"); System.out.println("the term share prefixlength is :"+input.readVInt()); System.out.println("term's own stuffix is:"+input.readString()); System.out.println("exists this term's field number is:"+input.readVInt()); int doccount = input.readVInt(); System.out.println("the doc count contain this term is:"+doccount); System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong()); System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong()); if(doccount >= 16) System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt()); } }finally{ input.close(); } } }
运行结果:
term index version:-4 term count:22 term IndexInterval:128 term SkipInterval:16 term MaxSkipLevels:10 *****read term info[0]****** the term share prefixlength is :0 term's own stuffix is:做 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:0 the position of this term's TermPositions within the .prx file is:0 *****read term info[1]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[2]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[3]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[4]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[5]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[6]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[7]****** the term share prefixlength is :0 term's own stuffix is:搜 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[8]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[9]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[10]****** the term share prefixlength is :0 term's own stuffix is:球 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[11]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[12]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[13]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[14]****** the term share prefixlength is :0 term's own stuffix is:度 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[15]****** the term share prefixlength is :0 term's own stuffix is:搜 exists this term's field number is:0 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[16]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[17]****** the term share prefixlength is :0 term's own stuffix is:百 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[18]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:0 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[19]****** the term share prefixlength is :0 term's own stuffix is:谷 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[20]****** the term share prefixlength is :0 term's own stuffix is:http://www.baidu.com exists this term's field number is:1 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[21]****** the term share prefixlength is :11 term's own stuffix is:g.cn exists this term's field number is:1 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1
由运行结果看貌似对又貌似错为什么有些乱码呢 ,如果按照java的string的charat然后做equal操作的话两个汉字是不可能有相同的前缀的
看了下文档有看了下源码 发现lucene比较前后两个term的公共前缀使用的是UTF8的字节码
比如说"做" 转换为utf8的字节数组是
-27
-127
-102
"全"转换为utf8的字节数组时:
-27
-123
-88
term ‘做’ 和term ‘全’ 比较的话byte[0]的值是相同的。由于term‘做’作为第一个term保存 所以保存term ‘做’ 的value信息就是
【3】【-27】【-127】【-102】 就是一个String类型的保存格式 作为相邻的term ‘全’ 共享了byte【0】
那么此时term ‘全’ 的stuff 字符串的就是【2】【-123】【-88】 虽然也是string格式的存储但是作为UTF8编码格式 两位的byte是不能保存汉字的 如果是纯英文的话就不会出现乱码问题 。
修改代码如下:
/**************** * *Create Class:ReadTermIndex.java *Author:a276202460 *Create at:2010-6-7 */ package com.rich.lucene.io; import org.apache.lucene.util.UnicodeUtil; public class ReadTerminfo { /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis"; IndexFileInput input = null; try{ input = new IndexFileInput(indexfile); System.out.println("term index version:"+input.readInt()); long termcount = input.readLong(); System.out.println("term count:"+termcount); System.out.println("term IndexInterval:"+input.readInt()); System.out.println("term SkipInterval:"+input.readInt()); System.out.println("term MaxSkipLevels:"+input.readInt()); int doccount = 0; int prefixlength = 0; String termvalue = null; byte[] lasttermbyte = null; int stufflenth; for(long i = 0 ;i < termcount;i++){ System.out.println("*****read term info["+i+"]******"); prefixlength = input.readVInt(); System.out.println("the term share prefixlength is :"+prefixlength); stufflenth = input.readVInt(); byte[] stuffbyte = new byte[stufflenth]; input.readBytes(stuffbyte, 0, stufflenth); if(prefixlength == 0){ termvalue = new String(stuffbyte,"UTF-8"); lasttermbyte = stuffbyte; }else{ byte[] termbyte = new byte[prefixlength+stufflenth]; System.arraycopy(lasttermbyte, 0, termbyte, 0, prefixlength); System.arraycopy(stuffbyte, 0, termbyte, prefixlength, stufflenth); termvalue = new String(termbyte,"UTF-8"); lasttermbyte = termbyte; } System.out.println("term's value is:"+termvalue); System.out.println("exists this term's field number is:"+input.readVInt()); doccount = input.readVInt(); System.out.println("the doc count contain this term is:"+doccount); System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong()); System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong()); if(doccount >= 16) System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt()); } }finally{ input.close(); } } }
运行结果:
term index version:-4 term count:22 term IndexInterval:128 term SkipInterval:16 term MaxSkipLevels:10 *****read term info[0]****** the term share prefixlength is :0 term's value is:做 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:0 the position of this term's TermPositions within the .prx file is:0 *****read term info[1]****** the term share prefixlength is :1 term's value is:全 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[2]****** the term share prefixlength is :1 term's value is:内 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[3]****** the term share prefixlength is :1 term's value is:国 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[4]****** the term share prefixlength is :1 term's value is:大 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[5]****** the term share prefixlength is :1 term's value is:度 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[6]****** the term share prefixlength is :1 term's value is:引 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[7]****** the term share prefixlength is :0 term's value is:搜 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[8]****** the term share prefixlength is :1 term's value is:擎 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[9]****** the term share prefixlength is :1 term's value is:最 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[10]****** the term share prefixlength is :0 term's value is:球 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[11]****** the term share prefixlength is :1 term's value is:百 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[12]****** the term share prefixlength is :1 term's value is:的 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[13]****** the term share prefixlength is :1 term's value is:索 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[14]****** the term share prefixlength is :0 term's value is:度 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[15]****** the term share prefixlength is :0 term's value is:搜 exists this term's field number is:0 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[16]****** the term share prefixlength is :1 term's value is:歌 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[17]****** the term share prefixlength is :0 term's value is:百 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[18]****** the term share prefixlength is :1 term's value is:索 exists this term's field number is:0 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[19]****** the term share prefixlength is :0 term's value is:谷 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[20]****** the term share prefixlength is :0 term's value is:http://www.baidu.com exists this term's field number is:1 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[21]****** the term share prefixlength is :11 term's value is:http://www.g.cn exists this term's field number is:1 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1
Tis - term 的详细信息存储
Tii – term 详细信息的索引文件(标识详细信息的页索引 每 128 个 term 在 tii 文件中建立一个索引项)
两个文件的头信息都是一样的
TIVersion --> UInt32 文件的格式版本号
TermCount --> UInt64 文件中保存的term 的数量 (tis 中就是此段索引中的所有分隔的term (项)的数量,不论源来自哪个field,tii 文件中记录的也是此文件中term 的数量但是不是全部,是每页的最后一项(第一页为空最后一页没有记录,128 (IndexInterval )个term 为一页)
IndexInterval --> UInt32 (每页存储的term 数量 )
SkipInterval --> UInt32
MaxSkipLevels --> UInt32
SkipInterval 和 MaxSkipLevels 的意义和其他的文件存储有关系,现在还不知道具体的含义,但是和查看TIS,TII 文件的结构没有关系,以后学习frq ,prx 文件的结构的时候在检验这个标识的意思
头信息完后就是每个term 的具体信息了
TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
Term --> <PrefixLength, Suffix, FieldNum>
PrefixLength 表示后面一个 term 共享的前面 term 的 byte 长度( utf8 )
Suffix 表示自己独有的后缀信息 文档中说是字符串,对英文来说没有异议,对中文的话就可能是一个不完整的字符,是长度和后缀 utf8 字节
FieldNum term 来源的 field number
DocFreq 出现此 term 的 document 的数量
FreqDelta frq 文件中词 term 的位置(具体此位置的信息还得接下来看 frq 文件)
ProxDelta, SkipDelta 和 FreqDelta 意思差不多也是位置信息,指定了位置也就对此位置的信息建立了指针也是一个索引
两个文件的内容格式:
图中在tis保存第一个term的时候tii保存了一个空的term信息进去
如果tis刚好存了128*n个数据的话 那么最后一页的末项term是不会被记录到tii文件中的。接下来将frq,prx,nrm的信息读取完以后 了解lucene整个查询检索的过程和索引创建的结构就很清楚了。
内容都是边学边写到博客的,欢迎拍砖指正。