边学边记(八) lucene索引结构五(_N.tis,_N.tii)

_N.tis保存了此段内容中的项(term)信息,因为lucene是倒排的索引格式 所以分词出来的term保存在tis文件里 每个term的信息包含了出现此term的doc的频率(多少个doc存在)等信息,每个term的具体信息中包含了出现此term的域编号(fieldnum)等信息

tis的文件结构:

TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos

TIVersion --> UInt32

TermCount --> UInt64

IndexInterval --> UInt32

SkipInterval --> UInt32

MaxSkipLevels --> UInt32

TermInfos --> <TermInfo> TermCount

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

Term --> <PrefixLength, Suffix, FieldNum>

Suffix --> String

PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt

读取tis文件内容(文档中没有详细说明文件保存的细节可参看org.apache.lucene.index.TermInfosWriter 类)

/**************** * *Create Class:ReadTermIndex.java *Author:a276202460 *Create at:2010-6-7 */ package com.rich.lucene.io; public class ReadTerminfo { /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis"; IndexFileInput input = null; try{ input = new IndexFileInput(indexfile); System.out.println("term index version:"+input.readInt()); long termcount = input.readLong(); System.out.println("term count:"+termcount); System.out.println("term IndexInterval:"+input.readInt()); System.out.println("term SkipInterval:"+input.readInt()); System.out.println("term MaxSkipLevels:"+input.readInt()); for(long i = 0 ;i < termcount;i++){ System.out.println("*****read term info["+i+"]******"); System.out.println("the term share prefixlength is :"+input.readVInt()); System.out.println("term's own stuffix is:"+input.readString()); System.out.println("exists this term's field number is:"+input.readVInt()); int doccount = input.readVInt(); System.out.println("the doc count contain this term is:"+doccount); System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong()); System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong()); if(doccount >= 16) System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt()); } }finally{ input.close(); } } }

运行结果:

term index version:-4 term count:22 term IndexInterval:128 term SkipInterval:16 term MaxSkipLevels:10 *****read term info[0]****** the term share prefixlength is :0 term's own stuffix is:做 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:0 the position of this term's TermPositions within the .prx file is:0 *****read term info[1]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[2]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[3]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[4]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[5]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[6]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[7]****** the term share prefixlength is :0 term's own stuffix is:搜 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[8]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[9]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[10]****** the term share prefixlength is :0 term's own stuffix is:球 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[11]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[12]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[13]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[14]****** the term share prefixlength is :0 term's own stuffix is:度 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[15]****** the term share prefixlength is :0 term's own stuffix is:搜 exists this term's field number is:0 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[16]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[17]****** the term share prefixlength is :0 term's own stuffix is:百 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[18]****** the term share prefixlength is :1 term's own stuffix is:?? exists this term's field number is:0 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[19]****** the term share prefixlength is :0 term's own stuffix is:谷 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[20]****** the term share prefixlength is :0 term's own stuffix is:http://www.baidu.com exists this term's field number is:1 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[21]****** the term share prefixlength is :11 term's own stuffix is:g.cn exists this term's field number is:1 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1

 

由运行结果看貌似对又貌似错为什么有些乱码呢 ,如果按照java的string的charat然后做equal操作的话两个汉字是不可能有相同的前缀的

看了下文档有看了下源码 发现lucene比较前后两个term的公共前缀使用的是UTF8的字节码

比如说"做" 转换为utf8的字节数组是

-27
-127
-102

"全"转换为utf8的字节数组时:

-27
-123
-88

term ‘做’ 和term ‘全’ 比较的话byte[0]的值是相同的。由于term‘做’作为第一个term保存 所以保存term ‘做’ 的value信息就是

【3】【-27】【-127】【-102】  就是一个String类型的保存格式 作为相邻的term ‘全’ 共享了byte【0】

那么此时term ‘全’  的stuff 字符串的就是【2】【-123】【-88】 虽然也是string格式的存储但是作为UTF8编码格式 两位的byte是不能保存汉字的 如果是纯英文的话就不会出现乱码问题 。

 

修改代码如下:

/**************** * *Create Class:ReadTermIndex.java *Author:a276202460 *Create at:2010-6-7 */ package com.rich.lucene.io; import org.apache.lucene.util.UnicodeUtil; public class ReadTerminfo { /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { String indexfile = "D:/lucenetest/indexs/txtindex/index4/_0.tis"; IndexFileInput input = null; try{ input = new IndexFileInput(indexfile); System.out.println("term index version:"+input.readInt()); long termcount = input.readLong(); System.out.println("term count:"+termcount); System.out.println("term IndexInterval:"+input.readInt()); System.out.println("term SkipInterval:"+input.readInt()); System.out.println("term MaxSkipLevels:"+input.readInt()); int doccount = 0; int prefixlength = 0; String termvalue = null; byte[] lasttermbyte = null; int stufflenth; for(long i = 0 ;i < termcount;i++){ System.out.println("*****read term info["+i+"]******"); prefixlength = input.readVInt(); System.out.println("the term share prefixlength is :"+prefixlength); stufflenth = input.readVInt(); byte[] stuffbyte = new byte[stufflenth]; input.readBytes(stuffbyte, 0, stufflenth); if(prefixlength == 0){ termvalue = new String(stuffbyte,"UTF-8"); lasttermbyte = stuffbyte; }else{ byte[] termbyte = new byte[prefixlength+stufflenth]; System.arraycopy(lasttermbyte, 0, termbyte, 0, prefixlength); System.arraycopy(stuffbyte, 0, termbyte, prefixlength, stufflenth); termvalue = new String(termbyte,"UTF-8"); lasttermbyte = termbyte; } System.out.println("term's value is:"+termvalue); System.out.println("exists this term's field number is:"+input.readVInt()); doccount = input.readVInt(); System.out.println("the doc count contain this term is:"+doccount); System.out.println("the position of this term's TermFreqs within the .frq file is:"+input.readVLong()); System.out.println("the position of this term's TermPositions within the .prx file is:"+input.readVLong()); if(doccount >= 16) System.out.println("the position of this term's SkipData within the .frq file is:"+input.readVInt()); } }finally{ input.close(); } } } 

 

运行结果:

term index version:-4 term count:22 term IndexInterval:128 term SkipInterval:16 term MaxSkipLevels:10 *****read term info[0]****** the term share prefixlength is :0 term's value is:做 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:0 the position of this term's TermPositions within the .prx file is:0 *****read term info[1]****** the term share prefixlength is :1 term's value is:全 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[2]****** the term share prefixlength is :1 term's value is:内 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[3]****** the term share prefixlength is :1 term's value is:国 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[4]****** the term share prefixlength is :1 term's value is:大 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[5]****** the term share prefixlength is :1 term's value is:度 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[6]****** the term share prefixlength is :1 term's value is:引 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[7]****** the term share prefixlength is :0 term's value is:搜 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[8]****** the term share prefixlength is :1 term's value is:擎 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[9]****** the term share prefixlength is :1 term's value is:最 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[10]****** the term share prefixlength is :0 term's value is:球 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[11]****** the term share prefixlength is :1 term's value is:百 exists this term's field number is:2 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[12]****** the term share prefixlength is :1 term's value is:的 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[13]****** the term share prefixlength is :1 term's value is:索 exists this term's field number is:2 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[14]****** the term share prefixlength is :0 term's value is:度 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:3 the position of this term's TermPositions within the .prx file is:3 *****read term info[15]****** the term share prefixlength is :0 term's value is:搜 exists this term's field number is:0 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[16]****** the term share prefixlength is :1 term's value is:歌 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[17]****** the term share prefixlength is :0 term's value is:百 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[18]****** the term share prefixlength is :1 term's value is:索 exists this term's field number is:0 the doc count contain this term is:2 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[19]****** the term share prefixlength is :0 term's value is:谷 exists this term's field number is:0 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:2 the position of this term's TermPositions within the .prx file is:2 *****read term info[20]****** the term share prefixlength is :0 term's value is:http://www.baidu.com exists this term's field number is:1 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1 *****read term info[21]****** the term share prefixlength is :11 term's value is:http://www.g.cn exists this term's field number is:1 the doc count contain this term is:1 the position of this term's TermFreqs within the .frq file is:1 the position of this term's TermPositions within the .prx file is:1

 

 

Tis - term 的详细信息存储

Tii – term 详细信息的索引文件(标识详细信息的页索引 128 term tii 文件中建立一个索引项)

两个文件的头信息都是一样的

TIVersion --> UInt32   文件的格式版本号

TermCount --> UInt64   文件中保存的term 的数量 (tis 中就是此段索引中的所有分隔的term (项)的数量,不论源来自哪个field,tii 文件中记录的也是此文件中term 的数量但是不是全部,是每页的最后一项(第一页为空最后一页没有记录,128IndexInterval )个term 为一页)

IndexInterval --> UInt32 (每页存储的term 数量 )

SkipInterval --> UInt32

MaxSkipLevels --> UInt32

SkipInterval MaxSkipLevels 的意义和其他的文件存储有关系,现在还不知道具体的含义,但是和查看TIS,TII 文件的结构没有关系,以后学习frqprx 文件的结构的时候在检验这个标识的意思

头信息完后就是每个term 的具体信息了

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>

Term --> <PrefixLength, Suffix, FieldNum>

PrefixLength 表示后面一个 term 共享的前面 term byte 长度( utf8

Suffix 表示自己独有的后缀信息 文档中说是字符串,对英文来说没有异议,对中文的话就可能是一个不完整的字符,是长度和后缀 utf8 字节

FieldNum term 来源的 field number

DocFreq 出现此 term document 的数量

FreqDelta frq 文件中词 term 的位置(具体此位置的信息还得接下来看 frq 文件)

ProxDelta, SkipDelta FreqDelta 意思差不多也是位置信息,指定了位置也就对此位置的信息建立了指针也是一个索引

 

两个文件的内容格式:

 

 

 

 

图中在tis保存第一个term的时候tii保存了一个空的term信息进去

如果tis刚好存了128*n个数据的话 那么最后一页的末项term是不会被记录到tii文件中的。接下来将frq,prx,nrm的信息读取完以后 了解lucene整个查询检索的过程和索引创建的结构就很清楚了。

内容都是边学边写到博客的,欢迎拍砖指正。

 

你可能感兴趣的:(exception,String,File,Lucene,input,byte)