索引3種方式:倒排、後綴數組和簽名文件.
一段讀寫文本文件的代碼:
- BufferWriter writer = new BufferWriter(new FileWriter(destFile));
- BufferReader reader = new BufferReader(new FileReader(readFile));
- String line = reader.readLine();
- while (line != null){
- writer.write(newline);
- writer.newLine();
- }
- reader.close();
- writer.close();
BufferWriter writer = new BufferWriter(new FileWriter(destFile));
BufferReader reader = new BufferReader(new FileReader(readFile));
String line = reader.readLine();
while (line != null){
writer.write(newline);
writer.newLine();//寫入行分割符
}
reader.close();
writer.close();
Store類的3個屬性:
Store.NO 不需存儲
Store.YES 需存儲
Store.COMPRESS 壓縮存儲
Index類的4個屬性
Index.NO 不需索引
Index.TOKENIZED 分詞索引
Index.UN_TOKENIZED 不分詞索引
Index.NO_NORMS 索引,但不使用Analyzer,且禁止參加評分
IndexWriter構造方法:
- private IndexWriter(Directory d, Analyzer a, final boolean create, boolean closeDir)
-
- public IndexWriter(File path, Analyzer a, boolean create)
- throws IOException {
- this(FSDirectory.getDirectory(path, create), a, create, true);
- }
-
- public IndexWriter(Directory d, Analyzer a, boolean create)
- throws IOException {
- this(d, a, create, false);
- }
private IndexWriter(Directory d, Analyzer a, final boolean create, boolean closeDir)
public IndexWriter(File path, Analyzer a, boolean create)
throws IOException {
this(FSDirectory.getDirectory(path, create), a, create, true);
}
public IndexWriter(Directory d, Analyzer a, boolean create)
throws IOException {
this(d, a, create, false);
}
往IndexWriter中添加Document
- public void addDocument(Document doc) throws IOException {
- addDocument(doc, analyzer);
- }
-
- public void addDocument(Document doc, Analyzer analyzer) throws IOException {
- DocumentWriter dw =
- new DocumentWriter(ramDirectory, analyzer, this);
- dw.setInfoStream(infoStream);
- String segmentName = newSegmentName();
- dw.addDocument(segmentName, doc);
- synchronized (this) {
- segmentInfos.addElement(new SegmentInfo(segmentName, 1, ramDirectory));
- maybeMergeSegments();
- }
- }
public void addDocument(Document doc) throws IOException {
addDocument(doc, analyzer);
}
public void addDocument(Document doc, Analyzer analyzer) throws IOException {
DocumentWriter dw =
new DocumentWriter(ramDirectory, analyzer, this);
dw.setInfoStream(infoStream);
String segmentName = newSegmentName();
dw.addDocument(segmentName, doc);
synchronized (this) {
segmentInfos.addElement(new SegmentInfo(segmentName, 1, ramDirectory));
maybeMergeSegments();
}
}
注意:在使用addDocument方法後,一定要使用IndexWriter的close方法關閉索引器。否則,索引不會被最終建立,同時可能出現下次加入索引時目錄鎖定的問題。
DocumentWriter的構造方法:
- DocumentWriter(Directory directory, Analyzer analyzer,
- Similarity similarity, int maxFieldLength) {
- this.directory = directory;
- this.analyzer = analyzer;
- this.similarity = similarity;
- this.maxFieldLength = maxFieldLength;
- }
-
- DocumentWriter(Directory directory, Analyzer analyzer, IndexWriter writer) {
- this.directory = directory;
- this.analyzer = analyzer;
- this.similarity = writer.getSimilarity();
- this.maxFieldLength = writer.getMaxFieldLength();
- this.termIndexInterval = writer.getTermIndexInterval();
- }
DocumentWriter(Directory directory, Analyzer analyzer,
Similarity similarity, int maxFieldLength) {
this.directory = directory;
this.analyzer = analyzer;
this.similarity = similarity;
this.maxFieldLength = maxFieldLength;
}
DocumentWriter(Directory directory, Analyzer analyzer, IndexWriter writer) {
this.directory = directory;
this.analyzer = analyzer;
this.similarity = writer.getSimilarity();
this.maxFieldLength = writer.getMaxFieldLength();
this.termIndexInterval = writer.getTermIndexInterval();
}
DocumentWriter的addDocument方法:
- final void addDocument(String segment, Document doc)
- throws IOException {
-
- fieldInfos = new FieldInfos();
- fieldInfos.add(doc);
- fieldInfos.write(directory, segment + ".fnm");
-
-
- FieldsWriter fieldsWriter =
- new FieldsWriter(directory, segment, fieldInfos);
- try {
- fieldsWriter.addDocument(doc);
- } finally {
- fieldsWriter.close();
- }
-
-
- postingTable.clear();
- fieldLengths = new int[fieldInfos.size()];
- fieldPositions = new int[fieldInfos.size()];
- fieldOffsets = new int[fieldInfos.size()];
-
- fieldBoosts = new float[fieldInfos.size()];
- Arrays.fill(fieldBoosts, doc.getBoost());
-
- invertDocument(doc);
-
-
- Posting[] postings = sortPostingTable();
-
-
- writePostings(postings, segment);
-
-
- writeNorms(segment);
-
- }
final void addDocument(String segment, Document doc)
throws IOException {
// write field names
fieldInfos = new FieldInfos();
fieldInfos.add(doc);
fieldInfos.write(directory, segment + ".fnm");
// write field values負責寫入.fdx和.fdt文件
FieldsWriter fieldsWriter =
new FieldsWriter(directory, segment, fieldInfos);
try {
fieldsWriter.addDocument(doc);
} finally {
fieldsWriter.close();
}
// invert doc into postingTable
postingTable.clear(); // clear postingTable初始化存儲所有詞條的HashTable
fieldLengths = new int[fieldInfos.size()]; // init fieldLengths
fieldPositions = new int[fieldInfos.size()]; // init fieldPositions所有Field在分析完畢後的最終Position
fieldOffsets = new int[fieldInfos.size()]; // init fieldOffsets
fieldBoosts = new float[fieldInfos.size()]; // init fieldBoosts
Arrays.fill(fieldBoosts, doc.getBoost());
invertDocument(doc);//倒排Document中每Field
// sort postingTable into an array對詞條進行排序
Posting[] postings = sortPostingTable();
// write postings把詞條信息寫入索引,主要是向.frq和.prx文件中寫入詞條的頻率和位置信息
writePostings(postings, segment);
// write norms of indexed fields把得分信息寫入索引,主要是向.f文件中寫入
writeNorms(segment);
}
索引目錄內的文件:
segment,是一個邏輯概念,在每個segment時,有許多的Document。每個segment內的所有索引文件都具有相同的前綴,但後綴不同。每個segment的名稱都是由segmentInfos.counter先加1,再轉成36進制,再在前面加上_而成。segmentInfos.counter的值其實就是當前segemnt中總共的文檔數量。
而一個目錄下隻有一個segments和deleable文件