索引3種方式:倒排、後綴數組和簽名文件.
一段讀寫文本文件的代碼:
BufferWriter writer = new BufferWriter(new FileWriter(destFile)); BufferReader reader = new BufferReader(new FileReader(readFile)); String line = reader.readLine(); while (line != null){ writer.write(newline); writer.newLine();//寫入行分割符 } reader.close(); writer.close();
Store類的3個屬性:
Store.NO 不需存儲
Store.YES 需存儲
Store.COMPRESS 壓縮存儲
Index類的4個屬性
Index.NO 不需索引
Index.TOKENIZED 分詞索引
Index.UN_TOKENIZED 不分詞索引
Index.NO_NORMS 索引,但不使用Analyzer,且禁止參加評分
IndexWriter構造方法:
private IndexWriter(Directory d, Analyzer a, final boolean create, boolean closeDir) public IndexWriter(File path, Analyzer a, boolean create) throws IOException { this(FSDirectory.getDirectory(path, create), a, create, true); } public IndexWriter(Directory d, Analyzer a, boolean create) throws IOException { this(d, a, create, false); }
往IndexWriter中添加Document
public void addDocument(Document doc) throws IOException { addDocument(doc, analyzer); } public void addDocument(Document doc, Analyzer analyzer) throws IOException { DocumentWriter dw = new DocumentWriter(ramDirectory, analyzer, this); dw.setInfoStream(infoStream); String segmentName = newSegmentName(); dw.addDocument(segmentName, doc); synchronized (this) { segmentInfos.addElement(new SegmentInfo(segmentName, 1, ramDirectory)); maybeMergeSegments(); } }
注意:在使用addDocument方法後,一定要使用IndexWriter的close方法關閉索引器。否則,索引不會被最終建立,同時可能出現下次加入索引時目錄鎖定的問題。
DocumentWriter的構造方法:
DocumentWriter(Directory directory, Analyzer analyzer, Similarity similarity, int maxFieldLength) { this.directory = directory; this.analyzer = analyzer; this.similarity = similarity; this.maxFieldLength = maxFieldLength; } DocumentWriter(Directory directory, Analyzer analyzer, IndexWriter writer) { this.directory = directory; this.analyzer = analyzer; this.similarity = writer.getSimilarity(); this.maxFieldLength = writer.getMaxFieldLength(); this.termIndexInterval = writer.getTermIndexInterval(); }
DocumentWriter的addDocument方法:
final void addDocument(String segment, Document doc) throws IOException { // write field names fieldInfos = new FieldInfos(); fieldInfos.add(doc); fieldInfos.write(directory, segment + ".fnm"); // write field values負責寫入.fdx和.fdt文件 FieldsWriter fieldsWriter = new FieldsWriter(directory, segment, fieldInfos); try { fieldsWriter.addDocument(doc); } finally { fieldsWriter.close(); } // invert doc into postingTable postingTable.clear(); // clear postingTable初始化存儲所有詞條的HashTable fieldLengths = new int[fieldInfos.size()]; // init fieldLengths fieldPositions = new int[fieldInfos.size()]; // init fieldPositions所有Field在分析完畢後的最終Position fieldOffsets = new int[fieldInfos.size()]; // init fieldOffsets fieldBoosts = new float[fieldInfos.size()]; // init fieldBoosts Arrays.fill(fieldBoosts, doc.getBoost()); invertDocument(doc);//倒排Document中每Field // sort postingTable into an array對詞條進行排序 Posting[] postings = sortPostingTable(); // write postings把詞條信息寫入索引,主要是向.frq和.prx文件中寫入詞條的頻率和位置信息 writePostings(postings, segment); // write norms of indexed fields把得分信息寫入索引,主要是向.f文件中寫入 writeNorms(segment); }
索引目錄內的文件:
segment,是一個邏輯概念,在每個segment時,有許多的Document。每個segment內的所有索引文件都具有相同的前綴,但後綴不同。每個segment的名稱都是由segmentInfos.counter先加1,再轉成36進制,再在前面加上_而成。segmentInfos.counter的值其實就是當前segemnt中總共的文檔數量。
而一個目錄下隻有一個segments和deleable文件