参考文献:
http://www.cnblogs.com/fly1988happy/archive/2012/04/01/2429000.html
http://blog.csdn.net/v_july_v/article/details/7109500
我的数据挖掘算法:https://github.com/linyiqun/DataMiningAlgorithm
我的算法库:https://github.com/linyiqun/lyq-algorithms-lib
算法介绍
在信息搜索领域,构建索引一直是是一种非常有效的方式,但是当搜索引擎面对的是海量数据的时候,你如果要从茫茫人海的数据中去找出数据,显然这不是一个很好的办法。于是倒排索引这个概念就被提了出来。再说倒排索引概念之前,先要理解一下,一般的索引检索信息的方式。比如原始的数据源假设都是以文档的形式被分开,文档1拥有一段内容,文档2也富含一段内容,文档3同样如此。然后给定一个关键词,要搜索出与此关键词相关的文档,自然而然我们联想到的办法就是一个个文档的内容去比较,判断是否含有此关键词,如果含有则返回这个文档的索引地址,如果不是接着用后面的文档去比,这就有点类似于字符串的匹配类似。很显然,当数据量非常巨大的时候,这种方式并不适用。原来的这种方式可以理解为是索引-->关键词,而倒排索引的形式则是关键词--->索引位置,也就是说,给出一个关键词信息,我能立马根据倒排索引的信息得出他的位置。当然,这里说的是倒排索引最后要达到的效果,至于是用什么方式实现,就不止一种了,本文所述的就是其中比较出名的BSBI和SPIMI算法。
算法的原理
这里首先给出一个具体的实例来了解一般的构造过程,先避开具体的实现方式,给定下面一组词句。
Doc1:Mike spoken English Frequently at home.And he can write English every day.
Doc2::Mike plays football very well.
首先我们必须知道,我们需要的是一些关键的信息,诸如一些修饰词等等都需要省略,动词的时态变化等都需要还原,如果代词指的是同个人也能够省略,于是上面的句子可以简化成
Doc1:Mike spoken English home.write English.
Doc2:Mike play football.
下面进行索引的倒排构建,因为Mike出现在文档1和文档2 中,所以Mike:{1, 2}后面的词的构造同样的道理。最后的关系就会构成词对应于索引位置的映射关系。理解了这个过程之后呢,可以介绍一下本文主要要说的BSBI(基于磁盘的外部排序构建索引)和SPIMI(内存单遍扫描构建索引)算法了,一般来说,后者比前者常用。
BSBI
此算法的主要步骤如下:
1、将文档中的词进行id的映射,这里可以用hash的方法去构造
2、将文档分割成大小相等的部分。
3、将每部分按照词ID对上文档ID的方式进行排序
4、将每部分排序好后的结果进行合并,最后写出到磁盘中。
5、然后递归的执行,直到文档内容全部完成这一系列操作。
这里有一张示意图:
在算法的过程中会用到读缓冲区和写缓冲区,至于期间的大小多少如何配置都是看个人的,我在后面的代码实现中也有进行设置。至于其中的排序算法的选择,一般建议使用效果比较好的快速排序算法,但是我在后面为了方便,直接用了自己更熟悉的冒泡排序算法,这个也看个人。
SPIMI
接下来说说SPIMI算法,就是内存单遍扫描算法,这个算法与上面的算法一上来就有直接不同的特点就是他无须做id的转换,还是采用了词对索引的直接关联。还有1个比较大的特点是他不经过排序,直接按照先后顺序构建索引,算法的主要步骤如下:
1、对每个块构造一个独立的倒排索引。
2、最后将所有独立的倒排索引进行合并就OK了。
本人为了方便就把这个算法的实现简洁化了,直接在内存中完成所有的构建工作。望读者稍加注意。SPIMI相对比较的简单,这里就不给出截图了。
算法的代码实现
首先是文档的输入数据,采用了2个一样的文档,我也是实在想不出有更好的测试数据了
doc1.txt:
- Mike studyed English hardly yesterday
- He got the 100 at the last exam
- He thinks English is very interesting
doc2.txt:
- Mike studyed English hardly yesterday
- He got the 100 at the last exam
- He thinks English is very interesting
下面是文档信息预处理类PreTreatTool.java:
- package InvertedIndex;
-
- import java.io.BufferedReader;
- import java.io.File;
- import java.io.FileNotFoundException;
- import java.io.FileOutputStream;
- import java.io.FileReader;
- import java.io.IOException;
- import java.io.PrintStream;
- import java.util.ArrayList;
- import java.util.regex.Matcher;
- import java.util.regex.Pattern;
-
-
-
-
-
-
-
- public class PreTreatTool {
-
- public static String[] FILTER_WORDS = new String[] { "at", "At", "The",
- "the", "is", "very" };
-
-
- private ArrayList<String> docFilePaths;
-
- private ArrayList<String> effectWordPaths;
-
- public PreTreatTool(ArrayList<String> docFilePaths) {
- this.docFilePaths = docFilePaths;
- }
-
-
-
-
-
-
- public ArrayList<String> getEFWPaths() {
- return this.effectWordPaths;
- }
-
-
-
-
-
-
-
- private ArrayList<String> readDataFile(String filePath) {
- File file = new File(filePath);
- ArrayList<String[]> dataArray = new ArrayList<String[]>();
- ArrayList<String> words = new ArrayList<>();
-
- try {
- BufferedReader in = new BufferedReader(new FileReader(file));
- String str;
- String[] tempArray;
- while ((str = in.readLine()) != null) {
- tempArray = str.split(" ");
- dataArray.add(tempArray);
- }
- in.close();
- } catch (IOException e) {
- e.getStackTrace();
- }
-
-
- for (String[] array : dataArray) {
- for (String word : array) {
- words.add(word);
- }
- }
-
- return words;
- }
-
-
-
-
- public void preTreatWords() {
- String baseOutputPath = "";
- int endPos = 0;
- ArrayList<String> tempWords = null;
- effectWordPaths = new ArrayList<>();
-
- for (String filePath : docFilePaths) {
- tempWords = readDataFile(filePath);
- filterWords(tempWords, true);
-
-
- endPos = filePath.lastIndexOf(".");
- baseOutputPath = filePath.substring(0, endPos);
-
- writeOutOperation(tempWords, baseOutputPath + "-efword.txt");
- effectWordPaths.add(baseOutputPath + "-efword.txt");
- }
- }
-
-
-
-
-
-
-
-
-
-
- private void filterWords(ArrayList<String> words, boolean canRepeated) {
- boolean isFilterWord;
-
- Pattern adjPattern;
-
- Pattern formerPattern;
-
- Pattern numberPattern;
- Matcher adjMatcher;
- Matcher formerMatcher;
- Matcher numberMatcher;
- ArrayList<String> deleteWords = new ArrayList<>();
-
- adjPattern = Pattern.compile(".*(ly$|ful$|ing$)");
- formerPattern = Pattern.compile(".*ed$");
- numberPattern = Pattern.compile("[0-9]+(.[0-9]+)?");
-
- String w;
- for (int i = 0; i < words.size(); i++) {
- w = words.get(i);
- isFilterWord = false;
-
- for (String fw : FILTER_WORDS) {
- if (fw.equals(w)) {
- deleteWords.add(w);
- isFilterWord = true;
- break;
- }
- }
-
- if (isFilterWord) {
- continue;
- }
-
- adjMatcher = adjPattern.matcher(w);
- formerMatcher = formerPattern.matcher(w);
- numberMatcher = numberPattern.matcher(w);
-
-
- w = w.toLowerCase();
-
-
- if (adjMatcher.matches() || numberMatcher.matches()) {
- deleteWords.add(w);
- } else if (formerMatcher.matches()) {
-
- w = w.substring(0, w.length() - 2);
- }
-
- words.set(i, w);
- }
-
-
- words.removeAll(deleteWords);
- deleteWords.clear();
-
- String s1;
- String s2;
-
-
- for (int i = 0; i < words.size() - 1; i++) {
- s1 = words.get(i);
-
- for (int j = i + 1; j < words.size(); j++) {
- s2 = words.get(j);
-
-
- if (s1.equals(s2)) {
- deleteWords.add(s1);
- break;
- }
- }
- }
-
-
- words.removeAll(deleteWords);
- words.addAll(deleteWords);
- }
-
-
-
-
-
-
-
-
-
- private void writeOutOperation(ArrayList<String> buffer, String filePath) {
- StringBuilder strBuilder = new StringBuilder();
-
-
- for (String word : buffer) {
- strBuilder.append(word);
- strBuilder.append("\n");
- }
-
- try {
- File file = new File(filePath);
- PrintStream ps = new PrintStream(new FileOutputStream(file));
- ps.print(strBuilder.toString());
- } catch (FileNotFoundException e) {
-
- e.printStackTrace();
- }
- }
-
- }
文档类Document.java:
- package InvertedIndex;
-
- import java.util.ArrayList;
-
-
-
-
-
-
- public class Document {
-
- int docId;
-
- String filePath;
-
- ArrayList<String> effectWords;
-
- public Document(ArrayList<String> effectWords, String filePath){
- this.effectWords = effectWords;
- this.filePath = filePath;
- }
-
- public Document(ArrayList<String> effectWords, String filePath, int docId){
- this(effectWords, filePath);
- this.docId = docId;
- }
- }
BSBI算法工具类BSBITool.java:
- package InvertedIndex;
-
- import java.io.BufferedReader;
- import java.io.File;
- import java.io.FileNotFoundException;
- import java.io.FileOutputStream;
- import java.io.FileReader;
- import java.io.IOException;
- import java.io.PrintStream;
- import java.util.ArrayList;
- import java.util.HashMap;
- import java.util.Map;
-
-
-
-
-
-
-
- public class BSBITool {
-
- public static int DOC_ID = 0;
-
-
- private int readBufferSize;
-
- private int writeBufferSize;
-
- private ArrayList<String> effectiveWordFiles;
-
- private String outputFilePath;
-
- private String[][] readBuffer1;
-
- private String[][] readBuffer2;
-
- private String[][] writeBuffer;
-
- private Map<String, String> code2word;
-
- public BSBITool(ArrayList<String> effectiveWordFiles, int readBufferSize,
- int writeBufferSize) {
- this.effectiveWordFiles = effectiveWordFiles;
- this.readBufferSize = readBufferSize;
- this.writeBufferSize = writeBufferSize;
-
- initBuffers();
- }
-
-
-
-
- private void initBuffers() {
- readBuffer1 = new String[readBufferSize][2];
- readBuffer2 = new String[readBufferSize][2];
- writeBuffer = new String[writeBufferSize][2];
- }
-
-
-
-
-
-
-
- private Document readEffectWords(String filePath) {
- long hashcode = 0;
-
- String w;
- Document document;
- code2word = new HashMap<String, String>();
- ArrayList<String> words;
-
- words = readDataFile(filePath);
-
- for (int i = 0; i < words.size(); i++) {
- w = words.get(i);
-
- hashcode = BKDRHash(w);
- hashcode = hashcode % 10000;
-
-
- code2word.put(hashcode + "", w);
- w = hashcode + "";
-
- words.set(i, w);
- }
-
- document = new Document(words, filePath, DOC_ID);
- DOC_ID++;
-
- return document;
- }
-
-
-
-
-
-
-
-
- private long BKDRHash(String str) {
- int seed = 31;
- long hash = 0;
- int i = 0;
-
- for (i = 0; i < str.length(); i++) {
- hash = (hash * seed) + (str.charAt(i));
- }
-
- return hash;
-
- }
-
-
-
-
- public void outputInvertedFiles() {
- int index = 0;
- String baseFilePath = "";
- outputFilePath = "";
- Document doc;
- ArrayList<String> tempPaths;
- ArrayList<String[]> invertedData1;
- ArrayList<String[]> invertedData2;
-
- tempPaths = new ArrayList<>();
- for (String filePath : effectiveWordFiles) {
- doc = readEffectWords(filePath);
- writeOutFile(doc);
-
- index = doc.filePath.lastIndexOf(".");
- baseFilePath = doc.filePath.substring(0, index);
- writeOutOperation(writeBuffer, baseFilePath + "-temp.txt");
-
- tempPaths.add(baseFilePath + "-temp.txt");
- }
-
- outputFilePath = baseFilePath + "-bsbi-inverted.txt";
-
-
- for (int i = 1; i < tempPaths.size(); i++) {
- if (i == 1) {
- invertedData1 = readInvertedFile(tempPaths.get(0));
- } else {
- invertedData1 = readInvertedFile(outputFilePath);
- }
-
- invertedData2 = readInvertedFile(tempPaths.get(i));
-
- mergeInvertedData(invertedData1, invertedData2, false,
- outputFilePath);
-
- writeOutOperation(writeBuffer, outputFilePath, false);
- }
- }
-
-
-
-
-
-
-
- private void writeOutFile(Document doc) {
-
- boolean ifSort = true;
- int index = 0;
- String baseFilePath;
- String[] temp;
- ArrayList<String> tempWords = (ArrayList<String>) doc.effectWords
- .clone();
- ArrayList<String[]> invertedData1;
- ArrayList<String[]> invertedData2;
-
- invertedData1 = new ArrayList<>();
- invertedData2 = new ArrayList<>();
-
-
- for (int i = 0; i < tempWords.size() / 2; i++) {
- temp = new String[2];
- temp[0] = tempWords.get(i);
- temp[1] = doc.docId + "";
- invertedData1.add(temp);
-
- temp = new String[2];
- temp[0] = tempWords.get(i + tempWords.size() / 2);
- temp[1] = doc.docId + "";
- invertedData2.add(temp);
- }
-
-
- if (tempWords.size() % 2 == 1) {
- temp = new String[2];
- temp[0] = tempWords.get(tempWords.size() - 1);
- temp[1] = doc.docId + "";
- invertedData2.add(temp);
- }
-
- index = doc.filePath.lastIndexOf(".");
- baseFilePath = doc.filePath.substring(0, index);
- mergeInvertedData(invertedData1, invertedData2, ifSort, baseFilePath
- + "-temp.txt");
- }
-
-
-
-
-
-
-
- private void mergeWordBuffers(String outputPath) {
- int i = 0;
- int j = 0;
- int num1 = 0;
- int num2 = 0;
-
- int writeIndex = 0;
-
- while (readBuffer1[i][0] != null && readBuffer2[j][0] != null) {
- num1 = Integer.parseInt(readBuffer1[i][0]);
- num2 = Integer.parseInt(readBuffer2[j][0]);
-
-
- if (num1 < num2) {
- writeBuffer[writeIndex][0] = num1 + "";
- writeBuffer[writeIndex][1] = readBuffer1[i][1];
-
- i++;
- } else if (num2 < num1) {
- writeBuffer[writeIndex][0] = num2 + "";
- writeBuffer[writeIndex][1] = readBuffer1[j][1];
-
- j++;
- } else if (num1 == num2) {
-
- writeBuffer[writeIndex][0] = num1 + "";
- writeBuffer[writeIndex][1] = readBuffer1[i][1] + ":"
- + readBuffer2[j][1];
-
- i++;
- j++;
- }
-
-
- writeIndex++;
-
-
- if (writeIndex >= writeBufferSize) {
- writeOutOperation(writeBuffer, outputPath);
- writeIndex = 0;
- }
- }
-
- if (readBuffer1[i][0] == null) {
- writeRemainReadBuffer(readBuffer2, j, outputPath);
- }
-
- if (readBuffer2[j][0] == null) {
- writeRemainReadBuffer(readBuffer1, j, outputPath);
- }
- }
-
-
-
-
-
-
-
-
-
- private void writeOutOperation(String[][] buffer, String filePath) {
- String word;
- StringBuilder strBuilder = new StringBuilder();
-
-
- for (String[] array : buffer) {
- if (array[0] == null) {
- continue;
- }
-
- word = array[0];
-
- strBuilder.append(word);
- strBuilder.append(" ");
- strBuilder.append(array[1]);
- strBuilder.append("\n");
- }
-
- try {
- File file = new File(filePath);
- PrintStream ps = new PrintStream(new FileOutputStream(file));
- ps.print(strBuilder.toString());
- } catch (FileNotFoundException e) {
-
- e.printStackTrace();
- }
- }
-
-
-
-
-
-
-
-
-
-
-
- private void writeOutOperation(String[][] buffer, String filePath, boolean isCoded) {
- String word;
- StringBuilder strBuilder = new StringBuilder();
-
-
- for (String[] array : buffer) {
- if (array[0] == null) {
- continue;
- }
-
- if(!isCoded){
- word = code2word.get(array[0]);
- }else{
- word = array[0];
- }
-
- strBuilder.append(word);
- strBuilder.append(" ");
- strBuilder.append(array[1]);
- strBuilder.append("\n");
- }
-
- try {
- File file = new File(filePath);
- PrintStream ps = new PrintStream(new FileOutputStream(file));
- ps.print(strBuilder.toString());
- } catch (FileNotFoundException e) {
-
- e.printStackTrace();
- }
- }
-
-
-
-
-
-
-
-
-
-
-
- private void writeRemainReadBuffer(String[][] remainBuffer,
- int currentReadPos, String outputPath) {
- while (remainBuffer[currentReadPos][0] != null
- && currentReadPos < readBufferSize) {
- removeRBToWB(remainBuffer[currentReadPos]);
-
- currentReadPos++;
-
-
- if (writeBuffer[writeBufferSize - 1][0] != null) {
- writeOutOperation(writeBuffer, outputPath);
- }
- }
-
- }
-
-
-
-
-
-
- private void removeRBToWB(String[] record) {
- int insertIndex = 0;
- int endIndex = 0;
- long num1;
- long num2;
- long code = Long.parseLong(record[0]);
-
-
- if (writeBuffer[0][0] == null) {
- writeBuffer[0] = record;
- return;
- }
-
-
- for (int i = 0; i < writeBufferSize - 1; i++) {
- if (writeBuffer[i][0] == null) {
- endIndex = i;
- break;
- }
-
- num1 = Long.parseLong(writeBuffer[i][0]);
-
- if (writeBuffer[i + 1][0] == null) {
- if (code > num1) {
- endIndex = i + 1;
- insertIndex = i + 1;
- }
- } else {
- num2 = Long.parseLong(writeBuffer[i + 1][0]);
-
- if (code > num1 && code < num2) {
- insertIndex = i + 1;
- }
- }
- }
-
-
- for (int i = endIndex; i > insertIndex; i--) {
- writeBuffer[i] = writeBuffer[i - 1];
- }
- writeBuffer[insertIndex] = record;
- }
-
-
-
-
-
-
-
-
-
-
-
-
-
- private void mergeInvertedData(ArrayList<String[]> invertedData1,
- ArrayList<String[]> invertedData2, boolean ifSort, String outputPath) {
- int rIndex1 = 0;
- int rIndex2 = 0;
-
-
- initBuffers();
-
- while (invertedData1.size() > 0 && invertedData2.size() > 0) {
- readBuffer1[rIndex1][0] = invertedData1.get(0)[0];
- readBuffer1[rIndex1][1] = invertedData1.get(0)[1];
-
- readBuffer2[rIndex2][0] = invertedData2.get(0)[0];
- readBuffer2[rIndex2][1] = invertedData2.get(0)[1];
-
- invertedData1.remove(0);
- invertedData2.remove(0);
- rIndex1++;
- rIndex2++;
-
- if (rIndex1 == readBufferSize) {
- if (ifSort) {
- wordBufferSort(readBuffer1);
- wordBufferSort(readBuffer2);
- }
-
- mergeWordBuffers(outputPath);
- initBuffers();
- }
- }
-
- if (ifSort) {
- wordBufferSort(readBuffer1);
- wordBufferSort(readBuffer2);
- }
-
- mergeWordBuffers(outputPath);
- readBuffer1 = new String[readBufferSize][2];
- readBuffer2 = new String[readBufferSize][2];
-
- if (invertedData1.size() == 0 && invertedData2.size() > 0) {
- readRemainDataToRB(invertedData2, outputPath);
- } else if (invertedData1.size() > 0 && invertedData2.size() == 0) {
- readRemainDataToRB(invertedData1, outputPath);
- }
- }
-
-
-
-
-
-
-
-
-
- private void readRemainDataToRB(ArrayList<String[]> remainData,
- String outputPath) {
- int rIndex = 0;
- while (remainData.size() > 0) {
- readBuffer1[rIndex][0] = remainData.get(0)[0];
- readBuffer1[rIndex][1] = remainData.get(0)[1];
- remainData.remove(0);
-
- rIndex++;
-
-
- if (readBuffer1[readBufferSize - 1][0] != null) {
- wordBufferSort(readBuffer1);
-
- writeRemainReadBuffer(readBuffer1, 0, outputPath);
- initBuffers();
- }
- }
-
- wordBufferSort(readBuffer1);
-
- writeRemainReadBuffer(readBuffer1, 0, outputPath);
-
- }
-
-
-
-
-
-
-
- private void wordBufferSort(String[][] buffer) {
- String[] temp;
- int k = 0;
-
- long num1 = 0;
- long num2 = 0;
- for (int i = 0; i < buffer.length - 1; i++) {
-
- if (buffer[i][0] == null) {
- continue;
- }
-
- k = i;
- for (int j = i + 1; j < buffer.length; j++) {
-
- if (buffer[j][0] == null) {
- continue;
- }
-
- num1 = Long.parseLong(buffer[k][0]);
- num2 = Long.parseLong(buffer[j][0]);
-
- if (num2 < num1) {
- k = j;
- }
- }
-
- if (k != i) {
- temp = buffer[k];
- buffer[k] = buffer[i];
- buffer[i] = temp;
- }
- }
- }
-
-
-
-
-
-
-
- private ArrayList<String[]> readInvertedFile(String filePath) {
- File file = new File(filePath);
- ArrayList<String[]> dataArray = new ArrayList<String[]>();
-
- try {
- BufferedReader in = new BufferedReader(new FileReader(file));
- String str;
- String[] tempArray;
- while ((str = in.readLine()) != null) {
- tempArray = str.split(" ");
- dataArray.add(tempArray);
- }
- in.close();
- } catch (IOException e) {
- e.getStackTrace();
- }
-
- return dataArray;
- }
-
-
-
-
-
-
-
- private ArrayList<String> readDataFile(String filePath) {
- File file = new File(filePath);
- ArrayList<String[]> dataArray = new ArrayList<String[]>();
- ArrayList<String> words = new ArrayList<>();
-
- try {
- BufferedReader in = new BufferedReader(new FileReader(file));
- String str;
- String[] tempArray;
- while ((str = in.readLine()) != null) {
- tempArray = str.split(" ");
- dataArray.add(tempArray);
- }
- in.close();
- } catch (IOException e) {
- e.getStackTrace();
- }
-
-
- for (String[] array : dataArray) {
- for (String word : array) {
- if (!word.equals("")) {
- words.add(word);
- }
- }
- }
-
- return words;
- }
- }
SPIMI算法工具类SPIMITool.java:
算法测试类Client.java:
- package InvertedIndex;
-
- import java.util.ArrayList;
-
-
-
-
-
-
- public class Client {
- public static void main(String[] args){
-
- int readBufferSize;
- int writeBufferSize;
- String baseFilePath;
- PreTreatTool preTool;
-
- BSBITool bTool;
-
- SPIMITool sTool;
-
- ArrayList<String> efwFilePaths;
- ArrayList<String> docFilePaths;
-
- readBufferSize = 10;
- writeBufferSize = 20;
- baseFilePath = "C:\\Users\\lyq\\Desktop\\icon\\";
- docFilePaths = new ArrayList<>();
- docFilePaths.add(baseFilePath + "doc1.txt");
- docFilePaths.add(baseFilePath + "doc2.txt");
-
-
- preTool = new PreTreatTool(docFilePaths);
- preTool.preTreatWords();
-
-
- efwFilePaths = preTool.getEFWPaths();
- bTool = new BSBITool(efwFilePaths, readBufferSize, writeBufferSize);
- bTool.outputInvertedFiles();
-
- sTool = new SPIMITool(efwFilePaths);
- sTool.createInvertedIndexFile();
- }
- }
算法的输出:
为了模拟出真实性,算法的输出都是以文件的形式。
首先是预处理类处理之后的有效词文件doc1-efword.txt和doc2-efword.txt:
- mike
- study
- yesterday
- got
- last
- exam
- thinks
- english
- he
可以看见,一些修饰词什么的已经被我过滤掉了。
下面是BSBI算法生成的中间文件,就是映射成编码的文件,也许你看了这些数值真实表示的是什么词语:
- 1426 0
- 1542 0
- 2540 0
- 3056 0
- 3325 0
- 4326 0
- 4897 0
- 6329 0
- 7327 0
还有文档2的临时文件:
- 1426 1
- 1542 1
- 2540 1
- 3056 1
- 3325 1
- 4326 1
- 4897 1
- 6329 1
- 7327 1
将这2个文档的信息进行合并最终输出的倒排索引文件为:
- yesterday 0:1
- mike 0:1
- got 0:1
- english 0:1
- he 0:1
- last 0:1
- thinks 0:1
- study 0:1
- exam 0:1
同样的SPIMI算法输出的结果:
- mike 1:2
- study 1:2
- yesterday 1:2
- got 1:2
- last 1:2
- exam 1:2
- thinks 1:2
- english 1:2
- he 1:2
算法小结
我在实现算法的过程中无疑低估了此算法的难度,尤其是BSBI的实现,因为中间读写缓冲区在做数据操作的时候,各种情况需要判断,诸如写缓冲区满了的时候要刷出到磁盘上,读缓冲区满的时候要通过归并排序移入读缓冲区中,这里面的判断实在过多,加上之前早期没有想到这个问题,导致算法可读性不是很好,就索性把缓冲区设大,先走通这个流程,所以这个算法大家还是以理解为主,就不要拿来实际运用了,同样对于SPIMI算法一样的道理,算法实现在这里帮助大家更好的理解吧,还有很多不足的地方。还有1点是文档内容预处理的时候,我只是象征性的进行过滤,真实的信息过滤实现复杂程度远远超过我所写的,这里包括了修饰词,时态词的变化,副词等等,这些有时还需要语义挖掘的一些知识来解决,大家意会即可。