Carrot2 聚类算法概要说明

Carrot2 聚类算法概要说明

一、实验环境:Carrort2
输入数据类型:数组
输入值:

 

String[][] documents = new String[][] { 
  { "Introduction yourSelf", "上海" },// 0 
  { "KD Nuggets", "中国上海" },// 1 
  { "The Data Mine", "上海" },// 2 
  { "DMG", "上海浦东" },// 3 
  { "Two Crows: Data mining glossary", "中国上海" },// 4 
  { "中国上海浦东", 
  "http://www-db.stanford.edu/~ullman/mining/mining.html" },// 5 
          { "中国上海", "http://www.thearling.com/" },// 6 
          { "中国上海浦西", 
  "http://www.eco.utexas.edu/~norman/BUS.FOR/course.mat/Alex" },// 7 
          { "CCSU - Data Mining", "中国上海浦东" },// 8 
          { 
  "Data Mining: Practical Machine Learning Tools and Techniques", 
  "中国上海浦西" },// 19 
  { "Data Mining - Monografias.com", "中国上海" } // 10

 

输出结果为:Clusters:

 

 

CL-1 中国上海 (4 documents) 
     1: dummy://6 
     2: dummy://10 
     3: dummy://1 
     4: dummy://4 
 
CL-2 Data Mining (4 documents) 
     1: dummy://2 
     2: dummy://10 
     3: dummy://8 
     4: dummy://4 
 
CL-3 中国上海浦西 (2 documents) 
     1: dummy://7 
     2: dummy://9 

 

二、算法主要处理的类及方法

 

LocalControllerBase.queryàLocalProcessBase.queryàLocalInputComponentBase.endProcessing()àLingoLocalFilterComponentàMultilingualClusteringContext.cluster()àMultilingualClusteringContext.extractFeatures()àLsiClusteringStrategy.cluster()

 

三、算法跟踪
      第一部分,抽取数据:目的,得到Feature[] singleTerms
1)  将所有的documents用特定的分隔符组成一个字符串,如下:

 

introduct yourself .  上海  | KD nugget .  中国上海  | the data mine .  上海  | 
DMG .  上海浦东  | two crow data mine glossari .  中国上海  |  中国上海浦东  .
ullman mine |  中国上海  . |  中国上海浦西  . norman | CCSU data mine .  中国
上海浦东  | data mine practic machin learn tool and    techniqu .  中国上海浦西
| data mine monografias.com .  中国上海 

 

 

2)  记录每一个词的code,所有不重复的词,如下:

 

words = hashMap<String,int>(); 
{and=23,  中国上海=5, glossari=13, norman=17, techniqu=24, 
monografias.com=25,  上海浦东=10, crow=12, nugget=4, tool=22, ccsu=18, kd=3 
ullman=15, the=6, practic=19, dmg=9, machin=20, yourself=1, two=11, data=7, 
introduct=0,  中国上海浦东=14, mine=8, learn=21,  上海=2,  中国上海浦西=16} 
 

 

 

 

包装关键词,记录它们出现的次数及位置
伪代码如下:
     遍历所有的词{
              记录每个词的位置、code、及文档的id
     }
 
以下给出运行时变量值:

 

//所有的词共有63个,其中不重复的是26个,-1表示结束,大的
整数表示标点符号或者分隔符“|” 
intData[63] = [0, 1, 2147483647, 2, 2147483646, 3, 4, 2147483645, 5, 
2147483644, 6, 7, 8, 2147483643, 2, 2147483642, 9, 2147483641, 10, 
2147483640, 11, 12, 7, 8, 13, 2147483639, 5, 2147483638, 14, 2147483637, 
15, 8, 2147483636, 5, 2147483635, 2147483634, 16, 2147483633, 
17,2147483632, 18, 7, 8, 2147483631, 14, 2147483630, 7, 8, 19, 20, 21, 22, 
23, 24, 2147483629, 16, 2147483628, 7, 8, 25, 2147483627, 5, -1]   
//词在字串中的位置 
wordPositions[63] = [0, 10, 19, 21, 24, 26, 29, 36, 38, 43, 45, 49, 54, 59, 61, 
64, 66, 70, 72, 77, 79, 83, 88, 93, 98, 107, 109, 114, 116, 123, 125, 132, 137, 
139, 144, 146, 148, 155, 157, 164, 166, 171, 176, 181, 183, 190, 192, 197, 
202, 210, 217, 223, 228, 232, 241, 243, 250, 252, 257, 262, 278, 280, 285] 
//这些词分别出现的文档 
 documentIndices[62] = [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 
4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
9, 9, 10, 10, 10, 10, 10, 10] 

 

 

3)记录每个词出现的文档:下面的值是遍历intData[63]得出来的,记录每个
词出现的文档,比如introduct=0,第0个词,出现在第0个文档中

 

[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 1, 
0, 1, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 
1, 0, 1, 1, 0, 0, 1, 1, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 
0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 
0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 
1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]] 

 

 

4)记录Feature[] singleTerms = new Feature[termCount]; //记录每个词的属性,
共26个词,tf表示在所有的文档中出现的次数。遍历所有的intData[63]得到结果。
 

[introduct tf=1 tfidf=0.00 ST null, yourself tf=1 tfidf=0.00 ST null,  上海  tf=2 
tfidf=0.00 null, KD tf=1 tfidf=0.00 ST null, nugget tf=1 tfidf=0.00 ST null,  中国
上海  tf=4 tfidf=0.00 ST null, the tf=1 tfidf=0.00 ST null, data tf=5 tfidf=0.00 
ST null, mine tf=6 tfidf=0.00 ST null, DMG tf=1 tfidf=0.00 ST null,  上海浦东 
tf=1 tfidf=0.00 null, two tf=1 tfidf=0.00 ST null, crow tf=1 tfidf=0.00 ST null, 
glossari tf=1 tfidf=0.00 ST null,  中国上海浦东  tf=2 tfidf=0.00 ST null, ullman 
tf=1 tfidf=0.00 null,  中国上海浦西  tf=2 tfidf=0.00 ST null, norman tf=1 
tfidf=0.00 null, CCSU tf=1 tfidf=0.00 ST null, practic tf=1 tfidf=0.00 ST null, 
machin tf=1 tfidf=0.00 ST null, learn tf=1 tfidf=0.00 ST null, tool tf=1 
tfidf=0.00 ST null, and tf=1 tfidf=0.00 ST null, techniqu tf=1 tfidf=0.00 ST null, 
monografias.com tf=1 tfidf=0.00 ST null] 

 

 

计算tfidf,判断是否是Strongword,如果是加权重。再次计算结果为:

 

[introduct tf=2 tfidf=4.80 ST null, yourself tf=2 tfidf=4.80 ST null,  上海  tf=2 
tfidf=3.41 null, KD tf=2 tfidf=4.80 ST null, nugget tf=2 tfidf=4.80 ST null,  中国
上海  tf=10 tfidf=10.12 ST null, the tf=2 tfidf=4.80 ST null, data tf=12 
tfidf=9.46 ST null, mine tf=15 tfidf=9.09 ST null, DMG tf=2 tfidf=4.80 ST null, 
上海浦东  tf=1 tfidf=2.40 null, two tf=2 tfidf=4.80 ST null, crow tf=2 
tfidf=4.80 ST null, glossari tf=2 tfidf=4.80 ST null,  中国上海浦东  tf=5 
tfidf=8.52 ST null, ullman tf=1 tfidf=2.40 null,  中国上海浦西  tf=5 tfidf=8.52 
ST null, norman tf=1 tfidf=2.40 null, CCSU tf=2 tfidf=4.80 ST null, practic tf=2 tfidf=4.80 ST null, machin tf=2 tfidf=4.80 ST null, learn tf=2 tfidf=4.80   
ST null, tool tf=2 tfidf=4.80 ST null, and tf=2 tfidf=4.80 ST null, techniqu tf=2 
tfidf=4.80 ST null, monografias.com tf=2 tfidf=4.80 ST null] 

 

 

singleTerms最后一步,设置语言,格式化数据,如把glossari格式化为:glossary

 

[Introduction tf=2 tfidf=4.80 ST en, yourself tf=2 tfidf=4.80 SW ST en,  上海 
tf=2 tfidf=3.41 en, KD tf=2 tfidf=4.80 ST en, Nuggets tf=2 tfidf=4.80 ST en,  中
国上海  tf=10 tfidf=10.12 ST en, the tf=2 tfidf=4.80 SW ST en, Data tf=12 
tfidf=9.46 ST en, Mining tf=15 tfidf=9.09 ST en, DMG tf=2 tfidf=4.80 ST en,  上
海浦东  tf=1 tfidf=2.40 en, two tf=2 tfidf=4.80 SW ST en, Crows tf=2 
tfidf=4.80 ST en, Glossary tf=2 tfidf=4.80 ST en,  中国上海浦东  tf=5 
tfidf=8.52 ST en, Ullman tf=1 tfidf=2.40 en,  中国上海浦西  tf=5 tfidf=8.52 ST 
en, Norman tf=1 tfidf=2.40 en, CCSU tf=2 tfidf=4.80 ST en, Practical tf=2 
tfidf=4.80 ST en, Machine tf=2 tfidf=4.80 ST en, Learning tf=2 tfidf=4.80 ST en, 
Tools tf=2 tfidf=4.80 ST en, and tf=2 tfidf=4.80 SW ST en, Techniques tf=2 
tfidf=4.80 ST en, Monografias.com tf=2 tfidf=4.80 ST en] 
 

 

说到这,应该给出Feature的属性及说明了,如下:

 

/** A unique integer code of this feature */ 
   private int code; 
   /** String representation of this feature */ 
   private String text; 
   /** ISO code of the language to which Lingo believes this feature belongs *
   private String language; 
   /** The number of occurrences of this feature in all input snippets */ 
   private int tf; 
   /** 
   singleTerms[i].setIdf(Math.log( 
                   (double) 11 /  当前的word在所有的文档中出现的次
数); **/ 
private double idf; 
/** Length (in words) of this feature */ 
private int length; 
/** True if this feature is a stop word */ private boolean stopWord; 
 
     /** True if this feature is among query words */ 
     private boolean queryWord; 
 
     /** True if this feature appeared in some snippet's title */ 
     private boolean strong; 
 
     /** 
      * Used for phrase features only. An array pointing to features 
      * representing the phrase's individual words.   
       */ 
     private int[] phraseFeatureIndices; 
 
     /** 
      *  记录在哪些文档中出现了,比如“上海”这个词,这个值为:[0, 2] 
      */ 
     private int[] snippetIndices; 
 
     //即termDocument关于当前word的出现文档的数组,如: [1, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0],表示这个word在哪些文档中出现过。 
     private int[] snippetTf; 

 

 

第二部分:聚类
1)排序,按非停词,tf倒排

 

[Mining tf=15 tfidf=9.09 ST en, Data tf=12 tfidf=9.46 ST en,  中国上海  tf=10 
tfidf=10.12 ST en,  中国上海浦东  tf=5 tfidf=8.52 ST en,  中国上海 
浦西  tf=5 tfidf=8.52 ST en, CCSU tf=2 tfidf=4.80 ST en, Crows tf=2 tfidf=4.80 
ST en, DMG tf=2 tfidf=4.80 ST en, Glossary tf=2 tfidf=4.80 ST en, Introduction 
tf=2 tfidf=4.80 ST en, KD tf=2 tfidf=4.80 ST en, Learning tf=2 tfidf=4.80 ST en, 
Machine tf=2 tfidf=4.80 ST en, Monografias.com tf=2 tfidf=4.80 ST en, 
Nuggets tf=2 tfidf=4.80 ST en, Practical tf=2 tfidf=4.80 ST en, Techniques tf=2 
tfidf=4.80 ST en, Tools tf=2 tfidf=4.80 ST en,  上海  tf=2 tfidf=3.41 en, 
Norman tf=1 tfidf=2.40 en, Ullman tf=1 tfidf=2.40 en,  上海浦东  tf=1 
tfidf=2.40 en, and tf=2 tfidf=4.80 SW ST en, the tf=2 tfidf=4.80 SW ST en, two 
tf=2 tfidf=4.80 SW ST en, yourself tf=2 tfidf=4.80 SW ST en] 
 

 

 

2)以上26个关键词中,有的是停词,去除停词后生成矩阵
 
生成非停词的矩阵,这里二维数组的长度为:double[][] tdMatrix = new double [19][11]

 

矩阵生成策略:tdMatrix[term][snippetIndices[doc]] =
features[term].getSnippetTf()[snippetIndices[doc]] * features[term].getIdf();  

tdMatrix = new Matrix(matrix);//其中Matrix是Jama-1.0.2的包,一个专门生
成矩阵策略的jar工具包。 (TfidfTdMatrixBuildingStrategy. buildTdMatrix)

 

3)根据矩阵,有N多复杂的计算,有兴趣请参阅源码:LsiClusteringStrategy. createClusters();抛开矩阵内部处理不说,有了这些数据,我们也可以编写自已的算法,得出哪些keyword或keywords集合出现在哪些文档中!达到聚类的效果!

你可能感兴趣的:(算法,Lucene)