Carrot2 聚类算法概要说明
一、实验环境:Carrort2
输入数据类型:数组
输入值:
String[][] documents = new String[][] { { "Introduction yourSelf", "上海" },// 0 { "KD Nuggets", "中国上海" },// 1 { "The Data Mine", "上海" },// 2 { "DMG", "上海浦东" },// 3 { "Two Crows: Data mining glossary", "中国上海" },// 4 { "中国上海浦东", "http://www-db.stanford.edu/~ullman/mining/mining.html" },// 5 { "中国上海", "http://www.thearling.com/" },// 6 { "中国上海浦西", "http://www.eco.utexas.edu/~norman/BUS.FOR/course.mat/Alex" },// 7 { "CCSU - Data Mining", "中国上海浦东" },// 8 { "Data Mining: Practical Machine Learning Tools and Techniques", "中国上海浦西" },// 19 { "Data Mining - Monografias.com", "中国上海" } // 10
输出结果为:Clusters:
CL-1 中国上海 (4 documents) 1: dummy://6 2: dummy://10 3: dummy://1 4: dummy://4 CL-2 Data Mining (4 documents) 1: dummy://2 2: dummy://10 3: dummy://8 4: dummy://4 CL-3 中国上海浦西 (2 documents) 1: dummy://7 2: dummy://9
二、算法主要处理的类及方法
LocalControllerBase.queryàLocalProcessBase.queryàLocalInputComponentBase.endProcessing()àLingoLocalFilterComponentàMultilingualClusteringContext.cluster()àMultilingualClusteringContext.extractFeatures()àLsiClusteringStrategy.cluster()
三、算法跟踪
第一部分,抽取数据:目的,得到Feature[] singleTerms
1) 将所有的documents用特定的分隔符组成一个字符串,如下:
introduct yourself . 上海 | KD nugget . 中国上海 | the data mine . 上海 | DMG . 上海浦东 | two crow data mine glossari . 中国上海 | 中国上海浦东 . ullman mine | 中国上海 . | 中国上海浦西 . norman | CCSU data mine . 中国 上海浦东 | data mine practic machin learn tool and techniqu . 中国上海浦西 | data mine monografias.com . 中国上海
2) 记录每一个词的code,所有不重复的词,如下:
words = hashMap<String,int>(); {and=23, 中国上海=5, glossari=13, norman=17, techniqu=24, monografias.com=25, 上海浦东=10, crow=12, nugget=4, tool=22, ccsu=18, kd=3 ullman=15, the=6, practic=19, dmg=9, machin=20, yourself=1, two=11, data=7, introduct=0, 中国上海浦东=14, mine=8, learn=21, 上海=2, 中国上海浦西=16}
包装关键词,记录它们出现的次数及位置
伪代码如下:
遍历所有的词{
记录每个词的位置、code、及文档的id
}
以下给出运行时变量值:
//所有的词共有63个,其中不重复的是26个,-1表示结束,大的 整数表示标点符号或者分隔符“|” intData[63] = [0, 1, 2147483647, 2, 2147483646, 3, 4, 2147483645, 5, 2147483644, 6, 7, 8, 2147483643, 2, 2147483642, 9, 2147483641, 10, 2147483640, 11, 12, 7, 8, 13, 2147483639, 5, 2147483638, 14, 2147483637, 15, 8, 2147483636, 5, 2147483635, 2147483634, 16, 2147483633, 17,2147483632, 18, 7, 8, 2147483631, 14, 2147483630, 7, 8, 19, 20, 21, 22, 23, 24, 2147483629, 16, 2147483628, 7, 8, 25, 2147483627, 5, -1] //词在字串中的位置 wordPositions[63] = [0, 10, 19, 21, 24, 26, 29, 36, 38, 43, 45, 49, 54, 59, 61, 64, 66, 70, 72, 77, 79, 83, 88, 93, 98, 107, 109, 114, 116, 123, 125, 132, 137, 139, 144, 146, 148, 155, 157, 164, 166, 171, 176, 181, 183, 190, 192, 197, 202, 210, 217, 223, 228, 232, 241, 243, 250, 252, 257, 262, 278, 280, 285] //这些词分别出现的文档 documentIndices[62] = [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10]
3)记录每个词出现的文档:下面的值是遍历intData[63]得出来的,记录每个
词出现的文档,比如introduct=0,第0个词,出现在第0个文档中
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
4)记录Feature[] singleTerms = new Feature[termCount]; //记录每个词的属性,
共26个词,tf表示在所有的文档中出现的次数。遍历所有的intData[63]得到结果。
[introduct tf=1 tfidf=0.00 ST null, yourself tf=1 tfidf=0.00 ST null, 上海 tf=2 tfidf=0.00 null, KD tf=1 tfidf=0.00 ST null, nugget tf=1 tfidf=0.00 ST null, 中国 上海 tf=4 tfidf=0.00 ST null, the tf=1 tfidf=0.00 ST null, data tf=5 tfidf=0.00 ST null, mine tf=6 tfidf=0.00 ST null, DMG tf=1 tfidf=0.00 ST null, 上海浦东 tf=1 tfidf=0.00 null, two tf=1 tfidf=0.00 ST null, crow tf=1 tfidf=0.00 ST null, glossari tf=1 tfidf=0.00 ST null, 中国上海浦东 tf=2 tfidf=0.00 ST null, ullman tf=1 tfidf=0.00 null, 中国上海浦西 tf=2 tfidf=0.00 ST null, norman tf=1 tfidf=0.00 null, CCSU tf=1 tfidf=0.00 ST null, practic tf=1 tfidf=0.00 ST null, machin tf=1 tfidf=0.00 ST null, learn tf=1 tfidf=0.00 ST null, tool tf=1 tfidf=0.00 ST null, and tf=1 tfidf=0.00 ST null, techniqu tf=1 tfidf=0.00 ST null, monografias.com tf=1 tfidf=0.00 ST null]
计算tfidf,判断是否是Strongword,如果是加权重。再次计算结果为:
[introduct tf=2 tfidf=4.80 ST null, yourself tf=2 tfidf=4.80 ST null, 上海 tf=2 tfidf=3.41 null, KD tf=2 tfidf=4.80 ST null, nugget tf=2 tfidf=4.80 ST null, 中国 上海 tf=10 tfidf=10.12 ST null, the tf=2 tfidf=4.80 ST null, data tf=12 tfidf=9.46 ST null, mine tf=15 tfidf=9.09 ST null, DMG tf=2 tfidf=4.80 ST null, 上海浦东 tf=1 tfidf=2.40 null, two tf=2 tfidf=4.80 ST null, crow tf=2 tfidf=4.80 ST null, glossari tf=2 tfidf=4.80 ST null, 中国上海浦东 tf=5 tfidf=8.52 ST null, ullman tf=1 tfidf=2.40 null, 中国上海浦西 tf=5 tfidf=8.52 ST null, norman tf=1 tfidf=2.40 null, CCSU tf=2 tfidf=4.80 ST null, practic tf=2 tfidf=4.80 ST null, machin tf=2 tfidf=4.80 ST null, learn tf=2 tfidf=4.80 ST null, tool tf=2 tfidf=4.80 ST null, and tf=2 tfidf=4.80 ST null, techniqu tf=2 tfidf=4.80 ST null, monografias.com tf=2 tfidf=4.80 ST null]
singleTerms最后一步,设置语言,格式化数据,如把glossari格式化为:glossary
[Introduction tf=2 tfidf=4.80 ST en, yourself tf=2 tfidf=4.80 SW ST en, 上海 tf=2 tfidf=3.41 en, KD tf=2 tfidf=4.80 ST en, Nuggets tf=2 tfidf=4.80 ST en, 中 国上海 tf=10 tfidf=10.12 ST en, the tf=2 tfidf=4.80 SW ST en, Data tf=12 tfidf=9.46 ST en, Mining tf=15 tfidf=9.09 ST en, DMG tf=2 tfidf=4.80 ST en, 上 海浦东 tf=1 tfidf=2.40 en, two tf=2 tfidf=4.80 SW ST en, Crows tf=2 tfidf=4.80 ST en, Glossary tf=2 tfidf=4.80 ST en, 中国上海浦东 tf=5 tfidf=8.52 ST en, Ullman tf=1 tfidf=2.40 en, 中国上海浦西 tf=5 tfidf=8.52 ST en, Norman tf=1 tfidf=2.40 en, CCSU tf=2 tfidf=4.80 ST en, Practical tf=2 tfidf=4.80 ST en, Machine tf=2 tfidf=4.80 ST en, Learning tf=2 tfidf=4.80 ST en, Tools tf=2 tfidf=4.80 ST en, and tf=2 tfidf=4.80 SW ST en, Techniques tf=2 tfidf=4.80 ST en, Monografias.com tf=2 tfidf=4.80 ST en]
说到这,应该给出Feature的属性及说明了,如下:
/** A unique integer code of this feature */ private int code; /** String representation of this feature */ private String text; /** ISO code of the language to which Lingo believes this feature belongs * private String language; /** The number of occurrences of this feature in all input snippets */ private int tf; /** singleTerms[i].setIdf(Math.log( (double) 11 / 当前的word在所有的文档中出现的次 数); **/ private double idf; /** Length (in words) of this feature */ private int length; /** True if this feature is a stop word */ private boolean stopWord; /** True if this feature is among query words */ private boolean queryWord; /** True if this feature appeared in some snippet's title */ private boolean strong; /** * Used for phrase features only. An array pointing to features * representing the phrase's individual words. */ private int[] phraseFeatureIndices; /** * 记录在哪些文档中出现了,比如“上海”这个词,这个值为:[0, 2] */ private int[] snippetIndices; //即termDocument关于当前word的出现文档的数组,如: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],表示这个word在哪些文档中出现过。 private int[] snippetTf;
第二部分:聚类
1)排序,按非停词,tf倒排
[Mining tf=15 tfidf=9.09 ST en, Data tf=12 tfidf=9.46 ST en, 中国上海 tf=10 tfidf=10.12 ST en, 中国上海浦东 tf=5 tfidf=8.52 ST en, 中国上海 浦西 tf=5 tfidf=8.52 ST en, CCSU tf=2 tfidf=4.80 ST en, Crows tf=2 tfidf=4.80 ST en, DMG tf=2 tfidf=4.80 ST en, Glossary tf=2 tfidf=4.80 ST en, Introduction tf=2 tfidf=4.80 ST en, KD tf=2 tfidf=4.80 ST en, Learning tf=2 tfidf=4.80 ST en, Machine tf=2 tfidf=4.80 ST en, Monografias.com tf=2 tfidf=4.80 ST en, Nuggets tf=2 tfidf=4.80 ST en, Practical tf=2 tfidf=4.80 ST en, Techniques tf=2 tfidf=4.80 ST en, Tools tf=2 tfidf=4.80 ST en, 上海 tf=2 tfidf=3.41 en, Norman tf=1 tfidf=2.40 en, Ullman tf=1 tfidf=2.40 en, 上海浦东 tf=1 tfidf=2.40 en, and tf=2 tfidf=4.80 SW ST en, the tf=2 tfidf=4.80 SW ST en, two tf=2 tfidf=4.80 SW ST en, yourself tf=2 tfidf=4.80 SW ST en]
2)以上26个关键词中,有的是停词,去除停词后生成矩阵
生成非停词的矩阵,这里二维数组的长度为:double[][] tdMatrix = new double [19][11]
矩阵生成策略:tdMatrix[term][snippetIndices[doc]] =
features[term].getSnippetTf()[snippetIndices[doc]] * features[term].getIdf();
tdMatrix = new Matrix(matrix);//其中Matrix是Jama-1.0.2的包,一个专门生
成矩阵策略的jar工具包。 (TfidfTdMatrixBuildingStrategy. buildTdMatrix)
3)根据矩阵,有N多复杂的计算,有兴趣请参阅源码:LsiClusteringStrategy. createClusters();抛开矩阵内部处理不说,有了这些数据,我们也可以编写自已的算法,得出哪些keyword或keywords集合出现在哪些文档中!达到聚类的效果!