怎么使用Mahout做聚类有空我会专门写的,这篇博客主要为了讲一下Mahout处理的结果。
Mahout版本为0.9,数据没做归一化、标准化,只是为了测试。
输出目录下有clusteredPoints、cluster-x、cluster-(x+1)-final等几个文件夹,x表示第x次迭代,每次的迭代结果都会存到cluster-x,最后一次(x+1)迭代结果存在cluster-(x+1)-final,clusteredPoints下存的也是最后聚类结果,但它俩存的东西不太一样,一个是类,一个是点,具体情况请看下面。
ps:
mahout clusterdump 解析ClusterWritable并转成可读文件 -of TEXT,CSV等,后面有贴的
#最后聚类结果(类名称vl-x,中心点位置c,半径r,类中点个数n)
[root@drguo home]# mahout clusterdump -i file:///home/guo/Desktop/output/clusters-2-final -o /home/guo/Desktop/result
VL-0{n=7 c=[1.714, 2.286, 4.429, 0.857, 7.571] r=[2.185, 2.711, 6.884, 2.100, 5.233]}
VL-1{n=3 c=[0.667, 8.667, 11.333, 5.333, 0.667, 4.333, 1.667, 3.333, 21.667] r=[0.943, 5.437, 5.185, 7.542, 0.943, 6.128, 2.357, 4.714, 9.428]}
#最后聚类结果(key:所属类,value:权重wt、距离、向量(这是有名字的namedvector,不是普通的哦,之后我也会专门写如何生成))
[root@drguo clusteredPoints]# mahout seqdumper -i file:///home/guo/Desktop/output/clusteredPoints -o /home/guo/Desktop/points
Input Path: file:/home/guo/Desktop/output/clusteredPoints/part-m-0
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable
Key: 0: Value: wt: 0.7140480784137244 distance: 6.885358615591935 vec: 001461E4-86C64780-A0B495C4-D19BA86F__201601 = [5.000, 6.000, 6.000]
Key: 1: Value: wt: 0.6106543697821432 distance: 11.445523142259598 vec: 001461E4-86C64780-A0B495C4-D19BA86F__201602 = [12.000, 15.000, 15.000]
Key: 1: Value: wt: 0.6113140078611051 distance: 11.775681155103799 vec: 001461E4-86C64780-A0B495C4-D19BA86F__201603 = [13.000, 15.000, 15.000]
Key: 0: Value: wt: 0.7140480784137244 distance: 6.885358615591935 vec: 001461E4-86C64780-A0B495C4-D19BA86F__201604 = [5.000, 6.000, 6.000]
Key: 0: Value: wt: 0.7643111018595771 distance: 6.010195419417895 vec: 001461E4-86C64780-A0B495C4-D19BA86F__201605 = [2.000, 4.000, 4.000]
Key: 0: Value: wt: 0.7408819961153278 distance: 7.529533687488249 vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201603 = [6.000, 6.000]
Key: 0: Value: wt: 0.7511412095733683 distance: 7.989789402348321 vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201604 = [1.000, 1.000]
Key: 0: Value: wt: 0.6648742191066574 distance: 9.264811638337692 vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201605 = [12.000, 12.000]
Key: 0: Value: wt: 0.53656917576395 distance: 17.373449130609547 vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201606 = [18.000, 18.000]
Key: 1: Value: wt: 0.5948320024451352 distance: 23.202011407059803 vec: 001641C0-75CC4BC2-9E31CF60-C15627D2__201608 = [2.000, 1.000, 4.000, 16.000, 2.000, 13.000, 5.000, 10.000, 35.000]
Count: 10
#将类与点结合输出
[root@drguo home]# mahout clusterdump -i file:///home/guo/Desktop/output/clusters-2-final -p file:///home/guo/Desktop/output/clusteredPoints -o /home/guo/Desktop/cluster-point
VL-0{n=7 c=[1.714, 2.286, 4.429, 0.857, 7.571] r=[2.185, 2.711, 6.884, 2.100, 5.233]}
Weight : [props - optional]: Point:
0.7140480784137244 : [distance=6.885358615591935]: 001461E4-86C64780-A0B495C4-D19BA86F__201601 = [5.000, 6.000, 6.000]
0.7140480784137244 : [distance=6.885358615591935]: 001461E4-86C64780-A0B495C4-D19BA86F__201604 = [5.000, 6.000, 6.000]
0.7643111018595771 : [distance=6.010195419417895]: 001461E4-86C64780-A0B495C4-D19BA86F__201605 = [2.000, 4.000, 4.000]
0.7408819961153278 : [distance=7.529533687488249]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201603 = [6.000, 6.000]
0.7511412095733683 : [distance=7.989789402348321]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201604 = [1.000, 1.000]
0.6648742191066574 : [distance=9.264811638337692]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201605 = [12.000, 12.000]
0.53656917576395 : [distance=17.373449130609547]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201606 = [18.000, 18.000]
VL-1{n=3 c=[0.667, 8.667, 11.333, 5.333, 0.667, 4.333, 1.667, 3.333, 21.667] r=[0.943, 5.437, 5.185, 7.542, 0.943, 6.128, 2.357, 4.714, 9.428]}
Weight : [props - optional]: Point:
0.6106543697821432 : [distance=11.445523142259598]: 001461E4-86C64780-A0B495C4-D19BA86F__201602 = [12.000, 15.000, 15.000]
0.6113140078611051 : [distance=11.775681155103799]: 001461E4-86C64780-A0B495C4-D19BA86F__201603 = [13.000, 15.000, 15.000]
0.5948320024451352 : [distance=23.202011407059803]: 001641C0-75CC4BC2-9E31CF60-C15627D2__201608 = [2.000, 1.000, 4.000, 16.000, 2.000, 13.000, 5.000, 10.000, 35.000]
最后贴一下参数选项
seqdumper
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--substring (-b) substring The number of chars to print out per value
--count (-c) Report the count only
--numItems (-n) numItems Output at most key value pairs
--facets (-fa) Output the counts per key. Note, if there are
a lot of unique keys, this can take up a fair
amount of memory
--quiet (-q) Print only file contents.
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
clusterdump
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--outputFormat (-of) outputFormat The optional output format for the
results. Options: TEXT, CSV, JSON
or GRAPH_ML
--substring (-b) substring The number of chars of the
asFormatString() to print
--numWords (-n) numWords The number of top terms to print
--pointsDir (-p) pointsDir The directory containing points
sequence files mapping input
vectors to their cluster. If
specified, then the program will
output the points associated with
a cluster
--samplePoints (-sp) samplePoints Specifies the maximum number of
points to include _per_ cluster.
The default is to include all
points
--dictionary (-d) dictionary The dictionary file
--dictionaryType (-dt) dictionaryType The dictionary file type
(text|sequencefile)
--evaluate (-e) Run ClusterEvaluator and
CDbwEvaluator over the input. The
output will be appended to the
rest of the output at the end.
--distanceMeasure (-dm) distanceMeasure The classname of the
DistanceMeasure. Default is
SquaredEuclidean
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run