一、itemCF 测试
mahout版本 0.10.0
mahout 提供了很多的算法,其中比较常用的算是itemCF了这里记录一下itemcf的使用方法
1、数据准备,这里是使用自己采集的一些行为数据 ,数据不多,但是可以测试出结果:
下面三列分别是 user_id , item_id , perfence
把以下数据存放到hdfs上,我存放的路径是/mahout/itemcf/data1/itemdata.data
0162381440670851711,4,7.0
0162381440670851711,11,4.0
0162381440670851711,32,1.0
0162381440670851711,176,27.0
0162381440670851711,183,11.0
0162381440670851711,184,5.0
0162381440670851711,207,9.0
0162381440670851711,256,3.0
0162381440670851711,258,4.0
0162381440670851711,259,16.0
0162381440670851711,260,8.0
0162381440670851711,261,18.0
0162381440670851711,301,1.0
0162381440670851711,307,1.0
0162381440670851711,477,1.0
0162381440670851711,518,1.0
0162381440670851711,549,3.0
0162381440670851711,570,1.0
0162381440670851711,826,2.0
0357211441096952115,207,1.0
0617721441096186493,184,1.0
0617721441096186493,207,1.0
1205421441071459451,5,1.0
1214361441096861254,207,1.0
1401731441095483081,258,1.0
1401731441095483081,814,4.0
1401731441095483081,826,1.0
1917281441163686119,259,10.0
1917281441163686119,260,1.0
1917281441163686119,261,3.0
1966141441163860798,176,1.0
2294491441095342047,176,1.0
2441031440670827430,4,13.0
2441031440670827430,259,29.0
2441031440670827430,261,14.0
2441031440670827430,460,2.0
2441031440670827430,477,6.0
2441031440670827430,570,1.0
2441031440670827430,577,6.0
2441031440670827430,702,1.0
2441031440670827430,758,2.0
2441031440670827430,809,1.0
2475791441161318569,176,1.0
2987091441068878630,261,1.0
3114261440726814722,549,1.0
3445831441096810087,207,1.0
3846061441096937902,207,1.0
4266911441160164599,176,1.0
4698311441097046150,176,2.0
4698311441097046150,183,2.0
4698311441097046150,184,4.0
4698311441097046150,207,6.0
4946291441097563245,183,1.0
4956331440750398178,159,1.0
4956331440750398178,160,1.0
5307571441160362208,4,1.0
5307571441160362208,176,1.0
5719691441098504387,176,5.0
5719691441098504387,184,1.0
5719691441098504387,207,1.0
5813281441095425044,184,2.0
5813281441095425044,258,1.0
5894601441095265604,184,1.0
5981521441096106535,207,1.0
6292291441096870187,207,1.0
6533651441161410910,176,1.0
6810691441096902907,207,1.0
6836071440729632252,4,3.0
6836071440729632252,49,1.0
6836071440729632252,259,2.0
6836071440729632252,570,1.0
6836071440729632252,577,2.0
6964141441160527746,176,1.0
7495291441096796843,207,1.0
7616681441095305067,183,1.0
7616681441095305067,184,2.0
7616681441095305067,258,2.0
7616681441095305067,261,1.0
7732211441095211112,183,1.0
7732211441095211112,259,2.0
7732211441095211112,260,9.0
7732211441095211112,261,1.0
7732211441095211112,632,6.0
8211761441096060717,176,1.0
8211761441096060717,183,1.0
8305691441168039389,259,3.0
8305691441168039389,260,2.0
8305691441168039389,261,1.0
8375281440837772178,527,1.0
8432311440724457499,290,1.0
8641451441097297246,183,1.0
8641451441097297246,184,1.0
8641451441097297246,207,1.0
8641451441097297246,259,1.0
8641451441097297246,263,1.0
8641451441097297246,838,1.0
8641451441097297246,839,1.0
8641451441097297246,840,1.0
8651081441095283643,176,2.0
8651081441095283643,183,7.0
8753221441095342356,176,1.0
2、使用mahout自带的算法 实现协同过滤:
语句如下:
bin/hadoop jar /home/lin/hadoop/mahout-distribution-0.10.0/mahout-examples-0.10.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -i /mahout/itemcf/data1 -o /mahout/itemcf/result1 -s SIMILARITY_LOGLIKELIHOOD --tempDir /mahout/itemcf/temp1
其中 -i 后面是输入数据存放地址,也就是上面给的测试数据;
-o 后面是结果输出地址,这个文件夹不用建立,mahout会自动建立,若是已经存在则会报错;
--tempDir 是指临时存放的一些输出数据,mahout自己的一些输出 ,这个路径mahout自动创建,若是存在会报错;
-s 是指定使用算法;可以根据自己的需要选择;
具体的help如下
Job-Specific Options:
--input (-i) input Path to job input
directory.
--output (-o) output The directory
pathname for output.
--similarityClassname (-s) similarityClassname Name of distributed
similarity measures
class to instantiate,
alternatively use one
of the predefined
similarities
([SIMILARITY_COOCCURRE
NCE,
SIMILARITY_LOGLIKELIHO
OD,
SIMILARITY_TANIMOTO_CO
EFFICIENT,
SIMILARITY_CITY_BLOCK,
SIMILARITY_COSINE,
SIMILARITY_PEARSON_COR
RELATION,
SIMILARITY_EUCLIDEAN_D
ISTANCE])
--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem try to cap the number
of similar items per
item to this number
(default: 100)
--maxPrefs (-mppu) maxPrefs max number of
preferences to
consider per user or
item, users or items
with more preferences
will be sampled down
(default: 500)
--minPrefsPerUser (-mp) minPrefsPerUser ignore users with
less preferences than
this (default: 1)
--booleanData (-b) booleanData Treat input as
without pref values
--threshold (-tr) threshold discard item pairs
with a similarity
value below this
--randomSeed randomSeed use this seed for
sampling
--help (-h) Print out help
--tempDir tempDir Intermediate output
directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
3、执行上述命令后,等待执行完毕,在目录 /mahout/itemcf/result1 可以看到如下数据:
162381440670851711 [809:13.535571,702:13.535571,460:13.535571,758:13.535571,632:13.182321,577:12.929438,49:11.368558,307:10.562227,32:10.562227,518:10.562227]
617721441096186493 [839:1.0,259:1.0,518:1.0,826:1.0,11:1.0,260:1.0,4:1.0,32:1.0,176:1.0,840:1.0]
1401731441095483081 [11:1.0,570:1.0,518:1.0,307:1.0,260:1.0,259:1.0,549:1.0,32:1.0,207:1.0,184:1.0]
1917281441163686119 [577:7.365086,702:6.5,809:6.5,758:6.5,460:6.5,184:5.9840446,176:5.981493,4:5.577299,570:5.3220325,477:4.9567957]
2441031440670827430 [632:21.5,176:18.084661,183:15.684914,260:14.2175,207:13.510652,11:12.28147,307:12.28147,32:12.28147,518:12.28147,256:12.28147]
4698311441097046150 [263:3.9337947,839:3.9337947,840:3.9337947,838:3.9337947,11:3.4747553,307:3.4747553,32:3.4747553,518:3.4747553,256:3.4747553,301:3.4747553]
5307571441160362208 [826:1.0,259:1.0,518:1.0,307:1.0,11:1.0,260:1.0,549:1.0,32:1.0,207:1.0,184:1.0]
5719691441098504387 [4:3.6454906,259:3.6147578,260:2.67091,261:2.6694102,183:2.517088,307:2.2876854,11:2.2876854,32:2.2876854,518:2.2876854,256:2.2876854]
5813281441095425044 [207:1.8607497,259:1.6642486,183:1.5539461,301:1.4806436,11:1.4806436,307:1.4806436,32:1.4806436,518:1.4806436,256:1.4806436,549:1.4099455]
6836071440729632252 [207:2.6088793,176:2.3617313,477:1.9966183,460:1.9945599,758:1.9945599,809:1.9945599,702:1.9945599,11:1.9926376,307:1.9926376,32:1.9926376]
7616681441095305067 [826:1.5790755,207:1.5721571,549:1.535743,301:1.50748,307:1.50748,11:1.50748,32:1.50748,518:1.50748,256:1.50748,839:1.5]
7732211441095211112 [826:3.7059078,549:3.7059078,307:3.3461132,256:3.3461132,518:3.3461132,11:3.3461132,301:3.3461132,32:3.3461132,570:3.1800203,477:3.1795032]
8211761441096060717 [826:1.0,259:1.0,518:1.0,307:1.0,11:1.0,260:1.0,549:1.0,32:1.0,207:1.0,184:1.0]
8305691441168039389 [577:2.2471673,4:2.083036,570:2.0549815,809:2.0,460:2.0,11:2.0,826:2.0,32:2.0,307:2.0,549:2.0]
8641451441097297246 [11:1.0,632:1.0,518:1.0,826:1.0,260:1.0,570:1.0,549:1.0,32:1.0,307:1.0,477:1.0]
8651081441095283643 [184:6.597979,258:6.1955295,260:6.1955295,826:5.5266876,549:5.5266876,477:5.5266876,259:4.662548,261:4.662548,11:4.626224,307:4.626224]
mahout 还有一个经常用到的算法 物品相似度 ,这样得到的结果是物品间的相度:
mahout itemsimilarity -i /mahout/itemcf/data1 -o /mahout/itemcf/result1 -s SIMILARITY_LOGLIKELIHOOD --tempDir /mahout/itemcf/temp1