Mahout在MapReduce上实现了Item-Based Collaborative Filtering,这里我尝试运行一下。
安装Hadoop
从下载Mahout并解压
准备数据
下载1 Million MovieLens Dataset,解压得到ratings.dat,用
sed ‘s/:[0-9]{1,}):[0-9]{1})::[0-9]{1,}$/,\1,\2/’ ratings.dat
处理成需要的格式。
运行
mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /path/to/input/file -o /path/to/desired/output -n 25
参数:
MAHOUT-JOB: /home/laxe/apple/mahout/mahout-examples-0.11.0-job.jar
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--numRecommendations (-n) numRecommendations Number of recommendations per user.
--usersFile usersFile File of users to recommend for.
--itemsFile itemsFile File of items to recommend for.
--filterFile (-f) filterFile File containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user(optional).
--userItemFile (-uif) userItemFile File containing comma-separated userID,itemID pairs(optional). Used to include only these items into recommendations. Cannot be used together with usersFile or itemsFile.
--booleanData (-b) booleanData Treat input as without prefvalues.
--maxPrefsPerUser (-mxp) maxPrefsPerUser Maximum number of preferences considered per user in final recommendation phase.
--minPrefsPerUser (-mp) minPrefsPerUser Ignore users with less preferences than this in the similarity computation (default: 1).
--maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem Maximum number of similarities considered per item.
--maxPrefsInItemSimilarity (-mpiis) maxPrefsInItemSimilarity Max number of preferences to consider per user or item in the item similarity computation phase, users or items with more preferences will be sampled down(default: 500).
--similarityClassname (-s) similarityClassname Name of distributed similarity measures class to instantiate,
alternatively use one of the predefined similarities([SIMILARITY_COOCCURRENCE, SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_CITY_BLOCK, SIMILARITY_COSINE, SIMILARITY_PEARSON_CORRELATION, SIMILARITY_EUCLIDEAN_DISTANCE])
--threshold (-tr) threshold Discard item pairs with a similarity value below this.
--outputPathForSimilarityMatrix (-opfsm) outputPathForSimilarityMatrix Write the items imilarity matrix to this path(optional).
--randomSeed randomSeed Use this seed for sampling.
--sequencefileOutput Write the output into a Sequence File instead of a text file.
--help (-h) Print out help.
--tempDir tempDir Intermediate output directory.
--startPhase startPhase First phase to run.
--endPhase endPhase Last phase to run specify HDFS directories while running on hadoop; else specify local file system directories.
参考
Introduction to Item-Based Recommendations with Hadoop
mahout分布式:Item-based推荐