最近在项目中要使用Mahout来进行Kmeans聚类,搜了一下资料,发现大多数都是用hadoop jar的形式运行Mahout程序。我们的项目是一个后端接口服务,也就是需要放在resin或tomcat中运行的war程序。这就不可能打包成jar,用hadoop jar这种粗糙的形式来跑了。
hadoop之所以需要我们把程序打包成jar,是因为他需要把我们的程序分发到各个分布式节点中跑。那有什么办法可以让他不需要我们打包jar也可以分布式运行呢?
首先理解一个概念,真正需要分发到各个节点运行的是Mahout内部的运算逻辑。所以,如果我们在hadoop的mapreduce classpath中加入mahout相关的jar包,就可以实现这个目的。
增加这个配置:
增加之后点右上角的保存修改即可。
也可以在mapred-site.xml配置文件中手动修改,见如下:
<property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH,/opt/hermes/mahout/lib/*,/opt/hermes/mahout/*</value> </property>这种方式修改完成之后需要重启hadoop使之生效。
注:这上面的/opt/hermes/mahout是我linux环境里的mahout安装位置。
接下来看一下我的Kmeans工具类:
package com.cn21.function; import java.io.IOException; import java.net.URI; import java.util.List; import java.util.Set; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.util.ToolRunner; import org.apache.log4j.Logger; import org.apache.mahout.clustering.kmeans.KMeansDriver; import org.apache.mahout.common.HadoopUtil; import org.apache.mahout.utils.clustering.ClusterDumper; import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles; import com.cn21.common.config.DefaultConfigure; import com.cn21.function.hadoop.SequenceFilesFromDirectory; import com.cn21.util.HdfsUtil; import com.cn21.util.ReadFileInLineTool; import com.google.common.collect.Lists; import com.google.common.collect.Sets; import com.google.gson.Gson; /** * <p> * Kmeans文本聚类 * <p> * * @author <a href="mailto:[email protected]">kexm</a> * @version * @since 2015年11月19日 * */ public class KmeansClusterFunction { private static final Logger log = Logger.getLogger(KmeansClusterFunction.class); /** * Kmeans文本聚类 * @param inputDir * @param hadoopDir * @param outputFilePath * @return * @throws Exception */ public static Set<List<String>> run(String inputDir, String hadoopDir, String outputFilePath) throws Exception { log.info("method[run] begin inputDir<" + inputDir + "> hadoopDir<" + hadoopDir + "> outputFilePath<" + outputFilePath + ">"); String seqFileOutput = LOCAL_FILE_PREFIX + outputFilePath + "/SeqFile"; String seqFileHadoopPath = hadoopDir + "/SeqFile"; String vectorOutFile = LOCAL_FILE_PREFIX + outputFilePath + "/VectorFile"; String vectorHadoopFile = hadoopDir + "/VectorFile"; String hadoopVectorOutFile = hadoopDir + "/VectorFile"; String dictionaryFileName = hadoopVectorOutFile + "/dictionary.file-0"; String tdidfFileName = hadoopVectorOutFile + "/tfidf-vectors"; String initCluster = hadoopDir + "/kmeans-init-clusters"; String kmeansPath = hadoopDir + "/kmeans"; String kmeansClusterPoint = kmeansPath + "/clusteredPoints"; String kmeansDumpPath = outputFilePath + "/kmeans-result"; /** 1. 如果结果文件已经存在,删除结果文件 */ Configuration conf = new Configuration(); conf.addResource("hdfs-site.xml.mahout"); conf.addResource("mapred-site.xml.mahout"); conf.addResource("core-site.xml.mahout"); conf.addResource("yarn-site.xml.mahout"); log.info("config init"); HadoopUtil.delete(conf, new Path(seqFileOutput)); HadoopUtil.delete(conf, new Path(seqFileHadoopPath)); HadoopUtil.delete(conf, new Path(vectorOutFile)); HadoopUtil.delete(conf, new Path(hadoopVectorOutFile)); HadoopUtil.delete(conf, new Path(initCluster)); HadoopUtil.delete(conf, new Path(kmeansPath)); HadoopUtil.delete(conf, new Path(kmeansDumpPath)); log.info("method[run] deleted."); HdfsUtil.createHadoopDir(hadoopDir); log.info("method[run] ahadoopDir<" + hadoopDir + "> created."); /** Step1: 把存放邮件的文件夹转成Seq文件*/ log.info("starting dir to seq job"); log.info("conf<"+conf.get("yarn.resourcemanager.address")+">"); String[] dirToSeqArgs = { "--input", inputDir, "--output", seqFileOutput, "--method", "sequential", "-c", "UTF-8", "-chunk", "64" }; ToolRunner.run(new SequenceFilesFromDirectory(), dirToSeqArgs); log.info("finished dir to seq job"); log.info("starting copy vector files to hadoop vectorOutFile<" + vectorOutFile + "> hadoopDir<" + hadoopDir + ">"); /** Step2: seq文件上传至hadoop*/ FileSystem fs = FileSystem.get(new URI(DefaultConfigure.config.hadoopDir), conf); fs.copyFromLocalFile(false, true, new Path(seqFileOutput), new Path(hadoopDir)); log.info("finished copy"); /** Step3: seq文件转为vector*/ log.info("starting seq To Vector job seqFileHadoopPath<" + seqFileHadoopPath + "> vectorOutFile<" + vectorHadoopFile + ">"); String[] seqToVectorArgs = { "--input", seqFileHadoopPath, "--output", vectorHadoopFile, "-ow", "--weight", "tfidf", "--maxDFPercent", "85", "--namedVector", "-a", "org.apache.lucene.analysis.core.WhitespaceAnalyzer" }; ToolRunner.run(conf, new SparseVectorsFromSequenceFiles(), seqToVectorArgs); log.info("finished seq to vector job"); /** Step4: kmeans运算*/ log.info("starting kmeans job tdidfFileName<" + tdidfFileName + "> initCluster<" + initCluster + "> kmeansPath<" + kmeansPath + ">"); String[] kmeansArgs = { "-i", tdidfFileName, "-o", kmeansPath, "-k", "5", "-c", initCluster, "-dm", "org.apache.mahout.common.distance.CosineDistanceMeasure", "-x", "100", "-ow", "--clustering" }; ToolRunner.run(conf, new KMeansDriver(), kmeansArgs); log.info("finished kmeans job"); String finalResult = ""; FileStatus[] files = fs.listStatus(new Path(kmeansPath)); for (FileStatus file : files) { if (file.getPath().getName().contains("-final")) { finalResult = file.getPath().toString(); } } /** Step5:解析聚类结果*/ log.info("starting clusterDumper finalResult<" + finalResult + "> dictionaryFileName<" + dictionaryFileName + "> kmeansDumpPath<" + kmeansDumpPath + "> " + "kmeansClusterPoint<" + kmeansClusterPoint + ">"); ClusterDumper dump = new ClusterDumper(); String[] dumperArgs = { "-i", finalResult, "-d", dictionaryFileName, "-dt", "sequencefile", "-o", kmeansDumpPath, "--pointsDir", kmeansClusterPoint, "-n", "20" }; dump.run(dumperArgs); log.info("finished clusterDumper job."); /** Step6:从聚类结果中读取*/ Set<List<String>> result = readDump(kmeansDumpPath); log.info("result<" + new Gson().toJson(result) + ">"); /**清除运算过程产生的文件*/ HadoopUtil.delete(conf, new Path(kmeansDumpPath)); HadoopUtil.delete(conf, new Path(hadoopDir)); log.info("method[run] file<"+kmeansDumpPath+"> <"+hadoopDir+"> deleted."); return result; } /** * 读聚类结果文件,从中获得每个话题的关键词 * @param kmeansDumpPath * @return * @throws IOException */ private static Set<List<String>> readDump(String kmeansDumpPath) throws IOException { ReadFileInLineTool rt = new ReadFileInLineTool(); rt.setPath(kmeansDumpPath); Set<List<String>> result = Sets.newHashSet(); List<String> topic = Lists.newArrayList(); boolean inTopic = true; String line = ""; while ((line = rt.readLine()) != null) { // log.info("line<" + line + ">"); if (!line.contains("=>")) { if (inTopic) { result.add(topic); topic = Lists.newArrayList(); inTopic = false; } continue; } inTopic = true; String term = line.split("=>")[0].trim(); topic.add(term); } return result; } private static final String LOCAL_FILE_PREFIX = "file:///"; }
这个类接收三个参数,inputDir是一个包含多个文本文件的目录,每个文本文件中是一篇已用空格分词好的文章;hadoopDir是一个hdfs目录,用于存放运算过程文件;outputFilePath是存放最终ClusterDump文件的目录。输出一个Set<List<String>>,包括了N个主题,每个主题里有M个关键词。
这里有一个关键点,Configuration的配置。
此处我选择让conf加载classpath(项目中是WEB-INF/classes目录)下的四个文件,但这四个文件没有用原始命名,而是在文件后增加了 ".mahout"后缀。
原因是SequenceFilesFromDirectory是需要在本地跑local job的,不可以用分布式job跑。因为这里要合并的文件目录是本地目录而非hadoop目录(要是是hadoop目录,就不用转化SequenceFile了)。我翻看了一下SequenceFilesFromDirectory的源码,它在我们配置--method sequential的情况下,会创建一个默认的Configuration作为它的配置,Configuration在创建时会默认读取classpath下的hdfs-site.xml和core-site.xml,这样会令SequenceFilesFromDirectory在--method sequential时也跑在分布式 job上,引起出错。因此,我修改了配置文件的后缀。确保SequenceFilesFromDirectory跑本地job。
当然,最好是用 conf.set("attr", "value") 的形式来配置Configuration,这样也可以免除这个问题。
到这里,接口里就可以正常调用mahou的分布式计算了。
截一小段我的邮件的聚类结果,因为平时没用邮箱,都是些测试邮件,所以结果不是很规律。
调试过程遇到的错误:
1. 这是Configuration配置不当引起的。
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error while doing final merge at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:160)at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:415)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)Caused by: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.mahout.math.VectorWritable not foundat org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2231)at org.apache.hadoop.mapred.JobConf.getOutputValueClass(JobConf.java:1096)at org.apache.hadoop.mapred.JobConf.getMapOutputValueClass(JobConf.java:847)at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.finalMerge(MergeManagerImpl.java:693)at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.close(MergeManagerImpl.java:371)at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:158)... 6 moreCaused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.mahout.math.VectorWritable not foundat org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2199)at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2223)... 11 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.mahout.math.VectorWritable not foundat org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)... 12 more
3. map时就报classNotFound错误,这是因为hadoop的classpath中没有包含mahout的jar导致的。
Error: java.lang.ClassNotFoundException: org.apache.lucene.analysis.standard.StandardAnalyzerat java.net.URLClassLoader$1.run(URLClassLoader.java:366)at java.net.URLClassLoader$1.run(URLClassLoader.java:355)at java.security.AccessController.doPrivileged(Native Method)at java.net.URLClassLoader.findClass(URLClassLoader.java:354)at java.lang.ClassLoader.loadClass(ClassLoader.java:425)at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)at java.lang.ClassLoader.loadClass(ClassLoader.java:358)at org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.setup(SequenceFileTokenizerMapper.java:62)at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:415)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
希望能帮大家少走弯路。本文中的图都比较大,可以右键查看原图。