环境:虚拟机vmware7+ubuntu12.04
1,先下载需要的文件:
【注意】:版本问题很重要
jdk,eclipse,maven
hadoop:http://mirror.bjtu.edu.cn/apache/hadoop/common/hadoop-1.0.3/ 我其实先下载了0.2.0
mahout:http://labs.renren.com/apache-mirror/mahout/0.7/
2,安装jdk,下载的rpm包,需要安装alien,然后用alien把rpm转换成deb,再使用dpkg安装
3,eclipse解压,我用的helios版
4,maven解压,配置环境变量:
我的/etc/profile文件最终的配置(我的文件都放在share目录下,然后share目录可以和windows共享):
export JAVA_HOME=/usr/java/jdk1.7.0_07 export PATH=$JAVA_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export HADOOP_HOME=/home/ydp/share/hadoop-1.0.3 export HADOOP_CONF_DIR=$HADOOP_HOME/conf export PATH=$HADOOP_HOME/bin:$PATH export MAHOUT_HOME=/home/ydp/share/mahout-distribution-0.7 export PATH=$MAHOUT_HOME/bin:$PATH export MAVEN_HOME=/home/ydp/share/apache-maven-3.0.4 export PATH=$MAVEN_HOME/bin:$PAT
5,安装hadoop,解压1.0.3版本,配置文件:
core-site.xml
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/ydp/tmp</value> </property> </configuration
mapred-site.xml
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration
hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration
hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_07 export HADOOP_HOME_WARN_SUPPRESS=TRU
6,安装mahout,解压(我就没用mvn install了,直接下了个可用的)
7,在eclipse中配置hadoop,
将hadoop.0.2.0下contrib/eclipse-plugin目录下的插件拷贝出来,
解压(右键解压即可),
将hadoop内的文件hadoop-common-0.21.0..jar,hadoop-hdfs-0.21.0.jar,log4j-1.2.15.jar,hadoop-mapred-0.21.0.jar拷贝到lib目录,
在插件解压后的目录打包:jar cvf hadoop-0.21.0-eclipse-plugin ./* ;
打开jar包内(META-INF)的文件MANIFEST.MF打开,用加压后的内容覆盖这个文件里的内容,并修改其中(应该在文件末尾处)的Bundle-ClassPath: classes/,lib/hadoop-common-0.21.0..jar,lib/hadoop-hdfs-0.21.0.jar,lib/log4j-1.2.15.jar,lib/hadoop-mapred-0.21.0.jar
将修改后的问价保存,然后把新的插件jar包拷贝到eclipse/plugins目录下,重启eclipse,打开map reduce视窗,配置其中的参数和之前在hadoop.1的配置文件中一样。
8,在eclipse中运行wordcount,
package org.frame.base.hbase.hadoop; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { /** * TokenizerMapper 继续自 Mapper<Object, Text, Text, IntWritable> * * [一个文件就一个map,两个文件就会有两个map] * map[这里读入输入文件内容 以" \t\n\r\f" 进行分割,然后设置 word ==> one 的key/value对] * * @param Object Input key Type: * @param Text Input value Type: * @param Text Output key Type: * @param IntWritable Output value Type: * * Writable的主要特点是它使得Hadoop框架知道对一个Writable类型的对象怎样进行serialize以及deserialize. * WritableComparable在Writable的基础上增加了compareT接口,使得Hadoop框架知道怎样对WritableComparable类型的对象进行排序。 * * @author yangchunlong.tw * */ public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } /** * IntSumReducer 继承自 Reducer<Text,IntWritable,Text,IntWritable> * * [不管几个Map,都只有一个Reduce,这是一个汇总] * reduce[循环所有的map值,把word ==> one 的key/value对进行汇总] * * 这里的key为Mapper设置的word[每一个key/value都会有一次reduce] * * 当循环结束后,最后的确context就是最后的结果. * * @author yangchunlong.tw * */ public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); /** * 这里必须有输入/输出 */ if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class);//主类 job.setMapperClass(TokenizerMapper.class);//mapper job.setCombinerClass(IntSumReducer.class);//作业合成类 job.setReducerClass(IntSumReducer.class);//reducer job.setOutputKeyClass(Text.class);//设置作业输出数据的关键类 job.setOutputValueClass(IntWritable.class);//设置作业输出值类 FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//文件输入 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//文件输出 System.exit(job.waitForCompletion(true) ? 0 : 1);//等待完成退出. } }
新建mapreduc目录,设置运行参数hdfs://localhost:9000/user/name/test hdfs://localhost:9000/user/name/result,运行
9,在命令行运行wordcount:
安装ssh并配置无密码输入登录,
ssh-keygen-t rsa -P
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hadoop namenode -format
start-all.sh
stop-all.sh
jps
echo "a a a b b" >> test
hadoop fs -put test test
hadoop fs -ls
hadoop jar example.jar wordcount test result
hadoop fs -cat result/*
10,运行mahout,mahout --help mahout kmeans --help
1.下载文件http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data放在$MAHOUT_HOME目录下。
2.启动Hadoop:$HADOOP_HOME/bin/start-all.sh
3.在$MAHOUT_HOME目录下创建测试目录testdata,并把数据导入到这个tastdata目录中(这里的目录的名字只能是testdata)
$HADOOP_HOME/bin/hadoop fs -mkdir testdata
$HADOOP_HOME/bin/hadoop fs -put $MAHOUT_HOME/synthetic_control.data $MAHOUT_HOME/testdata
4.使用kmeans算法(这会运行1分钟左右)
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/mahout-examples-0.5-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
5.查看结果
$HADOOP_HOME/bin/hadoop fs -lsr output
$HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples
$cd $MAHOUT_HOME/examples/output
$ ls
如果看到以下结果那么算法运行成功,你的安装也就成功了.
clusteredPoints clusters-0 clusters-1 clusters-10 clusters-2 clusters-3 clusters-4
clusters-5 clusters-6 clusters-7 clusters-8 clusters-9 data
如有问题,可加Q群:292303980