HADOOP集群具体来说包含两个集群:HDFS集群和YARN集群,两者逻辑上分离,但物理上常在一起
HDFS集群:负责海量数据的存储,集群中的角色主要有 NameNode / DataNode
YARN集群:负责海量数据运算时的资源调度,集群中的角色主要有ResourceManager /NodeManager
本集群搭建案例,以5节点为例进行搭建,角色分配如下:
节点 | 角色1 | 角色2 |
---|---|---|
hdp-node-01 | NameNode | SecondaryNameNode |
hdp-node-02 | ResourceManager | |
hdp-node-03 | DataNode | NodeManager |
hdp-node-04 | DataNode | NodeManager |
hdp-node-05 | DataNode | NodeManager |
本案例使用虚拟机服务器来搭建HADOOP集群,所用软件及版本:
1. Vmware 11.0
2. Centos 6.5 64bit
hdp-node-01
hdp-node-02
hdp-node-03
最简化配置如下:
vi hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/home/hadoop/apps/jdk1.7.0_51
vi core-site.xml
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://hdp-node-01:9000value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/home/HADOOP/apps/hadoop-2.6.1/tmpvalue>
property>
configuration>
vi hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dirname>
<value>/home/hadoop/data/namevalue>
property>
<property>
<name>dfs.datanode.data.dirname>
<value>/home/hadoop/data/datavalue>
property>
<property>
<name>dfs.replicationname>
<value>3value>
property>
<property>
<name>dfs.secondary.http.addressname>
<value>hdp-node-01:50090value>
property>
configuration>
vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
configuration>
vi yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>hadoop01value>
property>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
configuration>
vi salves
hdp-node-01
hdp-node-02
hdp-node-03
初始化HDFS
bin/hadoop namenode -format
启动HDFS
sbin/start-dfs.sh
启动YARN
sbin/start-yarn.sh
[HADOOP@hdp-node-01 ~]$ HADOOP fs -mkdir -p /wordcount/input
[HADOOP@hdp-node-01 ~]$ HADOOP fs -put /home/HADOOP/somewords.txt /wordcount/input
cd $HADOOP_HOME/share/hadoop/mapreduce/
hadoop jar mapredcue-example-2.6.1.jar wordcount /wordcount/input /wordcount/output
hdfs dfsadmin –report
可以看出,集群共有3个datanode可用,也可打开web控制台查看HDFS集群信息,在浏览器打开http://hdp-node-01:50070/
hadoop fs –ls /
hadoop fs -put ./ scala-2.10.6.tgz to /
hadoop fs -get /yarn-site.xml
mapreduce是hadoop中的分布式运算编程框架,只要按照其编程规范,只需要编写少量的业务逻辑代码即可实现一个强大的海量数据并发处理程序
从大量(比如T级别)文本文件中,统计出每一个单词出现的总次数
Map阶段:
Reduce阶段:
定义一个mapper类
//首先要定义四个泛型的类型
//keyin: LongWritable valuein: Text
//keyout: Text valueout:IntWritable
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
//map方法的生命周期: 框架每传一行数据就被调用一次
//key : 这一行的起始点在文件中的偏移量
//value: 这一行的内容
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//拿到一行数据转换为string
String line = value.toString();
//将这一行切分出各个单词
String[] words = line.split(" ");
//遍历数组,输出<单词,1>
for(String word:words){
context.write(new Text(word), new IntWritable(1));
}
}
}
定义一个reducer类
//生命周期:框架每传递进来一个kv 组,reduce方法被调用一次
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
//定义一个计数器
int count = 0;
//遍历这一组kv的所有v,累加到count中
for(IntWritable value:values){
count += value.get();
}
context.write(key, new IntWritable(count));
}
}
定义一个主类,用来描述job并提交job
public class WordCountRunner {
//把业务逻辑相关的信息(哪个是mapper,哪个是reducer,要处理的数据在哪里,输出的结果放哪里。。。。。。)描述成一个job对象
//把这个描述好的job提交给集群去运行
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job wcjob = Job.getInstance(conf);
//指定我这个job所在的jar包
// wcjob.setJar("/home/hadoop/wordcount.jar");
wcjob.setJarByClass(WordCountRunner.class);
wcjob.setMapperClass(WordCountMapper.class);
wcjob.setReducerClass(WordCountReducer.class);
//设置我们的业务逻辑Mapper类的输出key和value的数据类型
wcjob.setMapOutputKeyClass(Text.class);
wcjob.setMapOutputValueClass(IntWritable.class);
//设置我们的业务逻辑Reducer类的输出key和value的数据类型
wcjob.setOutputKeyClass(Text.class);
wcjob.setOutputValueClass(IntWritable.class);
//指定要处理的数据所在的位置
FileInputFormat.setInputPaths(wcjob, "hdfs://hdp-server01:9000/wordcount/data/big.txt");
//指定处理完成之后的结果所保存的位置
FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://hdp-server01:9000/wordcount/output/"));
//向yarn集群提交这个job
boolean res = wcjob.waitForCompletion(true);
System.exit(res?0:1);
}
vi /home/hadoop/test.txt
Hello tom
Hello jim
Hello ketty
Hello world
Ketty tom
在hdfs上创建输入数据文件夹:
hadoop fs mkdir -p /wordcount/input
将words.txt上传到hdfs上
hadoop fs –put /home/hadoop/words.txt /wordcount/input
$ hadoop jar wordcount.jar cn.itcast.bigdata.mrsimple.WordCountDriver /wordcount/input /wordcount/out
$ hadoop fs –cat /wordcount/out/part-r-00000