本次实验将搭建一个含有三节点的hadoop集群。
实验环境:
宿主机操作系统: Windows10
虚拟机软件:VMware Workstation
虚拟机操作系统1:Ubuntu2004LTS
虚拟机操作系统2:Ubuntu2004LTS
虚拟机操作系统3:Ubuntu2004LTS
在每个节点上分别进行如下操作:
# 创建新用户hadoop
root@hadoop1:~$ adduser hadoop
# 将hadoop添加到root用户组里
root@hadoop1:~$ chmod-v u+w /etc/sudoers
root@hadoop1:~$ vi /etc/sudoers
在root ALL=(ALL) ALL下添加一行:
hadoop ALL=(ALL) ALL
root@hadoop1:~$ chmod -v u-w /etc/sudoers
# 配置免密认证
root@hadoop1:~$ su - hadoop
# 将.ssh文件夹复制到每个节点
hadoop@hadoop1:~$ sudo cp -r /root/.ssh ./
hadoop@hadoop1:~$ scp -r ./.ssh hadoop2:~/
hadoop@hadoop1:~$ scp -r ./.ssh hadoop3:~/
# 修改权限
hadoop@hadoop1:~$ chmod 600 ~/.ssh/authorized_keys
hadoop@hadoop1:~$ chmod 600 ~/.ssh/config
若能通过ssh在每个节点的hadoop用户之间无密钥切换,则配置成功。
在每个节点的hadoop用户下分卸进行如下操作
# 安装Java
hadoop@hadoop1:~$ wget http://bigdata.cg.lzu.edu.cn/bigdata_software/jdk-8u321-linux-x64.tar.gz
hadoop@hadoop1:~$ tar -zxvf jdk-8u321-linux-x64.tar.gz
hadoop@hadoop1:~$ vi ~/.bashrc
添加如下内容:
export JAVA_HOME=~/jdk1.8.0_321
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
hadoop@hadoop1:~$ source ~/.bashrc
hadoop@hadoop1:~$ java -version
# 安装Hadoop
hadoop@hadoop1:~$ wget http://bigdata.cg.lzu.edu.cn/bigdata_software/hadoop-3.2.3.tar.gz
hadoop@hadoop1:~$ tar -zxvf hadoop-3.2.3.tar.gz
hadoop@hadoop1:~$ vi ~/.bashrc
export HADOOP_HOME=~/hadoop-3.2.3
export HADOOP_MAPRED_HOME=~/hadoop-3.2.3
export HADOOP_YARN_HOME=~/hadoop-3.2.3
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HDFS_NAMENODE_USER=hadoop
export HDFS_DATANODE_USER=hadoop
export HDFS_SECONDARYNAMENODE_USER=hadoop
export YARN_RESOURCEMANAGER_USER=hadoop
export YARN_NODEMANAGER_USER=hadoop
hadoop@hadoop1:~$ source ~/.bashrc
hadoop@hadoop1:~$ hadoop version
在hadoop1节点上配置hadoop:
hadoop@hadoop1:~$ cd ~/hadoop-3.2.3/etc/hadoop/
hadoop@hadoop1:~/hadoop-3.2.3/etc/hadoop/$ vi hadoop-env.sh
添加:
export JAVA_HOME=/home/hadoop/jdk1.8.0_321
export HADOOP_HOME=/home/hadoop/hadoop-3.2.3
hadoop@hadoop1:~/hadoop-3.2.3/etc/hadoop/$ vi core-site.xml
添加下方内容
hadoop@hadoop1:~/hadoop-3.2.3/etc/hadoop/$ vi hdfs-site.xml
添加下方内容
hadoop@hadoop1:~/hadoop-3.2.3/etc/hadoop/$ vi mapred-site.xml
添加下方内容
hadoop@hadoop1:~/hadoop-3.2.3/etc/hadoop/$ vi yarn-site.xml
添加下方内容
# 使用主机名或IP地址配置集群所有节点
hadoop@hadoop1:~/hadoop-3.2.3/etc/hadoop/$ vi workers
修改为:
hadoop1
hadoop2
hadoop3
# 将hadoop文件夹拷贝到其他节点
hadoop@hadoop1:~/hadoop-3.2.3/etc/hadoop/$ cd ~
hadoop@hadoop1:~$ scp -r ~/hadoop-3.2.3 hadoop2:~/
hadoop@hadoop1:~$ scp -r ~/hadoop-3.2.3 hadoop3:~/
# 格式化文件系统
hadoop@hadoop1:~$ hdfs namenode -format
core-site.xml中主要指定了文件系统默认访问地址和端口,以及hdfs文件系统默认目录,内容:
!!注意其中hadoop1应该为IP地址
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://hadoop1:9000value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/home/hadoop/hadoop-3.2.3value>
property>
configuration>
hdfs-site.xml配置主要指定namenode、secondarynamenode节点和服务端口,以及副本数和安全检查:
<configuration>
<property>
<name>dfs.namenode.http-addressname>
<value>hadoop1:9870value>
property>
<property>
<name>dfs.namenode.secondary.http-addressname>
<value>hadoop1:9868value>
property>
<property>
<name>dfs.replicationname>
<value>3value>
property>
<property>
<name>dfs.permissions.enabledname>
<value>falsevalue>
property>
<property>
<name>dfs.blocksizename>
<value>128mvalue>
property>
configuration>
mapred-site.xml配置主要指定了MR框架以及一些环境变量和路径:
<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
<property>
<name>mapreduce.admin.user.envname>
<value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop-3.2.3value>
property>
<property>
<name>mapreduce.application.classpathname>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*value>
property>
<property>
<name>yarn.app.mapreduce.am.envname>
<value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop-3.2.3value>
property>
configuration>
yarn-site.xml,主要指定了yarn resourcemanager节点和一些其他变量:
<configuration>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.classname>
<value>org.apache.hadoop.mapred.ShuffleHandlervalue>
property>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>hadoop1value>
property>
<property>
<name>yarn.nodemanager.env-whitelistname>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOMEvalue>
property>
<property>
<name>yarn.nodemanager.vmem-check-enabledname>
<value>falsevalue>
property>
configuration>
启动集群服务:(仅在Hadoop1节点上运行即可)
# 启动HDFS
hadoop@hadoop1:~$ start-dfs.sh
# 启动yarn:ResourceManager和NodeManager服务进程
hadoop@hadoop1:~$ start-yarn.sh
# 启动MapReduce JobHistory Server历史服务器和timelineserver时间线服务器
hadoop@hadoop1:~$ mapred --daemon start historyserver
hadoop@hadoop1:~$ yarn --daemon start timelineserver
# 查看运行节点情况
hadoop@hadoop1:~$ jps
也可以通过网址访问node节点:http://hadoop1:9870/
访问NodeManager:http://hadoop1:8088/
访问yarn:http://hadoop1:19888/
Hadoop上的一些操作:
# 查看系统健康状况
hadoop@hadoop1:~$ hdfs fsck /
# 创建HDFS文件系统用户目录
hadoop@hadoop1:~$ hdfs dfs -mkdir /user
hadoop@hadoop1:~$ hdfs dfs -mkdir /user/hadoop/
hadoop@hadoop1:~$ hdfs dfs -ls /
# 创建空文件
hadoop@hadoop1:~$ hdfs dfs –touchz /directory/filename
# 查看文件大小
hadoop@hadoop1:~$ hdfs dfs –du –s /directory/filename
# 查看文件内容
hadoop@hadoop1:~$ hdfs dfs –cat /path/to/file_in_hdfs
# 文件从本地上传
hadoop@hadoop1:~$ hdfs dfs -copyFromLocal <localsrc> <hdfs destination>
hadoop@hadoop1:~$ hdfs dfs -put <localsrc> <destination>
# 文件下载到本地
hadoop@hadoop1:~$ hdfs dfs -copyToLocal <hdfs source> <localdst>
hadoop@hadoop1:~$ hdfs dfs -get <src> <localdst>
# 查看目录下文件数目、大小
hadoop@hadoop1:~$ hdfs dfs -count <path>
# 删除文件
hadoop@hadoop1:~$ hdfs dfs –rm <path>
# 复制文件
hadoop@hadoop1:~$ hdfs dfs -cp <src> <dest>
# 移动文件
hadoop@hadoop1:~$ hdfs dfs -mv <src> <dest>
# 清空回收站(被rm的文件/文件夹)
hadoop@hadoop1:~$ hdfs dfs -expunge
# 删除目录
hadoop@hadoop1:~$ hdfs dfs -rmdir <path>
# 查看某命令的用法
hadoop@hadoop1:~$ hdfs dfs -usage <command>
关闭集群服务:
hadoop@hadoop1:~$ stop-yarn.sh
hadoop@hadoop1:~$ stop-dfs.sh
hadoop@hadoop1:~$ mapred --daemon stop historyserver
hadoop@hadoop1:~$ yarn --daemon stop timelineserver
官方指导手册链接:指导手册
下面实现通过MapReduce的单词计数WordCount:
首先需要打开虚拟机并启动集群服务,然后再hadoop用户根目录创建软代码文件:
hadoop@hadoop1:~$ vi WordCount.java
粘贴源代码内容:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
然后执行:
# 编译WordCount.java并生成jar文件(保存在~/目录中)
hadoop@hadoop1:~$ hadoop com.sun.tools.javac.Main WordCount.java
hadoop@hadoop1:~$ jar cf wc.jar WordCount*.class
# 创建wordcount及输入文件夹,输出文件夹会在mapreduce执行过程中自动创建
hadoop@hadoop1:~$ hadoop fs mkdir /user
hadoop@hadoop1:~$ hadoop fs mkdir /user/hadoop
hadoop@hadoop1:~$ hadoop fs mkdir /user/hadoop/wordcount
hadoop@hadoop1:~$ hadoop fs mkdir /user/hadoop/wordcount/input
# 将所有输入文件复制到输入文件夹中(假设文件在用户根目录)
hadoop@hadoop1:~$ hadoop fs -copyFromLocal ~/file01 /user/hadoop/wordcount/input/
hadoop@hadoop1:~$ hadoop fs -copyFromLocal ~/file02 /user/hadoop/wordcount/input/
# 检查输入文件夹
hadoop@hadoop1:~$ hadoop fs -ls /user/hadoop/wordcount/input/
# 进行Map-Reduce操作
hadoop@hadoop1:~$ hadoop jar wc.jar WordCount /user/hadoop/wordcount/input /user/hadoop/wordcount/output
# 查看结果
hadoop@hadoop1:~$ hadoop fs -cat /user/hadoop/wordcount/output/part-r-00000
官方指导手册链接:指导手册