centos7安装hadoop3.0.3和jdk1.8的伪分布式模式
useradd hadoop
passwd hadoop
1
chmod u+w /etc/sudoers
vi /etc/sudoers
添加
hadoop ALL=(ALL) ALL
或者
hadoop ALL=(root) NOPASSWD:ALL
su - hadoop
sudo mkidr /home/hadoop/hadoop3.03
tar -zxvf hadoop-3.0.3.tar.gz
mv hadoop-3.0.3 hadoop3.03
tar -zxvf jdk-8u172-linux-x64.gz
mv jdk_1.8.0.172 jdk1.8
vi /etc/profile
##java
export JAVA_HOME=/home/hadoop/java/jdk1.8
export PATH=$PATH:$JAVA_HOME/bin
##hadoop
export HADOOP_HOME=/home/hadoop/hadoop3.03
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
echo $JAVA_HOME
echo $HADOOP_HOME
export JAVA_HOME=/home/hadoop/java/jdk1.8
hadoop-localhost为主机名称,/opt/data/tmp要先创建好目录
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://hadoop-localhost:8020value>
<description>HDFS的URI,文件系统://namenode标识:端口号description>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/opt/data/tmpvalue>
<description>namenode上本地的hadoop临时文件夹description>
property>
configuration>
hadoop.tmp.dir配置的是Hadoop临时目录,比如HDFS的NameNode数据默认都存放这个目录下,查看*-default.xml等默认配置文件,就可以看到很多依赖${hadoop.tmp.dir}的配置。
默认的hadoop.tmp.dir是/tmp/hadoop-${user.name},此时有个问题就是NameNode会将HDFS的元数据存储在这个/tmp目录下,如果操作系统重启了,系统会清空/tmp目录下的东西,导致NameNode元数据丢失,是个非常严重的问题,所有我们应该修改这个路径。
sudo mkdir -p /opt/data/tmp
将临时目录的所有者修改为hadoop
sudo chown –R hadoop:hadoop /opt/data/tmp
<configuration>
<property>
<name>dfs.name.dirname>
<value>/opt/data/tmp/dfs/namevalue>
<description>namenode上存储hdfs名字空间元数据description>
property>
<property>
<name>dfs.data.dirname>
<value>/opt/data/tmp/dfs/datavalue>
<description>datanode上数据块的物理存储位置description>
property>
<property>
<name>dfs.replicationname>
<value>1value>
property>
configuration>
sudo chown -R hadoop:hadoop /opt/data
hdfs namenode –format
查看NameNode格式化后的目录
$ ll /opt/data/tmp/dfs/name/current
启动NameNode
sbin/hadoop-daemon.sh start namenode
启动DataNode
sbin/hadoop-daemon.sh start datanode
启动SecondaryNameNode
sbin/hadoop-daemon.sh start secondarynamenode
JPS命令查看是否已经启动成功,有结果就是启动成功了
$ jps
[hadoop@hadoop-localhost hadoop3.03]#
创建目录
bin/hdfs dfs -mkdir /demo1
上传
bin/hdfs dfs -put etc/hadoop/core-site.xml /demo1
读取HDFS上的文件内容
bin/hdfs dfs -cat /demo1/core-site.xml
从HDFS上下载文件到本地
bin/hdfs dfs -get /demo1/core-site.xml
hdfs 2.X版本的web页面端口号为50070
http://192.168.145.129:50070
hdfs 3.X版本的web页面端口号为9870
http://192.168.145.129:9870/dfshealth.html#tab-overview
<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
<property>
<name>yarn.app.mapreduce.am.envname>
<value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03value>
property>
<property>
<name>mapreduce.map.envname>
<value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03value>
property>
<property>
<name>mapreduce.reduce.envname>
<value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03value>
property>
configuration>
arn.nodemanager.aux-services配置了yarn的默认混洗方式,选择为mapreduce的默认混洗算法。
yarn.resourcemanager.hostname指定了Resourcemanager运行在哪个节点上。
<configuration>
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>hadoop-localhostvalue>
property>
configuration>
sbin/yarn-daemon.sh start resourcemanager
sbin/yarn-daemon.sh start nodemanager
也可执行批处理文件启动服务
启动hdfs 和yarn
sbin/start-dfs.sh
sbin/start-yarn.sh
sbin/start-all.sh
YARN的Web客户端端口号是8088,通过http://192.168.145.129:8088/可以查看。
创建测试用的Input文件
bin/hdfs dfs -mkdir -p /wordcountdemo/input
wc.input文件内容为:
hadoop mapreduce hive
hbase spark storm
sqoop hadoop hive
spark hadoop
将wc.input文件上传到HDFS的/wordcountdemo/input目录中:
bin/hdfs dfs -put /opt/data/wc.input /wordcountdemo/input
运行WordCount MapReduce Job
2018-07-03 19:38:23,956 INFO client.RMProxy: Connecting to ResourceManager at hadoop-localhost/192.168.145.129:8032
2018-07-03 19:38:24,565 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1530615244194_0002
2018-07-03 19:38:24,879 INFO input.FileInputFormat: Total input files to process : 1
2018-07-03 19:38:25,784 INFO mapreduce.JobSubmitter: number of splits:1
2018-07-03 19:38:25,841 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2018-07-03 19:38:26,314 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1530615244194_0002
2018-07-03 19:38:26,315 INFO mapreduce.JobSubmitter: Executing with tokens: []
2018-07-03 19:38:26,466 INFO conf.Configuration: resource-types.xml not found
2018-07-03 19:38:26,466 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-07-03 19:38:26,547 INFO impl.YarnClientImpl: Submitted application application_1530615244194_0002
2018-07-03 19:38:26,590 INFO mapreduce.Job: The url to track the job: http://hadoop-localhost:8088/proxy/application_1530615244194_0002/
2018-07-03 19:38:26,590 INFO mapreduce.Job: Running job: job_1530615244194_0002
2018-07-03 19:38:35,985 INFO mapreduce.Job: Job job_1530615244194_0002 running in uber mode : false
2018-07-03 19:38:35,988 INFO mapreduce.Job: map 0% reduce 0%
2018-07-03 19:38:42,310 INFO mapreduce.Job: map 100% reduce 0%
2018-07-03 19:38:47,402 INFO mapreduce.Job: map 100% reduce 100%
2018-07-03 19:38:49,469 INFO mapreduce.Job: Job job_1530615244194_0002 completed successfully
2018-07-03 19:38:49,579 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=94
FILE: Number of bytes written=403931
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=195
HDFS: Number of bytes written=60
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=4573
Total time spent by all reduces in occupied slots (ms)=2981
Total time spent by all map tasks (ms)=4573
Total time spent by all reduce tasks (ms)=2981
Total vcore-milliseconds taken by all map tasks=4573
Total vcore-milliseconds taken by all reduce tasks=2981
Total megabyte-milliseconds taken by all map tasks=4682752
Total megabyte-milliseconds taken by all reduce tasks=3052544
Map-Reduce Framework
Map input records=4
Map output records=11
Map output bytes=115
Map output materialized bytes=94
Input split bytes=122
Combine input records=11
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=94
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=171
CPU time spent (ms)=1630
Physical memory (bytes) snapshot=332750848
Virtual memory (bytes) snapshot=5473169408
Total committed heap usage (bytes)=165810176
Peak Map Physical memory (bytes)=214093824
Peak Map Virtual memory (bytes)=2733207552
Peak Reduce Physical memory (bytes)=118657024
Peak Reduce Virtual memory (bytes)=2739961856
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=73
File Output Format Counters
Bytes Written=60
[hadoop@hadoop-localhost hadoop3.03]$
输出统计结果为:
[hadoop@hadoop-localhost hadoop3.03]$ bin/hdfs dfs -cat /wordcountdemo/output/part-r-00000
hadoop 3
hbase 1
hive 2
mapreduce 1
spark 2
sqoop 1
storm 1
[hadoop@hadoop-localhost hadoop3.03]$
结果是按照键值排好序的
sbin/hadoop-daemon.sh stop namenode
sbin/hadoop-daemon.sh stop datanode
sbin/yarn-daemon.sh stop resourcemanager
sbin/yarn-daemon.sh stop nodemanager
全部停止批处理文件
sbin/stop_yarn.sh
sbin/stop_dfs.sh
sbin/stop_all.sh
HDFS负责大数据的存储,通过将大文件分块后进行分布式存储方式,突破了服务器硬盘大小的限制,解决了单台机器无法存储大文件的问题,HDFS是个相对独立的模块,可以为YARN提供服务,也可以为HBase等其他模块提供服务。
YARN是一个通用的资源协同和任务调度框架,是为了解决Hadoop1.x中MapReduce里NameNode负载太大和其他问题而创建的一个框架。
YARN是个通用框架,不止可以运行MapReduce,还可以运行Spark、Storm等其他计算框架。
MapReduce是一个计算框架,它给出了一种数据处理的方式,即通过Map阶段、Reduce阶段来分布式地流式处理数据。它只适用于大数据的离线处理,对实时性要求很高的应用不适用。
—-the—–end—-