ubuntu18.04（后面换了16.04）配置hadoop2.9

伪分布模式

运行伪分布模式

1.退出localhost，在伪分布式模式中，运行hdfs，输入命令$ bin/hdfs namenode -format，得到如下输出（简略掉很多warning，可能是由于刚才使用）：

18/12/29 20:05:16 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-ubuntu/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 325 bytes saved in 0 seconds .
18/12/29 20:05:16 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/12/29 20:05:16 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

2.启动namenode和datanode两个进程，输入$ sbin/start-dfs.sh命令，得到以下结果：

localhost: namenode running as process 38637. Stop it first.
localhost: datanode running as process 38804. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 39086. Stop it first.
WARNING: An illegal reflective access operation has occurred
......
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

运行日志都将保存在$Hadoop_DIR中，默认目录为logs，我们可用查看这些目录，例如查看hadoop-ubuntu-namenode-ubuntu.log查看运行日志，.out查看运行输出。
3.namenode节点默认运行在http://localhost:50070/上，打开可以发现节点运行正常;
4.之后建立运行mapreduce任务所必需的目录，建立/usr目录以及用户对应目录，但是我没找到在机子上他对应的位置，应该按照博客所说的则是在默认存放路径：{hadoop.tmp.dir} =/tmp/hadoop- ${user.name}中，但是在本机并未找到，但是通过bin/hadoop fs -ls /命令的确可以看到这些目录的存在。

  $ bin/hdfs dfs -mkdir /user
  $ bin/hdfs dfs -mkdir /user/

5.将input文件拷贝到hdfs系统中，他将被保存在/user/ubuntu(我的username是ubuntu)下面。

bin/hdfs dfs -put etc/hadoop input
bin/hadoop fs  -ls /user/ubuntu

6.运行例程，将运行结果从分布式系统中拷贝到自己的本地系统中，并查看其中所有文件。

  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar grep input output 'dfs[a-z.]+'
 $ bin/hdfs dfs -get output output
 $ cat output/output/*

也可以输入命令$ bin/hdfs dfs -cat output/*查看分布式中的文件

ubuntu@ubuntu:~/hadoop-2.9.2$ bin/hdfs dfs -cat output/*
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/home/ubuntu/hadoop-2.9.2/share/hadoop/common/lib/hadoop-auth-2.9.2.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
cat: `output/output': No such file or directory
6   dfs.audit.logger
4   dfs.class
3   dfs.logger
3   dfs.server.namenode.
2   dfs.audit.log.maxbackupindex
2   dfs.period
2   dfs.audit.log.maxfilesize
1   dfs.log
1   dfs.file
1   dfs.servers
1   dfsadmin
1   dfsmetrics.log
1   dfs.replication

7.使用命令$ sbin/stop-dfs.sh关闭这个服务。

完全分布模式

Hadoop中对于java的配置有两类，主要有：

只读文件：ore-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml；
个性定制化文件： - etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml.
通过个性化配置——对于 etc/hadoop/hadoop-env.sh和etc/hadoop/yarn-env.sh.的改动对于/bin下面的可执行目录进行更改；
HDFS包括 NameNode, SecondaryNameNode, and DataNode，YARN是资源管理、节点管理和网络管理；

配置环境

必须配置JAVA_HOME，这样便于远程上的操作，对于一些环境变量的配置：

Daemon	Environment Variable
NameNode	HADOOP_NAMENODE_OPTS
DataNode	HADOOP_DATANODE_OPTS
SecondaryNameNode	HADOOP_SECONDARYNAMENODE_OPTS
ResourceManager	YARN_RESOURCEMANAGER_OPTS
NodeManager	YARN_NODEMANAGER_OPTS
WebAppProxy	YARN_PROXYSERVER_OPTS
Map Reduce Job History Server	HADOOP_JOB_HISTORYSERVER_OPTS

如果希望namenode使用parallelGC，则可以在hadoop-env.sh 中添加这个语句：
export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC"
其他一些可能有用的环境变量：

HADOOP_PID_DIR：用于存储进程ID信息
HADOOP_LOG_DIR：存储日志信息的目录
HADOOP_HEAPSIZE / YARN_HEAPSIZE ：HEAP最大规模，单位是MB，默认值为1000。
大多数情况下你需要配置HADOOP_PID_DIR和HADOOP_LOG_DIR，否则容易造成符号链接攻击symlink attack。

对于运行状况的健康检测

配置信息主要写在 etc/hadoop/yarn-site.xml中，其中包括：

Parameter	Value	Notes
yarn.nodemanager.health-checker.script.path	Node health script	用于检测节点正常的脚本
yarn.nodemanager.health-checker.script.opts	Node health script options	脚本参数
yarn.nodemanager.health-checker.interval-ms	Node health script interval	检测的时间间隔
yarn.nodemanager.health-checker.script.timeout-ms	Node health script timeout interval	检测超时时间

slave文件

列出所有的附属节点的IP地址，以及主机名称，每个节点占一行。后面介绍的helper脚本将会一次在多个节点上运行命令，这个需要使用hadoop账户建立信任（通过无密码的SSH或者其他方式）。
关于Rack Awareness感觉没有什么用：Many Hadoop components are rack-aware and take advantage of the network topology for performance and safety. Hadoop daemons obtain the rack information of the slaves in the cluster by invoking an administrator configured module. See the [Rack Awareness](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/RackAwareness.html) documentation for more specific information.

hadoop操作

启动HDFS，首先要初始化一个分布式文件系统，输入命令ubuntu@ubuntu:~/hadoop-2.9.2$ bin/hdfs namenode -format namecluster

19/01/01 11:13:38 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
19/01/01 11:13:38 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

启动namenode节点

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/hadoop-daemon.sh --config conf_user/ --script hdfs start namenode
starting namenode, logging to /home/ubuntu/hadoop-2.9.2/logs/hadoop-ubuntu-namenode-ubuntu.out

启动一个datanode节点，以HDFS的分布式架构启动

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/hadoop-daemon.sh --config conf_user/ --script hdfs start datanode
starting datanode, logging to /home/ubuntu/hadoop-2.9.2/logs/hadoop-ubuntu-datanode-ubuntu.out

在slave文件以及SSH通信配置完成之后，可以使用脚本启动HDFS

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/ubuntu/hadoop-2.9.2/logs/hadoop-ubuntu-namenode-ubuntu.out
localhost: starting datanode, logging to /home/ubuntu/hadoop-2.9.2/logs/hadoop-ubuntu-datanode-ubuntu.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/ubuntu/hadoop-2.9.2/logs/hadoop-ubuntu-secondarynamenode-ubuntu.out

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/yarn-daemon.sh  --config conf_user/ start resourcemanager
starting resourcemanager, logging to /home/ubuntu/hadoop-2.9.2/logs/yarn-ubuntu-resourcemanager-ubuntu.out

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/yarn-daemon.sh  --config conf_user/ start proxyserver
starting proxyserver, logging to /home/ubuntu/hadoop-2.9.2/logs/yarn-ubuntu-proxyserver-ubuntu.out
2019-01-01 14:14:46,189 INFO  [main] webproxy.WebAppProxyServer (LogAdapter.java:info(51)) - STARTUP_MSG:

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /home/ubuntu/hadoop-2.9.2/logs/yarn-ubuntu-resourcemanager-ubuntu.out
localhost: starting nodemanager, logging to /home/ubuntu/hadoop-2.9.2/logs/yarn-ubuntu-nodemanager-ubuntu.out

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/mr-jobhistory-daemon.sh --config conf_user/ start historyserver
starting historyserver, logging to /home/ubuntu/hadoop-2.9.2/logs/mapred-ubuntu-historyserver-ubuntu.out

hadoop终止操作

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/hadoop-daemon.sh --config conf_user/ --script hdfs stop namenode
stopping namenode
ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/hadoop-daemon.sh --config conf_user/ --script hdfs stop datanode
stopping datanode

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/stop-dfs.sh 
Stopping namenodes on [localhost]
localhost: no namenode to stop
localhost: no datanode to stop
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode

ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/yarn-daemon.sh --config conf_user/ stop resourcemanager
stopping resourcemanager
ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/yarn-daemon.sh --config conf_user/ stop namenodemanager
no namenodemanager to stop
ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/yarn-daemon.sh --config conf_user/ stop nodemanager
stopping nodemanager
nodemanager did not stop gracefully after 5 seconds: killing with kill -9
ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/yarn-daemon.sh --config conf_user/ stop nodemanager
no nodemanager to stop
ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/stop-yarn.sh 
stopping yarn daemons
no resourcemanager to stop
localhost: no nodemanager to stop
no proxyserver to stop
ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/yarn-daemon.sh --config conf_user/ stop proxyserver
no proxyserver to stop
ubuntu@ubuntu:~/hadoop-2.9.2$ sbin/mr-jobhistory-daemon.sh --config conf_user/  stop historyserver
stopping historyserver

MMP 我一定要讲

在Ubuntu18.04上一直没有搭成功，因为他的datanode在8088中没有显示，真的是令人头疼。
于是我在16.04中按照某个教程配结果问题超级多，所以说还是按照上面的来进行配置，特别是版本不一样的话一定要看以下别瞎找东西配置，挺令人脑阔疼的。
因为我按照哪个版本进行配置的全都是说"ssh xxx：host or sever not founded"，充分说明这个乱用版本是真的坑。。。按照网上的进行更改为
然后发现8088页面成功，但是后面的哪个50070页面的datanode没有成功，然后看日志，发现主要是因为关于文件夹权限问题，emmmm但都是777的权限啊，然后我看了看，把这两个文件夹移除，重新建立文件夹（文件夹名称和master中文件夹名称不一样，免得被认为是一个节点，这个我就遇到过hhh），然后发现成功辣辣辣,颠颠颠
顺便说下，其实使用两个命令就足够了，就是

hdfs namenode -format namecluster
hadoop-2.9.2/sbin/start-all.sh

emmmm,作者还要bb两句，就是如果50070端口的datanode节点没有显示或者显示不完全的话，先把对应datanode日志以及我们设置的datanode文件目录里面的文件移除，然后再跑就可以了。
鸭，忽然发现没有说网络部分，就是那啥我们为了配置集群必须要互相知晓，首先呢就是修改/etc/hostname给自己的节点机器起个名字，然后呢修改/etc/hosts把所有节点和master的IP都写进去，这样就可以再网络上找到彼此辣
令人脑阔疼啊。。。