目录
前言
一、配置 SASL 认证证书
二、修改集群配置文件
1.hdfs添加以下配置
2.yarn添加以下配置
3.hive添加以下配置
4.hbase添加以下配置
三、kerberos相关命令
四、快速测试
五、问题解决
1、Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Server asks us to fall back to SIMPLE auth, but this client is configured to only allow secure connections.; Host Details : local host is: "v0106-c0a8003a.4eb06.c.local/192.168.0.58"; destination host is: "v0106-c0a80019.4eb06.c.local":8020;
2、bash: keytools: command not found
3、java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "v0106-c0a8003a.4eb06.c.local/192.168.0.58"; destination host is: "v0106-c0a80019.4eb06.c.local":8020;
4、retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over /192.168.0.92:8020 after 4 fail over attempts.
5、Caused by: GSSException: No valid credentials provided (Mechanism level: Ticket expired (32) - PROCESS_TGS)。Caused by: KrbException: Identifier doesn't match expected value (906)。Failed on local exception: java.io.IOException: Couldn't setup connection for hdfs/[email protected] to /192.168.0.25:8485; Host Details : local host is: "v0106-c0a8005c.4eb06.c.local/192.168.0.92"; destination host is: "v0106-c0a80019.4eb06.c.local":8485;
6、java.net.ConnectException: Call From xxx to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused
7、WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
8、Incompatible clusterID for journal Storage Directory /export/data/hadoop/journal/yinheforyanghang: NameNode has clusterId 'CID-7c4fa72e-afa0-4334-bafd-68af133d4ffe' but storage has clusterId 'CID-52b9fb86-24b4-43a3-96d3-faa584e96d69'
9、There appears to be a gap in the edit log. We expected txid 1, but got txid 2941500.
10、Caused by: ExitCodeException exitCode=24: File /home/hadoop/core/hadoop-2.7.6/etc/hadoop/container-executor.cfg must be owned by root, but is owned by 1000
11、ulimit -a for user hdfs
12、Caused by: MetaException(message:Version information not found in metastore. )
13、Caused by: java.lang.IllegalStateException: Cannot skip to less than the current value (=492685), where newValue=16385。
14、This node has namespaceId '279103593 and clusterId 'xxx' but the requesting node expected '279103593' and 'xxx'。
15、FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient。
16、WARN hdfs.DFSClient DataStreamer Exception java.lang.IllegalArgumentException: null at javax.security.auth.callback.NameCallback.(NameCallback.java:90)...。
17、Failed to become active master.org.apache.hadoop.security.AccessControlException: Permission denied. user=hbase is not the owner of inode=/hbase。
前一篇记录了如何安装主从kdc,这篇记录下kerberos如何与hadoop集群集成。设定:
hadoop 的配置文件目录为:/export/common/hadoop/conf/
hadoop 的安装目录为:/export/hadoop/
因为启用kerberos 认证后,会使用HTTPS 访问,需要访问证书。随意在一台服务器比如:v0106-c0a8000e.4e462.c.local 上执行:
$ openssl req -newkey rsa:2048 -keyout rsa_private.key -x509 -days 365 -out cert.crt -subj /C=CN/ST=BJ/L=BJ/O=test/OU=dev/CN=jd.com/[email protected]
将生成的 cert.crt rsa_private.key 复制到 每个服务器的hadoop 配置文件目录里
$ scp cert.crt rsa_private.key 192.168.0.25:/export/common/hadoop/conf/(每个节点都要有)
在每个服务器上执行执行中需要输入密码:123456。(每个节点都要有java环境才可以执行)
$ cd /export/common/hadoop/conf/
$ keytool -keystore keystore -alias localhost -validity 9999 -genkey -keyalg RSA -keysize 2048 -dname "CN=v0106-c0a8000e.4e462.c.local, OU=dev, O=test, L=BJ, ST=BJ, C=CN"
$ keytool -keystore truststore -alias CARoot -import -file cert.crt;
$ keytool -certreq -alias localhost -keystore keystore -file cert;
$ openssl x509 -req -CA cert.crt -CAkey rsa_private.key -in cert -out cert_signed -days 9999 -CAcreateserial;
$ keytool -keystore keystore -alias CARoot -import -file cert.crt ;
$ keytool -keystore keystore -alias localhost -import -file cert_signed ;
core-site.xml
hadoop.security.authorization
true
hadoop.security.authentication
kerberos
hdfs-site.xml
dfs.namenode.keytab.file
/export/common/kerberos5/hdfs.keytab
dfs.namenode.kerberos.principal
hdfs/[email protected]
dfs.namenode.kerberos.internal.spnego.principal
HTTP/[email protected]
dfs.web.authentication.kerberos.principal
HTTP/[email protected]
dfs.web.authentication.kerberos.keytab
/export/common/kerberos5/hdfs.keytab
dfs.datanode.keytab.file
/export/common/kerberos5/hdfs.keytab
dfs.datanode.kerberos.principal
hdfs/[email protected]
dfs.http.policy
HTTPS_ONLY
dfs.data.transfer.protection
integrity
dfs.block.access.token.enable
true
dfs.datanode.data.dir.perm
700
dfs.journalnode.keytab.file
/export/common/kerberos5/hdfs.keytab
dfs.journalnode.kerberos.principal
hdfs/[email protected]
dfs.journalnode.kerberos.internal.spnego.principal
HTTP/[email protected]
hadoop_env.sh
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=${JAVA_HOME}/lib -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.krb5.realm=HADOOP.COM -Djava.security.krb5.kdc=192.168.0.49:88"
ssl-server.xml(放在hadoop配置目录下:/export/common/hadoop/conf,赋权hdfs:hadoop)
ssl.server.truststore.location
/export/common/hadoop/conf/truststore
Truststore to be used by NN and DN. Must be specified.
ssl.server.truststore.password
123456
Optional. Default value is "".
ssl.server.truststore.type
jks
Optional. The keystore file format, default value is "jks".
ssl.server.truststore.reload.interval
10000
Truststore reload check interval, in milliseconds.
Default value is 10000 (10 seconds).
ssl.server.keystore.location
/export/common/hadoop/conf/keystore
Keystore to be used by NN and DN. Must be specified.
ssl.server.keystore.password
123456
Must be specified.
ssl.server.keystore.keypassword
123456
Must be specified.
ssl.server.keystore.type
jks
Optional. The keystore file format, default value is "jks".
ssl.server.exclude.cipher.list
TLS_ECDHE_RSA_WITH_RC4_128_SHA,SSL_DHE_RSA_EXPORT_WITH_DES40_CBC_SHA,
SSL_RSA_WITH_DES_CBC_SHA,SSL_DHE_RSA_WITH_DES_CBC_SHA,
SSL_RSA_EXPORT_WITH_RC4_40_MD5,SSL_RSA_EXPORT_WITH_DES40_CBC_SHA,
SSL_RSA_WITH_RC4_128_MD5
Optional. The weak security cipher suites that you want excluded
from SSL communication.
ssl-client.xml(放在hadoop配置目录下:/export/common/hadoop/conf,赋权hdfs:hadoop)
ssl.client.truststore.location
/export/common/hadoop/conf/truststore
Truststore to be used by clients like distcp. Must be
specified.
ssl.client.truststore.password
123456
Optional. Default value is "".
ssl.client.truststore.type
jks
Optional. The keystore file format, default value is "jks".
ssl.client.truststore.reload.interval
10000
Truststore reload check interval, in milliseconds.
Default value is 10000 (10 seconds).
ssl.client.keystore.location
/export/common/hadoop/conf/keystore
Keystore to be used by clients like distcp. Must be
specified.
ssl.client.keystore.password
123456
Optional. Default value is "".
ssl.client.keystore.keypassword
123456
Optional. Default value is "".
ssl.client.keystore.type
jks
Optional. The keystore file format, default value is "jks".
yarn-site.xml
yarn.web-proxy.principal
HTTP/[email protected]
yarn.web-proxy.keytab
/export/common/kerberos5/hdfs.keytab
yarn.resourcemanager.principal
hdfs/[email protected]
yarn.resourcemanager.keytab
/export/common/kerberos5/hdfs.keytab
yarn.nodemanager.principal
hdfs/[email protected]
yarn.nodemanager.keytab
/export/common/kerberos5/hdfs.keytab
yarn.nodemanager.container-executor.class
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
yarn.nodemanager.linux-container-executor.group
hdfs
yarn.timeline-service.http-authentication.type
kerberos
Defines authentication used for the timeline server HTTP endpoint. Supported values are: simple | kerberos | #AUTHENTICATION_HANDLER_CLASSNAME#
yarn.timeline-service.principal
hdfs/[email protected]
yarn.timeline-service.keytab
/export/common/kerberos5/hdfs.keytab
yarn.timeline-service.http-authentication.kerberos.principal
HTTP/[email protected]
yarn.timeline-service.http-authentication.kerberos.keytab
/export/common/kerberos5/hdfs.keytab
yarn.nodemanager.container-localizer.java.opts
-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.krb5.realm=HADOOP.COM -Djava.security.krb5.kdc=192.168.0.49 :88
yarn.nodemanager.health-checker.script.opts
-Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.krb5.realm=HADOOP.COM -Djava.security.krb5.kdc=192.168.0.49:88
mapred-site.xml
mapreduce.map.java.opts
-Xmx1638M -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.krb5.realm=HADOOP.COM -Djava.security.krb5.kdc=192.168.0.49:88
mapreduce.reduce.java.opts
-Xmx3276M -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.krb5.realm=HADOOP.COM -Djava.security.krb5.kdc=192.168.0.49:88
mapreduce.jobhistory.keytab
/export/common/kerberos5/hdfs.keytab
mapreduce.jobhistory.principal
hdfs/[email protected]
mapreduce.jobhistory.webapp.spnego-keytab-file
/export/common/kerberos5/hdfs.keytab
mapreduce.jobhistory.webapp.spnego-principal
HTTP/[email protected]
mapred.child.java.opts
-Xmx1024m -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.krb5.realm=HADOOP.COM -Djava.security.krb5.kdc=192.168.0.49:88
yarn.app.mapreduce.am.command-opts
-Xmx3276m -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.krb5.realm=HADOOP.COM -Djava.security.krb5.kdc=192.168.0.49:88
hive-site.xml
hive.server2.authentication
KERBEROS
hive.server2.authentication.kerberos.principal
hdfs/[email protected]
hive.server2.authentication.kerberos.keytab
/export/common/kerberos5/hdfs.keytab
hive.metastore.sasl.enabled
true
hive.metastore.kerberos.keytab.file
/export/common/kerberos5/hdfs.keytab
hive.metastore.kerberos.principal
hdfs/[email protected]
hbase-site.xml
hbase.security.authentication
kerberos
hbase.rpc.engine
org.apache.hadoop.hbase.ipc.SecureRpcEngine
hbase.master.kerberos.principal
hdfs/[email protected]
-->
hbase.master.keytab.file
/export/common/kerberos5/hdfs.keytab
-->
hbase.regionserver.kerberos.principal
hdfs/[email protected]
-->
hbase.regionserver.keytab.file
/export/common/kerberos5/hdfs.keytab
退出授权:kdestroy
主kdc打开kadmin管理:kadmin.local
查看当前系统使用的Kerberos账户:klist
使用keytab获取用户凭证: kinit -kt /export/common/kerberos5/kadm5.keytab admin/[email protected]
查看keytab内容:klist -k -e /export/common/kerberos5/hdfs.keytab
生成keytab文件:kadmin.local -q "xst -k /export/common/kerberos5/hdfs.keytab admin/[email protected]"
延长kerberos认证时长:kinit -R
删除kdc数据库:rm -rf /export/common/kerberos5/principal(这个路径是create时新建的数据库路径)
测试hdfs:切换到hdfs用户,键入命令:hdfs dfs -ls /后,需要认证。再次键入命令“kinit -kt /export/common/kerberos5/hdfs.keytab hdfs/`hostname | awk '{print tolower($0)}'`”后可以查出结果即为与hdfs集成成功。
还要注意一些显著的变化:1、任务可以由提交作业的用户以操作系统账户启动运行,而不一定要由运行节点管理的用户启动。这意味着,可以借助操作系统来隔离正在运行的任务,使它们之间无法相互传送指令,这样的话,诸如任务数据等本地信息的隐私即可通过本地文件系统的安全性而得到保护。(需要将yarn.nodemanager.container-executor.class设为org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor。)
分析:配置好某一个hadoop节点后,切换到hdfs再执行hive命令,报上述错。刚开始排查以为是读的目标机器与发送请求机器的密钥验证不一致导致,需要把全部hadoop节点变为kdc客户端,但是在客户端执行hive命令,仍然报相同错。后来查资料发现原因是代码中编写了登录kerberos的代码,而hdfs没有开kerberos,故还需检查hdfs配置是否生效。
解决:重启了下集群后生效了。
解决: keytool命令不识别,请设置java_home
分析:重启集群后,有hdfs用户访问命令“hdfs dfs -ls /”时,报上述错。刚开始认为没有赋权,故用kinit -kt /export/common/kerberos5/hdfs.keytab hdfs/[email protected]命令赋权,之后还是报同样的错误。然后开查资料进行调试。
调试:先执行如下命令:export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dlog.enable-console=true -Dsun.security.krb5.debug=true ${HADOOP_OPTS}"
export HADOOP_ROOT_LOGGER=DEBUG,console
再执行hadoop操作:hadoop dfs -ls /,在控制台上即可看到相关的debug输出。调试结果仍然是看不出什么。
解决1:由于之前kdc.conf配置master_key_type和supported_enctypes默认使用aes256-cts。后来在网上查资料,发现有说jdk问题,由于出口管制原因,JAVA8使用aes256-cts验证方式需要安装额外的jar包。后来按照提示把JDK8密码扩展无限制权限策略文件下载下来(我的是jdk8)JCE Unlimited Strength Jurisdiction Policy Files for JDK/JRE 8 Download,下载后,覆盖到$JAVA_HOME/jre/lib/security中去,主要是local_policy.jar和 US_export_policy.jar中即可。
解决2:上面的解决方案需要我们替换java包,比较麻烦。所以我们在安装kdc时不用aec256加密,在安装时就修改默认的kdc.conf文件的master_key_type和supported_enctypes中去掉aes256-cts加密。
描述:解决完问题3后,再用命令“hdfs dfs -ls /”,又报上面的错。
分析:hdfs haadmin -getServiceState nn2/nn1,发现两个namenode都是standby状态。zkfc也起来了。(nn1 nn2的名称命名可以去看hdfs-site.xml配置)
解决:1)如果zkfc没有起来,启动zkfc就OK了。进入/export/hadoop/sbin目录下 ./hadoop-daemon.sh start zkfc;
2)手动切换主备状态:hdfs haadmin -transitionToActive nn1 --forcemanual(切换为standby状态命令:hdfs haadmin –transitionToStandby nn2)
分析:1)检查/etc/krb5.conf中"default_tgs_enctypes" ; "default_tkt_enctypes" ; "permitted_enctypes",把/etc/krb5.conf中的arcfour-hmac-md5加密方式去掉,因为在kdc.conf中没有匹配。
2)检查hdfs安全模式:hdfs dfsadmin -safemode get,发现进入了安全模式,然后输入命令:hdfs dfsadmin -safemode leave离开安全模式,重启namenode,又进入安全模式。
解决:后来采用更新krb5版本来测试:
1)先卸载原来的版本yum remove -y libkadm5-1.15.1-18.el7.x86_64;rpm -e --nodeps krb5-libs-1.15.1-18.el7.x86_64
(注:yum有时卸载不了,就用rpm忽略依赖来卸载,其间可用rpm -qa | grep krb5或yum list installed |grep krb5来看安装的依赖包)
2) 安装新版本
rpm -ivh krb5-libs-1.15.1-37.el7_6.x86_64.rpm
rpm -ivh libkadm5-1.15.1-37.el7_6.x86_64.rpm
rpm -ivh krb5-server-1.15.1-37.el7_6.x86_64.rpm(客户端不用安装)
rpm -ivh krb5-workstation-1.15.1-37.el7_6.x86_64.rpm
描述:解决完问题5后,启动namenode时报上述错。
解决:重新格式化namenode,命令如下:hadoop namenode -format,之后分别重启两个namenode,发现还是都为standby状态,报错日志如下问题7
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby。
描述:解决完问题6后,报上面的错。然后执行切换HA namenode active状态命令:hdfs haadmin -transitionToActive --forcemanual nn1后,namenode进程没了,同时报下面错:Incompatible namespaceID for journal Storage Directory /export/data/hadoop/journal/yinheforyanghang: NameNode has nsId 420017674 but storage has nsId 1693249738
解决:vim /export/data/hadoop/namenode/current/VERSION,把420017674改为1693249738。重启namenode发现还是都为standby状态。
描述:解决完问题7后,重启NN,然后还是standby,再执行切换HA namenode active状态命令:hdfs haadmin -transitionToActive --forcemanual nn1后,namenode进程没了,报上面的错。
解决:vim /export/data/hadoop/namenode/current/VERSION,把clusterId的值改成 storage的值。
描述:解决完问题8后,nn1节点为active了,但nn2节点进程自动消失,并且日志报上面的错。这个错误表明该节点namenode元数据发生了损坏。需要恢复元数据以后,才能启动namenode
解决:cd $HADOOP_HOME/bin&&hadoop namenode -recover,一路选择c 进行元数据的恢复。恢复完元数据以后,使用如下命令重新启动namenode节点。
描述:启动nodemanager节点时,报的错
解决:1)hadoop版本2.7.6时,这个问题的原因是 LinuxContainerExecutor 通过container-executor来启动容器,但是出于安全的考虑,要求其所依赖的配置文件container-executor.cfg及其各级父路径owner必须是root用户。命令格式为:chown root 每个层级目录名(云环境目录层次为:/export/server/hadoop-2.7.6/etc/hadoop/container-executor.cfg)
2)设置了之后又出现别的问题,如下:
Caused by: ExitCodeException exitCode=22: Invalid permissions on container-executor binary.
这里确实是权限问题,我们还需要执行以下命令修改$HADOOP_HOME/bin/container-executor权限:
chown root:hadoop $HADOOP_HOME/bin/container-executor
chmod 6050 $HADOOP_HOME/bin/container-executor
然后再启动nodemanager节点,又出现下面问题:
3)Caused by: ExitCodeException exitCode=24: Can't get configured value for yarn.nodemanager.linux-container-executor.group
vim /export/server/hadoop-2.7.6/etc/hadoop/container-executor.cfg
然后再启动nodemanager节点:cd /export/hadoop/sbin && ./yarn-daemon.sh start nodemanager,发现已经成功启动了。
分析:重启datanode时启动失败。查看日志/export/server/hadoop_home/logs/hadoop-hdfs-datanode-v0106-c0a80049.4eb06.c.local.out发现“ulimit -a for user hdfs”,之后查了下/export/server/hadoop_home/logs/hadoop-hdfs-datanode-v0106-c0a80049.4eb06.c.local.log发现“Address already in use;”,说明50070端口被占。
解决:netstat -anlp | grep 50070输入如下:
tcp 0 0 192.168.0.73:50070 192.168.0.73:50070 ESTABLISHED 2931329/python2
kill -9 2931329,之后再重启即可。
描述 :启动metastore命令:nohup hive --service metastore &后,报此错,到 hive-site.xml中查看: hive.metastore.schema.verification=false,配置也没问题。
解决:su - hdfs
vim ~/.bashrc,在里面添加以下环境变量
export HIVE_CONF_DIR=/export/common/hive/conf
export HIVE_HOME=/export/hive
export PATH=".:$HIVE_HOME/bin:$HIVE_HOME/sbin"
描述:运行一段时间后,发现standby的namenode挂了,启动时报如上错误。之后format该namenode节点,再重启发现虽然启来了。但是datanode节点全部挂了,并报错: java.io.IOException: Incompatible clusterIDs in /home/hduser/mydata/hdfs/datanode: namenode clusterID = **CID-8e09ff25-80fb-4834-878b-f23b3deb62d0**; datanode clusterID = **CID-cd85e59a-ed4a-4516-b2ef-67e213cfa2a1,后vim /export/data/hadoop/namenode/current/VERSION,把clusterId的值改成 datanode的值。再次启动namenode时报错:got premature end-of-file at txid 0; expected file to go up to 4,再执行命令:hdfs namenode -bootstrapStandby。
解决:解决了13问题后,又报上面错,之后vim /export/data/hadoop/jounalnode/集群名/current/VERSION下的clusterid改成与namenode的一样
描述:启动hive时,进入hive命令行后,执行hive sq时报上述错。用此命令启动hive --hiveconf hive.root.logger=DEBUG,console,看debug日志:Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - LOOKING_UP_SERVER)。Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: GSS initiate failed。
解决:1)先清空或删掉元数据库表再初始化hive元数据库:/export/hive/bin/schematool -dbType mysql -initSchema,这一步后仍然没有解决。(这不是根本原因,可以不用做)
2)后分析日志后发现是连不上metastore,看了hive-site.xml后才发现属性hive.metastore.uris的值是ip:9083,而kerberos在认证时是hostname通过认证的,故把ip改为hostaname名即可成功。具体配置示例如下:
描述:用hdfs用户执行命令“hdfs dfs -put test.txt /tmp”时报上述错。
解决:hdfs-site.xml中加入以下配置:
描述:用hdfs用户执行命令“hdfs dfs -put test.txt /tmp”时报上述错。
解决:hbase-site.xml中加入以下配置: