更多问题解决来源
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html (包含大部分Kerberos问题,票据,server)
https://www.bbsmax.com/A/rV574Yk9dP/
1.问题:
2019-03-22 20:53:01 WARN ScriptBasedMapping:254 - Exception running /etc/hadoop/conf/topology_script.py 10.111.32.186
19/03/22 20:53:01 INFO LineBufferedStream: stdout: java.io.IOException: Cannot run program "/etc/hadoop/conf/topology_script.py" (in directory "/home/jerry"): error=2, 没有那个文件或目录
19/03/22 20:53:01 INFO LineBufferedStream: stdout: Caused by: java.io.IOException: error=2, 没有那个文件或目录
原因:
(1)在拷贝集群的core-site.xml ,*.xml 资源文件文件时,在IDE(IDEA,pycharm)运行时,未把下面的注释(core-site.xml中):
(2) 将程序打包后,以yarn-client集群运行时,如果能读到本地有集群的资源文件时,仍然会有读取上述配置,执行文件,报上述错误。
我的就是使用livy 以 yarn-client模式运行提交作业,在远程集群执行jar包时,本地报上述错误(未影响集群运行),解决是讲运行模式
改为yarn-cluster,使作业运行在远端集群中,不依赖本地的资源
3.创建 本地访问Kerberos集群的代理用户,秘钥 (当前的主机名为 / 后面的东西)
addprinc HTTP/[email protected]
modprinc -maxrenewlife "1 week" +allow_renewable HTTP/[email protected]
xst -k http.livy.test.com.keytab HTTP/[email protected]addprinc dp/[email protected]
modprinc -maxrenewlife "1 week" +allow_renewable dp/[email protected]
xst -k dp.livy.test.com.keytab dp/[email protected]hdfs 所在机器节点进行初始化Ticket cache:
kinit -kt dp/livy.test.com.keytab dp/livy.test.com测试hdfs,发现不再抛出TGT异常:
hdfs dfs -ls /
4.复制 $HADOOP_HOME share/lib 下的jersey-*-.jar到classpath (否则会报yarn的classNotFound)
5.在kdc server 创建对应的 本地(提交livy作业的host)主机的 principal HTTP/_HOSTNAME
5.添加本地访问(10.111.23.70)集群的代理设置:
hadoop.proxyuser.dp.groups=*
hadoop.proxyuser.dp.hosts=10.111.23.70
019-03-21 12:18:20 ERROR ApplicationMaster:91 - Uncaught exception:
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:657)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:517)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:347)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:854)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
2019-03-21 12:18:20 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!)
2019-03-21 12:18:20 INFO ApplicationMaster:54 - Deleting staging directory hdfs://namenode.gai.test.com:8020/user/dp/.sparkStaging/application_1553139449000_0004
2019-03-21 12:18:20 INFO ShutdownHookManager:54 - Shutdown hook called
6.livy ERROR RSCClient: Failed to connect to context.
在 livy keberos ,hdadoop集群都配置所有的ip-hostname
7.ambari 经过配置后 start service 时 ,某台节点报错,principal没有对应的 keytab
resource_management.core.exceptions.ExecutionFailed: Execution of '/usr/bin/kinit -c /var/lib/ambari-agent/tmp/kerberos_service_check_cc_fbb115245f1af7a48db9ac7847a98bb9036a0cfceeae4045ac38351c -kt /etc/security/keytabs/smokeuser.headless.keytab [email protected]' returned 1. kinit: Password incorrect while getting initial credentials
解决思路参考:https://www.cnblogs.com/zhzhang/p/5692249.html
https://blog.csdn.net/tianbaochao/article/details/78592989 (该文章指出加密方式影响 秘钥加密,解密,提供思路)
原因:Kerberos kinit命令执行时无对应的keytab 或者keytab过期,密码不对, principal和keytab不对应
解决:删除Kerberos database种的 principal,重新创建principal ,以及对应的 keytab
delprinc [email protected]
addprinc [email protected]
指定加密,解密方式:
xst -e aes128-cts-hmac-sha1-96,arcfour-hmac,des-cbc-md5,des3-cbc-sha1,aes256-cts-hmac-sha1-96 -k smokeuser.headless.keytab -q [email protected]
8.我们信任您已经从系统管理员那里了解了日常注意事项。
总结起来无外乎这三点:
#1) 尊重别人的隐私。
#2) 输入前要先考虑(后果和风险)。
#3) 权力越大,责任越大。
sudo: 没有终端存在,且未指定 askpass 程序
原因:ambari组用户无法在安装Kerberos时 免密执行脚本执行 脚本
解决:使用visudo ,在/etc/sudoers修改 ambari用户及组用户可以免密执行脚本
visudo
%ambari ALL=(ALL) NOPASSWD: ALL # admin 组免密
9.通过livy+kerberos 访问远程hadoop集群时报错,无法找到合适的加密解密 类型的keytab
GSS-API Exception - Cannot find key of appropriate type to decrypt AP REP - AES128
原因<1>: keytab多次导出,重复创建,造成keytab的加密解密类型种类,数量与其他正常keytab不一致
解决:删除该principal及其keytab,
指定该keytab的加密解密方式(通过查看 其他keytab的加密解密方式指定 klist -ke 其他.keytab),指定加密解密方式后创建,重启kdc-server ,重启ambari集群(hadoop相关集群)
klist -ke 其他.keytab
kadmin.local:
delprinc principal
addprinc principal
#下面根据自己的principal,realms名称适当修改(来自https://blog.csdn.net/tianbaochao/article/details/78592989 启发)
xst -e aes128-cts-hmac-sha1-96,arcfour-hmac,des-cbc-md5,des3-cbc-sha1,aes256-cts-hmac-sha1-96 -k principal.keytab -q principal/EXAMPLE.COM
原因<2>: 本地和livy 所在机器,远程机器未全部安装 JCE 扩展加密组件,无法进行加密解密
解决:
下载JCE,解压到$JAVA_HOME/jre/lib/security下面
从 Oracle JCE官网 (点击左边)下载
unzip -o -j -q jce_policy-8.zip -d $JAVA_HOME/jre/lib/security (避免软链接)
10.本地Kerberos API 访问 远程 Kerberos集群出现 无法解析principal 问题
19/03/15 16:43:40 INFO SparkContext: Created broadcast 0 from textFile at TestKerberos.scala:42
Exception in thread "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)wenti
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
at Test.TestKerberos$.main(TestKerberos.scala:44)
at Test.TestKerberos.main(TestKerberos.scala)
19/03/15 16:43:41 INFO SparkContext: Invoking stop() from shutdown hook
19/03/15 16:43:41 INFO SparkUI: Stopped Spark web UI at http://10.111.23.70:4040
19/03/15 16:43:41 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/03/15 16:43:41 INFO MemoryStore: MemoryStore cleared
原因: 没有把要访问的远程集群的hdfs文件(/hadoop/etc/hadoop所有的*.xml配置文件)复制到当前项目合适的classpath(代码编译后产生的目录)下面
使用如下代码查找合适的classpath:
解决:
解决思路参考: https://www.bbsmax.com/A/rV574Yk9dP/
复制要访问的远程集群的hdfs文件(/hadoop/etc/hadoop所有的*.xml配置文件)复制到当前项目的classpath
使用如下代码寻找当前项目的classpath,选择合适的classpath目录:
val classLoader = Thread.currentThread.getContextClassLoader
val urlclassLoader = classLoader.asInstanceOf[URLClassLoader]
val urls = urlclassLoader.getURLs
for (url <- urls) {
println("classpath "+url)
}
11.创建时加密方式,种类而在使用生成的keytab(秘钥)发生报错
原因:手动生成的加密方式、数量与ambari安装时自动的生成的keytab不一致,需要保持一直
xst -e aes128-cts-hmac-sha1-96,arcfour-hmac,des-cbc-md5,des3-cbc-sha1,aes256-cts-hmac-sha1-96 -k smokeuser.headless.keytab -q [email protected])