一.准备
从官网下载官方编译好的0.7.1包.并从maven下载如下文件:
jackson-core-2.6.3.jar
jackson-databind-2.6.3.jar
jackson-annotations-2.6.3.jar
使用上面三个jackson 包替换$zeppelin-home/lib中的jackson
按照github livy页面下载并编译livy:
mvn package -X -e -DskipTests -Dspark-2.0 package
二.部署
解压相关安装包,按如下步骤设置
1)ldap登录支持已经权限设置
在conf/shiro.ini中配置ldap验证服务器相关的设置
activeDirectoryRealm= org.apache.shiro.realm.activedirectory.ActiveDirectoryRealm
activeDirectoryRealm.systemUsername =
activeDirectoryRealm.systemPassword =
activeDirectoryRealm.searchBase =
activeDirectoryRealm.url=
同 一文件中可以设置admin管理员权限,管理员权限可以实现去除特定interpreter,例如去除spark interpreter,强制用户使用livy 等等,只需在
[roles] 中配置,例如:
[roles]
admin=admin
2)代理用户,user impersonation支持
虽然按照官网文档0.7.1已经支持代理用户,但是会按照默认登录的用户来代理,但是登录之后的用户名是[email protected],由于存在特殊字符@,所以在zeppelin-env.sh中添加如下代码,将代理的用户名前缀切割下来
export ZEPPELIN_IMPERSONATE_CMD='echo ${ZEPPELIN_IMPERSONATE_USER} | cut -d \@ -f 1 |xargs -I {} sudo -H -u {} bash -c '
同时需要修改bin/interpreter.sh文件,使得上面的shell指令生效
修改第50行:
ZEPPELIN_IMPERSONATE_RUN_CMD=$(eval "echo ${ZEPPELIN_IMPERSONATE_CMD} ")
为:
ZEPPELIN_IMPERSONATE_RUN_CMD=$ZEPPELIN_IMPERSONATE_CMD
修改的原因在于原本的echo叠加上指令中的echo,会导致指令被提前执行而报错,具体分析已经向官方jira提了.
除上面的修改外,由于登录后的用户不一定存在于zeppelin 服务器,需要在bin/interpreter.sh 文件中46行开始的部分添加下面的代码:
id $ZEPPELIN_IMPERSONATE_USER
if [ "$(echo $?)" != "0" ];then
sudo useradd -r -s /bin/nologin $ZEPPELIN_IMPERSONATE_USER
fi
(需要启动zeppelin服务器的用户具有sudo权限)
3)其他设置
conf/zeppelin-env.sh中可以配置spark interpreter的默认提交设置,例如
export SPARK_SUBMIT_OPTIONS="--driver-memory 4096M --num-executors 3 --executor-cores 1 --executor-memory 2G".
其他基础配置例如spark,hadoop,以及spark任务默认提交模式也需要设定,如下
export SPARK_HOME=/usr/local/spark-2.0.0-bin-hadoop2.6
export MASTER=yarn-client
export HADOOP_HOME=/usr/local/hadoop-2.7.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
livy提交任务时的相关默认设置可以在 livy.conf 中设置,livy.conf设置不了的设置也可以在¥SPARK_HOME/conf/spark_default.conf 中设置
4)修改权限
由于代理用户之后会以代理用户的身份向logs文件夹记日志,需要如下操作
cd $ZEPPELIN_HOME
mkdir logs
chmod -R 777 logs
5.livy log的设置
livy使用log4j记录log,默认只向console输出log,需要在conf/log4j.properties 中加入记录log到文件的设置。
6.在livy中使用hive和livy.sql 的使用设置
conf/livy.conf 中的livy.repl.enableHiveContext = true
复制hive-site.xml到livy/conf
升级spark 版本,使用高于2.01的版本。会遇到类似下面的错误:
java.io.FileNotFoundException: Added file file:/data/livy-hive/livy/conf/hive-site.xml does not exist.
原因在于spark cluster模式提交任务有bug.
三.常见问题
1.hive权限问题
由于conf/目录下的所有文件都会被zeppelin读取加载,所以当存在hive-site.xml时,hive-site.xml配置中的本地tmp目录也会被zeppelin初始化,如果没有相关权限,会导致zeppelin的spark interpreter执行代码时出现类似下面的错误
java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Permission denied
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at org.apache.spark.sql.hive.HiveSessionState$$anon$1.
at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset.
at org.apache.spark.sql.Dataset.
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:441)
at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:395)
at org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:163)
... 46 elided
Caused by: java.lang.RuntimeException: java.io.IOException: Permission denied
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:515)
... 70 more
Caused by: java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at org.apache.hadoop.hive.ql.session.SessionState.createTempFile(SessionState.java:818)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:513)
... 70 more
原因在于hive.exec.scratchdir这一项设定的目录权限没有开放,设置成777之后就没有问题了.
这个问题在每次zeppelin 重启之后都会发生,需要手动修改权限.
2. zeppelin 实现--proxy-user 失败问题
由于生产集群目前的设置,除了superuser身份的hadoop之外别的用户并没有–proxy user的权限,所以zepplin 通过spark --proxy-user 的方式会导致类似下面的错误:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: heyang.wang is not allowed to impersonate heyang.wang
解决方案即在于删除bin/interpreter.sh中第205行中的--proxy-user ${ZEPPELIN_IMPERSONATE_USER} ,通过使用目标代理用户来提交spark任务一样可以实现代理用户.
3.livy 实现--proxy-user 失败问题
在livy的服务中,所有服务都是由同一个用户启动的,只有--proxy-user 这一种实现方式,解决方案就是单独给livy 服务器所在的机器开设proxy user 许可,在hadoop的core-site.xml设置中设置如下的配置:
hadoop.proxyuser.super.hosts host1,host2 hadoop.proxyuser.super.groups group1,group2
上面的配置中有如下效果:
hadoop.proxyuser.$superuser.hosts 所代表的$superuser用户super可以且仅可从host1和host2发送代理用户的请求,代理成
hadoop.proxyuser.super.groups中所包含的用户,也就是上面例子中group1,group2中所包含的用户.
经过测试,需要重启namenode和yarn才能使上面的设置生效.
需要更新的配置:
hadoop.proxyuser.zeppelin-dummy.hosts 10.204.11.182,10.204.11.183 hadoop.proxyuser.super.groups *
如果上面的设置没有生效或者启动livy的用户并不具有hadoop user impersonation的权限,可能会出现类似下面的错误:
17/06/07 21:49:16 ERROR RSCClient: Failed to connect to context.
java.util.concurrent.TimeoutException: Timed out waiting for context to start.
at com.cloudera.livy.rsc.ContextLauncher.connectTimeout(ContextLauncher.java:133)
at com.cloudera.livy.rsc.ContextLauncher.access$300(ContextLauncher.java:62)
at com.cloudera.livy.rsc.ContextLauncher$2.run(ContextLauncher.java:121)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
17/06/07 21:49:16 INFO RSCClient: Failing pending job 24ab6625-bbf1-4f68-8301-4c7ef3c47857 due to shutdown.
Exception in thread "Thread-34" java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at com.cloudera.livy.util.LineBufferedStream$$anon$1.run(LineBufferedStream.scala:39)
17/06/07 21:49:16 DEBUG InteractiveSession: InteractiveSession 0 session state change from starting to error
17/06/07 21:49:16 INFO InteractiveSession: Stopping InteractiveSession 0...
17/06/07 21:49:16 DEBUG InteractiveSession: InteractiveSession 0 session state change from error to shutting_down
17/06/07 21:49:16 INFO InteractiveSession: Failed to ping RSC driver for session 0. Killing application.
17/06/07 21:50:16 WARN SparkYarnApp: Deleting a session while its YARN application is not found.
17/06/07 21:50:16 ERROR SparkYarnApp: Error whiling refreshing YARN state: java.lang.Exception: spark-submit exited with code 143}.
只提示spark 任务失败却并没有显示失败的原因。但是使用具有hadoop user impersonation权限的用户启动livy 可以解决这个问题。
4.在编辑zeppelin spark interpreter的时候出现设置不能保存,右上角出现不带任何内容的红色报警窗.
原因在于zeppelin 0.71使用java 8编译,当zeppelin-env.sh配置中使用的是java7,会出现兼容性错误.
解决方案就是换用java8.
但是在新版的zeppelin 0.72中,又换回了java7,所以使用新版zeppelin 可能可以规避此问题.
5.在启动spark interpreter出现
java.lang.NullPointerException,并且后台log中提示jackson version too old.
解决方案在于按照本文开头所述,使用新的jackson 包替换旧的.
6.Livy 设置spark资源时漏了内存单位
如果在zeppelin livy interpreter中设置了spark 资源相关的设定时漏写了单位,会导致yarn application master 启动时出现如下错误:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.xerces.dom.DeferredDocumentImpl.getNodeObject(Unknown Source)
at org.apache.xerces.dom.DeferredDocumentImpl.synchronizeChildren(Unknown Source)
at org.apache.xerces.dom.DeferredElementNSImpl.synchronizeChildren(Unknown Source)
at org.apache.xerces.dom.ParentNode.hasChildNodes(Unknown Source)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2551)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2444)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2361)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:968)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:987)
at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1388)
at org.apache.hadoop.security.SecurityUtil.
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:272)
at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:311)
at org.apache.spark.deploy.SparkHadoopUtil.
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:414)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:412)
at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:412)
at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:437)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:747)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
错误的原因在于定义spark.executor.memory这样的内存选项时漏写了内存单位,只写数字,不带G或者M 会导致上面的错误.
7.livy返回log显示过短问题
在使用livy作为后台服务器时,程序运行出错的返回log往往只显示一行,例如
但是完整的报错log可能是这样的:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.sql.types.StructType)
val testDF = spark.createDataFrame(rdds, schema)
之所以只显示第一行是因为livy服务端在返回结果时,第一行作为evalue 字段返回,剩下的作为traceback字段返回,而zeppelin只显示了evalue字段。
在源码中可以发现traceback字段虽然定义了但却并没有被输出。
已经提了pr,下一次发布的livy不存在这个问题。
8.null pointer 获得spark context时的空指针错误
9. %livy.spark 或者%livy.sql 中使用spark sql 查询hive 返回结果为空或者在对应session的yarn log中出现类似下面的错误:
参考:
https://issues.apache.org/jira/browse/SPARK-18160
https://community.hortonworks.com/questions/82644/how-to-disable-spark-interpreter-in-zeppelin.html
https://github.com/cloudera/livy
https://issues.apache.org/jira/browse/ZEPPELIN-2405
https://zeppelin.apache.org/docs/0.7.1/manual/userimpersonation.html
https://zeppelin.apache.org/download.html
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html#Configurations