Apach Zeppelin和Apach Livy搭配使用配置

一.准备

从官网下载官方编译好的0.7.1包.并从maven下载如下文件:

jackson-core-2.6.3.jar

jackson-databind-2.6.3.jar

jackson-annotations-2.6.3.jar

使用上面三个jackson 包替换$zeppelin-home/lib中的jackson

按照github livy页面下载并编译livy:

mvn package -X -e -DskipTests -Dspark-2.0 package

二.部署

解压相关安装包,按如下步骤设置

1)ldap登录支持已经权限设置

在conf/shiro.ini中配置ldap验证服务器相关的设置

activeDirectoryRealm= org.apache.shiro.realm.activedirectory.ActiveDirectoryRealm
activeDirectoryRealm.systemUsername =
activeDirectoryRealm.systemPassword =
activeDirectoryRealm.searchBase =
activeDirectoryRealm.url=

 

同 一文件中可以设置admin管理员权限,管理员权限可以实现去除特定interpreter,例如去除spark interpreter,强制用户使用livy 等等,只需在

[roles] 中配置,例如:

[roles]
admin=admin 

2)代理用户,user impersonation支持

虽然按照官网文档0.7.1已经支持代理用户,但是会按照默认登录的用户来代理,但是登录之后的用户名是[email protected],由于存在特殊字符@,所以在zeppelin-env.sh中添加如下代码,将代理的用户名前缀切割下来

export ZEPPELIN_IMPERSONATE_CMD='echo ${ZEPPELIN_IMPERSONATE_USER} | cut -d \@ -f 1 |xargs -I {} sudo -H -u {} bash -c '

同时需要修改bin/interpreter.sh文件,使得上面的shell指令生效

修改第50行:

ZEPPELIN_IMPERSONATE_RUN_CMD=$(eval "echo ${ZEPPELIN_IMPERSONATE_CMD} ")

为:

ZEPPELIN_IMPERSONATE_RUN_CMD=$ZEPPELIN_IMPERSONATE_CMD

修改的原因在于原本的echo叠加上指令中的echo,会导致指令被提前执行而报错,具体分析已经向官方jira提了.

除上面的修改外,由于登录后的用户不一定存在于zeppelin 服务器,需要在bin/interpreter.sh 文件中46行开始的部分添加下面的代码:

id $ZEPPELIN_IMPERSONATE_USER
if [ "$(echo $?)" != "0" ];then
sudo useradd -r -s /bin/nologin $ZEPPELIN_IMPERSONATE_USER
fi

(需要启动zeppelin服务器的用户具有sudo权限)

 

3)其他设置

conf/zeppelin-env.sh中可以配置spark interpreter的默认提交设置,例如

export SPARK_SUBMIT_OPTIONS="--driver-memory 4096M --num-executors 3 --executor-cores 1  --executor-memory 2G".

其他基础配置例如spark,hadoop,以及spark任务默认提交模式也需要设定,如下 

export SPARK_HOME=/usr/local/spark-2.0.0-bin-hadoop2.6
export MASTER=yarn-client
export HADOOP_HOME=/usr/local/hadoop-2.7.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/

livy提交任务时的相关默认设置可以在 livy.conf 中设置,livy.conf设置不了的设置也可以在¥SPARK_HOME/conf/spark_default.conf 中设置

4)修改权限

由于代理用户之后会以代理用户的身份向logs文件夹记日志,需要如下操作

cd $ZEPPELIN_HOME

mkdir logs

chmod -R 777 logs

 

5.livy log的设置

livy使用log4j记录log,默认只向console输出log,需要在conf/log4j.properties 中加入记录log到文件的设置。

 

6.在livy中使用hive和livy.sql 的使用设置

conf/livy.conf 中的livy.repl.enableHiveContext = true

复制hive-site.xml到livy/conf

升级spark 版本,使用高于2.01的版本。会遇到类似下面的错误:

java.io.FileNotFoundException: Added file file:/data/livy-hive/livy/conf/hive-site.xml does not exist. 

原因在于spark cluster模式提交任务有bug.

 

三.常见问题

1.hive权限问题

由于conf/目录下的所有文件都会被zeppelin读取加载,所以当存在hive-site.xml时,hive-site.xml配置中的本地tmp目录也会被zeppelin初始化,如果没有相关权限,会导致zeppelin的spark interpreter执行代码时出现类似下面的错误

java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: Permission denied
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:171)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset.(Dataset.scala:161)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:441)
at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:395)
at org.apache.spark.sql.SQLImplicits.rddToDatasetHolder(SQLImplicits.scala:163)
... 46 elided
Caused by: java.lang.RuntimeException: java.io.IOException: Permission denied
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:515)
... 70 more
Caused by: java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createTempFile(File.java:2024)
at org.apache.hadoop.hive.ql.session.SessionState.createTempFile(SessionState.java:818)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:513)
... 70 more

 

原因在于hive.exec.scratchdir这一项设定的目录权限没有开放,设置成777之后就没有问题了.

这个问题在每次zeppelin 重启之后都会发生,需要手动修改权限.

 

2. zeppelin 实现--proxy-user 失败问题

由于生产集群目前的设置,除了superuser身份的hadoop之外别的用户并没有–proxy user的权限,所以zepplin 通过spark --proxy-user 的方式会导致类似下面的错误:

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: heyang.wang is not allowed to impersonate heyang.wang

解决方案即在于删除bin/interpreter.sh中第205行中的--proxy-user ${ZEPPELIN_IMPERSONATE_USER} ,通过使用目标代理用户来提交spark任务一样可以实现代理用户.

 

3.livy 实现--proxy-user 失败问题

在livy的服务中,所有服务都是由同一个用户启动的,只有--proxy-user 这一种实现方式,解决方案就是单独给livy 服务器所在的机器开设proxy user 许可,在hadoop的core-site.xml设置中设置如下的配置:

 

     hadoop.proxyuser.super.hosts
     host1,host2


     hadoop.proxyuser.super.groups
     group1,group2

上面的配置中有如下效果:

hadoop.proxyuser.$superuser.hosts 所代表的$superuser用户super可以且仅可从host1和host2发送代理用户的请求,代理成
hadoop.proxyuser.super.groups中所包含的用户,也就是上面例子中group1,group2中所包含的用户.
经过测试,需要重启namenode和yarn才能使上面的设置生效.

需要更新的配置:


     hadoop.proxyuser.zeppelin-dummy.hosts
     10.204.11.182,10.204.11.183


     hadoop.proxyuser.super.groups
     *

 

如果上面的设置没有生效或者启动livy的用户并不具有hadoop user impersonation的权限,可能会出现类似下面的错误:

17/06/07 21:49:16 ERROR RSCClient: Failed to connect to context.
java.util.concurrent.TimeoutException: Timed out waiting for context to start.
at com.cloudera.livy.rsc.ContextLauncher.connectTimeout(ContextLauncher.java:133)
at com.cloudera.livy.rsc.ContextLauncher.access$300(ContextLauncher.java:62)
at com.cloudera.livy.rsc.ContextLauncher$2.run(ContextLauncher.java:121)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
17/06/07 21:49:16 INFO RSCClient: Failing pending job 24ab6625-bbf1-4f68-8301-4c7ef3c47857 due to shutdown.
Exception in thread "Thread-34" java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at com.cloudera.livy.util.LineBufferedStream$$anon$1.run(LineBufferedStream.scala:39)
17/06/07 21:49:16 DEBUG InteractiveSession: InteractiveSession 0 session state change from starting to error
17/06/07 21:49:16 INFO InteractiveSession: Stopping InteractiveSession 0...
17/06/07 21:49:16 DEBUG InteractiveSession: InteractiveSession 0 session state change from error to shutting_down
17/06/07 21:49:16 INFO InteractiveSession: Failed to ping RSC driver for session 0. Killing application.
17/06/07 21:50:16 WARN SparkYarnApp: Deleting a session while its YARN application is not found.
17/06/07 21:50:16 ERROR SparkYarnApp: Error whiling refreshing YARN state: java.lang.Exception: spark-submit exited with code 143}.

 

只提示spark 任务失败却并没有显示失败的原因。但是使用具有hadoop user impersonation权限的用户启动livy 可以解决这个问题。

 

4.在编辑zeppelin spark interpreter的时候出现设置不能保存,右上角出现不带任何内容的红色报警窗.

原因在于zeppelin 0.71使用java 8编译,当zeppelin-env.sh配置中使用的是java7,会出现兼容性错误.

解决方案就是换用java8.

但是在新版的zeppelin 0.72中,又换回了java7,所以使用新版zeppelin 可能可以规避此问题.

 

5.在启动spark interpreter出现

java.lang.NullPointerException,并且后台log中提示jackson version too old.

解决方案在于按照本文开头所述,使用新的jackson 包替换旧的.

 

6.Livy 设置spark资源时漏了内存单位

如果在zeppelin livy interpreter中设置了spark 资源相关的设定时漏写了单位,会导致yarn application master 启动时出现如下错误:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded 

        at org.apache.xerces.dom.DeferredDocumentImpl.getNodeObject(Unknown Source) 

        at org.apache.xerces.dom.DeferredDocumentImpl.synchronizeChildren(Unknown Source) 

        at org.apache.xerces.dom.DeferredElementNSImpl.synchronizeChildren(Unknown Source) 

        at org.apache.xerces.dom.ParentNode.hasChildNodes(Unknown Source) 

        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2551) 

        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2444) 

        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2361) 

        at org.apache.hadoop.conf.Configuration.get(Configuration.java:968) 

        at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:987) 

        at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1388) 

        at org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:70) 

        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:272) 

        at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:311) 

        at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:55) 

        at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:56) 

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 

        at java.lang.reflect.Constructor.newInstance(Constructor.java:422) 

        at java.lang.Class.newInstance(Class.java:442) 

        at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:414) 

        at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:412) 

        at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:412) 

        at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:437) 

        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:747) 

        at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 

错误的原因在于定义spark.executor.memory这样的内存选项时漏写了内存单位,只写数字,不带G或者M 会导致上面的错误.

 

7.livy返回log显示过短问题

在使用livy作为后台服务器时,程序运行出错的返回log往往只显示一行,例如

:113: error: overloaded method value createDataFrame with alternatives:

但是完整的报错log可能是这样的:

:56: error: overloaded method value createDataFrame with alternatives:
(data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame
(rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame
(rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame
(rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
(rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
(rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.sql.types.StructType)
val testDF = spark.createDataFrame(rdds, schema)

 

之所以只显示第一行是因为livy服务端在返回结果时,第一行作为evalue 字段返回,剩下的作为traceback字段返回,而zeppelin只显示了evalue字段。

在源码中可以发现traceback字段虽然定义了但却并没有被输出。

 

已经提了pr,下一次发布的livy不存在这个问题。

 

8.null pointer 获得spark context时的空指针错误

java.lang.IncompatibleClassChangeError: class org.objectweb.asm.tree.ClassNode has interface org.objectweb.asm.ClassVisitor as super class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader. java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader. java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader. java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader. java:368)
在使用zeppelin notebnook时出现类似上面的错误可能原因在于公司虚拟机内有默认设置的JAVA_OPTS,限制了jvm使用内存的大小的上限,导致zepplein在以client模式启动spark任务时向yarn申请的资源超出默认JAVA_OPTS限制而失败。解决方案在于在zeppelin-ens.sh中设置JAVA_OPTS为空。

 

9. %livy.spark 或者%livy.sql 中使用spark sql 查询hive 返回结果为空或者在对应session的yarn log中出现类似下面的错误:

java.lang.RuntimeException: Stream '/classes/org/apache/spark/sql/catalyst/expressions/Object.class' was not found.
原因在于livy 中hive相关设置没有设置好,详见上面配置6)

参考:

https://issues.apache.org/jira/browse/SPARK-18160

https://community.hortonworks.com/questions/82644/how-to-disable-spark-interpreter-in-zeppelin.html

https://github.com/cloudera/livy

https://issues.apache.org/jira/browse/ZEPPELIN-2405

https://zeppelin.apache.org/docs/0.7.1/manual/userimpersonation.html

https://zeppelin.apache.org/download.html

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html#Configurations


你可能感兴趣的:(大数据工具)