spark2.3引入了一些实用的新特性,如orc read/write optimization, bucket join with SQL, Continuous Processing等,由于业务需要使用部分新特性故筹备升级。将现网正在使用的spark版本上已经存在的patch合并到spark v2.3.2分支上之后,发布到测试环境中进行测试。
测试环境配置:
使用spark-shell 执行如下代码
> spark.sql("select count(*) from test_db.test_table where dt='20180924'").show
java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org.apache.spark.sql.execution.SparkPlan.org$apache$spark$sql$execution$SparkPlan$$decodeUnsafeRows(SparkPlan.scala:274)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$1.apply(SparkPlan.scala:366)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$1.apply(SparkPlan.scala:366)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:366)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 49 elided
通过如下命令分析
# 获取spark-shell的进程id ${pid}
$ jps -ml | grep -i spark-shell
# jinfo查询进程的环境变量
$ jinfo ${pid} | grep -i "lz4"
# 发现环境变量有且仅有一个/xxx/spark-2.3.2-bin-2.6.0-cdh5.13.1/jars/lz4-java-1.4.0.jar
# 进一步分析无法识别的类是否在上述jar包中
$ grep 'net\/jpountz\/lz4\/LZ4BlockInputStream' /xxx/spark-2.3.2-bin-2.6.0-cdh5.13.1/jars/lz4-java-1.4.0.jar
# 输出: Binary file /xxx/spark-2.3.2-vip-1.0.0-bin-2.6.0-cdh5.13.1/jars/lz4-java-1.4.0.jar matches;说明需要的类确实已经存在于环境变量中
通过上述分析,可以发现,异常中无法识别的类确实已经存在于JVM进程的环境变量当中,那为什么还是无法识别呢?
莫非jar包没有被加载?
# ${pid} is your jvm process id,use pgrep or jps to find the PID of your jvm process.
$ /usr/sbin/lsof -p ${pid} | grep lz4
java 100541 hdfs mem REG 8,3 59545 23294926 /tmp/liblz4-java1940655373971688476.so
java 100541 hdfs mem REG 8,3 370119 21442570 /xxx/spark-2.3.2-bin-2.6.0-cdh5.13.1/jars/lz4-java-1.4.0.jar
java 100541 hdfs 171r REG 8,3 370119 21442570 /xxx/spark-2.3.2-bin-2.6.0-cdh5.13.1/jars/lz4-java-1.4.0.jar
通过上述结果可以确认jar包确实已经被JVM加载,并且未显示版本相互冲突的lz4*.jar
那为什么还是识别不了呢?
因为JVM识别不了net.jpountz.lz4.LZ4BlockInputStream.
的问题已经铁证如山了,所以只能继续分析JVM加载的其他外部依赖,确认是否是由外部依赖引入冲突的jar包进而导致所需的类无法正常加载。
$ /usr/sbin/lsof -p ${pid} | grep '\.jar' | awk '{print $9}' | grep -v 'jdk' | xargs grep -w 'net\/jpountz\/lz4\/LZ4BlockInputStream'
# output:
# Binary file /xxx/spark-2.3.2-vip-1.0.0-bin-2.6.0-cdh5.13.1/jars/lz4-java-1.4.0.jar matches
# Binary file /xxx/yyy/spark-plugins-0.1.0-SNAPSHOT-jar-with-dependencies.jar matches
到这里问题的原因基本明朗了,spark-plugins这个模块中引入了lz4相关的类,进而导致net.jpountz.lz4.LZ4BlockInputStream.
not found的异常;
打开spark-plugins工程,分析其dependency tree,结果如下
......
[INFO] +- org.apache.kafka:kafka-clients:jar:0.9.0.1:compile
[INFO] | +- org.xerial.snappy:snappy-java:jar:1.1.1.7:compile
[INFO] | \- net.jpountz.lz4:lz4:jar:1.2.0:compile
[INFO] \- org.apache.commons:commons-pool2:jar:2.4.1:compile
......
通过修改pom.xml
中·kafka-client·依赖,exclude掉lz4的依赖
<dependency>
<groupId>org.apache.kafkagroupId>
<artifactId>kafka-clientsartifactId>
<version>${kafka.version}version>
<exclusions>
<exclusion>
<groupId>net.jpountz.lz4groupId>
<artifactId>lz4artifactId>
exclusion>
exclusions>
dependency>
重新编译部署spark-plugins-0.1.0-SNAPSHOT-jar-with-dependencies.jar
到测试环境,重新启动spark-shell;
测试通过,jar包冲突的问题成功fix。
目前,大数据生态的众多组件均运行在JVM之上,jar dependency conflict的问题时常发生,整理出一套快速分析定位此类问题的方法可以很好地推进日常开发工作顺利开展,故做此记录,文中或有纰漏之处,欢迎批评指正。
2018.9.29