spark开发问题记录

一、spark on yarn client模式

1.JavaSparkContext not serializable

解决:
JavaSparkContext不是可序列化的,是不应该。它不能用于函数发送到远程工作者。使用static修饰JavaSparkContext,序列化会忽略静态变量,即序列化不保存静态变量的状态。transient后的变量也不能序列化

2.使用ES API报错:org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse; Bailing out…

这种错误一般是ES API和DStream类型没有匹配,导致转换失败,当然也可能是数据类型本身转换问题

  1. saveToEs对应DStream type needs to be a Map (either a Scala or a Java
    one), a JavaBean or a Scala case class.
  2. saveJsonToEs 对应JSON字符串
  3. saveToEsWithMeta 对应pair DStream

3.java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

将依赖scope “provided” 改为 “compile”,

4.引入了elasticsearch-hadoop-5.2.2后log4j jar包与spark log4j jar包冲突。报错如下

Caused by: java.lang.IllegalStateException: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting StackOverflowError

解决:加入依赖排除

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>5.2.2</version>
<exclusions>
  <exclusion> 
    <groupId>org.slf4j</groupId>
    <artifactId>log4j-over-slf4j</artifactId>
  </exclusion>
  <exclusion> 
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-log4j12</artifactId>
  </exclusion>
</exclusions> 
</dependency>

二、spark on yarn cluster模式

1.使用spark sql报错如下
版本:spark 2.2.0 cdh5.10.2

Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsAdmin.getKeyProvider()Lorg/apache/hadoop/crypto/key/KeyProvider;
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:1706)
	... 68 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsAdmin.getKeyProvider()Lorg/apache/hadoop/crypto/key/KeyProvider;
	at org.apache.hadoop.hive.shims.Hadoop23Shims$HdfsEncryptionShim.<init>(Hadoop23Shims.java:1265)
	at org.apache.hadoop.hive.shims.Hadoop23Shims.createHdfsEncryptionShim(Hadoop23Shims.java:1407)
	at org.apache.hadoop.hive.ql.session.SessionState.getHdfsEncryptionShim(SessionState.java:464)
	at org.apache.hadoop.hive.ql.metadata.Hive.needToCopy(Hive.java:2973)
	at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2874)
	at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:3199)
	at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1465)
	at org.apache.hadoop.hive.ql.metadata.Hive$2.call(Hive.java:1685)
	at org.apache.hadoop.hive.ql.metadata.Hive$2.call(Hive.java:1676)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	... 3 more

原因:
Spark sql与Hive版本不兼容,Spark sql 2.2.0默认使用的hive 1.2.1构建,而cdh5.10.2使用的hive 1.1.0
解决:
在项目中显式引入Hive的依赖,${hive.version}为自己集群的版本,如下

<!--hive begin-->
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-metastore</artifactId>
            <version>${hive.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-service</artifactId>
            <version>${hive.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>${hive.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>${hive.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-cli</artifactId>
            <version>${hive.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.thrift</groupId>
            <artifactId>libfb303</artifactId>
            <version>0.9.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.thrift</groupId>
            <artifactId>libthrift</artifactId>
            <version>0.9.0</version>
        </dependency>
        <!--hive end-->

三、standalone模式

1.错误

Exception while invoking mkdirs of class ClientNamenodeProtocolTranslatorPB over 10.107.99.217/10.107.99.217:9000 after 10 fail over attempts. Trying to fail over after sleeping for 13156ms.
java.net.ConnectException: Call From VM-100-181-centos/127.0.0.1 to 10.107.99.217:9000 failed on 

原因:代码中设置了检查点,但集群模式检查点目录默认存在HDFS上,由于HDFS服务不可用,创建目录失败

2.在使用spark standalone模式过程中,有时会因为数据增大,而出现下面两种错误:

java.lang.OutOfMemoryError: Java heap space

java.lang.OutOfMemoryError:GC overhead limit exceeded

原因是driver内存不够,在不指定给driver分配内存时,默认分配的是512M。需要在spark-submit提交时指定 -driver-memory 2g

3.Spark to ES由于只配置一个ES节点,通信失败导致错误

Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
        at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:250)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:546)
        at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
        at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
        at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        ... 3 more
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [] failed; server[10.255.10.xxx:9200] returned [503|Service Unavailable:]

解决:配置多个ES节点,es.nodes = 10.255.10.xxx,10.255.10.xxx,10.107.99.xxx,10.107.103.xxx

你可能感兴趣的:(Spark)