pyspark写入hbase2.*的坑

目录

    • 前言
    • 测试过程
    • 问题1:找不到类StringToImmutableBytesWritableConverter
    • 问题2:找不到方法: org.apache.hadoop.hbase.client.Put.add([B[B[B)Lorg/apache/hadoop/hbase/client/Put
    • 参考

前言

日前笔者尝试使用pyspark 2.4.3访问问hbase 2.1并进行读写,遇到以下一些坑,分享给大家。

测试过程

使用的liunux环境安装了CDH-6,安装了hbase 2.1, spark 2.2.0。使用anaconda安装了python3.5的虚拟环境,pip安装了pyspark 2.4.3。启动pyspark shell,运行以下python代码:

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils,TopicAndPartition
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED", "0").set("spark.kryoserializer.buffer.max", "2040mb")
sc.stop()
sc = SparkContext(appName='HBaseInputFormat', conf=conf)
host = "10.210.110.24,10.210.110.129,10.210.110.130"
table = 'leo01'
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf={"hbase.zookeeper.quorum": host, "hbase.mapred.outputtable": table,
            "mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
            "mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
            "mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable",
            "mapreduce.output.fileoutputformat.outputdir": "/tmp"}

rawData = ['3,course,a100,200','4,course,chinese,90']
print('准备写入数据')
sc.parallelize(rawData).map(lambda x: (x[0],x.split(','))).saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)

问题1:找不到类StringToImmutableBytesWritableConverter

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: java.lang.ClassNotFoundException: org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
        at org.apache.spark.api.python.Converter$$anonfun$getInstance$1$$anonfun$1.apply(PythonHadoopUtil.scala:46)
        at org.apache.spark.api.python.Converter$$anonfun$getInstance$1$$anonfun$1.apply(PythonHadoopUtil.scala:45)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.api.python.Converter$$anonfun$getInstance$1.apply(PythonHadoopUtil.scala:45)
        at org.apache.spark.api.python.Converter$$anonfun$getInstance$1.apply(PythonHadoopUtil.scala:44)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.api.python.Converter$.getInstance(PythonHadoopUtil.scala:44)
        at org.apache.spark.api.python.PythonRDD$.getKeyValueConverters(PythonRDD.scala:470)
        at org.apache.spark.api.python.PythonRDD$.convertRDD(PythonRDD.scala:483)
        at org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:580)
        at org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

解决: 下载spark-examples_2.11/1.6.0-typesafe-001.jar 并在启动pyspark时加入到–jars的后面。

问题2:找不到方法: org.apache.hadoop.hbase.client.Put.add([B[B[B)Lorg/apache/hadoop/hbase/client/Put

java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.add([B[B[B)Lorg/apache/hadoop/hbase/client/Put;
        at org.apache.spark.examples.pythonconverters.StringListToPutConverter.convert(HBaseConverters.scala:81)
        at org.apache.spark.examples.pythonconverters.StringListToPutConverter.convert(HBaseConverters.scala:77)
        at org.apache.spark.api.python.PythonHadoopUtil$$anonfun$convertRDD$1.apply(PythonHadoopUtil.scala:181)
        at org.apache.spark.api.python.PythonHadoopUtil$$anonfun$convertRDD$1.apply(PythonHadoopUtil.scala:181)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:129)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
        at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
        at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
19/08/07 14:46:59 ERROR Utils: Aborting task

解决:先找到出错原因,是spark-examples里面的StringListToPutConverter类调用了hbase-client里面的Put.add函数,由于hbase升级了到2之后,hbase-client的Put.add接口变了。从Put.add(Byte[], Byte[], Byte[]) 变成了 Put.addColumn(Byte[], Byte[], Byte[]), 所以解决方案是下载spark最新版的源码,修改StringListToPutConverter类并重新打包spark-examples。StringListToPutConverter的源码:

class StringListToPutConverter extends Converter[Any, Put] {
  override def convert(obj: Any): Put = {
    val output = obj.asInstanceOf[java.util.ArrayList[String]].asScala.map(Bytes.toBytes).toArray
    val put = new Put(output(0))
    put.add(output(1), output(2), output(3))
  }
}

修改成:

class StringListToPutConverter extends Converter[Any, Put] {
  override def convert(obj: Any): Put = {
    val output = obj.asInstanceOf[java.util.ArrayList[String]].asScala.map(Bytes.toBytes).toArray
    val put = new Put(output(0))
    put.addColumn(output(1), output(2), output(3))
  }
}

然后重新用maven打包就可以了,使用的maven命令是:mvn clean install -e -X -pl :spark-examples_2.11_hardfixed 注意要在spark-examples的子项目的pom.xml中修改对应的artifact_id。

打包过程中会遇到2个问题:

  1. compile插件报错:注意要使用java1.8 JDK。使用java 12的编译环境会报编译器插件错误。改成1.8后通过了。
  2. checker报错:scala-style-checker和maven checker,这两个错误可以通过google网上的解法,实在不行可以在主pom中注释掉整个插件。

顺利的话,应该可以成功构建,然后在target目录下得到兼容了新接口的spark-examples jar包。替换掉原来的jar包,最终成功写入。附件是我修复过的jar包,可以尝试使用。

参考

  1. http://dblab.xmu.edu.cn/blog/1715-2/
  2. https://www.hiwendi.com/detail/11/
  3. https://stackoverflow.com/questions/56001027/real-time-kafka-data-ingestion-into-hbase-via-pyspark-java-lang-nosuchmethoder

你可能感兴趣的:(技术填坑)