日前笔者尝试使用pyspark 2.4.3访问问hbase 2.1并进行读写,遇到以下一些坑,分享给大家。
使用的liunux环境安装了CDH-6,安装了hbase 2.1, spark 2.2.0。使用anaconda安装了python3.5的虚拟环境,pip安装了pyspark 2.4.3。启动pyspark shell,运行以下python代码:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils,TopicAndPartition
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED", "0").set("spark.kryoserializer.buffer.max", "2040mb")
sc.stop()
sc = SparkContext(appName='HBaseInputFormat', conf=conf)
host = "10.210.110.24,10.210.110.129,10.210.110.130"
table = 'leo01'
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf={"hbase.zookeeper.quorum": host, "hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable",
"mapreduce.output.fileoutputformat.outputdir": "/tmp"}
rawData = ['3,course,a100,200','4,course,chinese,90']
print('准备写入数据')
sc.parallelize(rawData).map(lambda x: (x[0],x.split(','))).saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: java.lang.ClassNotFoundException: org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.api.python.Converter$$anonfun$getInstance$1$$anonfun$1.apply(PythonHadoopUtil.scala:46)
at org.apache.spark.api.python.Converter$$anonfun$getInstance$1$$anonfun$1.apply(PythonHadoopUtil.scala:45)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.api.python.Converter$$anonfun$getInstance$1.apply(PythonHadoopUtil.scala:45)
at org.apache.spark.api.python.Converter$$anonfun$getInstance$1.apply(PythonHadoopUtil.scala:44)
at scala.Option.map(Option.scala:146)
at org.apache.spark.api.python.Converter$.getInstance(PythonHadoopUtil.scala:44)
at org.apache.spark.api.python.PythonRDD$.getKeyValueConverters(PythonRDD.scala:470)
at org.apache.spark.api.python.PythonRDD$.convertRDD(PythonRDD.scala:483)
at org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:580)
at org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
解决: 下载spark-examples_2.11/1.6.0-typesafe-001.jar 并在启动pyspark时加入到–jars的后面。
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.add([B[B[B)Lorg/apache/hadoop/hbase/client/Put;
at org.apache.spark.examples.pythonconverters.StringListToPutConverter.convert(HBaseConverters.scala:81)
at org.apache.spark.examples.pythonconverters.StringListToPutConverter.convert(HBaseConverters.scala:77)
at org.apache.spark.api.python.PythonHadoopUtil$$anonfun$convertRDD$1.apply(PythonHadoopUtil.scala:181)
at org.apache.spark.api.python.PythonHadoopUtil$$anonfun$convertRDD$1.apply(PythonHadoopUtil.scala:181)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:129)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/08/07 14:46:59 ERROR Utils: Aborting task
解决:先找到出错原因,是spark-examples里面的StringListToPutConverter类调用了hbase-client里面的Put.add函数,由于hbase升级了到2之后,hbase-client的Put.add接口变了。从Put.add(Byte[], Byte[], Byte[])
变成了 Put.addColumn(Byte[], Byte[], Byte[])
, 所以解决方案是下载spark最新版的源码,修改StringListToPutConverter类并重新打包spark-examples。StringListToPutConverter的源码:
class StringListToPutConverter extends Converter[Any, Put] {
override def convert(obj: Any): Put = {
val output = obj.asInstanceOf[java.util.ArrayList[String]].asScala.map(Bytes.toBytes).toArray
val put = new Put(output(0))
put.add(output(1), output(2), output(3))
}
}
修改成:
class StringListToPutConverter extends Converter[Any, Put] {
override def convert(obj: Any): Put = {
val output = obj.asInstanceOf[java.util.ArrayList[String]].asScala.map(Bytes.toBytes).toArray
val put = new Put(output(0))
put.addColumn(output(1), output(2), output(3))
}
}
然后重新用maven打包就可以了,使用的maven命令是:mvn clean install -e -X -pl :spark-examples_2.11_hardfixed
注意要在spark-examples的子项目的pom.xml中修改对应的artifact_id。
打包过程中会遇到2个问题:
顺利的话,应该可以成功构建,然后在target目录下得到兼容了新接口的spark-examples jar包。替换掉原来的jar包,最终成功写入。附件是我修复过的jar包,可以尝试使用。