进入spark shell:
spark-shell \
--packages org.apache.spark:spark-avro_2.11:2.4.6,org.apache.avro:avro:1.8.2 \
--repositories http://maven.aliyun.com/nexus/content/groups/public \
--jars $HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar,/data/opt/hudi/hudi-spark-bundle_2.11-0.6.0.jar \
--conf spark.driver.extraClassPath=$HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf spark.executor.extraClassPath=$HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
spark 程序:
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
因为我使用的是spark 2.4.6,hudi使用的是0.6.0,这里会有两个兼容性问题:
1)java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
这个是由于当前spark版本使用的avro序列化版本是1.7.7,没有LogicalType这个类,该类在org.apache.avro:avro:1.8.0之后的版本才出现。所以需要升级spark avro的序列化版本。
2)NoSuchMethodError: org.apache.avro.Schema.createUnion
该问题极可能是因为,任务下发到excutor之后,excutor无法找到对应的依赖引起的,使用local执行程序即可。
2. 解决思路升级avro版本,spark shell改成local模式运行
spark-shell \
--master local[2] \
--packages org.apache.spark:spark-avro_2.11:2.4.6,org.apache.avro:avro:1.8.2 \
--repositories http://maven.aliyun.com/nexus/content/groups/public \
--jars $HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar,/data/opt/hudi/hudi-spark-bundle_2.11-0.6.0.jar \
--conf spark.driver.extraClassPath=$HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf spark.executor.extraClassPath=$HOME/.ivy2/jars/org.apache.avro_avro-1.8.2.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
也可以将org.apache.avro_avro-1.8.2.jar放到spark集群各个节点lib【或hdfs】下
spark-shell \
--jars /usr/hdp/3.0.1.0-187/spark2/jars/hudi-spark-bundle_2.11-0.6.0.jar \
--conf spark.driver.extraClassPath=/usr/hdp/3.0.1.0-187/spark2/jars/org.apache.avro_avro-1.8.2.jar \
--conf spark.executor.extraClassPath=/usr/hdp/3.0.1.0-187/spark2/jars/org.apache.avro_avro-1.8.2.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
注意,要写入到hdfs的应用程序:
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "hdfs:///tmp/hudi_trips_cow" //写入到hdfs
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
各位大神,文章写的不好和缺漏的地方,请各位多多指导。谢谢观ka a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a