记录 hudi hive sync 代码端经历

记录 hudi hive sync 代码端经历

前言

之前写过篇博客hudi-hive-sync,提到了hive 同步有两种方式,有兴趣可以去看看。

博客内的第一种方法稍微有一点问题。因为hudi 支持的hive版本为2.1.1,而之前我们测试环境的hive版本为1.2.1,所以关于方式一的报错,我们单方面的认为是hive版本不兼容的原因,加上当时环境不能说变就变,一直没有去研究这个问题;

前不久,我们测试环境升级后,hive的版本升级到2.1.1,发现执行之前的代码还是这样报错,于是稍微研究了hudi hive sync,特意记录

代码

object HiveSyncPartition {
  def main(args: Array[String]): Unit = {
	//构造sparksession对象
    val spark = SparkSession
      .builder
      .appName("delta hiveSync")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .master("local[3]")
      .getOrCreate()
    
    //读取parquet
    val upsertData = spark.read.parquet("/tmp/partition/*")

    upsertData.write.format("org.apache.hudi")
       //主键
      .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rowkey")
        //更新列
      .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "lastupdatedttm")
        //分区列
      .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "p_create_time")
        //hudi表主键生成
      .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, classOf[SimpleKeyGenerator].getName)
        //表数据发生变更时,分区是否发生变更
      .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true")
        //索引类型,这是使用全局布隆
      .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name())

		//同步hive库,需要注意的是,hive库需要提前创建,hudi是不会自动创建的
      .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "hid0101_cache_xdcs_pacs_hj")
        //同步hive表
      .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, "merge_test13")
        //是否开启hive同步
      .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
        //hive表分区字段
      .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "deta_pre")
        //hivejdbc连接
      .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, "jdbc:hive2://192.168.0.112:10000")
        //分区表与非分区表的主键生成策略不同,需要注意
      .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor")


      // 并行度参数设置
      .option("hoodie.insert.shuffle.parallelism", "2")
      .option("hoodie.upsert.shuffle.parallelism", "2")
      .mode(SaveMode.Append)
      .option(HoodieWriteConfig.TABLE_NAME, "nuwa_hudi_partition")
      .save("/tmp/test2");
  }
}

上面代码需要你根据自己开发环境修改后是可以运行的,但是呢,会报错哦

error 1

org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS `hid0101_cache_xdcs_pacs_hj`.`merge_test13`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `age` string, `birthday` string, `create_time` string, `id` string, `isdeleted` string, `lastupdatedttm` string, `name` string, `p_create_time` string, `rowkey` string, `sex` string) PARTITIONED BY (`date_prt` String) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/datalake/hid0101_cache_xdcs_pacs_hj_test/merge_test13'
	at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:482)
	at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
	at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:114)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:87)
	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
	at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at com.clb.utils.HoodieImportHelper$.hudiUpsert(HoodieImportHelper.scala:159)
	at com.clb.HoodieImportHandler$$anonfun$importDataToHudi$1.apply$mcV$sp(HoodieImportHandler.scala:103)
	at scala.util.control.Breaks.breakable(Breaks.scala:38)
	at com.clb.HoodieImportHandler$.importDataToHudi(HoodieImportHandler.scala:65)
	at com.clb.HoodieImportHandlerTest.testImportDataToHudiPartition(HoodieImportHandlerTest.scala:48)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException Cannot find class 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
	at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
	at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
	at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:480)
	... 55 more
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException Cannot find class 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
	at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:329)
	at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:207)
	at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
	at org.apache.hive.service.cli.operation.Operation.run(Operation.java:260)
	at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:504)
	at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:490)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
	at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
	at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
	at com.sun.proxy.$Proxy35.executeStatementAsync(Unknown Source)
	at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:295)
	at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:507)
	at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
	at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Cannot find class 'org.apache.hudi.hadoop.HoodieParquetInputFormat'
	at org.apache.hadoop.hive.ql.parse.ParseUtils.ensureClassExists(ParseUtils.java:263)
	at org.apache.hadoop.hive.ql.parse.StorageFormat.fillStorageFormat(StorageFormat.java:57)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:11665)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10838)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10948)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10638)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:250)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:603)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1425)
	at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1398)
	at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:205)
	... 27 more
Caused by: java.lang.ClassNotFoundException: org.apache.hudi.hadoop.HoodieParquetInputFormat
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.ensureClassExists(ParseUtils.java:261)
	... 37 more
原因

hive 的classpath 下是不存在org.apache.hudi.hadoop.HoodieParquetInputFormat,这是hudi 自己特有的输入格式

解决
1.将hudi-hadoop-mr-0.5.2-incubating.jar 添加到hive 目录下,例如:
  cp /opt/software/hudi-hadoop-mr-0.5.2-incubating.jar $HIVE_HOME/auxlib/
2.重启hive即可

error 2

org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL CREATE EXTERNAL TABLE  IF NOT EXISTS `hid0101_cache_xdcs_pacs_hj`.`merge_test13`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `age` string, `birthday` string, `create_time` string, `id` string, `isdeleted` string, `lastupdatedttm` string, `name` string, `p_create_time` string, `rowkey` string, `sex` string) PARTITIONED BY (`date_prt` String) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '/datalake/hid0101_cache_xdcs_pacs_hj_test/merge_test13'
	at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:482)
	at org.apache.hudi.hive.HoodieHiveClient.createTable(HoodieHiveClient.java:272)
	at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:146)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:114)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:87)
	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
	at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at com.clb.utils.HoodieImportHelper$.hudiUpsert(HoodieImportHelper.scala:159)
	at com.clb.HoodieImportHandler$$anonfun$importDataToHudi$1.apply$mcV$sp(HoodieImportHandler.scala:103)
	at scala.util.control.Breaks.breakable(Breaks.scala:38)
	at com.clb.HoodieImportHandler$.importDataToHudi(HoodieImportHandler.scala:65)
	at com.clb.HoodieImportHandlerTest.testImportDataToHudiPartition(HoodieImportHandlerTest.scala:48)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.NoClassDefFoundError: org/apache/hudi/common/table/TableFileSystemView$BaseFileOnlyView
	at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
	at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)
	at org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:480)
	... 55 more
Caused by: org.apache.hive.service.cli.HiveSQLException: Error running query: java.lang.NoClassDefFoundError: org/apache/hudi/common/table/TableFileSystemView$BaseFileOnlyView
	at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:239)
	at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
	at org.apache.hive.service.cli.operation.Operation.run(Operation.java:260)
	at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:504)
	at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:490)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
	at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
	at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
	at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
	at com.sun.proxy.$Proxy35.executeStatementAsync(Unknown Source)
	at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:295)
	at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:507)
	at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
	at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/hudi/common/table/TableFileSystemView$BaseFileOnlyView
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.ensureClassExists(ParseUtils.java:261)
	at org.apache.hadoop.hive.ql.parse.StorageFormat.fillStorageFormat(StorageFormat.java:57)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:11665)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genResolvedParseTree(SemanticAnalyzer.java:10838)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10948)
	at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10638)
	at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:250)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:603)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1425)
	at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1398)
	at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:205)
	... 27 more
Caused by: java.lang.ClassNotFoundException: org.apache.hudi.common.table.TableFileSystemView$BaseFileOnlyView
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 40 more
原因

讲道理,程序里面是真实存在org.apache.hudi.common.table.TableFileSystemView$BaseFileOnlyView,而且该错误是开始进行元数据同步才抛出来,我这里推测是因为hive服务端进行元数据同步时,找不到此类

解决
1.将hudi-common-0.5.2-incubating.jar 添加到hive 目录下,例如:
  cp /opt/software/hudi-common-0.5.2-incubating.jar $HIVE_HOME/auxlib/
2.重启hive即可

error3

51714 [main] ERROR org.apache.hudi.hive.HiveSyncTool  - Got runtime exception when hive syncing
org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to 20200918095851
	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:658)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:128)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:87)
	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:229)
	at org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:279)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at com.clb.utils.HoodieImportHelper$.hudiUpsert(HoodieImportHelper.scala:159)
	at com.clb.HoodieImportHandler$$anonfun$importDataToHudi$1.apply$mcV$sp(HoodieImportHandler.scala:103)
	at scala.util.control.Breaks.breakable(Breaks.scala:38)
	at com.clb.HoodieImportHandler$.importDataToHudi(HoodieImportHandler.scala:65)
	at com.clb.HoodieImportHandlerTest.testImportDataToHudiPartition(HoodieImportHandlerTest.scala:48)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: NoSuchObjectException(message:hid0101_cache_xdcs_pacs_hj.merge_test13 table not found)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table_core(HiveMetaStore.java:1808)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1778)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
	at com.sun.proxy.$Proxy39.get_table(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1208)
	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:131)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
	at com.sun.proxy.$Proxy40.getTable(Unknown Source)
	at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:654)
	... 53 more
51775 [main] INFO  org.apache.hadoop.hive.metastore.HiveMetaStore  - 0: Shutting down the object store...
51775 [main] INFO  org.apache.hadoop.hive.metastore.HiveMetaStore.audit  - ugi=Administrator	ip=unknown-ip-addr	cmd=Shutting down the object store...	
51775 [main] INFO  org.apache.hadoop.hive.metastore.HiveMetaStore  - 0: Metastore shutdown complete.

原因

到达这里,hudi的hive外表已经被创建出来,并且分区添加完毕,但是在最后,hudi 在进行get update last commit time synced操作时出现了问题。首先我们得知道为什么要进行 update last commit ,先来看一下正常的 hudi hive外表:

CREATE EXTERNAL TABLE `merge_test13`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `age` string, 
  `birthday` string, 
  `create_time` string, 
  `id` string, 
  `isdeleted` string, 
  `lastupdatedttm` string, 
  `name` string, 
  `p_create_time` string, 
  `rowkey` string, 
  `sex` string)
PARTITIONED BY ( 
  `date_prt` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://master:8020/datalake/hid0101_cache_xdcs_pacs_hj_test/merge_test13'
TBLPROPERTIES (
  'last_commit_time_sync'='20200917193300', 
  'transient_lastDdlTime'='1600340594')

可以看到,sql的最后有一个 'last_commit_time_sync'='20200917193300',这个记录着上一次hive 外表的同步时间;每一次执行hive 同步操作后,都会去更新last_commit_time_sync这个时间;

关于hive外表同步的操作,其实就是为hive外表添加分区的过程,并且将这些分区信息记录到hive元数据中

简单来说hudi hive 同步操作:

  1. 判断表是否存在

    1. 不存在
      1. 创建表
      2. 添加分区信息
      3. 更新last_commit_time_sync
    2. 存在
      1. 获取last_commit_time_sync
      2. 根据last_commit_time_sync在hudi的元数据中寻找新的commit 文件
        1. commit 文件存在,更新文件信息,添加hive分区,更新last_commit_time_sync
        2. commit 文件不存在,不做操作

    我们可以得到结论:

    表存在:根据last_commit_time_sync,做增量同步

    表不存在:创建表,做全量同步

    所以我们上面的报错,最终是因为 更新 last_commit_time_sync 的原因

    我们需要看一下源码,是如何更新last_commit_time_sync :

    1. syncHoodieTable
    //org.apache.hudi.hive.HiveSyncTool#syncHoodieTable(java.lang.String, boolean)
    private void syncHoodieTable(String tableName, boolean useRealtimeInputFormat) {
    	//.....
        // Sync the partitions if needed
        syncPartitions(tableName, writtenPartitionsSince);
    	// 主要看一下这个方法
        hoodieHiveClient.updateLastCommitTimeSynced(tableName);
        LOG.info("Sync complete for " + tableName);
    }
    
    1. updateLastCommitTimeSynced
    //org.apache.hudi.hive.HoodieHiveClient#updateLastCommitTimeSynced
    void updateLastCommitTimeSynced(String tableName) {
        String lastCommitSynced = activeTimeline.lastInstant().get().getTimestamp();
        try {
          Table table = client.getTable(syncConfig.databaseName, tableName);
          table.putToParameters(HOODIE_LAST_COMMIT_TIME_SYNC, lastCommitSynced);
          //主要看该方法,这里一直往下跟
          client.alter_table(syncConfig.databaseName, tableName, table);
        } catch (Exception e) {
          throw new HoodieHiveSyncException("Failed to get update last commit time synced to " + lastCommitSynced, e);
        }
      }
    
    1. get_table_core
    //org.apache.hadoop.hive.metastore.HiveMetaStore.HMSHandler#get_table_core
    public Table get_table_core(final String dbname, final String name) throws MetaException,
    	NoSuchObjectException {
      Table t;
      try {
        // 主要看这里
        //在这儿,根据hive的metastore的连接,获取hive元数据信息。我们根据真实存在的库,表去获取Table对象,无法获取,为什么? 你首先得了解hive是如何获取metastore的 ?
        // hive 是根据 hive-site.xml 来连接metastore,类似于hadoop 通过hdfs-site.xml连接hdfs
        // 那么你在看看你的class path 下是否存在hive-site.xml呢?
        // 不用看了,肯定是不存在的,所以hive默认是读取 hive-metastore.jar下的hive-default.xml
        // 既然读取的是默认的xml文件,那么肯定是无法获取你环境里面的库和表
    	t = getMS().getTable(dbname, name);
    	if (t == null) {
    	  throw new NoSuchObjectException(dbname + "." + name
    		  + " table not found");
    	}
      } catch (Exception e) {
    	if (e instanceof MetaException) {
    	  throw (MetaException) e;
    	} else if (e instanceof NoSuchObjectException) {
    	  throw (NoSuchObjectException) e;
    	} else {
    	  throw newMetaException(e);
    	}
      }
      return t;
    }
    
    解决

    将你本地环境下的 hive-site.xml 文件拷贝一份到resource 下,重启程序,大概率就没问题了

    结束语

    上面基本就是hudi hive sync 中基本遇到的坑,至于源码中更加详细的,例如:hudi 在同步是如何去与hoodie 元数据信息交互的还没来得及看,文中可能会存在不正确的内容,希望大家指正!

你可能感兴趣的:(Hudi,hadoop,spark)