kudu 插入数据的三种方式

JAVA API 提供了三种向 kudu 插入数据的刷新策略,分别为:

1、AUTO_FLUSH_SYNC

2、AUTO_FLUSH_BACKGROUND

3、MANUAL_FLUSH

如源码所示: 

public interface SessionConfiguration {

  @InterfaceAudience.Public
  @InterfaceStability.Evolving
  enum FlushMode {
    /**
     * Each {@link KuduSession#apply KuduSession.apply()} call will return only after being
     * flushed to the server automatically. No batching will occur.
     *
     * 

In this mode, the {@link KuduSession#flush} call never has any effect, since each * {@link KuduSession#apply KuduSession.apply()} has already flushed the buffer before * returning. * *

This is the default flush mode. */ AUTO_FLUSH_SYNC, /** * {@link KuduSession#apply KuduSession.apply()} calls will return immediately, but the writes * will be sent in the background, potentially batched together with other writes from * the same session. If there is not sufficient buffer space, then * {@link KuduSession#apply KuduSession.apply()} may block for buffer space to be available. * *

Because writes are applied in the background, any errors will be stored * in a session-local buffer. Call {@link #countPendingErrors() countPendingErrors()} or * {@link #getPendingErrors() getPendingErrors()} to retrieve them. * *

Note: The {@code AUTO_FLUSH_BACKGROUND} mode may result in * out-of-order writes to Kudu. This is because in this mode multiple write * operations may be sent to the server in parallel. * See KUDU-1767 for more * information. * *

The {@link KuduSession#flush()} call can be used to block until the buffer is empty. */ AUTO_FLUSH_BACKGROUND, /** * {@link KuduSession#apply KuduSession.apply()} calls will return immediately, but the writes * will not be sent until the user calls {@link KuduSession#flush()}. If the buffer runs past * the configured space limit, then {@link KuduSession#apply KuduSession.apply()} will return * an error. */ MANUAL_FLUSH }

简要说下这三种刷新策略的意思:

1、AUTO_FLUSH_SYNC(默认),意思是调用  KuduSession.apply() 方法后,客户端会在当数据刷新到服务器后再返回,这种情况就不能批量插入数据,调用  KuduSession.flush() 方法不会起任何作用,应为此时缓冲区数据已经被刷新到了服务器。

2、AUTO_FLUSH_BACKGROUND,意思是调用  KuduSession.apply() 方法后,客户端会立即返回,但是写入将在后台发送,可能与来自同一会话的其他写入一起进行批处理。如果没有足够的缓冲空间,KuduSession.apply()会阻塞,缓冲空间不可用。因为写入操作是在后台应用进行的的,因此任何错误都将存储在一个会话本地缓冲区中。注意:这个模式可能会导致数据插入是乱序的,这是因为在这种模式下,多个写操作可以并发地发送到服务器。即此处为 kudu 自身的一个 bug,KUDU-1767 已经说明。

3、MANUAL_FLUSH,意思是调用  KuduSession.apply() 方法后,会返回的非常快,但是写操作不会发送,直到用户使用flush()函数,如果缓冲区超过了配置的空间限制,KuduSession.apply()函数会返回一个错误。

以上三种方法实践中已经证明 第三种方法效率更高,可是任然存在问题:

当我用 flink 做实时业务处理时,存在数据丢失的问题。

例如:我用 flink 消费 kafka 中的数据实时的插入 kudu 数据库中,我用 MANUAL_FLUSH 方式插入数据, 我设置当缓冲区满 10 条数据时 调用 session.flush() 开始将数据刷新到磁盘,但是,当客户端向缓冲区写入了9条数据,未满10条,则此时由于断电或者其他事故造成业务停止,则这9条数据并没有刷新到磁盘,当我重启业务时(这里 flink 的程序做了Checkpoint),这9条数据并没有插入到数据库中,而如何处理这种问题,我目前并没有得到解决,希望能够和大家共同探讨。

这里给出当时写的 KuduSink 代码:

object KuduSink extends RichSinkFunction[(String, String, String, String, String, String, String, String)] {
  private val logger = LoggerFactory.getLogger(KuduSink.getClass)
  var clint: KuduClient = null
  var session: KuduSession = null
  var table: KuduTable = null
  // 定义累加器 ,设置缓冲大小
  var OPERATION_BATCH: IntCounter = null
  private lazy val PATH = PATH_Qqwry_dat.stringValue

  /**
    * 业务逻辑处理
    *
    * @param value value值为: (preEvent,preIp,preOs,preOsVersion,preLib,preBrowser,preBrowserVersion,preProject)
    */
  override def invoke(value: (String, String, String, String, String, String, String, String)): Unit = {

  }

  def insertDB(insert: Insert): Unit ={
    //将数据插入 Kudu
    session.apply(insert)
    this.OPERATION_BATCH.add(1)
    val num = this.OPERATION_BATCH.getLocalValue
    //内存数据每满 10条 将数据刷入到磁盘
    if (num > 9) {
      session.flush()
      this.OPERATION_BATCH.resetLocal()
      // 确保数据插入成功
      if (this.OPERATION_BATCH.getLocalValue > 0) {
        session.flush()
      }
    }
  }
  /**
    * 创建 Kudu 连接
    *
    * @param parameters
    */
  override def open(parameters: Configuration): Unit = {
    clint = new KuduClient.KuduClientBuilder(KUDU_MASTER.stringValue).build()
    session = clint.newSession()
    table = clint.openTable(KUDU_TABLE_MK_DataDictionary.stringValue)
    val mode = SessionConfiguration.FlushMode.MANUAL_FLUSH
    session.setFlushMode(mode)
    OPERATION_BATCH = new IntCounter()
    //getRuntimeContext().addAccumulator("operationBatch",this.OPERATION_BATCH)
  }

  /**
    * 关闭 Kudu 连接
    */
  override def close(): Unit = {
    if (session != null) {
      session.close()
    }
    if (clint != null) {
      clint.close()
    }
  }
}

具体代码详见个人git: https://github.com/seniscz/stream/blob/master/flink/flinkexample/flinkexample-parent/flink-db/src/main/scala/com/cz/datadictionary/KuduSink.scala

你可能感兴趣的:(kudu)