Flink:异步IO关联HBase维表数据

一、使用异步IO关联HBase维表数据优点

为避免在流计算环境中频繁的以同步方式查询外部维表,Flink官方提供使用异步IO与外部系统并发的交互方式,这样可以减轻因为网络交互引起的系统吞吐和延迟问题。当然,为了避免频繁与外部系统进行交互,建议使用内部缓存的方式存储近期容易使用到的维度数据,也就是LRU(最近最少使用)思想,业界经常使用的一个缓存机制是Guava 库提供的 CacheBuilder。
整体的设计思想就是:先用异步IO将HBase维表数据加载到缓存中,这样在关联维表时候先去缓存中查找,如果找不到再去HBase表中查询,然后加载到缓存中。

1、优点

这样一方面可以避免大量的维表数据将内存撑爆,另一方面可以进行多维度数据的关联

2、缺点

1、需要异步客户端,比如HBase原生的客户端是不能使用的,因为原生的是同步交互客户端,必须使用异步客户端asynchbase。如果应用的热存储没有异步客户端,那么可以使用自己创建线程池模拟异步请求的方式。
2、由于用到了缓存机制,维度数据更新就会有一定的延迟

3、应用场景

比较适合的场景就是维度数据量特别大,并且可以接受维度数据更新有一定的延迟,或者说维表数据自身更新就很不频繁的情况。

二、实现原理

1、异步IO

Flink:异步IO关联HBase维表数据_第1张图片
如上图,这是人家Flink官方提供的一个流计算引擎在同步和异步方式与外界存储介质交互的差异对比,左边是同步方式,右边是异步方式。
可以很清楚看到,同步交互方式必须是发送一条请求,然后整个计算任务是卡住状态,等待存储介质返回查询结果,这么干肯定影响计算速度,我自己在刚接触Flink前期就比较喜欢在RichFunction的open()中创建外部存储介质的链接,然后在map()或者filter()中直接使用这个链接去获取想要的数据,这就是典型的同步交互方式。
而异步交互方式则是同时发送多个查询,然后哪个查询结果先到就可以直接使用,也可以认为流计算和查询这两个动作是分开执行的,当然异步IO组件支持返回结果的顺序。

2、缓存机制

这里使用缓存机制是Guava 库提供的 CacheBuilder。

三、源码解析

一、CacheBuilder缓存

二、HBase异步客户端

HBase异步客户端官网
一定要详细看一看java Docs,用法讲的很详细
HBase异步客户端源码Git地址
下面源码分析使用的是v1.8.2

使用异步客户端必须引入依赖:

        <dependency>
            <groupId>org.hbase</groupId>
            <artifactId>asynchbase</artifactId>
            <version>1.8.2</version>
        </dependency>

一个完整的与HBase异步交互的代码需要以下知识。

1、HBaseClient

HBaseClient源码位置

由于目前只使用到get方法,只列出两个get方法的源码,这两个方法是从HBase获取数据的方法,
两者的区别就是,前者只能获取一个维表数据,后者可以获取多个维表的数据,
不过在生产过程中我把好几个维表放在一个HBase表中,不同维表对应不同列蔟

  /**
   * Retrieves data from HBase.从 HBase 检索数据。
   * @param request The {@code get} request.
   * @return A deferred list of key-values that matched the get request.
   *         与 get 请求匹配的延迟键值列表。
   */
  public Deferred<ArrayList<KeyValue>> get(final GetRequest request) {
    num_gets.increment();
    return sendRpcToRegion(request).addCallbacks(got, Callback.PASSTHROUGH);
  }
/**
   * Method to issue multiple get requests to HBase in a batch. This can avoid
   * bottlenecks in region clients and improve response time.
   * 批量向 HBase 发出多个 get 请求的方法。
   * 这可以避免区域客户端的瓶颈并提高响应时间。
   * @param requests A list of one or more GetRequests.
   *         requests 一个或多个 GetRequests 的列表。
   * @return A deferred grouping of result or exceptions. Note that this API may
   * return a DeferredGroupException if one or more calls failed.
   * 结果或异常的延迟分组。
   * 请注意,如果一个或多个调用失败,此 API 可能会返回 DeferredGroupException。
   * @since 1.8
   */
  public Deferred<List<GetResultOrException>> get(final List<GetRequest> requests) {
    return Deferred.groupInOrder(multiGet(requests))
        .addCallback(
            new Callback<List<GetResultOrException>, ArrayList<GetResultOrException>>() {
              public List<GetResultOrException> call(ArrayList<GetResultOrException> results) {
                return results;
              }
            }
        );
  }

构造函数:

  /**
   * Constructor.
   * @param quorum_spec The specification of the quorum, e.g.
   * {@code "host1,host2,host3"}.
   *                    第一个参数指定Zookeeper地址
   * @param base_path The base path under which is the znode for the
   * -ROOT- region.
   *                   第二个参数执行port
   */
  public HBaseClient(final String quorum_spec, final String base_path) {
    this(quorum_spec, base_path, defaultChannelFactory(new Config()));
  }

2、GetRequest

GetRequest源码位置

这个是对于从HBase怎么获取数据的一种描述,无非就是指定 key 列蔟 列。
Flink:异步IO关联HBase维表数据_第2张图片
此处主要关注构造函数:通过如下几个构造函数,就能明白可以按照业务需求指定 key 列蔟 或者 列来获取数据

  /**
   * Constructor.
   * These byte arrays will NOT be copied.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   */
  public GetRequest(final byte[] table, final byte[] key) {
    super(table, key);
    this.bufferable = false; //don't buffer get request
  }

  /**
   * Constructor.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   * This byte array will NOT be copied.
   */
  public GetRequest(final String table, final byte[] key) {
    this(table.getBytes(), key);
  }

  /**
   * Constructor.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   */
  public GetRequest(final String table, final String key) {
    this(table.getBytes(), key.getBytes());
  }

  /**
   * Constructor.
   * These byte arrays will NOT be copied.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   * @param family The column family.
   * @since 1.5
   */
  public GetRequest(final byte[] table,
                    final byte[] key,
                    final byte[] family) {
    super(table, key);
    this.family(family);
    this.bufferable = false; //don't buffer get request
  }

  /**
   * Constructor.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   * @param family The column family.
   * @since 1.5
   */
  public GetRequest(final String table,
                    final String key,
                    final String family) {
    this(table, key);
    this.family(family);
    this.bufferable = false; //don't buffer get request
  }

  /**
   * Constructor.
   * These byte arrays will NOT be copied.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   * @param family The column family.
   * @param qualifier The column qualifier.
   * @since 1.5
   */
  public GetRequest(final byte[] table,
                    final byte[] key,
                    final byte[] family,
                    final byte[] qualifier) {
    super(table, key);
    this.family(family);
    this.qualifier(qualifier);
    this.bufferable = false; //don't buffer get request
  }

  /**
   * Constructor.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   * @param family The column family.
   * @param qualifier The column qualifier.
   * @since 1.5
   */
  public GetRequest(final String table,
                    final String key,
                    final String family,
                    final String qualifier) {
    this(table, key);
    this.family(family);
    this.qualifier(qualifier);
    this.bufferable = false; //don't buffer get request
  }

3、GetResultOrException

GetResultOrException源码位置

这个与GetResult不同,这个代表的是从HBase读取后返回的结果,只有当使用异步客户端获取多个GetRequest时才返回这个对象。

public Deferred<List<GetResultOrException>> get(final List<GetRequest> requests)

Flink:异步IO关联HBase维表数据_第3张图片

  //返回所有结果数据
  public ArrayList<KeyValue> getCells() {
    return this.cells;
  }

  //查询不到的GetRequest会返回异常
  public Exception getException() {
    return this.exception;
  }

4、KeyValue

KeyValue源码位置

这个代表的是从HBase获取的数据对象,《HBase权威指南》上说KeyValue实例代表了一个唯一的数据单元格。
Flink:异步IO关联HBase维表数据_第4张图片
目前比较常用的几个方法:

  /** Returns the row key.  */
  public byte[] key() {
    return key;
  }

  /** Returns the column family.  */
  public byte[] family() {
    return family;
  }

  /** Returns the column qualifier.  */
  public byte[] qualifier() {
    return qualifier;
  }

  /**
   * Returns the timestamp stored in this {@code KeyValue}.
   * @see #TIMESTAMP_NOW
   */
  public long timestamp() {
    return timestamp;
  }

  //public byte type() {
  //  return type;
  //}

  /** Returns the value, the contents of the cell.
   * 返回当前单元格的数据 */
  public byte[] value() {
    return value;
  }

  @Override
  public int compareTo(final KeyValue other) {
    int d;
    if ((d = Bytes.memcmp(key, other.key)) != 0) {
      return d;
    } else if ((d = Bytes.memcmp(family, other.family)) != 0) {
      return d;
    } else if ((d = Bytes.memcmp(qualifier, other.qualifier)) != 0) {
      return d;
    //} else if ((d = Bytes.memcmp(value, other.value)) != 0) {
    //  return d;
    } else if ((d = Long.signum(timestamp - other.timestamp)) != 0) {
      return d;
    } else {
    //  d = type - other.type;
      d = Bytes.memcmp(value, other.value);
    }
    return d;
  }

  public boolean equals(final Object other) {
    if (other == null || !(other instanceof KeyValue)) {
      return false;
    }
    return compareTo((KeyValue) other) == 0;
  }

  public int hashCode() {
    return Arrays.hashCode(key)
      ^ Arrays.hashCode(family)
      ^ Arrays.hashCode(qualifier)
      ^ Arrays.hashCode(value)
      ^ (int) (timestamp ^ (timestamp >>> 32))
      //^ type
      ;
  }

  public String toString() {
    final StringBuilder buf = new StringBuilder(84  // Boilerplate + timestamp
      // the row key is likely to contain non-ascii characters, so
      // let's multiply its length by 2 to avoid re-allocations.
      + key.length * 2 + family.length + qualifier.length + value.length);
    buf.append("KeyValue(key=");
    Bytes.pretty(buf, key);
    buf.append(", family=");
    Bytes.pretty(buf, family);
    buf.append(", qualifier=");
    Bytes.pretty(buf, qualifier);
    buf.append(", value=");
    Bytes.pretty(buf, value);
    buf.append(", timestamp=").append(timestamp);
    //  .append(", type=").append(type);
    buf.append(')');
    return buf.toString();
  }

5、Deferred

Deferred源码位置

该类给HBase异步客户端提供回调函数,当请求的数据返回的时候会调用。
Flink:异步IO关联HBase维表数据_第5张图片
此处主要关注回调函数:

  /**
   * Registers a callback.
   * 注册回调。
   * 

* If the deferred result is already available and isn't an exception, the * callback is executed immediately from this thread. * If the deferred result is already available and is an exception, the * callback is discarded. * If the deferred result is not available, this callback is queued and will * be invoked from whichever thread gives this deferred its initial result * by calling {@link #callback}. * * 如果延迟结果已经可用并且不是异常,则立即从该线程执行回调。 * 如果延迟结果已经可用并且是异常,则丢弃回调。 * 如果延迟结果不可用,则此回调将排队,并将从通过调用 {@link #callback} * 为延迟提供其初始结果的任何线程调用。 * @param cb The callback to register. 要注册的回调。 * @return {@code this} with an "updated" type. */ public <R> Deferred<R> addCallback(final Callback<R, T> cb) { return addCallbacks(cb, Callback.PASSTHROUGH); }

/**
   * Registers a callback and an "errback".
   * 注册一个回调和一个“errback”。
   * 

* If the deferred result is already available, the callback or the errback * (depending on the nature of the result) is executed immediately from this * thread. * 如果延迟结果已经可用,则立即从该线程执行回调或 errback(取决于结果的性质)。 * @param cb The callback to register.要注册的回调。 * @param eb Th errback to register.异常返回注册。 * @return {@code this} with an "updated" type. * @throws CallbackOverflowError if there are too many callbacks in this chain. * The maximum number of callbacks allowed in a chain is set by the * implementation. The limit is high enough that you shouldn't have to worry * about this exception (which is why it's an {@link Error} actually). If * you hit it, you probably did something wrong. */ @SuppressWarnings("unchecked") public <R, R2, E> Deferred<R> addCallbacks(final Callback<R, T> cb, final Callback<R2, E> eb) { if (cb == null) { throw new NullPointerException("null callback"); } else if (eb == null) { throw new NullPointerException("null errback"); } // We need to synchronize on `this' first before the CAS, to prevent // runCallbacks from switching our state from RUNNING to DONE right // before we add another callback. synchronized (this) { // If we're DONE, switch to RUNNING atomically. if (state == DONE) { // This "check-then-act" sequence is safe as this is the only code // path that transitions from DONE to RUNNING and it's synchronized. state = RUNNING; } else { // We get here if weren't DONE (most common code path) // -or- // if we were DONE and another thread raced with us to change the // state and we lost the race (uncommon). if (callbacks == null) { callbacks = new Callback[INIT_CALLBACK_CHAIN_SIZE]; } // Do we need to grow the array? else if (last_callback == callbacks.length) { final int oldlen = callbacks.length; if (oldlen == MAX_CALLBACK_CHAIN_LENGTH * 2) { throw new CallbackOverflowError("Too many callbacks in " + this + " (size=" + (oldlen / 2) + ") when attempting to add cb=" + cb + '@' + cb.hashCode() + ", eb=" + eb + '@' + eb.hashCode()); } final int len = Math.min(oldlen * 2, MAX_CALLBACK_CHAIN_LENGTH * 2); final Callback[] newcbs = new Callback[len]; System.arraycopy(callbacks, next_callback, // Outstanding callbacks. newcbs, 0, // Move them to the beginning. last_callback - next_callback); // Number of items. last_callback -= next_callback; next_callback = 0; callbacks = newcbs; } callbacks[last_callback++] = cb; callbacks[last_callback++] = eb; return (Deferred<R>) ((Deferred) this); } } // end of synchronized block if (!doCall(result instanceof Exception ? eb : cb)) { // While we were executing the callback, another thread could have // added more callbacks. If doCall returned true, it means we're // PAUSED, so we won't reach this point, because the Deferred we're // waiting on will call us back later. But if we're still in state // RUNNING, we'll get to here, and we must check to see if any new // callbacks were added while we were executing doCall, because if // there are, we must execute them immediately, because no one else // is going to execute them for us otherwise. boolean more; synchronized (this) { more = callbacks != null && next_callback != last_callback; } if (more) { runCallbacks(); // Will put us back either in DONE or in PAUSED. } else { state = DONE; } } return (Deferred<R>) ((Object) this); }

三、Flink异步IO

Flink:双流Join和维表Join

四、案例代码

package com.scallion.transform;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.scallion.common.Common;
import com.stumbleupon.async.Callback;
import com.stumbleupon.async.Deferred;
import org.apache.commons.lang.StringUtils;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;
import org.hbase.async.GetRequest;
import org.hbase.async.GetResultOrException;
import org.hbase.async.HBaseClient;
import org.hbase.async.KeyValue;

import java.util.*;

/**
 * created by gaowj.
 * created on 2021-07-16.
 * function: 异步关联维表函数
 */
public class AsyncHBaseDimJoinFunction extends RichAsyncFunction<Object, Object> {
    private HBaseClient client;//HBase异步客户端
    private String rowKeyCol; //主键列名
    private HashMap<String, HashSet<String>> joinTables;//需要关联的表名及其字段
    private HashMap<String, String> colAndResCol;//map的key为维表列名,value为流量bean的列名

    public AsyncHBaseDimJoinFunction(String rowKeyCol, HashMap<String, HashSet<String>> joinTables, HashMap<String, String> colAndResCol) {
        this.rowKeyCol = rowKeyCol;
        this.joinTables = joinTables;
        this.colAndResCol = colAndResCol;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        //获取全局配置文件
        ParameterTool params = (ParameterTool) getRuntimeContext().getExecutionConfig().getGlobalJobParameters();
        //获取HBase连接
        client = new HBaseClient(params.getRequired("hbase.zookeeper.quorum"),
                params.getRequired("hbase.zookeeper.property.clientPort"));
    }

    @Override
    public void asyncInvoke(Object bean, ResultFuture<Object> resultFuture) throws Exception {
        try {
            //流量日志
            JSONObject beanJsonObj = JSON.parseObject(JSON.toJSONString(bean));
            String rowKey = beanJsonObj.getString(rowKeyCol);//主键列值
            ArrayList<GetRequest> getRequests = new ArrayList<>();
            //需要join的维表名
            Iterator<String> tables = joinTables.keySet().iterator();
            while (tables.hasNext()) {
                String table = tables.next();
                HashSet<String> cols = joinTables.get(table);//需要关联的列名
                Iterator<String> colsIterator = cols.iterator();
                while (colsIterator.hasNext()) {
                    String col = colsIterator.next();
                    getRequests.add(new GetRequest(table, rowKey,
                            Common.DIM_HBASE_TABLE_FAMLIY,
                            col));
                }
            }
            Deferred<List<GetResultOrException>> listDeferred = client.get(getRequests);
            listDeferred.addCallbacks(new Callback<Object, List<GetResultOrException>>() {
                @Override
                public Object call(List<GetResultOrException> callBack) throws Exception {
                    if (callBack != null && !callBack.isEmpty()) {
                        Iterator<GetResultOrException> callBackIterator = callBack.iterator();
                        while (callBackIterator.hasNext()) {
                            GetResultOrException results = callBackIterator.next();
                            ArrayList<KeyValue> cells = results.getCells();
                            for (KeyValue kv : cells) {
                                String qualifier = new String(kv.qualifier());//维表列名
                                String v = new String(kv.value());
                                if (StringUtils.isNotBlank(v)) {
                                    String resCol = colAndResCol.get(qualifier);//流量日志bean的列名
                                    beanJsonObj.put(resCol, v);
                                }
                            }
                        }
                    } else {
                        //收集关联后的结果数据,或者未关联的数据
                        resultFuture.complete(Collections.singleton(beanJsonObj));
                    }
                    return null;
                }
            }, new Callback<Object, Object>() {
                @Override
                public Object call(Object o) throws Exception {
                    //收集关联时候出现异常的原始bean数据
                    resultFuture.complete(Collections.singleton(beanJsonObj));
                    return null;
                }
            });

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}

五、参考文章

Flink维表关联方式
用于外部数据访问的异步 I/O

你可能感兴趣的:(Flink,flink)