需求背景
我们生产常有将实时与Hive维表join来丰富数据的需求、Hive表是分区表、上周Flink 1.12发布了、刚好支撑了这种业务场景、我也将1.12版本部署了、做了一个线上需求并上线、分区时态表提升了很多开发效率、结合一些之前生产的例子、做一些小的分享
分享主题
- 没有分区时态表实现的一个批流join例子
- 1.12版本依赖以及注意项
- Flink 1.12 : Hive分区时态表与kafka的批流一体带来的便捷
- 分区时态表代码Demo
- Flink SQL开发的小技巧
- ORC格式BUG
没有分区时态表实现的一个批流join例子
在分区时态表出来之前、为了定期关联出最新的分区数据、通常要写kafkaStream程序、然后在map算子join返回Tuple2类型数据、再将该数据转换成Table、从而进行SQL查询
StreamTableEnvironment blinkStreamTableEnv = StreamTableEnvironment.create(blinkStreamEnv, blinkStreamSettings);
Configuration configuration = blinkStreamTableEnv.getConfig().getConfiguration();
configuration.setString("table.exec.mini-batch.enabled", "true"); // local-global aggregation depends on mini-batch is enabled
configuration.setString("table.exec.mini-batch.allow-latency", "60 s");
configuration.setString("table.exec.mini-batch.size", "5000");
configuration.setString("table.optimizer.agg-phase-strategy", "TWO_PHASE"); // enable two-phase, i.e. local-global aggregation
DataStream> indexBeanStream = masterDataStream.map(new IndexOrderJoin());
mapJoin的实现类: 将T-2的维度数据与实时数据join、并返回Tuple2数据、因为离线数仓出数一般在凌晨3点、有时候由于集群资源不稳定、导致出数慢、所以用的是T-2
public class IndexOrderJoin extends RichMapFunction> {
private Map> map = null;
Logger logger;
@Override
public void open(Configuration parameters) throws Exception {
logger = LoggerFactory.getLogger(Class.forName("com.hll.util.IndexOrderJoin"));
map = new HashMap<>();
}
@Override
public Tuple2 map(MasterBean masterBean) {
if (map.get(masterBean.getReportDate()) == null) {
//如果map里没有T-2的维表数据、则查询一次Hive、并将结果存入该map、map属于线程级别、所以保证Task维表数据是全的
logger.info("initial hive data : {}", masterBean.getReportDate());
map.put(masterBean.getReportDate(), ScalaHiveUtil.getHiveDayIndex(masterBean.getReportDate()));
}
//将传入过来的kafka数据、与hive join完后、返回Tuple2数据
return new Tuple2<>(masterBean, map.get(masterBean.getReportDate()).get(masterBean.getGroupID().concat(masterBean.getShopID())));
}
}
基于join好的indexBeanStream流转视图、然后做SQL查询
blinkStreamTableEnv.createTemporaryView("index_order_master", indexBeanStream);
blinkStreamTableEnv.sqlUpdate("select group_id, sum(amt) from index_order_master group by group_id");
blinkStreamTableEnv.execute("rt_aggr_master_flink");
可以看出、没有Hive分区时态表的时候、简单的一个join便涉及到kafkaStream、map算子join、若分区数据过大、还要用上async IO防止阻塞、程序的代码量和维护成本会增加
1.12版本依赖以及注意项
- 1.12中 flink-connector-kafka_2.11 替换了之前的 flink-connector-kafka-0.10[0.11]_2.11、无需注明artifactId里的kafka版本信息
- 在kafka DDL中、连接时无需指定版本信息
1.11版本: 'connector' = 'kafka-0.10'
1.12版本: 'connector' = 'kafka'
UTF-8
1.12.0
2.11
2.11.8
5.1.39
3.1.0
2.7.1
1.2.22
org.apache.flink
flink-core
${flink.version}
org.apache.flink
flink-streaming-java_${scala.binary.version}
${flink.version}
org.apache.flink
flink-streaming-scala_${scala.binary.version}
${flink.version}
org.apache.flink
flink-clients_${scala.binary.version}
${flink.version}
org.apache.flink
flink-table-common
${flink.version}
org.apache.flink
flink-statebackend-rocksdb_${scala.binary.version}
${flink.version}
org.apache.flink
flink-connector-kafka_${scala.binary.version}
org.apache.kafka
kafka-clents
${flink.version}
provided
mysql
mysql-connector-java
${mysql.version}
com.alibaba
fastjson
${fastjosn.version}
org.apache.flink
flink-connector-jdbc_2.12
${flink.version}
org.slf4j
slf4j-api
1.7.21
log4j
log4j
1.2.17
org.apache.flink
flink-table-api-java-bridge_${scala.binary.version}
${flink.version}
org.apache.flink
flink-table-api-scala-bridge_${scala.binary.version}
${flink.version}
provided
org.apache.flink
flink-table-planner-blink_${scala.binary.version}
${flink.version}
provided
org.apache.flink
flink-sql-connector-kafka_2.11
${flink.version}
provided
org.apache.flink
flink-json
${flink.version}
provided
org.apache.flink
flink-connector-hive_${scala.binary.version}
${flink.version}
provided
Flink 1.12 : Hive分区时态表与kafka的批流一体带来的便捷
1.12的分区时态表特性、可以直接通过sql的方式、一体化实现实时与Hive分区表的join、并且会自动监听最新的hive分区数据、读取到TaskManager中、始终维护最新的一份快照数据、之前的数据会废弃、从而保证TaskManager内存不会无限增长、无需用户编写stream程序即可完成批流join
此图摘自阿里雪尽2020FFA PPT
参数解释
streaming-source.enable
开启流式读取hive数据
streaming-source.partition.include
1.latest属性: 只读取最新分区数据、2. all: 读取全量分区数据 ,默认值为all
,表示读所有分区,latest只能用在temporal join中,用于读取最新分区作为维表,不能直接读取最新分区数据
streaming-source.monitor-interval
监听新分区生成的时间、不宜过短、最短是1
个小时,因为目前的实现是每个task都会查询metastore,高频的查可能会对metastore产生过大的压力
streaming-source.partition-order
分区策略: 主要有以下3种、其中最为推荐的是partition-name
1.partition-name 使用默认分区名称顺序加载最新分区
2.create-time 使用分区文件创建时间顺序
3.partition-time 使用分区时间顺序
注意事项
使用Tempmoral table之前 需要将table.dynamic-table-options.enabled属性设置为true、以此来打开SQL Hint属性、来使用/* option */ 设置table属性
例如
SELECT *
FROM hive_table
/*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.consume-start-offset'='2020-05-20') */;
SQL CLINET
set table.dynamic-table-options.enabled= true;
程序代码
tblEnv.getConfig().getConfiguration().setString("table.dynamic-table-options.enabled", "true");
官网示例
分区时态表代码Demo
HiveCatalog注册
Flink Sql Clinet
vim conf/sql-client-defaults.yaml
catalogs:
- name: hive_catalog
type: hive
hive-conf-dir: /disk0/soft/hive-conf/ 该目录需要包含hive-site.xml文件
KAFKA TABLE DDL
CREATE TABLE hive_catalog.flink_db.kfk_fact_bill_master_12 (
master Row,
foodLst ARRAY,
proctime as PROCTIME() -- PROCTIME用来和Hive时态表关联
) WITH (
'connector' = 'kafka',
'topic' = 'topic_name',
'format' = 'json',
'properties.bootstrap.servers' = 'host:9092',
'properties.group.id' = 'flinkTestGroup',
'scan.startup.mode' = 'timestamp',
'scan.startup.timestamp-millis' = '1607844694000'
);
Hive最新分区数据与Flink数据Join
CREATE VIEW IF NOT EXISTS hive_catalog.flink_db.view_fact_bill_master as
SELECT * FROM
(select t1.*, t2.group_id, t2.shop_id, t2.group_name, t2.shop_name, t2.brand_id,
ROW_NUMBER() OVER (PARTITION BY groupID, shopID, orderKey ORDER BY actionTime desc) rn
from hive_catalog.flink_db.kfk_fact_bill_master_12 t1
JOIN hive_catalog.flink_db.dim_extend_shop_info
/*+ OPTIONS('streaming-source.enable'='true', 'streaming-source.partition.include' = 'latest',
'streaming-source.monitor-interval' = '1 h','streaming-source.partition-order' = 'partition-name') */ FOR SYSTEM_TIME AS OF t1.proctime AS t2 --时态表
ON t1.groupID = t2.group_id and t1.shopID = t2.shop_id
where groupID in (202042)) t where t.rn = 1
将结果数据SInk到Mysql
CREATE TABLE hive_catalog.flink_db_sink.rt_aggr_bill_food_unit_rollup_flk (
report_date String,
group_id int,
group_name String,
shop_id int,
shop_name String,
brand_id BIGINT,
brand_name String,
province_name String,
city_name String,
foodcategory_name String,
food_name String,
food_code String,
unit String,
rt_food_unit_cnt double,
rt_food_unit_amt double,
rt_food_unit_real_amt double,
PRIMARY KEY (report_date, group_id, shop_id, brand_id, foodcategory_name, food_name, food_code, unit) NOT ENFORCED) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://host:4400/db_name?autoReconnect=true&useSSL=false',
'table-name' = 'table-name',
'username' = 'username',
'password' = 'password'
)
insert into hive_catalog.flink_db_sink.rt_aggr_bill_food_unit_rollup_flk
select reportDate, group_id, group_name, shop_id, shop_name, brand_id, brand_name, province_name, city_name, foodCategoryName, foodName, foodCode, unit
, SUM(foodNumber) rt_food_unit_cnt
, sum(foodPriceAmount) rt_food_unit_amt
, sum(foodRealAmount) rt_food_unit_real_amt
from hive_catalog.flink_db.view_fact_bill_master
group by reportDate, group_id, group_name, shop_id, shop_name, brand_id, brand_name, province_name, city_name;
Flink 程序、结合HiveCatalog、利用Sql Clinet已经创建好的source、sink、程序只需要关心逻辑代码、无需关注source、sink表的创建: insert into mysql_sink from kafka table join hive、提高代码复用性以及可读性、减少维护成本
EnvironmentSettings blinkStreamSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inStreamingMode()
.build();
StreamTableEnvironment tblEnv = StreamTableEnvironment.create(bsEnv, blinkStreamSettings);
FlinkEnvUtils.initTblEnv(tblEnv);
tblEnv.getConfig().getConfiguration().setString("table.dynamic-table-options.enabled", "true");
//创建hive catalog
tblEnv.executeSql("CREATE CATALOG hive_catalog WITH (\n" +
" 'type'='hive',\n" +
" 'hive-conf-dir'='/disk0/soft/hive-conf/'" + //该目录需要包含hive-site.xml文件
")");
tblEnv.createTemporaryFunction("boomFunction", ExplodeArray.class);
tblEnv.executeSql("insert into hive_catalog.flink_db_sink.rt_aggr_bill_food_unit_rollup_flk\n" +
"select reportDate, group_id, group_name, shop_id, shop_name, brand_id, brand_name, province_name, city_name, foodCategoryName, foodName, foodCode, unit\n" +
" , SUM(foodNumber) rt_food_unit_cnt\n" +
" , sum(foodPriceAmount) rt_food_unit_amt\n" +
" , sum(foodRealAmount) rt_food_unit_real_amt\n" +
" from hive_catalog.flink_db.view_fact_bill_master\n" +
" group by reportDate, group_id, group_name, shop_id, shop_name, brand_id, brand_name, province_name, city_name");
ORC读取BUG
分区时态表在读取ORC格式的时候、总是无法读取数据、我也向社区提了一个Jira: https://issues.apache.org/jira/browse/FLINK-20576?filter=-2
读取parquet和text都是正常的
Flink Sql开发小技巧
- 结合Hive catalog、持久化source与sink、提高程序复用性、并使代码只需关注逻辑SQL
- 结合Flink视图、抽象逻辑代码、提高复用性
- 利用SQL Clinet调试SQL、程序没问题后、再打包上线、而不是以程序的形式、提交到集群做测试