ETL实时方案: Kafka->Flink->Hive

  • 数据结构
    • kafka数据结构
    • hive数据表结构
    • flink处理逻辑和源码
  • 任务运行模式
    • dolphin on yarn
    • yarn-session
  • 定时优化任务
    • hive小分区合并
    • dolphin占用磁盘定时删除
    • presto内存定时释放

数据结构

kafka数据结构

基于前两章 数据埋点设计和SDK源码和数据采集和验证方案的介绍, 我们是使用filebeat采集容器日志到kafka, 使用kafka-eagle查看kafka数据。

image.png

经过json格式化(bejson.com, json.cn)之后,可得到以下数据格式, 由此可知真正的日志数据在message.request字段。

{
    "@timestamp": "2022-01-20T03:10:55.155Z",
    "@metadata": {
        "beat": "filebeat",
        "type": "_doc",
        "version": "7.9.3"
    },
    "ecs": {
        "version": "1.5.0"
    },
    "host": {
        "name": "df6d1b047497"
    },
    "agent": {
        "type": "filebeat",
        "version": "7.9.3",
        "hostname": "df6d1b047497",
        "ephemeral_id": "c711a4c8-904a-4dfe-9696-9b54f9dde4a9",
        "id": "d4671f09-1ec3-4bfd-bb6d-1a08761926f9",
        "name": "df6d1b047497"
    },
    "log": {
        "offset": 196240916,
        "file": {
            "path": "/var/lib/docker/containers/36847a16c7c8e029744475172847cd14dda0dc28d1c33199df9f8c443e2798ee/36847a16c7c8e029744475172847cd14dda0dc28d1c33199df9f8c443e2798ee-json.log"
        }
    },
    "stream": "stderr",
    "message": {
        "msg": "event",
        "request": "{\"agent\":\"python-requests/2.25.1\",\"event\":\"enter_party_group\",\"game_id\":10,\"ip\":\"192.168.90.90\",\"properties\":{\"#data_source\":\"来源3\",\"#os\":\"ios\",\"#vp@compared_with_now\":1,\"#vp@cost_channel\":\"渠道3\",\"#vp@revenue_amount\":\"9293\",\"#zone_offset\":\"694.0263211183825\",\"$city\":\"\",\"$country\":\"美国\",\"$iso\":\"US\",\"$latitude\":38.8868,\"$location_timezone\":\"America/Chicago\",\"$longitude\":-94.8223,\"$province\":\"\",\"channel\":\"渠道1\",\"group_id\":\"536\",\"life_time\":47},\"time\":1642644975070,\"timestamp\":\"1642648255126\",\"timezone\":\"Asia/Shanghai\",\"type\":\"action\",\"uid\":\"uid_44\",\"uid_type\":\"0\"}",
        "type": "hdfs",
        "app": "sdk_event",
        "level": "info",
        "ts": 1.6426482551549864e+09,
        "caller": "log/logger.go:71"
    },
    "input": {
        "type": "docker"
    }
}

最终的数据格式如下

{
    "agent": "python-requests/2.25.1",
    "event": "enter_party_group",
    "game_id": 10,
    "ip": "192.168.90.90",
    "properties": {
        "#data_source": "来源3",
        "#os": "ios",
        "#vp@compared_with_now": 1,
        "#vp@cost_channel": "渠道3",
        "#vp@revenue_amount": "9293",
        "#zone_offset": "694.0263211183825",
        "$city": "",
        "$country": "美国",
        "$iso": "US",
        "$latitude": 38.8868,
        "$location_timezone": "America/Chicago",
        "$longitude": -94.8223,
        "$province": "",
        "channel": "渠道1",
        "group_id": "536",
        "life_time": 47
    },
    "time": 1642644975070,
    "timestamp": "1642648255126",
    "timezone": "Asia/Shanghai",
    "type": "action",
    "uid": "uid_44",
    "uid_type": "0"
}

hive数据表结构

hive 
show create table  action  (此为表名, 我这边是action表) 

输入输出格式为ORC, Presto针对这种格式的数据做了优化查询, 如果是impala查询则使用parquet格式。

CREATE TABLE `action`(                             
   `uid` string,                                    
   `uid_type` string,                               
   `agent` string,                                  
   `ip` string,                                     
   `timestamp` timestamp,                           
   `time` timestamp,                                
   `year` string,                                   
   `month` string,                                  
   `week` string,                                   
   `hour` string,                                   
   `minute` string,                                 
   `properties` map)                 
 PARTITIONED BY (                                   
   `game_id` int,                                   
   `timezone` string,                               
   `event` string,                                  
   `day` date)                                      
 ROW FORMAT SERDE                                   
   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'      
 WITH SERDEPROPERTIES (                             
   'colelction.delim'=',',                          
   'field.delim'='\t',                              
   'mapkey.delim'=':',                              
   'serialization.format'='\t')                     
 STORED AS INPUTFORMAT                              
   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'  
 OUTPUTFORMAT                                      
   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 
 LOCATION                                           
   'hdfs://slaves01:8020/warehouse/tablespace/managed/hive/event.db/action' 
 TBLPROPERTIES (                                   
   'auto-compaction'='true',                       
   'bucketing_version'='2',                        
   'compaction.file-size'='128MB',                 
   'sink.partition-commit.delay'='0s',             
   'sink.partition-commit.policy.kind'='metastore,success-file',  
   'sink.partition-commit.trigger'='process-time',  
   'sink.shuffle-by-partition.enable'='true',       
   'transient_lastDdlTime'='1642571371')

flink处理逻辑和源码

从业务数据到hive, 首先是json解析数据, 将time时间进行了year, day等按时区进行解析拆分为时间分区。完整代码
Json解析

public class JsonParser {
    public static FileHdfsBean ParseHdfs(String jsonStr){
        FileHdfsBean bean = null;
        try {
            JSONObject jsonObject = JSON.parseObject(jsonStr).getJSONObject("message").getJSONObject("request");
            JSONObject properties = jsonObject.getJSONObject("properties");
            String deviceId = jsonObject.getString("device_no");
            String deviceType = jsonObject.getString("device_type");
            String uid = deviceId == null ? jsonObject.getString("uid") : deviceId;
            String uid_type = deviceId == null ? jsonObject.getString("uid_type") : deviceType;
            String agent = jsonObject.getString("agent");
            String ip = jsonObject.getString("ip");
            String event = jsonObject.getString("event");
            String timestamp = jsonObject.getString("timestamp");
            String time = jsonObject.getString("time");
            int game_id = jsonObject.getIntValue("game_id");
            String timezone = jsonObject.getString("timezone");
            Map map = JSONObject.parseObject(properties.toJSONString(), new TypeReference>() {
            });
            bean = new FileHdfsBean(event, uid, uid_type, agent, ip, timestamp, time, game_id, timezone);
            bean.setMap2Str(map);
        } catch (Exception e) {
            System.out.println("ParseHdfs Exception:" + e.getMessage());
        }
        return bean;
    }
}

时间拆分

    private void setYearMonthWeekDayHourMin(String timestamp, String time) {
        long t = Long.parseLong(timestamp);
        long t2 = Long.parseLong(time);
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
        sdf.setTimeZone(TimeZone.getTimeZone(this.timezone));
        String timestamp2 = sdf.format(t);
        String dateAndTime = sdf.format(t2);
        this.year = dateAndTime.substring(0,4);
        this.month = dateAndTime.substring(0,7);
        this.day = Date.valueOf(dateAndTime.substring(0,10));
        this.week = day +":"+ dateAndTime.substring(11,13) + getWeekOfMonth(dateAndTime.substring(0,10)) ;
        this.hour = dateAndTime.substring(0,13);
        this.minute = dateAndTime.substring(0,16);
        this.timestamp = Timestamp.valueOf(timestamp2);
        this.time = Timestamp.valueOf(dateAndTime);
    }

sink2hive

    private void sink2hive(StreamTableEnvironment tableEnv, DataStreamSource sourceStream) {
        System.out.println("save2hive...");
        Configuration configuration = tableEnv.getConfig().getConfiguration();
        // true使用mr,false则使用flink,在tableEnv上设置会作用再所有接收器上
        configuration.setString("table.exec.hive.fallback-mapred-reader", "false");
        // 创建Hive Catalog
        String name = "kafka2hive";
        String defaultDatabase = "default";
        HiveCatalog hiveCatalog = new HiveCatalog(name, defaultDatabase, HIVE_CONF_DIR);
        System.out.println("注册hive Catalog...");
        // 注册hive Catalog
        tableEnv.registerCatalog(name, hiveCatalog);
        tableEnv.useCatalog(name);
        System.out.println("解析字段,封装样例类...");
        // 解析字段,封装样例类
        SingleOutputStreamOperator beanStream = sourceStream
                .map(JsonParser::ParseHdfs)
                .filter(Objects::nonNull);
        // beanStream流转临时表
        System.out.println("转临时表...");
        tableEnv.executeSql("drop table if exists tmpTable");
        tableEnv.createTemporaryView("tmpTable", beanStream);
        System.out.println("创建Hive表...");
        // 切换为Hive的语法
        tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE);
        // 创建Hive表
        String table = HIVE_DB_NAME + "." + HIVE_TABLE_NAME;
        // tableEnv.executeSql("drop table if exists " + table);
        tableEnv.executeSql("create database IF NOT EXISTS " + HIVE_DB_NAME);
        tableEnv.executeSql("CREATE TABLE IF NOT EXISTS " + table + "\n" +
                "(uid string, uid_type string, agent string, ip string,\n" +
                "`timestamp` timestamp,`time` timestamp,`year` string,`month` string,`week` string,`hour` string, `minute` string,properties map)\n" +
                "PARTITIONED BY(game_id int,timezone string,event string,day date)\n" +
                "ROW FORMAT DELIMITED\n" +
                "FIELDS TERMINATED BY '\\t'\n" +
                "COLLECTION ITEMS TERMINATED BY ','\n" +
                "MAP KEYS TERMINATED BY ':'\n" +
                "stored as orc TBLPROPERTIES (\n" +
                "'sink.partition-commit.trigger'='process-time',\n" +
                "'sink.partition-commit.delay'='0s',\n" +
                "'sink.partition-commit.policy.kind'='metastore,success-file',\n" +
                "'sink.shuffle-by-partition.enable'='true',\n" +
                "'auto-compaction'='true',\n" +
                "'compaction.file-size'='128MB'\n" +
                ")"
        );
        // 切换回默认语法
        tableEnv.getConfig().setSqlDialect(SqlDialect.DEFAULT);
        System.out.println("插入数据到hive表...");
        // 插入数据到hive表
        tableEnv.executeSql("INSERT INTO  " + table + "\n" +
                "SELECT uid, uid_type, agent, ip,`timestamp`,`time`,`year`,`month`,`week`,`hour`,`minute`," +
                "str_to_map(properties,'&',':') as properties,game_id ,timezone ,event ,`day` from tmpTable");
    }

任务运行模式

dolphin on yarn

每次提交都会创建一个新的flink集群,任务之间互相独立,互不影响,方便管理。任务执行完成之后创建的集群也会消失。(推荐此模式)

创建工作流

image.png

从Yarn查看AM

image.png

查看flink任务详情

image.png

yarn-session

在yarn中初始化一个flink集群,开辟指定的资源,以后提交任务都向这里提交。这个flink集群会常驻在yarn集群中,除非手工停止。

# 启动命令
$FLINK_HOME/bin/yarn-session.sh -tm xxx  -s xx ...  (指定相关资源) 

启动成功后,在yarn会出现一个常驻任务Flink session cluster,点击找到ApplicationMaster地址


image.png

提交任务

image.png

查看任务和日志

image.png

定时优化任务

hive小分区合并

基于Hive构建数据仓库时,通常在ETL过程中为了加快速度而提高任务的并行度,无论任务的类型是MapReduce还是Spark还是Flink,都会在数据写入Hive表时产生很多小文件。这里的小文件是指文件size小于HDFS配置的block块大小(目前默认配置是128MB), 其中读写大量小文件的速度要远远小于读写几个大文件的速度,因为要频繁与NameNode交互导致NameNode处理队列过长和GC时间过长而产生延迟。故随着小文件的增多会严重影响到NameNode性能和制约集群的扩展。(建议按天定时清理)

#!/bin/bash
a='uid,uid_type,agent,ip,`timestamp`,`time`,`year`,`month`,`week`,`hour`,`minute`,properties'
for((i=2;i>1;i--));
do
 if [ $# -lt 2 ]
 then
    date=$(date -d"-$i day" +%Y-%m-%d)
 else
    date=$2
 fi
 table=$1
 echo "$date"
 echo "set hive.merge.mapfiles = true;" >> /root/${table}-${date}.sql
 echo "set hive.merge.mapredfiles = true;" >> /root/${table}-${date}.sql
 echo "set hive.merge.tezfiles = true;" >> /root/${table}-${date}.sql
 echo "set hive.merge.size.per.task = 256000000;" >> /root/${table}-${date}.sql
 echo "set hive.merge.smallfiles.avgsize = 16000000;" >> /root/${table}-${date}.sql
 hadoop fs -ls -R  /warehouse/tablespace/managed/hive/event.db/${table}/ | awk '{print $8}'| awk -F "/${table}/" '{print $2}'| grep day=${date}$ | while read line
 do
   array=(`echo ${line} | tr '/' ' '`)
   for var in ${array[@]}
   do
      arr=(`echo ${var}|tr '=' ' '`)
      case ${arr[0]} in
        "game_id") game_id=${arr[1]}
        ;;
        "timezone") timezone=$(echo "${arr[1]//%2F//}")
        ;;
        "event") event=${arr[1]}
        ;;
        "day") day=${arr[1]}
        ;;
        *) echo 'have a error!'
        ;;
      esac
   done
   echo "输出:game_id=${game_id},timezone=${timezone},event=${event},day=${day}"
   ## 将小文件数据先放入一个临时分区
   echo "insert overwrite table event.${table} partition (game_id='${game_id}',timezone='${timezone}',event='${event}',day='1970-01-01') select $a from event.${table} where game_id=${game_id} and timezone='${timezone}' and event='${event}' and day=cast('${day}' as date);" >> /root/${table}-${date}.sql
   ## 删除原来的小文件分区
   echo "alter table event.${table} drop if exists partition(game_id='${game_id}',timezone='${timezone}',event='${event}',day='${day}');" >> /root/${table}-${date}.sql
   ## 将临时分区rename为大文件分区
   echo "alter table event.${table} partition (game_id='${game_id}',timezone='${timezone}',event='${event}',day='1970-01-01') rename to partition(game_id='${game_id}',timezone='${timezone}',event='${event}',day='${day}');" >> /root/${table}-${date}.sql
 done
 ## 执行合并小文件的sql文件
 hive -f /root/${table}-${date}.sql > /dev/null 2>&1
 echo "${table}的${date}分区合并完成"
 ## 删除合并小文件的sql文件
 rm -rf /root/${table}-${date}.sql
done

本地运行
bash xx.sh xx(传入调整的table名称)

dolphinscheduler运行

image.png

dolphin占用磁盘定时删除

SQL定时删除工作流实例
随着定时调度任务的增加, 工作实例数据量越来越大, 如果不定时进行清除, 会导致页面崩溃
以下SQL是删除已完成的工作实例, 保留最近3天(建议按天定时清理)

delete from t_ds_process_instance where state != 1 and date(end_time) < DATE_SUB(CURDATE(), INTERVAL 3 DAY)
image.png
image.png

定时删除日志和资源
dolphinscheduler长期运行会产生大量日志和执行文件的jar包资源, 如果不定时删除, 将会占用巨大的磁盘空间, 导致整个大数据集群崩溃。(建议按天定时清理)

#!/bin/bash

# 校验ip格式
function isValidIp() {
  local ip=$1
  local ret=1

  if [[ $ip =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
    ip=(${ip//\./ }) # 按.分割,转成数组,方便下面的判断
    [[ ${ip[0]} -le 255 && ${ip[1]} -le 255 && ${ip[2]} -le 255 && ${ip[3]} -le 255 ]]
    ret=$?
  fi

  return $ret
}

ips=$1;
dolphin_base_dir=$2;

# 将分隔符替换为空格
ip_array=(`echo ${ips} | tr ',' ' '`);


# 遍历传递进来的地址参数,分别连接woker并执行手动触发Full GC命令
echo "ip count: ${#ip_array[@]}";
for ip in ${ip_array[@]}; do
    isValidIp $ip
    if [ ! $? == 0 ]; then
        echo "Warning: param ${ip} is not a valid ip!"
        continue
    else 
        # 删除jar包资源
        ssh -o "StrictHostKeyChecking no" ${ip} "find ${dolphin_base_dir}/exec/process -mindepth 3 -maxdepth 3 -name "[0-9]*" -mtime +1 -type d | xargs rm -rf;exit";
        # 删除日志
        ssh -o "StrictHostKeyChecking no" ${ip} 'rm -rf /usr/hdp/current/dolphinscheduler/logs/dolphinscheduler-worker.2*.log'
        echo "${ip} has been clean up!"
    fi;
done;
image.png

image.png

presto内存定时释放

Presto长时间运行如果遇到内存无法释放, 一直增加的情况, 可以每小时定时GC清一下内存。

#!/bin/bash

# 校验ip格式
function isValidIp() {
  local ip=$1
  local ret=1

  if [[ $ip =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
    ip=(${ip//\./ }) # 按.分割,转成数组,方便下面的判断
    [[ ${ip[0]} -le 255 && ${ip[1]} -le 255 && ${ip[2]} -le 255 && ${ip[3]} -le 255 ]]
    ret=$?
  fi

  return $ret
}

ips=$1;

# 将分隔符替换为空格
ip_array=(`echo ${ips} | tr ',' ' '`);


# 遍历传递进来的地址参数,分别连接woker并执行手动触发Full GC命令
echo "ip count: ${#ip_array[@]}";
for ip in ${ip_array[@]}; do
    isValidIp $ip
    if [ ! $? == 0 ]; then
        echo "Warning: param ${ip} is not a valid ip!"
        continue
    else 
        ssh -o "StrictHostKeyChecking no" ${ip} 'jmap -histo:live `jps | grep PrestoServer|cut -d " " -f 1` > /dev/null;exit;'
        echo "presto woker of ${ip} has been Full GC!"
    fi;
done;
image.png

系列文章

第一篇: Ambari自动化部署
第二篇: 数据埋点设计和SDK源码
第三篇: 数据采集和验证方案
第四篇: ETL实时方案: Kafka->Flink->Hive
第五篇: ETL用户数据处理: kafka->spark->kudu
第六篇: Presto分析模型SQL和UDF函数

你可能感兴趣的:(ETL实时方案: Kafka->Flink->Hive)