- 数据结构
- kafka数据结构
- hive数据表结构
- flink处理逻辑和源码
- 任务运行模式
- dolphin on yarn
- yarn-session
- 定时优化任务
- hive小分区合并
- dolphin占用磁盘定时删除
- presto内存定时释放
数据结构
kafka数据结构
基于前两章 数据埋点设计和SDK源码和数据采集和验证方案的介绍, 我们是使用filebeat采集容器日志到kafka, 使用kafka-eagle查看kafka数据。
经过json格式化(bejson.com, json.cn)之后,可得到以下数据格式, 由此可知真正的日志数据在message.request字段。
{
"@timestamp": "2022-01-20T03:10:55.155Z",
"@metadata": {
"beat": "filebeat",
"type": "_doc",
"version": "7.9.3"
},
"ecs": {
"version": "1.5.0"
},
"host": {
"name": "df6d1b047497"
},
"agent": {
"type": "filebeat",
"version": "7.9.3",
"hostname": "df6d1b047497",
"ephemeral_id": "c711a4c8-904a-4dfe-9696-9b54f9dde4a9",
"id": "d4671f09-1ec3-4bfd-bb6d-1a08761926f9",
"name": "df6d1b047497"
},
"log": {
"offset": 196240916,
"file": {
"path": "/var/lib/docker/containers/36847a16c7c8e029744475172847cd14dda0dc28d1c33199df9f8c443e2798ee/36847a16c7c8e029744475172847cd14dda0dc28d1c33199df9f8c443e2798ee-json.log"
}
},
"stream": "stderr",
"message": {
"msg": "event",
"request": "{\"agent\":\"python-requests/2.25.1\",\"event\":\"enter_party_group\",\"game_id\":10,\"ip\":\"192.168.90.90\",\"properties\":{\"#data_source\":\"来源3\",\"#os\":\"ios\",\"#vp@compared_with_now\":1,\"#vp@cost_channel\":\"渠道3\",\"#vp@revenue_amount\":\"9293\",\"#zone_offset\":\"694.0263211183825\",\"$city\":\"\",\"$country\":\"美国\",\"$iso\":\"US\",\"$latitude\":38.8868,\"$location_timezone\":\"America/Chicago\",\"$longitude\":-94.8223,\"$province\":\"\",\"channel\":\"渠道1\",\"group_id\":\"536\",\"life_time\":47},\"time\":1642644975070,\"timestamp\":\"1642648255126\",\"timezone\":\"Asia/Shanghai\",\"type\":\"action\",\"uid\":\"uid_44\",\"uid_type\":\"0\"}",
"type": "hdfs",
"app": "sdk_event",
"level": "info",
"ts": 1.6426482551549864e+09,
"caller": "log/logger.go:71"
},
"input": {
"type": "docker"
}
}
最终的数据格式如下
{
"agent": "python-requests/2.25.1",
"event": "enter_party_group",
"game_id": 10,
"ip": "192.168.90.90",
"properties": {
"#data_source": "来源3",
"#os": "ios",
"#vp@compared_with_now": 1,
"#vp@cost_channel": "渠道3",
"#vp@revenue_amount": "9293",
"#zone_offset": "694.0263211183825",
"$city": "",
"$country": "美国",
"$iso": "US",
"$latitude": 38.8868,
"$location_timezone": "America/Chicago",
"$longitude": -94.8223,
"$province": "",
"channel": "渠道1",
"group_id": "536",
"life_time": 47
},
"time": 1642644975070,
"timestamp": "1642648255126",
"timezone": "Asia/Shanghai",
"type": "action",
"uid": "uid_44",
"uid_type": "0"
}
hive数据表结构
hive
show create table action (此为表名, 我这边是action表)
输入输出格式为ORC, Presto针对这种格式的数据做了优化查询, 如果是impala查询则使用parquet格式。
CREATE TABLE `action`(
`uid` string,
`uid_type` string,
`agent` string,
`ip` string,
`timestamp` timestamp,
`time` timestamp,
`year` string,
`month` string,
`week` string,
`hour` string,
`minute` string,
`properties` map)
PARTITIONED BY (
`game_id` int,
`timezone` string,
`event` string,
`day` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
'colelction.delim'=',',
'field.delim'='\t',
'mapkey.delim'=':',
'serialization.format'='\t')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://slaves01:8020/warehouse/tablespace/managed/hive/event.db/action'
TBLPROPERTIES (
'auto-compaction'='true',
'bucketing_version'='2',
'compaction.file-size'='128MB',
'sink.partition-commit.delay'='0s',
'sink.partition-commit.policy.kind'='metastore,success-file',
'sink.partition-commit.trigger'='process-time',
'sink.shuffle-by-partition.enable'='true',
'transient_lastDdlTime'='1642571371')
flink处理逻辑和源码
从业务数据到hive, 首先是json解析数据, 将time时间进行了year, day等按时区进行解析拆分为时间分区。完整代码
Json解析
public class JsonParser {
public static FileHdfsBean ParseHdfs(String jsonStr){
FileHdfsBean bean = null;
try {
JSONObject jsonObject = JSON.parseObject(jsonStr).getJSONObject("message").getJSONObject("request");
JSONObject properties = jsonObject.getJSONObject("properties");
String deviceId = jsonObject.getString("device_no");
String deviceType = jsonObject.getString("device_type");
String uid = deviceId == null ? jsonObject.getString("uid") : deviceId;
String uid_type = deviceId == null ? jsonObject.getString("uid_type") : deviceType;
String agent = jsonObject.getString("agent");
String ip = jsonObject.getString("ip");
String event = jsonObject.getString("event");
String timestamp = jsonObject.getString("timestamp");
String time = jsonObject.getString("time");
int game_id = jsonObject.getIntValue("game_id");
String timezone = jsonObject.getString("timezone");
Map map = JSONObject.parseObject(properties.toJSONString(), new TypeReference
时间拆分
private void setYearMonthWeekDayHourMin(String timestamp, String time) {
long t = Long.parseLong(timestamp);
long t2 = Long.parseLong(time);
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
sdf.setTimeZone(TimeZone.getTimeZone(this.timezone));
String timestamp2 = sdf.format(t);
String dateAndTime = sdf.format(t2);
this.year = dateAndTime.substring(0,4);
this.month = dateAndTime.substring(0,7);
this.day = Date.valueOf(dateAndTime.substring(0,10));
this.week = day +":"+ dateAndTime.substring(11,13) + getWeekOfMonth(dateAndTime.substring(0,10)) ;
this.hour = dateAndTime.substring(0,13);
this.minute = dateAndTime.substring(0,16);
this.timestamp = Timestamp.valueOf(timestamp2);
this.time = Timestamp.valueOf(dateAndTime);
}
sink2hive
private void sink2hive(StreamTableEnvironment tableEnv, DataStreamSource sourceStream) {
System.out.println("save2hive...");
Configuration configuration = tableEnv.getConfig().getConfiguration();
// true使用mr,false则使用flink,在tableEnv上设置会作用再所有接收器上
configuration.setString("table.exec.hive.fallback-mapred-reader", "false");
// 创建Hive Catalog
String name = "kafka2hive";
String defaultDatabase = "default";
HiveCatalog hiveCatalog = new HiveCatalog(name, defaultDatabase, HIVE_CONF_DIR);
System.out.println("注册hive Catalog...");
// 注册hive Catalog
tableEnv.registerCatalog(name, hiveCatalog);
tableEnv.useCatalog(name);
System.out.println("解析字段,封装样例类...");
// 解析字段,封装样例类
SingleOutputStreamOperator beanStream = sourceStream
.map(JsonParser::ParseHdfs)
.filter(Objects::nonNull);
// beanStream流转临时表
System.out.println("转临时表...");
tableEnv.executeSql("drop table if exists tmpTable");
tableEnv.createTemporaryView("tmpTable", beanStream);
System.out.println("创建Hive表...");
// 切换为Hive的语法
tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE);
// 创建Hive表
String table = HIVE_DB_NAME + "." + HIVE_TABLE_NAME;
// tableEnv.executeSql("drop table if exists " + table);
tableEnv.executeSql("create database IF NOT EXISTS " + HIVE_DB_NAME);
tableEnv.executeSql("CREATE TABLE IF NOT EXISTS " + table + "\n" +
"(uid string, uid_type string, agent string, ip string,\n" +
"`timestamp` timestamp,`time` timestamp,`year` string,`month` string,`week` string,`hour` string, `minute` string,properties map)\n" +
"PARTITIONED BY(game_id int,timezone string,event string,day date)\n" +
"ROW FORMAT DELIMITED\n" +
"FIELDS TERMINATED BY '\\t'\n" +
"COLLECTION ITEMS TERMINATED BY ','\n" +
"MAP KEYS TERMINATED BY ':'\n" +
"stored as orc TBLPROPERTIES (\n" +
"'sink.partition-commit.trigger'='process-time',\n" +
"'sink.partition-commit.delay'='0s',\n" +
"'sink.partition-commit.policy.kind'='metastore,success-file',\n" +
"'sink.shuffle-by-partition.enable'='true',\n" +
"'auto-compaction'='true',\n" +
"'compaction.file-size'='128MB'\n" +
")"
);
// 切换回默认语法
tableEnv.getConfig().setSqlDialect(SqlDialect.DEFAULT);
System.out.println("插入数据到hive表...");
// 插入数据到hive表
tableEnv.executeSql("INSERT INTO " + table + "\n" +
"SELECT uid, uid_type, agent, ip,`timestamp`,`time`,`year`,`month`,`week`,`hour`,`minute`," +
"str_to_map(properties,'&',':') as properties,game_id ,timezone ,event ,`day` from tmpTable");
}
任务运行模式
dolphin on yarn
每次提交都会创建一个新的flink集群,任务之间互相独立,互不影响,方便管理。任务执行完成之后创建的集群也会消失。(推荐此模式)
创建工作流
从Yarn查看AM
查看flink任务详情
yarn-session
在yarn中初始化一个flink集群,开辟指定的资源,以后提交任务都向这里提交。这个flink集群会常驻在yarn集群中,除非手工停止。
# 启动命令
$FLINK_HOME/bin/yarn-session.sh -tm xxx -s xx ... (指定相关资源)
启动成功后,在yarn会出现一个常驻任务Flink session cluster,点击找到ApplicationMaster地址
提交任务
查看任务和日志
定时优化任务
hive小分区合并
基于Hive构建数据仓库时,通常在ETL过程中为了加快速度而提高任务的并行度,无论任务的类型是MapReduce还是Spark还是Flink,都会在数据写入Hive表时产生很多小文件。这里的小文件是指文件size小于HDFS配置的block块大小(目前默认配置是128MB), 其中读写大量小文件的速度要远远小于读写几个大文件的速度,因为要频繁与NameNode交互导致NameNode处理队列过长和GC时间过长而产生延迟。故随着小文件的增多会严重影响到NameNode性能和制约集群的扩展。(建议按天定时清理)
#!/bin/bash
a='uid,uid_type,agent,ip,`timestamp`,`time`,`year`,`month`,`week`,`hour`,`minute`,properties'
for((i=2;i>1;i--));
do
if [ $# -lt 2 ]
then
date=$(date -d"-$i day" +%Y-%m-%d)
else
date=$2
fi
table=$1
echo "$date"
echo "set hive.merge.mapfiles = true;" >> /root/${table}-${date}.sql
echo "set hive.merge.mapredfiles = true;" >> /root/${table}-${date}.sql
echo "set hive.merge.tezfiles = true;" >> /root/${table}-${date}.sql
echo "set hive.merge.size.per.task = 256000000;" >> /root/${table}-${date}.sql
echo "set hive.merge.smallfiles.avgsize = 16000000;" >> /root/${table}-${date}.sql
hadoop fs -ls -R /warehouse/tablespace/managed/hive/event.db/${table}/ | awk '{print $8}'| awk -F "/${table}/" '{print $2}'| grep day=${date}$ | while read line
do
array=(`echo ${line} | tr '/' ' '`)
for var in ${array[@]}
do
arr=(`echo ${var}|tr '=' ' '`)
case ${arr[0]} in
"game_id") game_id=${arr[1]}
;;
"timezone") timezone=$(echo "${arr[1]//%2F//}")
;;
"event") event=${arr[1]}
;;
"day") day=${arr[1]}
;;
*) echo 'have a error!'
;;
esac
done
echo "输出:game_id=${game_id},timezone=${timezone},event=${event},day=${day}"
## 将小文件数据先放入一个临时分区
echo "insert overwrite table event.${table} partition (game_id='${game_id}',timezone='${timezone}',event='${event}',day='1970-01-01') select $a from event.${table} where game_id=${game_id} and timezone='${timezone}' and event='${event}' and day=cast('${day}' as date);" >> /root/${table}-${date}.sql
## 删除原来的小文件分区
echo "alter table event.${table} drop if exists partition(game_id='${game_id}',timezone='${timezone}',event='${event}',day='${day}');" >> /root/${table}-${date}.sql
## 将临时分区rename为大文件分区
echo "alter table event.${table} partition (game_id='${game_id}',timezone='${timezone}',event='${event}',day='1970-01-01') rename to partition(game_id='${game_id}',timezone='${timezone}',event='${event}',day='${day}');" >> /root/${table}-${date}.sql
done
## 执行合并小文件的sql文件
hive -f /root/${table}-${date}.sql > /dev/null 2>&1
echo "${table}的${date}分区合并完成"
## 删除合并小文件的sql文件
rm -rf /root/${table}-${date}.sql
done
本地运行
bash xx.sh xx(传入调整的table名称)
dolphinscheduler运行
dolphin占用磁盘定时删除
SQL定时删除工作流实例
随着定时调度任务的增加, 工作实例数据量越来越大, 如果不定时进行清除, 会导致页面崩溃
以下SQL是删除已完成的工作实例, 保留最近3天(建议按天定时清理)
delete from t_ds_process_instance where state != 1 and date(end_time) < DATE_SUB(CURDATE(), INTERVAL 3 DAY)
定时删除日志和资源
dolphinscheduler长期运行会产生大量日志和执行文件的jar包资源, 如果不定时删除, 将会占用巨大的磁盘空间, 导致整个大数据集群崩溃。(建议按天定时清理)
#!/bin/bash
# 校验ip格式
function isValidIp() {
local ip=$1
local ret=1
if [[ $ip =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
ip=(${ip//\./ }) # 按.分割,转成数组,方便下面的判断
[[ ${ip[0]} -le 255 && ${ip[1]} -le 255 && ${ip[2]} -le 255 && ${ip[3]} -le 255 ]]
ret=$?
fi
return $ret
}
ips=$1;
dolphin_base_dir=$2;
# 将分隔符替换为空格
ip_array=(`echo ${ips} | tr ',' ' '`);
# 遍历传递进来的地址参数,分别连接woker并执行手动触发Full GC命令
echo "ip count: ${#ip_array[@]}";
for ip in ${ip_array[@]}; do
isValidIp $ip
if [ ! $? == 0 ]; then
echo "Warning: param ${ip} is not a valid ip!"
continue
else
# 删除jar包资源
ssh -o "StrictHostKeyChecking no" ${ip} "find ${dolphin_base_dir}/exec/process -mindepth 3 -maxdepth 3 -name "[0-9]*" -mtime +1 -type d | xargs rm -rf;exit";
# 删除日志
ssh -o "StrictHostKeyChecking no" ${ip} 'rm -rf /usr/hdp/current/dolphinscheduler/logs/dolphinscheduler-worker.2*.log'
echo "${ip} has been clean up!"
fi;
done;
presto内存定时释放
Presto长时间运行如果遇到内存无法释放, 一直增加的情况, 可以每小时定时GC清一下内存。
#!/bin/bash
# 校验ip格式
function isValidIp() {
local ip=$1
local ret=1
if [[ $ip =~ ^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$ ]]; then
ip=(${ip//\./ }) # 按.分割,转成数组,方便下面的判断
[[ ${ip[0]} -le 255 && ${ip[1]} -le 255 && ${ip[2]} -le 255 && ${ip[3]} -le 255 ]]
ret=$?
fi
return $ret
}
ips=$1;
# 将分隔符替换为空格
ip_array=(`echo ${ips} | tr ',' ' '`);
# 遍历传递进来的地址参数,分别连接woker并执行手动触发Full GC命令
echo "ip count: ${#ip_array[@]}";
for ip in ${ip_array[@]}; do
isValidIp $ip
if [ ! $? == 0 ]; then
echo "Warning: param ${ip} is not a valid ip!"
continue
else
ssh -o "StrictHostKeyChecking no" ${ip} 'jmap -histo:live `jps | grep PrestoServer|cut -d " " -f 1` > /dev/null;exit;'
echo "presto woker of ${ip} has been Full GC!"
fi;
done;
系列文章
第一篇: Ambari自动化部署
第二篇: 数据埋点设计和SDK源码
第三篇: 数据采集和验证方案
第四篇: ETL实时方案: Kafka->Flink->Hive
第五篇: ETL用户数据处理: kafka->spark->kudu
第六篇: Presto分析模型SQL和UDF函数