实时数仓数据由Flink源源不断从Kafka当中读数据计算,无需手动同步数据到实时数仓。
用户行为数据由Flume从Kafka直接同步到HDFS,由于离线数仓采用Hive的分区表按天统计,所以目标路径要包含一层日期。具体数据流向如下图所示
按照规划,该Flume需将Kafka中topic_log的数据发往HDFS。并且对每天产生的用户行为日志进行区分,将不同天的数据发往HDFS不同天的路径。此处选择KafkaSource、FileChannel、HDFSSink。
关键配置如下:
日志消费Flume关键配置
2.1.3.1创建Flume配置文件
在hadoop104节点的Flume的job目录下创建kafka_to_hdfs_log.conf
[maxwell@hadoop104 flume]$ cd job
[maxwell@hadoop104 job]$ ls -ltr
total 4
-rw-rw-r--. 1 maxwell maxwell 1178 Mar 27 16:03 kafka_to_hdfs_log.conf
[maxwell@hadoop104 job]$ vim kafka_to_hdfs_log.conf
[maxwell@hadoop104 job]$
2.1.3.2 配置文件内容如下
#定义组件
a1.sources=r1
a1.channels=c1
a1.sinks=k1
#配置source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.maxwell.gmall.flume.interceptor.TimestampInterceptor$Builder#配置channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6#配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_log/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = log
a1.sinks.k1.hdfs.round = false
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0#控制输出文件类型
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip#组装
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
注意:a1.sources.r1.interceptors.i1.type的配置一定要跟IDEA中的代码builder一致。
注:配置优化
1)FileChannel优化
通过配置dataDirs指向多个路径,每个路径对应不同的硬盘,增大Flume吞吐量。
checkpointDir和backupCheckpointDir也尽量配置在不同硬盘对应的目录中,保证checkpoint坏掉后,可以快速使用backupCheckpointDir恢复数据
2)HDFS Sink优化
(1)HDFS存入大量小文件,有什么影响?
元数据层面:每个小文件都有一份元数据,其中包括文件路径,文件名,所有者,所属组,权限,创建时间等,这些信息都保存在Namenode内存中。所以小文件过多,会占用Namenode服务器大量内存,影响Namenode性能和使用寿命
计算层面:默认情况下MR会对每个小文件启用一个Map任务计算,非常影响计算性能。同时也影响磁盘寻址时间。
(2)HDFS小文件处理
官方默认的这三个参数配置写入HDFS后会产生小文件,hdfs.rollInterval、hdfs.rollSize、hdfs.rollCount
基于以上hdfs.rollInterval=3600,hdfs.rollSize=134217728,hdfs.rollCount =0几个参数综合作用
几个参数综合作用,效果如下:
(1)文件在达到128M时会滚动生成新文件
(2)文件创建超3600秒时会滚动生成新文件
(1)数据漂移问题
(2)在com.maxwell.gmall.flume.interceptor包下创建TimestampInterceptor类
package com.maxwell.gmall.flume.interceptor;
import com.alibaba.fastjson.JSONObject;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;
public class TimestampInterceptor implements Interceptor {
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
// 1.获取header 和body 当中的数据
Map headers = event.getHeaders();
byte[] body = event.getBody();
String log = new String(body, StandardCharsets.UTF_8);
//2.解析log(json)的 ts时间戳字段
JSONObject jsonObject = JSONObject.parseObject(log);
String ts = jsonObject.getString("ts");
// 3.把解析出来的ts put 到header头当中 timestamp 时间字段替换成日志生成的时间戳(解决数据漂移问题)
headers.put("timestamp", ts);
return event;
}
@Override
public List intercept(List list) {
for (Event event : list) {
intercept(event);
}
return list;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new TimestampInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
(3) 重新打包
(4) 需要先将打好的包放入到hadoop104的/opt/module/flume/lib文件夹下面。
1)启动Zookeeper、Kafka集群
2)启动日志采集Flume
[maxwell@hadoop102 ~]$ cd bin
[maxwell@hadoop102 bin]$ ls -ltr
total 40
-rwxrwxr-x. 1 maxwell maxwell 565 Oct 7 18:33 xsync
-rwxrwxr-x. 1 maxwell maxwell 1023 Oct 8 10:26 myhadoop.sh
-rwxrwxr-x. 1 maxwell maxwell 122 Oct 8 10:27 jpsall
-rwxrwxrwx. 1 maxwell maxwell 195 Mar 22 17:48 lg.sh
-rwxrwxrwx. 1 maxwell maxwell 130 Mar 22 17:53 xcall
-rwxrwxrwx. 1 maxwell maxwell 1092 Mar 23 11:11 hdp.sh
-rwxrwxrwx. 1 maxwell maxwell 565 Mar 23 14:55 zk.sh
-rwxrwxrwx. 1 maxwell maxwell 442 Mar 23 15:26 kf.sh
-rwxrwxrwx. 1 maxwell maxwell 574 Mar 25 13:29 f1.sh
-rwxrwxrwx. 1 maxwell maxwell 804 Mar 27 13:13 mxw.sh
[maxwell@hadoop102 bin]$ hdp.sh start
=================== 启动 hadoop集群 ===================
--------------- 启动 hdfs ---------------
Starting namenodes on [hadoop102]
Starting datanodes
Starting secondary namenodes [hadoop104]
--------------- 启动 yarn ---------------
Starting resourcemanager
Starting nodemanagers
--------------- 启动 historyserver ---------------
[maxwell@hadoop102 bin]$ f1.sh start
--------启动 hadoop102 采集flume-------
--------启动 hadoop103 采集flume-------
[maxwell@hadoop102 bin]$ jps
13296 NodeManager
9490 Kafka
10738 Maxwell
13427 JobHistoryServer
13668 Jps
12965 DataNode
12827 NameNode
13564 Application
9102 QuorumPeerMain
[maxwell@hadoop102 bin]$
3)启动hadoop104的日志消费Flume
[maxwell@hadoop104 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_log.conf -Dflume.root.logger=info,console
Info: Sourcing environment configuration script /opt/module/flume/conf/flume-env.sh
Info: Including Hadoop libraries found via (/opt/module/hadoop/bin/hadoop) for HDFS access
Info: Including Hive libraries found via () for Hive access
+ exec /opt/module/jdk1.8.0_212/bin/java -Xms100m -Xmx2000m -Dcom.sun.management.jmxremote -Dflume.root.logger=info,console -cp '/opt/module/flume/conf:/opt/module/flume/lib/*:/opt/module/hadoop/etc/hadoop:/opt/module/hadoop/share/hadoop/common/lib/*:/opt/module/hadoop/share/hadoop/common/*:/opt/module/hadoop/share/hadoop/hdfs:/opt/module/hadoop/share/hadoop/hdfs/lib/*:/opt/module/hadoop/share/hadoop/hdfs/*:/opt/module/hadoop/share/hadoop/mapreduce/lib/*:/opt/module/hadoop/share/hadoop/mapreduce/*:/opt/module/hadoop/share/hadoop/yarn:/opt/module/hadoop/share/hadoop/yarn/lib/*:/opt/module/hadoop/share/hadoop/yarn/*:/lib/*' -Djava.library.path=:/opt/module/hadoop/lib/native org.apache.flume.node.Application -n a1 -f job/kafka_to_hdfs_log.conf
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/module/flume/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/module/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2023-03-27 17:00:01,955 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider.start(PollingPropertiesFileConfigurationProvider.java:62)] Configuration provider starting
2023-03-27 17:00:01,966 (conf-file-poller-0) [INFO - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:138)] Reloading configuration file:job/kafka_to_hdfs_log.conf
4)生成模拟数据
[maxwell@hadoop102 bin]$ lg.sh
----------------hadoop102------------------
----------------hadoop103------------------
[maxwell@hadoop102 bin]$
5)观察HDFS是否出现数据
若上述测试通过,为方便,此处创建一个Flume的启停脚本。
1)在hadoop102节点的/home/atguigu/bin目录下创建脚本f2.sh
[maxwell@hadoop102 bin]$ pwd
/home/maxwell/bin
[maxwell@hadoop102 bin]$ vim f2.sh
[maxwell@hadoop102 bin]$ vim f2.sh
[maxwell@hadoop102 bin]$
[maxwell@hadoop102 bin]$ chmod 777 f2.sh
[maxwell@hadoop102 bin]$
[maxwell@hadoop102 bin]$ cat f2.sh
#!/bin/bash
case $1 in
"start")
echo " --------启动 hadoop104 日志数据flume-------"
ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_log.conf >/dev/null 2>&1 &"
;;
"stop")
echo " --------停止 hadoop104 日志数据flume-------"
ssh hadoop104 "ps -ef | grep kafka_to_hdfs_log | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac
[maxwell@hadoop102 bin]$
2)增加脚本执行权限
[maxwell@hadoop102 bin]$ chmod 777 f2.sh
3)f2启动
[maxwell@hadoop102 module]$ f2.sh start
4)f2停止
[maxwell@hadoop102 module]$ f2.sh stop
业务数据是数仓的重要数据来源,需要每天定时从业务数据库中抽取数据,传输到DW中,之后对数据进行分析统计。
为保证统计结果的正确性,需要保证数仓中的数据与业务数据库保持同步,离线数仓的计算周期常以天为单位。数据同步周期一般为天,即每日同步一次。
数据的同步策略分为 全量同步 和 增量同步。
全量同步, 即将业务数据库中每日的全部数据同步到数仓中。确保业务数据库和数仓数据一致性。
增量同步,将每天中业务数据中的新增及变化数据同步到数仓库。采用每日增量同步的表。通常需要在首日先进行一次全量同步。
同步策略 |
优点 |
缺点 |
全量同步 |
逻辑简单 |
在某些情况下效率较低。例如某张表数据量较大,但是每天数据的变化比例很低,若对其采用每日全量同步,则会重复同步和存储大量相同的数据。 |
增量同步 |
效率高,无需同步和存储重复数据 |
逻辑复杂,需要将每日的新增及变化数据同原来的数据进行整合,才能使用 |
上述比较,得出以下结论:
通常情况,业务表数据量比较大,优先考虑增量,数据量比较小,优先考虑全量;
各表同步策略
数据同步工具种类繁多,大致可分为两类,
一类是以DataX、Sqoop为代表的基于Select查询的离线、批量同步工具,
另一类是以Maxwell、Canal为代表的基于数据库数据变更日志(例如MySQL的binlog,其会实时记录所有的DDL操作)的实时流式同步工具。
全量同步通常使用DataX、Sqoop等基于查询的离线同步工具。
增量同步既可以使用DataX、Sqoop等工具,也可使用Maxwell、Canal等工具,下面对增量同步不同方案进行简要对比。
增量同步方案 |
DataX/Sqoop |
Maxwell/Canal |
对数据库的要求 |
原理是基于查询,故若想通过select查询获取新增及变化数据,就要求数据表中存在create_time、update_time等字段,然后根据这些字段获取变更数据。 |
要求数据库记录变更操作,例如MySQL需开启binlog。 |
数据的中间状态 |
由于是离线批量同步,故若一条数据在一天中变化多次,该方案只能获取最后一个状态,中间状态无法获取。 |
由于是实时获取所有的数据变更操作,所以可以获取变更数据的所有中间状态。 |
本项目中,全量同步采用DataX,增量同步采用Maxwell。
2.2.5.1 数据同步工具DataX部署
(15条消息) 关于数据同步工具DataX部署_DB架构的博客-CSDN博客
2.2.5.2 数据通道
全量表数据由DataX从MySQL业务数据库直接同步到HDFS,具体数据流向如下图所示.
2.2.5.3 DataX配置文件
我们需要为每张全量表编写一个DataX的json配置文件,此处以activity_info为例,配置文件内容如下:
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"id",
"activity_name",
"activity_type",
"activity_desc",
"start_time",
"end_time",
"create_time"
],
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://hadoop102:3306/gmall"
],
"table": [
"activity_info"
]
}
],
"password": "xxxxxxxx",
"splitPk": "",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "bigint"
},
{
"name": "activity_name",
"type": "string"
},
{
"name": "activity_type",
"type": "string"
},
{
"name": "activity_desc",
"type": "string"
},
{
"name": "start_time",
"type": "string"
},
{
"name": "end_time",
"type": "string"
},
{
"name": "create_time",
"type": "string"
}
],
"compress": "gzip",
"defaultFS": "hdfs://hadoop102:8020",
"fieldDelimiter": "\t",
"fileName": "activity_info",
"fileType": "text",
"path": "${targetdir}",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
注:由于目标路径包含一层日期,用于对不同天的数据加以区分,故path参数并未写死,需在提交任务时通过参数动态传入,参数名称为targetdir。
2.2.5.4 DataX配置文件生成脚本
方便起见,此处提供了DataX配置文件批量生成脚本,脚本内容及使用方式如下
1)在~/bin目录下创建gen_import_config.py脚本
[maxwell@hadoop102 bin]$ vim ~/bin/gen_import_config.py
脚本内容如下
# ecoding=utf-8
import json
import getopt
import os
import sys
import MySQLdb
#MySQL相关配置,需根据实际情况作出修改
mysql_host = "hadoop102"
mysql_port = "3306"
mysql_user = "root"
mysql_passwd = "xxxxxx"
#HDFS NameNode相关配置,需根据实际情况作出修改
hdfs_nn_host = "hadoop102"
hdfs_nn_port = "8020"
#生成配置文件的目标路径,可根据实际情况作出修改
output_path = "/opt/module/datax/job/import"
def get_connection():
return MySQLdb.connect(host=mysql_host, port=int(mysql_port), user=mysql_user, passwd=mysql_passwd)
def get_mysql_meta(database, table):
connection = get_connection()
cursor = connection.cursor()
sql = "SELECT COLUMN_NAME,DATA_TYPE from information_schema.COLUMNS WHERE TABLE_SCHEMA=%s AND TABLE_NAME=%s ORDER BY ORDINAL_POSITION"
cursor.execute(sql, [database, table])
fetchall = cursor.fetchall()
cursor.close()
connection.close()
return fetchall
def get_mysql_columns(database, table):
return map(lambda x: x[0], get_mysql_meta(database, table))
def get_hive_columns(database, table):
def type_mapping(mysql_type):
mappings = {
"bigint": "bigint",
"int": "bigint",
"smallint": "bigint",
"tinyint": "bigint",
"decimal": "string",
"double": "double",
"float": "float",
"binary": "string",
"char": "string",
"varchar": "string",
"datetime": "string",
"time": "string",
"timestamp": "string",
"date": "string",
"text": "string"
}
return mappings[mysql_type]
meta = get_mysql_meta(database, table)
return map(lambda x: {"name": x[0], "type": type_mapping(x[1].lower())}, meta)
def generate_json(source_database, source_table):
job = {
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": mysql_user,
"password": mysql_passwd,
"column": get_mysql_columns(source_database, source_table),
"splitPk": "",
"connection": [{
"table": [source_table],
"jdbcUrl": ["jdbc:mysql://" + mysql_host + ":" + mysql_port + "/" + source_database]
}]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://" + hdfs_nn_host + ":" + hdfs_nn_port,
"fileType": "text",
"path": "${targetdir}",
"fileName": source_table,
"column": get_hive_columns(source_database, source_table),
"writeMode": "append",
"fieldDelimiter": "\t",
"compress": "gzip"
}
}
}]
}
}
if not os.path.exists(output_path):
os.makedirs(output_path)
with open(os.path.join(output_path, ".".join([source_database, source_table, "json"])), "w") as f:
json.dump(job, f)
def main(args):
source_database = ""
source_table = ""
options, arguments = getopt.getopt(args, '-d:-t:', ['sourcedb=', 'sourcetbl='])
for opt_name, opt_value in options:
if opt_name in ('-d', '--sourcedb'):
source_database = opt_value
if opt_name in ('-t', '--sourcetbl'):
source_table = opt_value
generate_json(source_database, source_table)
if __name__ == '__main__':
main(sys.argv[1:])
注:
(1)安装Python Mysql驱动
由于需要使用Python访问Mysql数据库,故需安装驱动,命令如下:
[maxwell@hadoop102 bin]$ sudo yum install -y MySQL-python
[maxwell@hadoop102 bin]$ sudo yum install -y MySQL-python
Loaded plugins: fastestmirror, langpacks
Loading mirror speeds from cached hostfile
epel/x86_64/metalink | 8.4 kB 00:00:00
* base: mirrors.tuna.tsinghua.edu.cn
* epel: mirrors.tuna.tsinghua.edu.cn
* extras: mirrors.tuna.tsinghua.edu.cn
* updates: mirrors.tuna.tsinghua.edu.cn
base | 3.6 kB 00:00:00
epel | 4.7 kB 00:00:00
extras | 2.9 kB 00:00:00
updates | 2.9 kB 00:00:00
(1/2): epel/x86_64/updateinfo | 1.0 MB 00:00:01
(2/2): epel/x86_64/primary_db | 7.0 MB 00:00:01
Resolving Dependencies
--> Running transaction check
---> Package MySQL-python.x86_64 0:1.2.5-1.el7 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
=========================================================================================================================================================================================================================================
Package Arch Version Repository Size
=========================================================================================================================================================================================================================================
Installing:
MySQL-python x86_64 1.2.5-1.el7 base 90 k
Transaction Summary
=========================================================================================================================================================================================================================================
Install 1 Package
Total download size: 90 k
Installed size: 284 k
Downloading packages:
MySQL-python-1.2.5-1.el7.x86_64.rpm | 90 kB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Warning: RPMDB altered outside of yum.
** Found 9 pre-existing rpmdb problem(s), 'yum check' output follows:
icedtea-web-1.7.1-1.el7.x86_64 has missing requires of java-1.8.0-openjdk
icedtea-web-1.7.1-1.el7.x86_64 has missing requires of jpackage-utils
icedtea-web-1.7.1-1.el7.x86_64 has missing requires of jpackage-utils
jline-1.0-8.el7.noarch has missing requires of java >= ('0', '1.5', None)
jline-1.0-8.el7.noarch has missing requires of jpackage-utils
rhino-1.7R5-1.el7.noarch has missing requires of jpackage-utils
rhino-1.7R5-1.el7.noarch has missing requires of jpackage-utils
tagsoup-1.2.1-8.el7.noarch has missing requires of jpackage-utils
tagsoup-1.2.1-8.el7.noarch has missing requires of jpackage-utils >= ('0', '1.6', None)
Installing : MySQL-python-1.2.5-1.el7.x86_64 1/1
Verifying : MySQL-python-1.2.5-1.el7.x86_64 1/1
Installed:
MySQL-python.x86_64 0:1.2.5-1.el7
Complete!
[maxwell@hadoop102 bin]$
(2)脚本使用说明
python gen_import_config.py -d database -t table
[maxwell@hadoop102 bin]$ python gen_import_config.py -d gmall -t base_province
[maxwell@hadoop102 bin]$ cd /opt/module/datax/
[maxwell@hadoop102 datax]$ cd job/
[maxwell@hadoop102 job]$ cd import/
[maxwell@hadoop102 import]$ ls -ltr
total 4
-rw-rw-r--. 1 maxwell maxwell 868 Mar 29 07:49 gmall.base_province.json
[maxwell@hadoop102 import]$ cat gmall.base_province.json
{"job": {"content": [{"writer": {"parameter": {"writeMode": "append", "fieldDelimiter": "\t", "column": [{"type": "bigint", "name": "id"}, {"type": "string", "name": "name"}, {"type": "string", "name": "region_id"}, {"type": "string", "name": "area_code"}, {"type": "string", "name": "iso_code"}, {"type": "string", "name": "iso_3166_2"}], "path": "${targetdir}", "fileType": "text", "defaultFS": "hdfs://hadoop102:8020", "compress": "gzip", "fileName": "base_province"}, "name": "hdfswriter"}, "reader": {"parameter": {"username": "root", "column": ["id", "name", "region_id", "area_code", "iso_code", "iso_3166_2"], "connection": [{"table": ["base_province"], "jdbcUrl": ["jdbc:mysql://hadoop102:3306/gmall"]}], "password": "centos123", "splitPk": ""}, "name": "mysqlreader"}}], "setting": {"speed": {"channel": 3}, "errorLimit": {"record": 0, "percentage": 0.02}}}}[maxwell@hadoop102 import]$
[maxwell@hadoop102 import]$
通过-d传入数据库名,-t传入表名,执行上述命令即可生成该表的DataX同步配置文件。
[maxwell@hadoop102 datax]$ python bin/datax.py -p"-Dtargetdir=/base_province" job/import/gmall.base_province.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2023-03-29 07:53:49.636 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2023-03-29 07:53:49.642 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.212-b10
jvmInfo: Linux amd64 3.10.0-862.el7.x86_64
cpu num: 2
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2023-03-29 07:53:49.665 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"column":[
"id",
"name",
"region_id",
"area_code",
"iso_code",
"iso_3166_2"
],
"connection":[
{
"jdbcUrl":[
"jdbc:mysql://hadoop102:3306/gmall"
],
"table":[
"base_province"
]
}
],
"password":"*********",
"splitPk":"",
"username":"root"
}
},
"writer":{
"name":"hdfswriter",
"parameter":{
"column":[
{
"name":"id",
"type":"bigint"
},
{
"name":"name",
"type":"string"
},
{
"name":"region_id",
"type":"string"
},
{
"name":"area_code",
"type":"string"
},
{
"name":"iso_code",
"type":"string"
},
{
"name":"iso_3166_2",
"type":"string"
}
],
"compress":"gzip",
"defaultFS":"hdfs://hadoop102:8020",
"fieldDelimiter":"\t",
"fileName":"base_province",
"fileType":"text",
"path":"/base_province",
"writeMode":"append"
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.02,
"record":0
},
"speed":{
"channel":3
}
}
}
2023-03-29 07:53:49.685 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2023-03-29 07:53:49.687 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2023-03-29 07:53:49.687 [main] INFO JobContainer - DataX jobContainer starts job.
2023-03-29 07:53:49.690 [main] INFO JobContainer - Set jobId = 0
2023-03-29 07:53:50.025 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://hadoop102:3306/gmall?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2023-03-29 07:53:50.044 [job-0] INFO OriginalConfPretreatmentUtil - table:[base_province] has columns:[id,name,region_id,area_code,iso_code,iso_3166_2].
Mar 29, 2023 7:53:50 AM org.apache.hadoop.util.NativeCodeLoader
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-03-29 07:53:51.219 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2023-03-29 07:53:51.220 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2023-03-29 07:53:51.220 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do prepare work .
2023-03-29 07:53:51.334 [job-0] INFO HdfsWriter$Job - 由于您配置了writeMode append, 写入前不做清理工作, [/base_province] 目录下写入相应文件名前缀 [base_province] 的文件
2023-03-29 07:53:51.335 [job-0] INFO JobContainer - jobContainer starts to do split ...
2023-03-29 07:53:51.335 [job-0] INFO JobContainer - Job set Channel-Number to 3 channels.
2023-03-29 07:53:51.338 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2023-03-29 07:53:51.339 [job-0] INFO HdfsWriter$Job - begin do split...
2023-03-29 07:53:51.352 [job-0] INFO HdfsWriter$Job - splited write file name:[hdfs://hadoop102:8020/base_province__0b3e92f7_5b70_4159_8161_a73cf205ecea/base_province__c22f5d17_f8f6_4e3c_b0d4_4fce3e181d1c]
2023-03-29 07:53:51.352 [job-0] INFO HdfsWriter$Job - end do split.
2023-03-29 07:53:51.352 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] splits to [1] tasks.
2023-03-29 07:53:51.369 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2023-03-29 07:53:51.373 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2023-03-29 07:53:51.375 [job-0] INFO JobContainer - Running by standalone Mode.
2023-03-29 07:53:51.384 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2023-03-29 07:53:51.388 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2023-03-29 07:53:51.388 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2023-03-29 07:53:51.411 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2023-03-29 07:53:51.416 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select id,name,region_id,area_code,iso_code,iso_3166_2 from base_province
] jdbcUrl:[jdbc:mysql://hadoop102:3306/gmall?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2023-03-29 07:53:51.442 [0-0-0-writer] INFO HdfsWriter$Task - begin do write...
2023-03-29 07:53:51.443 [0-0-0-writer] INFO HdfsWriter$Task - write to file : [hdfs://hadoop102:8020/base_province__0b3e92f7_5b70_4159_8161_a73cf205ecea/base_province__c22f5d17_f8f6_4e3c_b0d4_4fce3e181d1c]
2023-03-29 07:53:51.488 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select id,name,region_id,area_code,iso_code,iso_3166_2 from base_province
] jdbcUrl:[jdbc:mysql://hadoop102:3306/gmall?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2023-03-29 07:53:51.872 [0-0-0-writer] INFO HdfsWriter$Task - end do write
2023-03-29 07:53:51.919 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[521]ms
2023-03-29 07:53:51.919 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2023-03-29 07:54:01.395 [job-0] INFO StandAloneJobContainerCommunicator - Total 34 records, 707 bytes | Speed 70B/s, 3 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2023-03-29 07:54:01.395 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2023-03-29 07:54:01.396 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do post work.
2023-03-29 07:54:01.396 [job-0] INFO HdfsWriter$Job - start rename file [hdfs://hadoop102:8020/base_province__0b3e92f7_5b70_4159_8161_a73cf205ecea/base_province__c22f5d17_f8f6_4e3c_b0d4_4fce3e181d1c.gz] to file [hdfs://hadoop102:8020/base_province/base_province__c22f5d17_f8f6_4e3c_b0d4_4fce3e181d1c.gz].
2023-03-29 07:54:01.405 [job-0] INFO HdfsWriter$Job - finish rename file [hdfs://hadoop102:8020/base_province__0b3e92f7_5b70_4159_8161_a73cf205ecea/base_province__c22f5d17_f8f6_4e3c_b0d4_4fce3e181d1c.gz] to file [hdfs://hadoop102:8020/base_province/base_province__c22f5d17_f8f6_4e3c_b0d4_4fce3e181d1c.gz].
2023-03-29 07:54:01.405 [job-0] INFO HdfsWriter$Job - start delete tmp dir [hdfs://hadoop102:8020/base_province__0b3e92f7_5b70_4159_8161_a73cf205ecea] .
2023-03-29 07:54:01.414 [job-0] INFO HdfsWriter$Job - finish delete tmp dir [hdfs://hadoop102:8020/base_province__0b3e92f7_5b70_4159_8161_a73cf205ecea] .
2023-03-29 07:54:01.415 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do post work.
2023-03-29 07:54:01.415 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2023-03-29 07:54:01.416 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/module/datax/hook
2023-03-29 07:54:01.520 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 1 | 1 | 1 | 0.036s | 0.036s | 0.036s
PS Scavenge | 1 | 1 | 1 | 0.051s | 0.051s | 0.051s
2023-03-29 07:54:01.520 [job-0] INFO JobContainer - PerfTrace not enable!
2023-03-29 07:54:01.520 [job-0] INFO StandAloneJobContainerCommunicator - Total 34 records, 707 bytes | Speed 70B/s, 3 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2023-03-29 07:54:01.525 [job-0] INFO JobContainer -
任务启动时刻 : 2023-03-29 07:53:49
任务结束时刻 : 2023-03-29 07:54:01
任务总计耗时 : 11s
任务平均流量 : 70B/s
记录写入速度 : 3rec/s
读出记录总数 : 34
读写失败总数 : 0
[maxwell@hadoop102 datax]$
2)在~/bin目录下创建gen_import_config.sh脚本
[maxwell@hadoop102 bin]$ vim ~/bin/gen_import_config.sh
脚本内容如下
#!/bin/bash
python ~/bin/gen_import_config.py -d gmall -t activity_info
python ~/bin/gen_import_config.py -d gmall -t activity_rule
python ~/bin/gen_import_config.py -d gmall -t base_category1
python ~/bin/gen_import_config.py -d gmall -t base_category2
python ~/bin/gen_import_config.py -d gmall -t base_category3
python ~/bin/gen_import_config.py -d gmall -t base_dic
python ~/bin/gen_import_config.py -d gmall -t base_province
python ~/bin/gen_import_config.py -d gmall -t base_region
python ~/bin/gen_import_config.py -d gmall -t base_trademark
python ~/bin/gen_import_config.py -d gmall -t cart_info
python ~/bin/gen_import_config.py -d gmall -t coupon_info
python ~/bin/gen_import_config.py -d gmall -t sku_attr_value
python ~/bin/gen_import_config.py -d gmall -t sku_info
python ~/bin/gen_import_config.py -d gmall -t sku_sale_attr_value
python ~/bin/gen_import_config.py -d gmall -t spu_info
3)为gen_import_config.sh脚本增加执行权限
[maxwell@hadoop102 bin]$ chmod 777 ~/bin/gen_import_config.sh
4)执行gen_import_config.sh脚本,生成配置文件
[maxwell@hadoop102 bin]$ gen_import_config.sh
5)观察生成的配置文件
[maxwell@hadoop102 bin]$ ll /opt/module/datax/job/import/
总用量 60
-rw-rw-r-- 1 maxwell maxwell 957 10月 15 22:17 gmall.activity_info.json
-rw-rw-r-- 1 maxwell maxwell 1049 10月 15 22:17 gmall.activity_rule.json
-rw-rw-r-- 1 maxwell maxwell 651 10月 15 22:17 gmall.base_category1.json
-rw-rw-r-- 1 maxwell maxwell 711 10月 15 22:17 gmall.base_category2.json
-rw-rw-r-- 1 maxwell maxwell 711 10月 15 22:17 gmall.base_category3.json
-rw-rw-r-- 1 maxwell maxwell 835 10月 15 22:17 gmall.base_dic.json
-rw-rw-r-- 1 maxwell maxwell 865 10月 15 22:17 gmall.base_province.json
-rw-rw-r-- 1 maxwell maxwell 659 10月 15 22:17 gmall.base_region.json
-rw-rw-r-- 1 maxwell maxwell 709 10月 15 22:17 gmall.base_trademark.json
-rw-rw-r-- 1 maxwell maxwell 1301 10月 15 22:17 gmall.cart_info.json
-rw-rw-r-- 1 maxwell maxwell 1545 10月 15 22:17 gmall.coupon_info.json
-rw-rw-r-- 1 maxwell maxwell 867 10月 15 22:17 gmall.sku_attr_value.json
-rw-rw-r-- 1 maxwell maxwell 1121 10月 15 22:17 gmall.sku_info.json
-rw-rw-r-- 1 maxwell maxwell 985 10月 15 22:17 gmall.sku_sale_attr_value.json
-rw-rw-r-- 1 maxwell maxwell 811 10月 15 22:17 gmall.spu_info.json
[maxwell@hadoop102 bin]$ vim gen_import_config.sh
[maxwell@hadoop102 bin]$ chmod 777 gen_import_config.sh
[maxwell@hadoop102 bin]$ gen_import_config.sh
[maxwell@hadoop102 bin]$ cd /opt/module/datax/
[maxwell@hadoop102 datax]$ cd job/import/
[maxwell@hadoop102 import]$ ls -ltr
total 60
-rw-rw-r--. 1 maxwell maxwell 960 Mar 29 07:58 gmall.activity_info.json
-rw-rw-r--. 1 maxwell maxwell 1052 Mar 29 07:58 gmall.activity_rule.json
-rw-rw-r--. 1 maxwell maxwell 654 Mar 29 07:58 gmall.base_category1.json
-rw-rw-r--. 1 maxwell maxwell 714 Mar 29 07:58 gmall.base_category2.json
-rw-rw-r--. 1 maxwell maxwell 714 Mar 29 07:58 gmall.base_category3.json
-rw-rw-r--. 1 maxwell maxwell 838 Mar 29 07:58 gmall.base_dic.json
-rw-rw-r--. 1 maxwell maxwell 868 Mar 29 07:58 gmall.base_province.json
-rw-rw-r--. 1 maxwell maxwell 662 Mar 29 07:58 gmall.base_region.json
-rw-rw-r--. 1 maxwell maxwell 712 Mar 29 07:58 gmall.base_trademark.json
-rw-rw-r--. 1 maxwell maxwell 1304 Mar 29 07:58 gmall.cart_info.json
-rw-rw-r--. 1 maxwell maxwell 1548 Mar 29 07:58 gmall.coupon_info.json
-rw-rw-r--. 1 maxwell maxwell 870 Mar 29 07:58 gmall.sku_attr_value.json
-rw-rw-r--. 1 maxwell maxwell 1124 Mar 29 07:58 gmall.sku_info.json
-rw-rw-r--. 1 maxwell maxwell 988 Mar 29 07:58 gmall.sku_sale_attr_value.json
-rw-rw-r--. 1 maxwell maxwell 814 Mar 29 07:58 gmall.spu_info.json
[maxwell@hadoop102 import]$
2.2.5.5 测试生成的DataX配置文件
以activity_info为例,测试用脚本生成的配置文件是否可用
1)创建目标路径
由于DataX同步任务要求目标路径提前存在,故需手动创建路径,当前activity_info表的目标路径应为/origin_data/gmall/db/activity_info_full/2020-06-14.
[maxwell@hadoop102 bin]$ hadoop fs -mkdir /origin_data/gmall/db/activity_info_full/2020-06-14
2)执行DataX同步命令
[maxwell@hadoop102 bin]$ python /opt/module/datax/bin/datax.py -p"-Dtargetdir=/origin_data/gmall/db/activity_info_full/2020-06-14" /opt/module/datax/job/import/gmall.activity_info.json
3)观察同步结果
观察HFDS目标路径是否出现数据。
2.2.5.6 全量表数据同步脚本
为方便使用以及后续的任务调度,此处编写一个全量表数据同步脚本。
1)在~/bin目录创建mysql_to_hdfs_full.sh
[maxwell@hadoop102 bin]$ vim ~/bin/mysql_to_hdfs_full.sh
脚本内容如下:
#!/bin/bash
DATAX_HOME=/opt/module/datax
# 如果传入日期则do_date等于传入的日期,否则等于前一天日期
if [ -n "$2" ] ;then
do_date=$2
else
do_date=`date -d "-1 day" +%F`
fi
#处理目标路径,此处的处理逻辑是,如果目标路径不存在,则创建;若存在,则清空,目的是保证同步任务可重复执行
handle_targetdir() {
hadoop fs -test -e $1
if [[ $? -eq 1 ]]; then
echo "路径$1不存在,正在创建......"
hadoop fs -mkdir -p $1
else
echo "路径$1已经存在"
fs_count=$(hadoop fs -count $1)
content_size=$(echo $fs_count | awk '{print $3}')
if [[ $content_size -eq 0 ]]; then
echo "路径$1为空"
else
echo "路径$1不为空,正在清空......"
hadoop fs -rm -r -f $1/*
fi
fi
}
#数据同步
import_data() {
datax_config=$1
target_dir=$2
handle_targetdir $target_dir
python $DATAX_HOME/bin/datax.py -p"-Dtargetdir=$target_dir" $datax_config
}
case $1 in
"activity_info")
import_data /opt/module/datax/job/import/gmall.activity_info.json /origin_data/gmall/db/activity_info_full/$do_date
;;
"activity_rule")
import_data /opt/module/datax/job/import/gmall.activity_rule.json /origin_data/gmall/db/activity_rule_full/$do_date
;;
"base_category1")
import_data /opt/module/datax/job/import/gmall.base_category1.json /origin_data/gmall/db/base_category1_full/$do_date
;;
"base_category2")
import_data /opt/module/datax/job/import/gmall.base_category2.json /origin_data/gmall/db/base_category2_full/$do_date
;;
"base_category3")
import_data /opt/module/datax/job/import/gmall.base_category3.json /origin_data/gmall/db/base_category3_full/$do_date
;;
"base_dic")
import_data /opt/module/datax/job/import/gmall.base_dic.json /origin_data/gmall/db/base_dic_full/$do_date
;;
"base_province")
import_data /opt/module/datax/job/import/gmall.base_province.json /origin_data/gmall/db/base_province_full/$do_date
;;
"base_region")
import_data /opt/module/datax/job/import/gmall.base_region.json /origin_data/gmall/db/base_region_full/$do_date
;;
"base_trademark")
import_data /opt/module/datax/job/import/gmall.base_trademark.json /origin_data/gmall/db/base_trademark_full/$do_date
;;
"cart_info")
import_data /opt/module/datax/job/import/gmall.cart_info.json /origin_data/gmall/db/cart_info_full/$do_date
;;
"coupon_info")
import_data /opt/module/datax/job/import/gmall.coupon_info.json /origin_data/gmall/db/coupon_info_full/$do_date
;;
"sku_attr_value")
import_data /opt/module/datax/job/import/gmall.sku_attr_value.json /origin_data/gmall/db/sku_attr_value_full/$do_date
;;
"sku_info")
import_data /opt/module/datax/job/import/gmall.sku_info.json /origin_data/gmall/db/sku_info_full/$do_date
;;
"sku_sale_attr_value")
import_data /opt/module/datax/job/import/gmall.sku_sale_attr_value.json /origin_data/gmall/db/sku_sale_attr_value_full/$do_date
;;
"spu_info")
import_data /opt/module/datax/job/import/gmall.spu_info.json /origin_data/gmall/db/spu_info_full/$do_date
;;
"all")
import_data /opt/module/datax/job/import/gmall.activity_info.json /origin_data/gmall/db/activity_info_full/$do_date
import_data /opt/module/datax/job/import/gmall.activity_rule.json /origin_data/gmall/db/activity_rule_full/$do_date
import_data /opt/module/datax/job/import/gmall.base_category1.json /origin_data/gmall/db/base_category1_full/$do_date
import_data /opt/module/datax/job/import/gmall.base_category2.json /origin_data/gmall/db/base_category2_full/$do_date
import_data /opt/module/datax/job/import/gmall.base_category3.json /origin_data/gmall/db/base_category3_full/$do_date
import_data /opt/module/datax/job/import/gmall.base_dic.json /origin_data/gmall/db/base_dic_full/$do_date
import_data /opt/module/datax/job/import/gmall.base_province.json /origin_data/gmall/db/base_province_full/$do_date
import_data /opt/module/datax/job/import/gmall.base_region.json /origin_data/gmall/db/base_region_full/$do_date
import_data /opt/module/datax/job/import/gmall.base_trademark.json /origin_data/gmall/db/base_trademark_full/$do_date
import_data /opt/module/datax/job/import/gmall.cart_info.json /origin_data/gmall/db/cart_info_full/$do_date
import_data /opt/module/datax/job/import/gmall.coupon_info.json /origin_data/gmall/db/coupon_info_full/$do_date
import_data /opt/module/datax/job/import/gmall.sku_attr_value.json /origin_data/gmall/db/sku_attr_value_full/$do_date
import_data /opt/module/datax/job/import/gmall.sku_info.json /origin_data/gmall/db/sku_info_full/$do_date
import_data /opt/module/datax/job/import/gmall.sku_sale_attr_value.json /origin_data/gmall/db/sku_sale_attr_value_full/$do_date
import_data /opt/module/datax/job/import/gmall.spu_info.json /origin_data/gmall/db/spu_info_full/$do_date
;;
esac
2)为mysql_to_hdfs_full.sh增加执行权限
[maxwell@hadoop102 bin]$ chmod 777 ~/bin/mysql_to_hdfs_full.sh
3)测试同步脚本
[maxwell@hadoop102 bin]$ mysql_to_hdfs_full.sh all 2020-06-14
4)检查同步结果
查看HDFS目表路径是否出现全量表数据,全量表共15张。
2.2.6.1 数据通道
2.2.6.2 Flume配置
1)Flume配置概述
Flume需要将Kafka中topic_db主题的数据传输到HDFS,故其需选用KafkaSource以及HDFSSink,Channel选用FileChannel。
需要注意的是, HDFSSink需要将不同mysql业务表的数据写到不同的路径,并且路径中应当包含一层日期,用于区分每天的数据。关键配置如下:
2)Flume配置实操
(1)创建Flume配置文件
在hadoop104节点的Flume的job目录下创建kafka_to_hdfs_db.conf
[maxwell@hadoop104 flume]$ mkdir job
[maxwell@hadoop104 flume]$ vim job/kafka_to_hdfs_db.conf
(2)配置文件内容如下
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092
a1.sources.r1.kafka.topics = topic_db
a1.sources.r1.kafka.consumer.group.id = flume
a1.sources.r1.setTopicHeader = true
a1.sources.r1.topicHeader = topic
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.maxwell.gmall.flume.interceptor.TimestampAndTableNameInterceptor$Builder
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior2
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior2/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/db/%{tableName}_inc/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = db
a1.sinks.k1.hdfs.round = false
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k1.hdfs.codeC = gzip
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
(3)编写Flume拦截器
1新建一个Maven项目,并在pom.xml文件中加入如下配置
org.apache.flume
flume-ng-core
1.9.0
provided
com.alibaba
fastjson
1.2.62
maven-compiler-plugin
2.3.2
1.8
maven-assembly-plugin
jar-with-dependencies
make-assembly
package
single
2在com.atguigu.gmall.flume.interceptor包下创建TimestampAndTableNameInterceptor类
package com.atguigu.gmall.flume.interceptor;
import com.alibaba.fastjson.JSONObject;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;
public class TimestampAndTableNameInterceptor implements Interceptor {
@Override
public void initialize() {
}
@Override
public Event intercept(Event event) {
Map headers = event.getHeaders();
String log = new String(event.getBody(), StandardCharsets.UTF_8);
JSONObject jsonObject = JSONObject.parseObject(log);
Long ts = jsonObject.getLong("ts");
//Maxwell输出的数据中的ts字段时间戳单位为秒,Flume HDFSSink要求单位为毫秒
String timeMills = String.valueOf(ts * 1000);
String tableName = jsonObject.getString("table");
headers.put("timestamp", timeMills);
headers.put("tableName", tableName);
return event;
}
@Override
public List intercept(List events) {
for (Event event : events) {
intercept(event);
}
return events;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder {
@Override
public Interceptor build() {
return new TimestampAndTableNameInterceptor ();
}
@Override
public void configure(Context context) {
}
}
}
3重新打包
4将打好的包放入到hadoop104的/opt/module/flume/lib文件夹下
[maxwell@hadoop102 lib]$ ls | grep interceptor
flume-interceptor-1.0-SNAPSHOT-jar-with-dependencies.jar
3)通道测试
(1)启动Zookeeper、Kafka集群
(2)启动hadoop104的Flume
[maxwell@hadoop104 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/kafka_to_hdfs_db.conf -Dflume.root.logger=info,console
(3)生成模拟数据
[maxwell@hadoop102 bin]$ cd /opt/module/db_log/
[maxwell@hadoop102 db_log]$ java -jar gmall2020-mock-db-2021-11-14.jar
(4)观察HDFS上的目标路径是否有数据出现
若HDFS上的目标路径已有增量表的数据出现了,就证明数据通道已经打通。
(5)数据目标路径的日期说明
仔细观察,会发现目标路径中的日期,并非模拟数据的业务日期,而是当前日期。这是由于Maxwell输出的JSON字符串中的ts字段的值,是数据的变动日期。而真实场景下,数据的业务日期与变动日期应当是一致的。
4)编写Flume启停脚本
为方便使用,此处编写一个Flume的启停脚本
(1)在hadoop102节点的/home/atguigu/bin目录下创建脚本f3.sh
[maxwell@hadoop102 bin]$ vim f3.sh
在脚本中填写如下内容
#!/bin/bash
case $1 in
"start")
echo " --------启动 hadoop104 业务数据flume-------"
ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent -n a1 -c /opt/module/flume/conf -f /opt/module/flume/job/kafka_to_hdfs_db.conf >/dev/null 2>&1 &"
;;
"stop")
echo " --------停止 hadoop104 业务数据flume-------"
ssh hadoop104 "ps -ef | grep kafka_to_hdfs_db | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
;;
esac
(2)增加脚本执行权限
[maxwell@hadoop102 bin]$ chmod 777 f3.sh
(3)f3启动
[maxwell@hadoop102 module]$ f3.sh start
(4)f3停止
[maxwell@hadoop102 module]$ f3.sh stop
2.2.6.3 Maxwell配置
1)Maxwell时间戳问题
此处为了模拟真实环境,对Maxwell源码进行了改动,增加了一个参数mock_date,该参数的作用就是指定Maxwell输出JSON字符串的ts时间戳的日期,接下来进行测试.
1修改Maxwell配置文件config.properties,增加mock_date参数,如下
log_level=info
producer=kafka
kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092
#kafka topic配置
kafka_topic=topic_db
#注:该参数仅在maxwell教学版中存在,修改该参数后重启Maxwell才可生效
mock_date=2020-06-14
# mysql login info
host=hadoop102
user=maxwell
password=maxwell
jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai
注:该参数仅供学习使用,修改该参数后重启Maxwell才可生效。
2重启Maxwell
[maxwell@hadoop102 bin]$ mxw.sh restart
3重新生成模拟数据
[maxwell@hadoop102 bin]$ cd /opt/module/db_log/
[maxwell@hadoop102 db_log]$ java -jar gmall2020-mock-db-2021-11-14.jar
4观察HDFS目标路径日期是否正常
2.2.6.4 增量表首日全量同步
通常情况下,增量表需要在首日进行一次全量同步,后续每日再进行增量同步,首日全量同步可以使用Maxwell的bootstrap功能,方便起见,下面编写一个增量表首日全量同步脚本。
1)在~/bin目录创建mysql_to_kafka_inc_init.sh
[maxwell@hadoop102 bin]$ vim mysql_to_kafka_inc_init.sh
脚本内容如下
#!/bin/bash
# 该脚本的作用是初始化所有的增量表,只需执行一次
MAXWELL_HOME=/opt/module/maxwell
import_data() {
$MAXWELL_HOME/bin/maxwell-bootstrap --database gmall --table $1 --config $MAXWELL_HOME/config.properties
}
case $1 in
"cart_info")
import_data cart_info
;;
"comment_info")
import_data comment_info
;;
"coupon_use")
import_data coupon_use
;;
"favor_info")
import_data favor_info
;;
"order_detail")
import_data order_detail
;;
"order_detail_activity")
import_data order_detail_activity
;;
"order_detail_coupon")
import_data order_detail_coupon
;;
"order_info")
import_data order_info
;;
"order_refund_info")
import_data order_refund_info
;;
"order_status_log")
import_data order_status_log
;;
"payment_info")
import_data payment_info
;;
"refund_payment")
import_data refund_payment
;;
"user_info")
import_data user_info
;;
"all")
import_data cart_info
import_data comment_info
import_data coupon_use
import_data favor_info
import_data order_detail
import_data order_detail_activity
import_data order_detail_coupon
import_data order_info
import_data order_refund_info
import_data order_status_log
import_data payment_info
import_data refund_payment
import_data user_info
;;
esac
2)为mysql_to_kafka_inc_init.sh增加执行权限
[maxwell@hadoop102 bin]$ chmod 777 ~/bin/mysql_to_kafka_inc_init.sh
3)测试同步脚本
(1)清理历史数据
为方便查看结果,现将HDFS上之前同步的增量表数据删除
[maxwell@hadoop102 ~]$ hadoop fs -ls /origin_data/gmall/db | grep _inc | awk '{print $8}' | xargs hadoop fs -rm -r -f
(2)执行同步脚本
[maxwell@hadoop102 bin]$ mysql_to_kafka_inc_init.sh all
4)检查同步结果
观察HDFS上是否重新出现增量表数据
[maxwell@hadoop102 bin]$ vim cluster.sh
在脚本中填写如下内容
#!/bin/bash
case $1 in
"start"){
echo ================== 启动 集群 ==================
#启动 Zookeeper集群
zk.sh start
#启动 Hadoop集群
hdp.sh start
#启动 Kafka采集集群
kf.sh start
#启动采集 Flume
f1.sh start
#启动日志消费 Flume
f2.sh start
#启动业务消费 Flume
f3.sh start
#启动 maxwell
mxw.sh start
};;
"stop"){
echo ================== 停止 集群 ==================
#停止 Maxwell
mxw.sh stop
#停止 业务消费Flume
f3.sh stop
#停止 日志消费Flume
f2.sh stop
#停止 日志采集Flume
f1.sh stop
#停止 Kafka采集集群
kf.sh stop
#停止 Hadoop集群
hdp.sh stop
#停止 Zookeeper集群
zk.sh stop
};;
esac
2)增加脚本执行权限
[maxwell@hadoop102 bin]$ chmod 777 cluster.sh
3)cluster集群启动脚本
[maxwell@hadoop102 module]$ cluster.sh start
4)cluster集群停止脚本
[maxwell@hadoop102 module]$ cluster.sh stop