尚硅谷大数据项目《在线教育之采集系统》笔记004

视频地址:尚硅谷大数据项目《在线教育之采集系统》_哔哩哔哩_bilibili

目录

P047

P048

P049

P050

P051

P052

P053

P054

P055

P056


P047

/opt/module/datax/job/base_province.json

尚硅谷大数据项目《在线教育之采集系统》笔记004_第1张图片

尚硅谷大数据项目《在线教育之采集系统》笔记004_第2张图片

[atguigu@node001 ~]$ hadoop fs -mkdir /base_province/2022-02-22
[atguigu@node001 ~]$ cd /opt/module/datax/
[atguigu@node001 datax]$ python bin/datax.py -p"-Ddt=2022-02-22" job/base_province.json

P048

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "defaultFS": "hdfs://node001:8020",
                        "path": "/base_province",
                        "column": [
                            "*"
                        ],
                        "fileType": "text",
                        "compress": "gzip",
                        "encoding": "UTF-8",
                        "nullFormat": "\\N",
                        "fieldDelimiter": "\t",
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "username": "root",
                        "password": "123456",
                        "connection": [
                            {
                                "table": [
                                    "test_province"
                                ],
                                "jdbcUrl": "jdbc:mysql://node001:3306/edu?useUnicode=true&characterEncoding=utf-8"
                            }
                        ],
                        "column": [
                            "id",
                            "name",
                            "region_id",
                            "area_code",
                            "iso_code",
                            "iso_3166_2"
                        ],
                        "writeMode": "replace"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1
            }
        }
    }
}
DROP TABLE IF EXISTS `test_province`;

CREATE TABLE `test_province`  (
  `id` BIGINT(20) NOT NULL,
  `name` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `region_id` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `area_code` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `iso_code` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `iso_3166_2` VARCHAR(20) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE = INNODB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = DYNAMIC;

P049

MysqlReader插件文档:https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md

尚硅谷大数据项目《在线教育之采集系统》笔记004_第3张图片

尚硅谷大数据项目《在线教育之采集系统》笔记004_第4张图片 尚硅谷大数据项目《在线教育之采集系统》笔记004_第5张图片

并行度  task数量
2        11
3        16
4        21
n        n*5+1

P050

HFDS Writer并未提供nullFormat参数:也就是用户并不能自定义null值写到HFDS文件中的存储格式。默认情况下,HFDS Writer会将null值存储为空字符串(''),而Hive默认的null值存储格式为\N。所以后期将DataX同步的文件导入Hive表就会出现问题。

解决该问题的方案有两个:

  1. 一是修改DataX HDFS Writer的源码,增加自定义null值存储格式的逻辑,可参考记Datax3.0解决MySQL抽数到HDFSNULL变为空字符的问题_datax nullformat_谭正强的博客-CSDN博客。
  2. 二是在Hive中建表时指定null值存储格式为空字符串(''),例如:
DROP TABLE IF EXISTS base_province;

CREATE EXTERNAL TABLE base_province
(
    `id`         STRING COMMENT '编号',
    `name`       STRING COMMENT '省份名称',
    `region_id`  STRING COMMENT '地区ID',
    `area_code`  STRING COMMENT '地区编码',
    `iso_code`   STRING COMMENT '旧版ISO-3166-2编码,供可视化使用',
    `iso_3166_2` STRING COMMENT '新版IOS-3166-2编码,供可视化使用'
) COMMENT '省份表'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    NULL DEFINED AS ''
    LOCATION '/base_province/';

尚硅谷大数据项目《在线教育之采集系统》笔记004_第6张图片

P051

第5章 DataX优化

P052

Maxwell 是由美国Zendesk公司开源,用Java编写的MySQL变更数据抓取软件。它会实时监控Mysql数据库的数据变更操作(包括insert、update、delete),并将变更数据以 JSON 格式发送给 Kafka、Kinesi等流数据处理平台。官网地址:Maxwell's Daemon

P053

尚硅谷大数据项目《在线教育之采集系统》笔记004_第7张图片

P054

[mysqld]

#数据库id
server-id = 1

##启动binlog,该参数的值会作为binlog的文件名
log-bin=mysql-bin

#binlog类型,maxwell要求为row类型
binlog_format=row

#启用binlog的数据库,需根据实际情况作出修改
binlog-do-db=edu

P055

[atguigu@node001 ~]$ mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 5
Server version: 5.7.29 MySQL Community Server (GPL)

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show master status;
Empty set (0.00 sec)

mysql> ^DBye
[atguigu@node001 ~]$ mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.29-log MySQL Community Server (GPL)

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show master status;
+------------------+----------+--------------+------------------+-------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000001 |      154 | edu          |                  |                   |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

mysql> CREATE DATABASE maxwell;
Query OK, 1 row affected (0.01 sec)

mysql> set global validate_password_policy=0;
ERROR 1193 (HY000): Unknown system variable 'validate_password_policy'
mysql> CREATE USER 'maxwell'@'%' IDENTIFIED BY 'maxwell';
Query OK, 0 rows affected (0.02 sec)

mysql> GRANT ALL ON maxwell.* TO 'maxwell'@'%';
Query OK, 0 rows affected (0.01 sec)

mysql> GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';
Query OK, 0 rows affected (0.00 sec)

mysql> quit
Bye
[atguigu@node001 ~]$ 

P056

  1. node001:启动zookeeper、kafka、maxwell。
  2. node002:[atguigu@node002 ~]$ kafka-console-consumer.sh --bootstrap-server node001:9092 --topic maxwell

[atguigu@node001 maxwell]$ cd /opt/module/maxwell/
[atguigu@node001 maxwell]$ ll
总用量 4
drwxrwxr-x 4 atguigu atguigu 4096 8月   9 16:00 maxwell-1.29.2
[atguigu@node001 maxwell]$ vim /etc/my.cnf
[atguigu@node001 maxwell]$  
[atguigu@node001 maxwell]$ sudo vim /etc/my.cnf
[atguigu@node001 maxwell]$ sudo systemctl restart mysqld
[atguigu@node001 maxwell]$ 
[atguigu@node001 maxwell]$ cd /opt/module/maxwell/maxwell-1.29.2/
[atguigu@node001 maxwell-1.29.2]$ cp config.properties.example config.properties
[atguigu@node001 maxwell-1.29.2]$ bin/maxwell --config config.properties --daemon
Redirecting STDOUT to /opt/module/maxwell/maxwell-1.29.2/bin/../logs/MaxwellDaemon.out
Using kafka version: 1.0.0
[atguigu@node001 maxwell-1.29.2]$ jps
5600 Maxwell
5631 Jps
[atguigu@node001 maxwell-1.29.2]$ zk.sh start
---------- zookeeper node001 启动 ----------
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/zookeeper-3.5.7/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
---------- zookeeper node002 启动 ----------
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/zookeeper-3.5.7/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
---------- zookeeper node003 启动 ----------
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/zookeeper-3.5.7/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[atguigu@node001 maxwell-1.29.2]$ kf.sh start
--------------- node001 Kafka 启动 ---------------
--------------- node002 Kafka 启动 ---------------
--------------- node003 Kafka 启动 ---------------
[atguigu@node001 maxwell-1.29.2]$ myhadoop.sh start
 ================ 启动 hadoop集群 ================
 ---------------- 启动 hdfs ----------------
Starting namenodes on [node001]
Starting datanodes
Starting secondary namenodes [node003]
 --------------- 启动 yarn ---------------
Starting resourcemanager
Starting nodemanagers
 --------------- 启动 historyserver ---------------
[atguigu@node001 maxwell-1.29.2]$ jpsall 
================ node001 ================
5600 Maxwell
7314 Jps
7059 NodeManager
6483 NameNode
6647 DataNode
7276 JobHistoryServer
5742 QuorumPeerMain
================ node002 ================
4583 NodeManager
4921 Jps
4461 ResourceManager
4254 DataNode
3727 QuorumPeerMain
================ node003 ================
4240 DataNode
3703 QuorumPeerMain
4344 SecondaryNameNode
4474 NodeManager
4090 Kafka
4606 Jps
[atguigu@node001 maxwell-1.29.2]$ kf.sh stop
--------------- node001 Kafka 停止 ---------------
No kafka server to stop
--------------- node002 Kafka 停止 ---------------
No kafka server to stop
--------------- node003 Kafka 停止 ---------------
[atguigu@node001 maxwell-1.29.2]$ kf.sh start
--------------- node001 Kafka 启动 ---------------
--------------- node002 Kafka 启动 ---------------
--------------- node003 Kafka 启动 ---------------
[atguigu@node001 maxwell-1.29.2]$ jpsall 
================ node001 ================
5600 Maxwell
7937 Kafka
7059 NodeManager
6483 NameNode
8004 Jps
6647 DataNode
7276 JobHistoryServer
5742 QuorumPeerMain
================ node002 ================
5457 Jps
4583 NodeManager
5402 Kafka
4461 ResourceManager
4254 DataNode
3727 QuorumPeerMain
================ node003 ================
4240 DataNode
3703 QuorumPeerMain
4344 SecondaryNameNode
4474 NodeManager
5195 Jps
5102 Kafka
[atguigu@node001 maxwell-1.29.2]$ mock.sh
[atguigu@node001 maxwell-1.29.2]$ 
[atguigu@node002 ~]$ kafka-console-consumer.sh --bootstrap-server node001:9092 --topic maxwell

你可能感兴趣的:(#,大数据数仓,大数据,数仓,maxwel,mysql,linux,zookeeper,kafka)