Broker load 是一个异步的导入方式,支持的数据源取决于 Broker 进程支持的数据源。
用户在提交导入任务后,FE 会生成对应的 Plan 并根据目前 BE 的个数和文件的大小,将 Plan 分给 多个 BE 执行,每个 BE 执行一部分导入数据。
BE 在执行的过程中会从 Broker 拉取数据,在对数据 transform 之后将数据导入系统。所有 BE 均完成导入,由 FE 最终决定导入是否成功。
+
| 1. user create broker load
v
+----+----+
| |
| FE |
| |
+----+----+
|
| 2. BE etl and load the data
+--------------------------+
| | |
+---v---+ +--v----+ +---v---+
| | | | | |
| BE | | BE | | BE |
| | | | | |
+---+-^-+ +---+-^-+ +--+-^--+
| | | | | |
| | | | | | 3. pull data from broker
+---v-+-+ +---v-+-+ +--v-+--+
| | | | | |
|Broker | |Broker | |Broker |
| | | | | |
+---+-^-+ +---+-^-+ +---+-^-+
| | | | | |
+---v-+-----------v-+----------v-+-+
| HDFS/BOS/AFS cluster |
| |
+----------------------------------+
Hadoop3.3 下载
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.3/hadoop-3.3.3.tar.gz
Lacalhost SSH 免密登录
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
ssh localhost
执行 ssh localhost,会弹出安全提示,填写yes即可。
如果是docker环境下,需要手动启动sshd服务。
/usr/sbin/sshd
core-site.xml配置
[root@17a5da45700b hadoop]# cat core-site.xml
fs.defaultFS
hdfs://localhost:9000
hadoop.proxyuser.root.hosts
*
hadoop.proxyuser.root.groups
*
hdfs-site.xml配置
[root@17a5da45700b hadoop]# cat hdfs-site.xml
dfs.replication
1
hadoop-env.sh配置:添加如下配置到hadoop-env.sh。
export JAVA_HOME=/data1/jdk1.8.0_331
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
格式化hdfs文件系统
bin/hdfs namenode -format
启动hdfs
sbin/start-dfs.sh
创建HDFS目录
bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/root
查看HDFS目录,确保HDFS服务正常。
[root@17a5da45700b hadoop-3.3.3]# bin/hdfs dfs -ls /user
Found 2 items
drwxr-xr-x - root supergroup 0 2022-06-21 03:00 /user/hive
drwxr-xr-x - root supergroup 0 2022-06-15 09:38 /user/root
Hadoop环境变量配置。
export HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Doris安装参照官网:https://doris.apache.org/zh-CN/docs/get-starting/get-starting.html#%E5%8D%95%E6%9C%BA%E9%83%A8%E7%BD%B2
Doris和Hadoop有一些端口冲突,需要对Doirs默认端口进行修改。
vim be/conf/be.conf
将webserver_port = 8040修改为webserver_port = 18040
vim fe/conf/fe.conf
将http_port = 8030修改为http_port = 18030
mysql -h 127.0.0.1 -P9030 -uroot
create database;
use test;
CREATE TABLE `test` (
`id` varchar(32) NULL DEFAULT "",
`user_name` varchar(32) NULL DEFAULT "",
`member_list` DECIMAL(10,3)
) ENGINE=OLAP
DUPLICATE KEY(`id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
5,sim5,1.500
6,sim6,1.006
7,sim7,1.070
#创建HDFS目录
bin/hdfs dfs -mkdir /user/root/doris_test
#将本地文件stream_load_data.csv写入HDFS上的doris_test目录
bin/hdfs dfs -put /data1/hadoop-3.3.0/stream_load_data.csv /user/root/doris_test
支持如下命令,通过Broker Load导入数据。
use test;
LOAD LABEL test.label_20220404
(
DATA INFILE("hdfs://127.0.0.1:9000/user/root/doris_test/stream_load_data.csv")
INTO TABLE `test`
COLUMNS TERMINATED BY ","
FORMAT AS "csv"
(id,user_name,member_list)
)
with HDFS (
"fs.defaultFS"="hdfs://127.0.0.1:9000",
"hadoop.username"="root"
)
PROPERTIES
(
"timeout"="1200",
"max_filter_ratio"="0.1"
);
注意:
show load
可以看到导入任务的状态 JobId: 10041
Label: label_20220404
State: FINISHED
Progress: ETL:100%; LOAD:100%
Type: BROKER
EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=3
TaskInfo: cluster:N/A; timeout(s):1200; max_filter_ratio:0.1
ErrorMsg: NULL
CreateTime: 2022-10-21 00:33:34
EtlStartTime: 2022-10-21 00:33:38
EtlFinishTime: 2022-10-21 00:33:38
LoadStartTime: 2022-10-21 00:33:38
LoadFinishTime: 2022-10-21 00:33:38
URL: NULL
JobDetails: {"Unfinished backends":{"a32767db5a4249e8-96d523ac04909465":[]},"ScannedRows":3,"TaskNumber":1,"LoadBytes":96,"All backends":{"a32767db5a4249e8-96d523ac04909465":[10003]},"FileNumber":1,"FileSize":39}
TransactionId: 5
ErrorTablets: {}
6 rows in set (0.01 sec)
支持表查询语句,查看导入结果:
mysql> select * from test;
+------+-----------+-------------+
| id | user_name | member_list |
+------+-----------+-------------+
| 5 | sim5 | 1.500 |
| 6 | sim6 | 1.006 |
| 7 | sim7 | 1.070 |
+------+-----------+-------------+
具体参加:https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD
最后宣传下我的书:
1 . 《图解Spark 大数据快速分析实战(异步图书出品)》 https://item.jd.com/13613302.html
2. 《Offer来了:Java面试核心知识点精讲(第2版)(博文视点出品)》https://item.jd.com/13200939.html
3. 《Offer来了:Java面试核心知识点精讲(原理篇)(博文视点出品)》https://item.jd.com/12737278.html
4. 《Offer来了:Java面试核心知识点精讲(框架篇)(博文视点出品) https://item.jd.com/12868220.htm