Doris Broker Load入门实战

Broker Load 原理

Broker load 是一个异步的导入方式,支持的数据源取决于 Broker 进程支持的数据源。

用户在提交导入任务后,FE 会生成对应的 Plan 并根据目前 BE 的个数和文件的大小,将 Plan 分给 多个 BE 执行,每个 BE 执行一部分导入数据。

BE 在执行的过程中会从 Broker 拉取数据,在对数据 transform 之后将数据导入系统。所有 BE 均完成导入,由 FE 最终决定导入是否成功。

                 +
                 | 1. user create broker load
                 v
            +----+----+
            |         |
            |   FE    |
            |         |
            +----+----+
                 |
                 | 2. BE etl and load the data
    +--------------------------+
    |            |             |
+---v---+     +--v----+    +---v---+
|       |     |       |    |       |
|  BE   |     |  BE   |    |   BE  |
|       |     |       |    |       |
+---+-^-+     +---+-^-+    +--+-^--+
    | |           | |         | |
    | |           | |         | | 3. pull data from broker
+---v-+-+     +---v-+-+    +--v-+--+
|       |     |       |    |       |
|Broker |     |Broker |    |Broker |
|       |     |       |    |       |
+---+-^-+     +---+-^-+    +---+-^-+
    | |           | |          | |
+---v-+-----------v-+----------v-+-+
|       HDFS/BOS/AFS cluster       |
|                                  |
+----------------------------------+

Broker Load 导入HDFS数据

1. HDFS环境搭建
  1. Hadoop3.3 下载

    wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.3/hadoop-3.3.3.tar.gz
    
  2. Lacalhost SSH 免密登录

      ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
      cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      chmod 0600 ~/.ssh/authorized_keys
      ssh localhost
    

    执行 ssh localhost,会弹出安全提示,填写yes即可。

    如果是docker环境下,需要手动启动sshd服务。

     /usr/sbin/sshd
    
  3. core-site.xml配置

    [root@17a5da45700b hadoop]# cat core-site.xml
    
    
    
    
    
    
    
        
            fs.defaultFS
            hdfs://localhost:9000
        
    
    
      hadoop.proxyuser.root.hosts
      *
    
    
    
      hadoop.proxyuser.root.groups
      *
    
    
    
    
  4. hdfs-site.xml配置

    [root@17a5da45700b hadoop]# cat  hdfs-site.xml
    
    
    
    
    
    
    
        
            dfs.replication
            1
        
    
    
  5. hadoop-env.sh配置:添加如下配置到hadoop-env.sh。

    export JAVA_HOME=/data1/jdk1.8.0_331
    export HDFS_NAMENODE_USER=root
    export HDFS_DATANODE_USER=root
    export HDFS_SECONDARYNAMENODE_USER=root
    export YARN_RESOURCEMANAGER_USER=root
    export YARN_NODEMANAGER_USER=root
    
  6. 格式化hdfs文件系统

    bin/hdfs namenode -format
    
  7. 启动hdfs

    sbin/start-dfs.sh
    
  8. 创建HDFS目录

     bin/hdfs dfs -mkdir /user
     bin/hdfs dfs -mkdir /user/root	
    
  9. 查看HDFS目录,确保HDFS服务正常。

    [root@17a5da45700b hadoop-3.3.3]#  bin/hdfs dfs -ls /user
    Found 2 items
    drwxr-xr-x   - root supergroup          0 2022-06-21 03:00 /user/hive
    drwxr-xr-x   - root supergroup          0 2022-06-15 09:38 /user/root
    
  10. Hadoop环境变量配置。

    export HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.3
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    
2.Doris环境安装
  1. Doris安装参照官网:https://doris.apache.org/zh-CN/docs/get-starting/get-starting.html#%E5%8D%95%E6%9C%BA%E9%83%A8%E7%BD%B2

  2. Doris和Hadoop有一些端口冲突,需要对Doirs默认端口进行修改。

    vim be/conf/be.conf
    

    将webserver_port = 8040修改为webserver_port = 18040

    vim fe/conf/fe.conf 
    

    将http_port = 8030修改为http_port = 18030

Doris 库表结构建立
mysql -h 127.0.0.1 -P9030  -uroot
create database;
use test;
CREATE TABLE `test` (
  `id` varchar(32) NULL DEFAULT "",
  `user_name` varchar(32) NULL DEFAULT "",
  `member_list` DECIMAL(10,3)
) ENGINE=OLAP
DUPLICATE KEY(`id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);


HDFS 数据准备
  1. 编辑stream_load_data.csv,并加入如下数据
5,sim5,1.500
6,sim6,1.006
7,sim7,1.070
  1. CSV数据导入HDFS
#创建HDFS目录	
bin/hdfs dfs -mkdir /user/root/doris_test
#将本地文件stream_load_data.csv写入HDFS上的doris_test目录
bin/hdfs dfs -put /data1/hadoop-3.3.0/stream_load_data.csv /user/root/doris_test
执行Broker Load导入数据

支持如下命令,通过Broker Load导入数据。

use test;
   LOAD LABEL test.label_20220404
        (
            DATA INFILE("hdfs://127.0.0.1:9000/user/root/doris_test/stream_load_data.csv")
            INTO TABLE `test`
            COLUMNS TERMINATED BY ","
            FORMAT AS "csv"          
            (id,user_name,member_list)
        ) 
        with HDFS (
            "fs.defaultFS"="hdfs://127.0.0.1:9000",
            "hadoop.username"="root"
        )
        PROPERTIES
        (
            "timeout"="1200",
            "max_filter_ratio"="0.1"
        );

注意:

  • INTO TABLE test:数据导入Doris test表
  • COLUMNS TERMINATED BY “,”:数据以逗号隔开
  • FORMAT AS “csv” :文件格式为csv
  • (id,user_name,member_list):导入的列
    通过 show load 可以看到导入任务的状态
         JobId: 10041
         Label: label_20220404
         State: FINISHED
      Progress: ETL:100%; LOAD:100%
          Type: BROKER
       EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=3
      TaskInfo: cluster:N/A; timeout(s):1200; max_filter_ratio:0.1
      ErrorMsg: NULL
    CreateTime: 2022-10-21 00:33:34
  EtlStartTime: 2022-10-21 00:33:38
 EtlFinishTime: 2022-10-21 00:33:38
 LoadStartTime: 2022-10-21 00:33:38
LoadFinishTime: 2022-10-21 00:33:38
           URL: NULL
    JobDetails: {"Unfinished backends":{"a32767db5a4249e8-96d523ac04909465":[]},"ScannedRows":3,"TaskNumber":1,"LoadBytes":96,"All backends":{"a32767db5a4249e8-96d523ac04909465":[10003]},"FileNumber":1,"FileSize":39}
 TransactionId: 5
  ErrorTablets: {}
6 rows in set (0.01 sec)

支持表查询语句,查看导入结果:

mysql> select * from test;
+------+-----------+-------------+
| id   | user_name | member_list |
+------+-----------+-------------+
| 5    | sim5      |       1.500 |
| 6    | sim6      |       1.006 |
| 7    | sim7      |       1.070 |
+------+-----------+-------------+

具体参加:https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD

最后宣传下我的书:
1 . 《图解Spark 大数据快速分析实战(异步图书出品)》 https://item.jd.com/13613302.html
2. 《Offer来了:Java面试核心知识点精讲(第2版)(博文视点出品)》https://item.jd.com/13200939.html
3. 《Offer来了:Java面试核心知识点精讲(原理篇)(博文视点出品)》https://item.jd.com/12737278.html
4. 《Offer来了:Java面试核心知识点精讲(框架篇)(博文视点出品) https://item.jd.com/12868220.htm

你可能感兴趣的:(Doris,hadoop,大数据,hdfs)