Doris整合Iceberg+Spark联邦查询实战

Doris整合Iceberg+Spark联邦查询实战

环境介绍:

Soft Verison
Hadoop 3.3
Hive 3.1.3
Doris 1.1-pr2
Spark 3.1.2
Iceberg 0.13.2

Hadoop HDFS 环境安装

  1. Hadoop3.3 下载

    wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.3/hadoop-3.3.3.tar.gz
    
  2. Lacalhost SSH 免密登录

      ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
      cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      chmod 0600 ~/.ssh/authorized_keys
      ssh localhost
    

    执行 ssh localhost,会弹出安全提示,填写yes即可。

    如果是docker环境下,需要手动启动sshd服务。

     /usr/sbin/sshd
    
  3. core-site.xml配置

    [root@17a5da45700b hadoop]# cat core-site.xml
    
    
    
    
    
    
    
        
            fs.defaultFS
            hdfs://localhost:9000
        
    
    
      hadoop.proxyuser.root.hosts
      *
    
    
    
      hadoop.proxyuser.root.groups
      *
    
    
    
    
  4. hdfs-site.xml配置

    [root@17a5da45700b hadoop]# cat  hdfs-site.xml
    
    
    
    
    
    
    
        
            dfs.replication
            1
        
    
    
  5. hadoop-env.sh配置:添加如下配置到hadoop-env.sh。

    export JAVA_HOME=/opt/software/jdk8
    export HDFS_NAMENODE_USER=root
    export HDFS_DATANODE_USER=root
    export HDFS_SECONDARYNAMENODE_USER=root
    export YARN_RESOURCEMANAGER_USER=root
    export YARN_NODEMANAGER_USER=root
    
  6. 格式化hdfs文件系统

    bin/hdfs namenode -format
    
  7. 启动hdfs

    sbin/start-dfs.sh
    
  8. 创建HDFS目录

     bin/hdfs dfs -mkdir /user
     bin/hdfs dfs -mkdir /user/root	
    
  9. 查看HDFS目录,确保HDFS服务正常。

    [root@17a5da45700b hadoop-3.3.3]#  bin/hdfs dfs -ls /user
    Found 2 items
    drwxr-xr-x   - root supergroup          0 2022-06-21 03:00 /user/hive
    drwxr-xr-x   - root supergroup          0 2022-06-15 09:38 /user/root
    
  10. Hadoop环境变量配置。

    export HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.3
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    

Hive环境安装

  1. 下载并安装Hive

    wgt https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
    tar -xzvf apache-hive-3.1.3-bin.tar.gz
    cd apache-hive-3.1.3-bin
    
    
  2. Hive环境配置

    export HIVE_HOME=/opt/software/hive/apache-hive-3.1.3-bin
    export PATH=$HIVE_HOME/bin:$PATH
    
  3. 在Hdfs上创建Hive目录并赋权。

    $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
    $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
    $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
    $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
    
  4. hive-env.sh添加HADOOP_HOME配置

    #创建hive-env.sh文件
    mv hive-env.sh.template hive-env.sh
    #添加HADOOP_HOME配置
    HADOOP_HOME=/opt/software/hadoop/hadoop-3.3.3
    
  5. hive-site.xml配置:这里新建hive-site.xml配置文件

    vim hive-site.xml
    [root@17a5da45700b conf]# cat hive-site.xml
    
    
    
            
                javax.jdo.option.ConnectionURL
                jdbc:mysql://localhost:3306/hive?
    createDatabaseIfNotExist=true
            
            
                javax.jdo.option.ConnectionDriverName
                com.mysql.jdbc.Driver
            
            
                javax.jdo.option.ConnectionUserName
                root
    
    
                javax.jdo.option.ConnectionPassword
                root
            
            
                    hive.metastore.warehouse.dir
                    /user/hive/warehouse
                    location of default database for the
    warehouse
            
            
                    hive.metastore.uris
                     thrift://127.0.0.1:9083  
                    Thrift URI for the remote metastore. Used by metastore
    client to connect to remote metastore.
            
            
                    javax.jdo.PersistenceManagerFactoryClass
                    org.datanucleus.api.jdo.JDOPersistenceManagerFactory
            
            
                    hive.metastore.schema.verification
                    false
            
            
                    datanucleus.schema.autoCreateAll
                    true
            
    
    
  6. 在MySQL侧创建hive数据库,用户存储Hive元数据信息。

mysql --host=localhost --user=root --password=root
create database hive;
use hive;
  1. 添加MySQL连接到apache-hive-3.1.3-bin/lib
cd /opt/software/hive/apache-hive-3.1.3-bin/lib
wegt https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar

:注意mysql-connector-java-5.1.49.jar的版本要和MySQL的大版本匹配,如mysql-connector版本大于MySQL很可能会报错。

我这里MySQL版本为5.7.38:

mysql> show variables like '%version%'
    -> ;
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| innodb_version          | 5.7.38                       |
| protocol_version        | 10                           |
| slave_type_conversions  |                              |
| tls_version             | TLSv1,TLSv1.1,TLSv1.2        |
| version                 | 5.7.38                       |
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86_64                       |
| version_compile_os      | Linux                        |
+-------------------------+------------------------------+
8 rows in set (0.00 sec)

mysql> 
  1. Hive元数据初始化。
schematool -initSchema -dbType mysql
  1. 在MySQL中查看Hive元数据
mysql> show tables;
+-------------------------------+
| Tables_in_hive                |
+-------------------------------+
| AUX_TABLE                     |
| BUCKETING_COLS                |
| CDS                           |
| COLUMNS_V2                    |
| COMPACTION_QUEUE              |
| COMPLETED_COMPACTIONS         |
| COMPLETED_TXN_COMPONENTS      |
| CTLGS                         |
| DATABASE_PARAMS               |
| DBS                           |
| DB_PRIVS                      |
| DELEGATION_TOKENS             |
| FUNCS                         |
| FUNC_RU                       |
| GLOBAL_PRIVS                  |
| HIVE_LOCKS                    |
| IDXS                          |
| INDEX_PARAMS                  |
| I_SCHEMA                      |
| KEY_CONSTRAINTS               |
| MASTER_KEYS                   |
| MATERIALIZATION_REBUILD_LOCKS |
| METASTORE_DB_PROPERTIES       |
| MIN_HISTORY_LEVEL             |
| MV_CREATION_METADATA          |
| MV_TABLES_USED                |
| NEXT_COMPACTION_QUEUE_ID      |
| NEXT_LOCK_ID                  |
| NEXT_TXN_ID                   |
| NEXT_WRITE_ID                 |
| NOTIFICATION_LOG              |
| NOTIFICATION_SEQUENCE         |
| NUCLEUS_TABLES                |
| PARTITIONS                    |
| PARTITION_EVENTS              |
| PARTITION_KEYS                |
| PARTITION_KEY_VALS            |
| PARTITION_PARAMS              |
| PART_COL_PRIVS                |
| PART_COL_STATS                |
| PART_PRIVS                    |
| REPL_TXN_MAP                  |
| ROLES                         |
| ROLE_MAP                      |
| RUNTIME_STATS                 |
| SCHEMA_VERSION                |
| SDS                           |
| SD_PARAMS                     |
| SEQUENCE_TABLE                |
| SERDES                        |
| SERDE_PARAMS                  |
| SKEWED_COL_NAMES              |
| SKEWED_COL_VALUE_LOC_MAP      |
| SKEWED_STRING_LIST            |
| SKEWED_STRING_LIST_VALUES     |
| SKEWED_VALUES                 |
| SORT_COLS                     |
| TABLE_PARAMS                  |
| TAB_COL_STATS                 |
| TBLS                          |
| TBL_COL_PRIVS                 |
| TBL_PRIVS                     |
| TXNS                          |
| TXN_COMPONENTS                |
| TXN_TO_WRITE_ID               |
| TYPES                         |
| TYPE_FIELDS                   |
| VERSION                       |
| WM_MAPPING                    |
| WM_POOL                       |
| WM_POOL_TO_TRIGGER            |
| WM_RESOURCEPLAN               |
| WM_TRIGGER                    |
| WRITE_SET                     |
+-------------------------------+
74 rows in set (0.00 sec)


  1. 启动HiveServer2和Metastore

    hive --service hiveserver2  >/dev/null 2>/dev/null &
    hive --service metastore  1>/dev/null  2>&1  &
    
  2. 查看进程

    [root@17a5da45700b lib ~]# jobs
    [1]-  Running                 hive --service hiveserver2 > /dev/null 2> /dev/null &
    [2]+  Running                 hive --service metastore > /dev/null 2>&1 &
    
    
  3. 查看metastore端口:

    [root@17a5da45700b lib]# netstat -ano |grep 9083
    tcp        0      0 0.0.0.0:9083            0.0.0.0:*               LISTEN      off (0.00/0/0)
    

Doris环境安装

  1. Doris安装参照官网:https://doris.apache.org/zh-CN/docs/get-starting/get-starting.html#%E5%8D%95%E6%9C%BA%E9%83%A8%E7%BD%B2

  2. Doris和Hadoop有一些端口冲突,需要对Doirs默认端口进行修改。

    vim be/conf/be.conf
    

    将webserver_port = 8040修改为webserver_port = 18040

    vim fe/conf/fe.conf 
    

    将http_port = 8030修改为http_port = 18030

Spark环境安装

  1. Spark下载:

    wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
    tar -xzvf spark-3.1.2-bin-hadoop3.2.tgz
    
  2. 配置Spark环境变量

    export SPARK_HOME=/opt/software/spark-3.1.2-bin-hadoop3.2
    export PATH=$PATH:$SPARK_HOME/bin
    
  3. 将hive-site.xml复制到Spark conf目录

    cp /opt/software/hive/apache-hive-3.1.3-bin/conf/hive-site.xml /opt/software/spark-3.1.2-bin-hadoop3.2/conf
    
  4. 添加MySQL驱动、hive依赖、Iceberg依赖到Spark

    cd /opt/software/spark-3.1.2-bin-hadoop3.2/jars/
    wegt https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar
    wget https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/0.13.2/iceberg-spark-runtime-3.1_2.12-0.13.2.jar
    wget https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-hive-runtime/0.13.2/iceberg-hive-runtime-0.13.2.jar
    

    注意:这里下载完可能出现如下的文件名,需要我们使用mv命令将其修改为正确的Jar包名称:

    remotecontent?filepath=org%2Fapache%2Ficeberg%2Ficeberg-spark-runtime-3.1_2.12%2F0.13.2%2Ficeberg-spark-runtime-3.1_2.12-0.13.2.jar
    

Iceberg On Spark建库建表

  1. 连接Spark-sql

spark-sql  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.hive_prod=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_prod.type=hive \
    --conf spark.sql.catalog.hive_prod.uri=thrift://127.0.0.1:9083
  1. 启用Iceberg hive支持
SET iceberg.engine.hive.enabled=true;
  1. 创建数据库和表:
ADD JAR /opt/software/hive/apache-hive-3.1.3-bin/lib/iceberg-hive-runtime-0.13.2.jar;
ADD JAR /opt/software/hive/apache-hive-3.1.3-bin/lib/iceberg-spark-runtime-3.1_2.12-0.13.2.jar;
create database hive_prod.testdb;
use hive_prod.testdb;
CREATE TABLE hive_prod.testdb.test_table (id bigint, data string) USING iceberg;
  1. 插入数据:
INSERT INTO hive_prod.testdb.test_table  VALUES (1, 'a'), (2, 'b'), (3, 'c');

Doris创建Iceberg外表和读取Iceberg数据

  1. 登录Doris
mysql -h 127.0.0.1 -P 9030 -uroot
  1. 创建数据库和表:
CREATE DATABASE test_iceberg_db PROPERTIES ( "iceberg.database" = "testdb", "iceberg.hive.metastore.uris" = "thrift://127.0.0.1:9083", "iceberg.catalog.type" = "HIVE_CATALOG" )
  1. Doris数据查询。

    use test_iceberg_db
    mysql> show tables;
    +----------------------------+
    | Tables_in_test_iceberg_db |
    +----------------------------+
    | test_table                 |
    +----------------------------+
    1 row in set (0.00 sec)
    
    mysql> select * from test_table;
    +------+------+
    | id   | data |
    +------+------+
    |    1 | a    |
    |    2 | b    |
    |    3 | c    |
    +------+------+
    3 rows in set (0.40 sec)
    
    

你可能感兴趣的:(Doris,hadoop,spark,hdfs)