Trafodion Bulk Load 初试

如在文章《Trafodion 数据加载介绍》中提到,Trafodion的Bulk Load与Tricke Load不同,Bulk Load主要是针对大数据量的且一般是批量装载的方式。Bulk Load通常使用LOAD语句实现。

下面介绍几种Bulk Load方式,

  • 从Trafodion表加载数据
  • 从HDFS文件加载数据(Hive外部表)
  • 从Hive表加载数据
  • 从外部数据库加载数据(Mysql to Hive)


从Trafodion表加载数据

(1)创建两个表结构相同的测试表

SQL>showddl test_tbl;


CREATE TABLE TRAFODION.SEABASE.TEST_TBL
  (
    ID                               INT NO DEFAULT NOT NULL NOT DROPPABLE
      SERIALIZED
  , NAME                             VARCHAR(10) CHARACTER SET ISO88591 COLLATE
      DEFAULT DEFAULT NULL SERIALIZED
  )
;

--- SQL operation complete.

SQL>showddl test_tbl2;


CREATE TABLE TRAFODION.SEABASE.TEST_TBL2
  (
    ID                               INT NO DEFAULT NOT NULL NOT DROPPABLE
      SERIALIZED
  , NAME                             VARCHAR(10) CHARACTER SET ISO88591 COLLATE
      DEFAULT DEFAULT NULL SERIALIZED
  )
;

--- SQL operation complete.

(2)表1有几条数据,表2没有数据

SQL>select * from test_tbl;

ID          NAME
----------- ----------
          1 LIU
          2 zhang
          3 LIU
          4 wu
          5 feng

--- 5 row(s) selected.

SQL>select * from test_tbl2;

--- 0 row(s) selected.

(3)从表1加载数据到表2

SQL>load into test_tbl2 select * from test_tbl;

UTIL_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------
Task: LOAD             Status: Started    Object: TRAFODION.SEABASE.TEST_TBL2
Task:  CLEANUP         Status: Started    Object: TRAFODION.SEABASE.TEST_TBL2
Task:  CLEANUP         Status: Ended      Object: TRAFODION.SEABASE.TEST_TBL2
Task:  PREPARATION     Status: Started    Object: TRAFODION.SEABASE.TEST_TBL2
       Rows Processed: 5
Task:  PREPARATION     Status: Ended      ET: 00:00:00.263
Task:  COMPLETION      Status: Started    Object: TRAFODION.SEABASE.TEST_TBL2
Task:  COMPLETION      Status: Ended      ET: 00:00:00.531

--- SQL operation complete.

SQL>select * from test_tbl2;

ID          NAME
----------- ----------
          1 LIU
          2 zhang
          3 LIU
          4 wu
          5 feng

--- 5 row(s) selected.


从HDFS文件加载数据(Hive 外部表)

(1)创建测试文件并利用scp或者其他方式上传到Hadoop Cluster上,内容如下

[centos@cent-1 tmp]$ cat test_table
Lucy,20,F
Jason,22,M
Stephen,21,M
Lily,19,F
Vincent,25,M

(2)将测试文件导入HDFS中

[hdfs@cent-1 ~]$ hadoop fs -mkdir -p /hive/test/test_tbl
[hdfs@cent-1 ~]$ hadoop fs -copyFromLocal /tmp/test_table /hive/test/test_tbl
[hdfs@cent-1 ~]$ hadoop fs -ls /hive/test/test_tbl
Found 1 items
-rw-r--r--   3 hdfs supergroup         57 2016-09-28 02:31 /hive/test/test_tbl/test_table

(3)在Trafodion中创建测试表

SQL>create table test_table ( name varchar(10), age int, sex varchar(1));

--- SQL operation complete.

SQL>showddl test_table;


CREATE TABLE TRAFODION.SEABASE.TEST_TABLE
  (
    NAME                             VARCHAR(10) CHARACTER SET ISO88591 COLLATE
      DEFAULT DEFAULT NULL SERIALIZED
  , AGE                              INT DEFAULT NULL SERIALIZED
  , SEX                              VARCHAR(1) CHARACTER SET ISO88591 COLLATE
      DEFAULT DEFAULT NULL SERIALIZED
  )
;

--- SQL operation complete.


(4)在Hive中创建Hive外部表,指向前面对应的HDFS文件

[hdfs@cent-1 ~]$ hive

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hive-common-1.1.0-cdh5.4.8.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> create external table test_tbl (name string,age  int,sex  string) row format delimited fields terminated by ',' location '/hive/test/test_tbl';
OK
Time taken: 0.085 seconds
hive> show tables;
OK
test_tbl
Time taken: 0.364 seconds, Fetched: 1 row(s)

(5)加载数据到Trafodion表

SQL>load into test_table
+>select * from hive.hive.test_tbl;

UTIL_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------
Task: LOAD             Status: Started    Object: TRAFODION.SEABASE.TEST_TABLE
Task:  CLEANUP         Status: Started    Object: TRAFODION.SEABASE.TEST_TABLE
Task:  CLEANUP         Status: Ended      Object: TRAFODION.SEABASE.TEST_TABLE
Task:  PREPARATION     Status: Started    Object: TRAFODION.SEABASE.TEST_TABLE
       Rows Processed: 5
Task:  PREPARATION     Status: Ended      ET: 00:00:01.272
Task:  COMPLETION      Status: Started    Object: TRAFODION.SEABASE.TEST_TABLE
Task:  COMPLETION      Status: Ended      ET: 00:00:00.254

--- SQL operation complete.

SQL>select * from test_table;

NAME       AGE         SEX
---------- ----------- ----
Lucy                20 F
Jason               22 M
Stephen             21 M
Lily                19 F
Vincent             25 M

--- 5 row(s) selected.


从Hive表加载数据

(1)在Hive中创建测试表并插入几条数据

hive> describe hive_tbl;
OK
name                    string
age                     int
sex                     string
Time taken: 0.107 seconds, Fetched: 3 row(s)
hive> select * from hive_tbl;
OK
Lucy    20      F
Jason   22      M
Stephen 21      M
Time taken: 0.093 seconds, Fetched: 3 row(s)

(2)Trafodion中的表继续沿用上述测试表,先清空数据

SQL>purgedata test_table;

--- SQL operation complete.

SQL>showddl test_table;


CREATE TABLE TRAFODION.SEABASE.TEST_TABLE
  (
    NAME                             VARCHAR(10) CHARACTER SET ISO88591 COLLATE
      DEFAULT DEFAULT NULL SERIALIZED
  , AGE                              INT DEFAULT NULL SERIALIZED
  , SEX                              VARCHAR(1) CHARACTER SET ISO88591 COLLATE
      DEFAULT DEFAULT NULL SERIALIZED
  )
;

--- SQL operation complete.

(3)从Hive加载数据到Trafodion

(注:在做从hive到trafodion数据加载的时候有几个参数可以配置,用来提升加载性能,如HIVE_MAX_STRING_LENGTH、ALLOW_INCOMPATIBLE_ASSIGNMENT

   典型的用法如下:

    CQD HIVE_MAX_STRING_LENGTH '1000'; --用于当列的最大长度为1KB时

    CQD ALLOW_INCOMPATIBLE_ASSIGNMENT 'ON'; --用于当有时间相关的列的时候

SQL>load into test_table select * from hive.hive.hive_tbl;

UTIL_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------
Task: LOAD             Status: Started    Object: TRAFODION.SEABASE.TEST_TABLE
Task:  CLEANUP         Status: Started    Object: TRAFODION.SEABASE.TEST_TABLE
Task:  CLEANUP         Status: Ended      Object: TRAFODION.SEABASE.TEST_TABLE
Task:  PREPARATION     Status: Started    Object: TRAFODION.SEABASE.TEST_TABLE
       Rows Processed: 3
Task:  PREPARATION     Status: Ended      ET: 00:00:00.390
Task:  COMPLETION      Status: Started    Object: TRAFODION.SEABASE.TEST_TABLE
Task:  COMPLETION      Status: Ended      ET: 00:00:00.447

--- SQL operation complete.

SQL>select * from test_table;

NAME       AGE         SEX
---------- ----------- ----
Lucy                20 F
Jason               22 M
Stephen             21 M

--- 3 row(s) selected.


从外部数据库加载数据(Mysql to Hive)

(1)在集群上安装并启动Sqoop,Sqoop安装步骤请参照其官方文档 点击打开链接


(2)下载mysql connector Jar包,并放到/var/lib/sqoop目录下


(3)在某台机器上安装Mysql,安装Mysql步骤此处不作详细说明


(4)在mysql数据中创建测试表,并插入几条数据

mysql> use test;
Database changed
mysql> create table test_tbl(name varchar(10), age int, sex char(1));
Query OK, 0 rows affected (0.00 sec)

mysql> select * from test;
ERROR 1146 (42S02): Table 'test.test' doesn't exist
mysql> select * from test_tbl;
Empty set (0.00 sec)

mysql> insert into test_tbl values('Wang wei', 30, 'M');
Query OK, 1 row affected (0.00 sec)

mysql> insert into test_tbl values('Li ping', 29, 'F');
Query OK, 1 row affected (0.00 sec)

mysql>
mysql>
mysql> select * from test_tbl;
+----------+------+------+
| name     | age  | sex  |
+----------+------+------+
| Wang wei |   30 | M    |
| Li ping  |   29 | F    |
+----------+------+------+
2 rows in set (0.00 sec)

(5)使用sqoop命令从mysql往hive导入数据

[hdfs@cent-2 ~]$  sqoop import --connect jdbc:mysql://localhost:3306/test --driver com.mysql.jdbc.Driver --username centos --table test_tbl --split-by name --hive-import --create-hive-table --hive-table test_tabl_hive
Warning: /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
...
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/jars/hive-common-1.1.0-cdh5.4.8.jar!/hive-log4j.properties
OK
Time taken: 3.052 seconds
Loading data to table default.test_tabl_hive
Table default.test_tabl_hive stats: [numFiles=4, totalSize=27]
OK
Time taken: 0.748 seconds

(6)从Trafodion或Hive Shell中查看数据是否导入成功

SQL>select * from hive.hive.test_tabl_hive;

NAME                                                                                                                             AGE         SEX                                                          
-------------------------------------------------------------------------------------------------------------------------------- ----------- --------------------------------------------------------------------------------------------------------------------------------
Li ping                                                                                                                                   29 F                                                            
Wang wei                                                                                                                                  30 M                                                            

--- 2 row(s) selected.






你可能感兴趣的:(Trafodion Bulk Load 初试)