HIVE到Greenplum数据导入技术

1.启动gpfdist服务:相关参数

/usr/local/greenplum-db/bin/gpfdist -d /home/gpadmin/data -p 8787 -l /home/gpadmin/data/interdir/gplog/gpfdist_8787.log

 

-d:存放外部表的目录

-p:端口号

-l:日志文件

 

2.gpfdist服务的验证:使用jobs命令,验证结果如下:

ps -ef|grep gpfdist 或者 jobs

[2]+  Running                 gpfdist -d /export/gpdata/gpfdist/ -p 8001 -l /home/gpadmin/gpAdminLogs/gpfdist.log &  (wd: /export/gpdata/gpfdist)

 

==========================================================================

3.greenplum创建外部表

 

create external table customer(name varchar(32), age int) location ('gpfdist://192.168.129.108:8787/interdir/dic_data/customer.txt') format 'text' (DELIMITER ',');

 

[gpadmin@bd129108 dic_data]$ cat customer.txt

wz,29

zhangsan,50

wangwu,19

 

 

4.导入内部表:

create table customer_inter(name varchar(32), age int);

insert into customer_inter(name, age)  select name, age from customer;

 

5. 删除临时外部表

==========================================================================

 

二、Greenplum中使用外部表访问Hive数据

在Hive中创建表,包括并加载数据

create table gp_test(id int,name string) row format delimited fields terminated by '\001' stored as textfile;

load data local inpath '/tmp/zyl/gp_test.txt' into table gp_test;

 

 

 

在GP中创建外部表,并通过gphdfs协议读取HDFS上的数据文件

create external table gp_test (id int,name text) location ('gphdfs://192.168.129.106:8020/user/hive/warehouse/gp_test') format 'TEXT' (DELIMITER '\001');

 

查询的事情报错:

tydic=# select * from gp_test;

ERROR:  external table gphdfs protocol command ended with error. SLF4J: Class path contains multiple SLF4J bindings.  (seg4 slice1 192.168.129.120:40000 pid=6030)

DETAIL: 

 

SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/jars/avro-tools-1.7.6-cdh5.8.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/jars/pig-0.12.0-cdh5.8.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerB

Command: execute:source $GPHOME/lib//hadoop/hadoop_env.sh;java $GP_JAVA_OPT -classpath $CLASSPATH com.emc.greenplum.gpdb.hdfsconnector.HDFSReader $GP_SEGMENT_ID $GP_SEGMENT_COUNT TEXT cdh4.1-gnet-1.2.0.0 'gphdfs://192.168.129.106:8020/user/hive/warehouse/gp_test/gp_test.txt' '000000002300044000000002500044' 'id,name,'

External table gp_test, file gphdfs://192.168.129.106:8020/user/hive/warehouse/gp_test/gp_test.txt

 

解决方法:配置文件,配置中加了一个非集群中的一个机器导致读取hdfs出问题

 

 

建立内部表:

create table gp_test_inter(id int,name varchar(32));

insert into gp_test_inter(id, name)  select id, name from gp_test;

 

 

测试内部表和外部表:

内部表查询速度明显快于外部表

 

文献:

http://blog.csdn.net/jiangshouzhuang/article/details/51721884?locationNum=7&fps=1

你可能感兴趣的:(Hadoop生态,Linux)