1.启动gpfdist服务:相关参数
/usr/local/greenplum-db/bin/gpfdist -d /home/gpadmin/data -p 8787 -l /home/gpadmin/data/interdir/gplog/gpfdist_8787.log
-d:存放外部表的目录
-p:端口号
-l:日志文件
2.gpfdist服务的验证:使用jobs命令,验证结果如下:
ps -ef|grep gpfdist 或者 jobs
[2]+ Running gpfdist -d /export/gpdata/gpfdist/ -p 8001 -l /home/gpadmin/gpAdminLogs/gpfdist.log & (wd: /export/gpdata/gpfdist)
==========================================================================
3.greenplum创建外部表
create external table customer(name varchar(32), age int) location ('gpfdist://192.168.129.108:8787/interdir/dic_data/customer.txt') format 'text' (DELIMITER ',');
[gpadmin@bd129108 dic_data]$ cat customer.txt
wz,29
zhangsan,50
wangwu,19
4.导入内部表:
create table customer_inter(name varchar(32), age int);
insert into customer_inter(name, age) select name, age from customer;
5. 删除临时外部表
==========================================================================
二、Greenplum中使用外部表访问Hive数据
在Hive中创建表,包括并加载数据
create table gp_test(id int,name string) row format delimited fields terminated by '\001' stored as textfile;
load data local inpath '/tmp/zyl/gp_test.txt' into table gp_test;
在GP中创建外部表,并通过gphdfs协议读取HDFS上的数据文件
create external table gp_test (id int,name text) location ('gphdfs://192.168.129.106:8020/user/hive/warehouse/gp_test') format 'TEXT' (DELIMITER '\001');
查询的事情报错:
tydic=# select * from gp_test;
ERROR: external table gphdfs protocol command ended with error. SLF4J: Class path contains multiple SLF4J bindings. (seg4 slice1 192.168.129.120:40000 pid=6030)
DETAIL:
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/jars/avro-tools-1.7.6-cdh5.8.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/jars/pig-0.12.0-cdh5.8.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerB
Command: execute:source $GPHOME/lib//hadoop/hadoop_env.sh;java $GP_JAVA_OPT -classpath $CLASSPATH com.emc.greenplum.gpdb.hdfsconnector.HDFSReader $GP_SEGMENT_ID $GP_SEGMENT_COUNT TEXT cdh4.1-gnet-1.2.0.0 'gphdfs://192.168.129.106:8020/user/hive/warehouse/gp_test/gp_test.txt' '000000002300044000000002500044' 'id,name,'
External table gp_test, file gphdfs://192.168.129.106:8020/user/hive/warehouse/gp_test/gp_test.txt
解决方法:配置文件,配置中加了一个非集群中的一个机器导致读取hdfs出问题
建立内部表:
create table gp_test_inter(id int,name varchar(32));
insert into gp_test_inter(id, name) select id, name from gp_test;
测试内部表和外部表:
内部表查询速度明显快于外部表
文献:
http://blog.csdn.net/jiangshouzhuang/article/details/51721884?locationNum=7&fps=1