如果数据在hdfs上存好,并且是结构化的数据。最常见的就是按天增量的结构化的日志或者计算结果,此时这部门数据基本不用后期维护,只需要后台程序每天正常运行。这样,在建表的时候直接用location指定即可。
create external table rpt_search_flow_experiment
(date_num String,flowId string,pvlist int,uvlist int,pvdetail int,uvdetail int, uvall int,ord int,
pvctr1 string comment "pvdetail/pvlist",
pvctr2 string comment "ord/pvdetail",
pvctr string comment "ord/pvlist",
uvctr1 string comment "uvdetail/uvlist",
uvctr2 string comment "ord/uvdetail",
uvctr string comment "ord/uvlist",
gross int comment "毛利润",
net int comment "净利润",
gross_div_uv float comment "毛利润/总uv",
net_div_uv float comment "净利润/总uv")
partitioned by (day string)
row format delimited fields terminated by '\t'
location '/xxx/xxx/rpt_search_flow_experiment'
一般这种情况下,都以建外部表为好。因为如果是内部表,drop表的时候会把数据也给删除掉。计算结果集还好说,一般数据量都不会特别大。如果是日志文件,现在的数据量大小,你懂的。。。所以为了保险起见,还是建个外部表为好。
hive> load data local inpath '/home/webopa/lei.wang/incubator/new_pv_uv/files/zzz'
> into table rpt_search_flow_experiment partition(day = '20160101');
Copying data from file:/home/webopa/lei.wang/incubator/new_pv_uv/files/zzz
Copying file: file:/home/webopa/lei.wang/incubator/new_pv_uv/files/zzz
Loading data to table rpt.rpt_search_flow_experiment partition (day=20160101)
OK
Time taken: 0.663 seconds
与从本地加载类似,去掉local关键字就可以。不再举例
hive> insert overwrite table rpt.rpt_search_flow_experiment partition (day="20160102")
> select date_num, flowid, pvlist, uvlist, pvdetail, uvdetail, uvall, ord, pvctr1, pvctr2, pvctr, uvctr1, uvctr2, uvctr, gross, net, gross_div_uv, net_div_uv from rpt.rpt_search_flow_experiment where day = "20160101";
...
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://mycluster/tmp/hive-webopa/hive_2016-04-27_14-43-59_493_1688413412217862660-1/-ext-10000
Loading data to table rpt.rpt_search_flow_experiment partition (day=20160102)
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.69 sec HDFS Read: 2606 HDFS Write: 2361 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 690 msec
OK
Time taken: 26.5 seconds
注意因为有分区,所以不能直接select *,因为存在有分区字段,直接select * 会报字段不匹配的错误。