数据集导出:
1、从tvlog库 tvlog_tcl表中导出2015-09-09号的数据
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/tvlog_tcl_2015_09_09'
SELECT *
FROM tvlog.tvlog_tcl
WHERE year = 2015 and month = 9 and day = 9;
2、导出数据如下:
f03ec5b8ed9f4d209864a39ab97b8711-1安徽卫视广西antv171.38.38.1512015-09-092015-09-09 19:24:252015-09-09 19:26:25CCTV-1综合CCTV-1综合40-8B-F6-A4-A1-49e4d0a284ce53e4ab810847adc33cb5a00d806370113522354201599
f03ec5b8ed9f4d209864a39ab97b8711-1CCTV-1综合广西cctv1171.38.38.1512015-09-092015-09-09 19:26:252015-09-09 19:30:25安徽卫视安徽卫视40-8B-F6-A4-A1-49e4d0a284ce53e4ab810847adc33cb5a00d806370113522354201599
...
导出的数据没有分隔符,这是因为并没有把^A和^B显示出来,不易使用
3、使用Linux管道命令导出数据
hive -e "SELECT * FROM tvlog.tvlog_tcl WHERE year = 2015 and month = 9 and day = 9;" >> "/tmp/tvlog_tcl_2015_09_09.txt";
4、导出数据如下:
NULL cc4b7bc02a1c47bcb5f47f23c6e4a45b -1 贵州卫视 中国 5a7d01661b5d9c64293860531374312b 103.244.252.71 2015-09-09 2015-09-09 00:39:03 2015-09-09 00:51:03 安徽卫视 广东卫视 5C-36-B8-40-EA-91 248dca07ce4070d56b59a56dff1fb8d3e0125654 406355278 2015 9 9
NULL 21ea96ff7f31439b8434baf2b6953db9 -1 深圳卫视 四川 20831bb807a45638cfaf81df1122024d 222.215.124.45 2015-09-09 2015-09-09 00:01:02 2015-09-09 00:23:02 浙江卫视 40-8B-F6-6B-1B-52 40
...
数据是以制表符进行分隔的
创建表:
创建存储格式为textfile并且字段类型全部为String类型的表
DROP TABLE test.tvlog_tcl_textfile_string;
CREATE TABLE IF NOT EXISTS test.tvlog_tcl_textfile_string (
id STRING,
userid STRING,
channelid STRING,
channelname STRING,
region STRING,
channelcode STRING,
ip STRING,
dt STRING,
starttime STRING,
endtime STRING,
fromchannel STRING,
tochannel STRING,
mac STRING,
deviceid STRING,
dnum STRING,
year STRING,
month STRING,
day STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/tmp/tvlog_tcl_2015_09_09.txt' INTO TABLE test.tvlog_tcl_textfile_string;
Loading data to table test.tvlog_tcl_textfile_string
Table test.tvlog_tcl_textfile_string stats: [numFiles=1, totalSize=549291193]
OK
Time taken: 2.4 seconds
创建存储格式为textfile并且对应类型的表
DROP TABLE test.tvlog_tcl_textfile_other;
CREATE TABLE IF NOT EXISTS test.tvlog_tcl_textfile_other (
id STRING,
userid STRING,
channelid STRING,
channelname STRING,
region STRING,
channelcode STRING,
ip STRING,
dt DATE,
starttime TIMESTAMP,
endtime TIMESTAMP,
fromchannel STRING,
tochannel STRING,
mac STRING,
deviceid STRING,
dnum INT,
year INT,
month INT,
day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/tmp/tvlog_tcl_2015_09_09.txt' INTO TABLE test.tvlog_tcl_textfile_other;
Loading data to table test.tvlog_tcl_textfile_other
Table test.tvlog_tcl_textfile_other stats: [numFiles=1, totalSize=549291193]
OK
Time taken: 2.36 seconds
创建存储格式为orcfile并且字段类型全部为对应类型的表
DROP TABLE test.tvlog_tcl_orc_string;
CREATE TABLE IF NOT EXISTS test.tvlog_tcl_orc_string (
id STRING,
userid STRING,
channelid STRING,
channelname STRING,
region STRING,
channelcode STRING,
ip STRING,
dt STRING,
starttime STRING,
endtime STRING,
fromchannel STRING,
tochannel STRING,
mac STRING,
deviceid STRING,
dnum STRING,
year STRING,
month STRING,
day STRING
)
STORED AS ORC;
INSERT INTO TABLE test.tvlog_tcl_orc_string
SELECT * FROM test.tvlog_tcl_textfile_string;
Loading data to table test.tvlog_tcl_orc_string
Table test.tvlog_tcl_orc_string stats: [numFiles=3, numRows=2223869, totalSize=87336289, rawDataSize=3863401633]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Cumulative CPU: 54.55 sec HDFS Read: 549326336 HDFS Write: 87336567 SUCCESS
Total MapReduce CPU Time Spent: 54 seconds 550 msec
OK
Time taken: 36.028 seconds
创建存储格式为orc并且字段类型为对应类型的表
DROP TABLE test.tvlog_tcl_orc_other;
CREATE TABLE IF NOT EXISTS test.tvlog_tcl_orc_other (
id STRING,
userid STRING,
channelid STRING,
channelname STRING,
region STRING,
channelcode STRING,
ip STRING,
dt DATE,
starttime TIMESTAMP,
endtime TIMESTAMP,
fromchannel STRING,
tochannel STRING,
mac STRING,
deviceid STRING,
dnum INT,
year INT,
month INT,
day INT
)
STORED AS orc;
INSERT INTO TABLE test.tvlog_tcl_orc_other
SELECT * FROM test.tvlog_tcl_textfile_other;
Loading data to table test.tvlog_tcl_orc_other
Table test.tvlog_tcl_orc_other stats: [numFiles=3, numRows=2223869, totalSize=84204372, rawDataSize=2755419207]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 Cumulative CPU: 53.6 sec HDFS Read: 549326196 HDFS Write: 84204647 SUCCESS
Total MapReduce CPU Time Spent: 53 seconds 600 msec
OK
Time taken: 33.834 seconds
综上所述:
1、如果表存储格式是textfile,存储字段是任意类型对于表大小没有影响。
2、如果表存储格式是某种压缩格式(orcfile),存储字段是对应类型比全是string类型要小。
3、2223869条数据,orcfile与textfile存储比率,84204372 / 549291193 = 0.153296417406787