写一个测试用例testcase,分别验证TXT文件和gzip文件的可并行计算性?
1)TXT和gzip文件准备OK,放到hdfs上去,各自的大小必须大于一个block块。
2)写hivesql,通过某种计算两种不同形式的数据文件对应的表,查看其map个数的差异
3)下个结论
txt文件测试:
TXT压缩成gzip文件的时候保留原TXT文件:gzip -c input.txt 就生成了gzip,保留TXT
cp test.txt test2.txt 先复制一个文件
cat test2.txt>>test.txt追加文件,扩大内存至128兆以上(多次追加)
du -sh * 查看内存大小
hdfs dfs -copyFromLocal test.txt /tmp/niuniu/test.txt 上传hdfs
hdfs dfs -du -h /tmp/niuniu/test.txt 查看hdfs文件大小
create table practice_score (
stdno string,
courseNo string,
score int,
opDate string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'; 创建表格
load data inpath '/tmp/niuniu/test.txt' into table practice_score;把hdfs上的TXT文件上传到表格中
(其实是将数据上传到了location里面(show create table practice_score查看))第一次传数据必须要把数据用load进表(表示表里面就已经有数据了),之后再添加数据可以直接通过hdfs上传到location里面(即使格式变了)
hdfs dfs -put test.txt +表的location地址(apps后面)
select stdno,count(1)
from practice_score
group by stdno; 写一个触发mrjob的select语句
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 4
2018-11-28 00:07:15,242 Stage-1 map = 0%, reduce = 0%
2018-11-28 00:07:26,875 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.03 sec
2018-11-28 00:07:34,245 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 13.62 sec
2018-11-28 00:07:36,347 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.1 sec
csv文件测试:
cp test.csv test2.csv 先复制一个文件
cat test2.csv>>test.csv追加文件,扩大内存至128兆以上(多次追加)
du -sh * 查看内存大小
hdfs dfs -copyFromLocal test.csv /tmp/niuniu/test.csv 上传hdfs
hdfs dfs -du -h /tmp/niuniu/test.csv 查看hdfs文件大小
create table practice_score2 (
stdno string,
courseNo string,
score int,
opDate string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','; 创建表格
load data inpath '/tmp/niuniu/test.csv' into table practice_score2; 把hdfs上的csv文件上传到表格中
select stdno,count(1)
from practice_score2
group by stdno; 写一个触发mrjob的select语句
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 3
2018-11-28 10:33:33,800 Stage-1 map = 0%, reduce = 0%
2018-11-28 10:33:45,500 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.59 sec
2018-11-28 10:33:46,544 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.62 sec
2018-11-28 10:33:52,837 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 10.12 sec
2018-11-28 10:33:53,877 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 11.65 sec
2018-11-28 10:33:57,021 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.29 sec
gzip文件测试:
gzip test.csv
gzip test2.csv
cat test2.csv.gz>>test.csv.gz
du -h test.csv.gz
hdfs dfs -copyFromLocal test.csv.gz /tmp/niuniu/test.csv.gz
hdfs dfs -ls /tmp/niuniu/
du -h test.csv.gz
create table practice_score3 (
stdno string,
courseNo string,
opDate string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
load data inpath '/tmp/niuniu/test.csv.gz' into table practice_score3;
select stdno,count(1)
from practice_score3
group by stdno;
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2018-12-16 19:34:08,478 Stage-1 map = 0%, reduce = 0%
2018-12-16 19:34:28,306 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 20.36 sec
2018-12-16 19:34:46,987 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 39.52 sec
2018-12-16 19:35:07,703 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 61.67 sec
2018-12-16 19:35:29,461 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 83.87 sec
2018-12-16 19:35:50,155 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 106.1 sec
2018-12-16 19:36:11,877 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 128.24 sec
2018-12-16 19:36:29,492 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 147.23 sec
2018-12-16 19:36:50,198 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 169.25 sec
2018-12-16 19:37:11,931 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 191.29 sec
2018-12-16 19:37:32,627 Stage-1 map = 10%, reduce = 0%, Cumulative CPU 213.4 sec
2018-12-16 19:37:54,335 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 235.52 sec
2018-12-16 19:38:11,911 Stage-1 map = 12%, reduce = 0%, Cumulative CPU 254.43 sec
2018-12-16 19:38:33,651 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 276.54 sec
2018-12-16 19:38:54,336 Stage-1 map = 14%, reduce = 0%, Cumulative CPU 298.66 sec
2018-12-16 19:39:15,004 Stage-1 map = 15%, reduce = 0%, Cumulative CPU 320.77 sec
2018-12-16 19:39:36,658 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 342.83 sec
2018-12-16 19:39:54,180 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 361.83 sec
2018-12-16 19:40:15,819 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 383.99 sec
2018-12-16 19:40:36,431 Stage-1 map = 19%, reduce = 0%, Cumulative CPU 406.01 sec
2018-12-16 19:40:58,079 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 428.04 sec
2018-12-16 19:41:18,681 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 450.2 sec
2018-12-16 19:41:40,328 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 472.23 sec
2018-12-16 19:42:00,919 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 494.25 sec
2018-12-16 19:42:22,539 Stage-1 map = 24%, reduce = 0%, Cumulative CPU 516.36 sec
2018-12-16 19:42:41,084 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 535.33 sec
2018-12-16 19:43:01,691 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 557.43 sec
2018-12-16 19:43:23,309 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 579.46 sec
2018-12-16 19:43:43,909 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 601.55 sec
2018-12-16 19:44:05,527 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 623.59 sec
2018-12-16 19:44:23,039 Stage-1 map = 30%, reduce = 0%, Cumulative CPU 642.58 sec
2018-12-16 19:44:44,697 Stage-1 map = 31%, reduce = 0%, Cumulative CPU 664.76 sec
2018-12-16 19:45:05,293 Stage-1 map = 32%, reduce = 0%, Cumulative CPU 686.87 sec
2018-12-16 19:45:26,910 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 708.99 sec
2018-12-16 19:45:47,530 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 731.03 sec
2018-12-16 19:46:09,143 Stage-1 map = 35%, reduce = 0%, Cumulative CPU 753.15 sec
2018-12-16 19:46:29,726 Stage-1 map = 36%, reduce = 0%, Cumulative CPU 775.2 sec
2018-12-16 19:46:51,346 Stage-1 map = 37%, reduce = 0%, Cumulative CPU 797.28 sec
2018-12-16 19:47:08,848 Stage-1 map = 38%, reduce = 0%, Cumulative CPU 816.21 sec
2018-12-16 19:47:30,475 Stage-1 map = 39%, reduce = 0%, Cumulative CPU 838.32 sec
2018-12-16 19:47:51,088 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 860.41 sec
2018-12-16 19:48:12,697 Stage-1 map = 41%, reduce = 0%, Cumulative CPU 882.53 sec
2018-12-16 19:48:33,283 Stage-1 map = 42%, reduce = 0%, Cumulative CPU 904.66 sec
2018-12-16 19:48:51,830 Stage-1 map = 43%, reduce = 0%, Cumulative CPU 923.53 sec
2018-12-16 19:49:12,424 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 945.61 sec
2018-12-16 19:49:34,051 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 967.7 sec
2018-12-16 19:49:54,643 Stage-1 map = 46%, reduce = 0%, Cumulative CPU 989.84 sec
2018-12-16 19:50:16,280 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 1011.93 sec
2018-12-16 19:50:34,821 Stage-1 map = 48%, reduce = 0%, Cumulative CPU 1030.79 sec
2018-12-16 19:50:55,415 Stage-1 map = 49%, reduce = 0%, Cumulative CPU 1052.83 sec
2018-12-16 19:51:17,048 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1074.88 sec
2018-12-16 19:51:37,663 Stage-1 map = 51%, reduce = 0%, Cumulative CPU 1096.9 sec
2018-12-16 19:51:59,293 Stage-1 map = 52%, reduce = 0%, Cumulative CPU 1119.07 sec
2018-12-16 19:52:19,875 Stage-1 map = 53%, reduce = 0%, Cumulative CPU 1141.19 sec
2018-12-16 19:52:41,503 Stage-1 map = 54%, reduce = 0%, Cumulative CPU 1163.33 sec
2018-12-16 19:52:59,005 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 1182.25 sec
2018-12-16 19:53:20,624 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 1204.3 sec
2018-12-16 19:53:41,241 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 1226.43 sec
2018-12-16 19:54:02,875 Stage-1 map = 58%, reduce = 0%, Cumulative CPU 1248.59 sec
2018-12-16 19:54:23,457 Stage-1 map = 59%, reduce = 0%, Cumulative CPU 1270.61 sec
2018-12-16 19:54:44,437 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 1292.64 sec
2018-12-16 19:55:22,313 Stage-1 map = 61%, reduce = 0%, Cumulative CPU 1330.56 sec
2018-12-16 19:55:45,583 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 1355.82 sec
2018-12-16 19:56:04,191 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 1374.78 sec
2018-12-16 19:56:24,908 Stage-1 map = 64%, reduce = 0%, Cumulative CPU 1396.93 sec
2018-12-16 19:56:45,732 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 1418.99 sec
2018-12-16 19:57:07,375 Stage-1 map = 66%, reduce = 0%, Cumulative CPU 1441.0 sec
2018-12-16 19:57:28,046 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 1463.07 sec
2018-12-16 19:57:31,150 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1466.62 sec
2018-12-16 19:57:38,401 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 1468.17 sec
2018-12-16 19:57:40,490 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 1469.75 sec
2018-12-16 19:57:47,709 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1471.64 sec
MapReduce Total cumulative CPU time: 24 minutes 31 seconds 640 msec
Ended Job = job_1543589162810_0714
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 1471.72 sec HDFS Read: 180142367 HDFS Write: 436 SUCCESS
Total MapReduce CPU Time Spent: 24 minutes 31 seconds 720 msec
OK
对比分析,text和csv文件都是可切分的,所以有多个mapper,进行map任务,速度相对较快。而gzip是不可切分的,只有一个mapper在map阶段,所以速度就相对慢很多。