1,问题来源:
对于有几个万分区的分区表,sparksql一跑就挂,但hive不会,请问怎么处理
执行sql:
ga10.coin_gain_lost是一个有几万个分区的分区表
date字段是一级分区
Caused by:org.apache.thrift.transport.TTransportException: Frame size (47350517) largerthan max length (16384000)!
atorg.apache.spark.sql.hive.client.HiveTable.getAllPartitions(ClientInterface.scala:74)
apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions(ThriftHiveMetastore.java:1979)
初步判断:spark把这个表的所有分区信息抓取回来(HiveTable.getAllPartitions),
补充说明:这个sql在hive中能正常跑
内容资源: spark-sql --num-executors 6 --driver-memory 20g--executor-memory 18g --master yarn
查看spark界面,没有job生成,没有stage信息
2,问题重现测试
根据分区重现这个问题的步骤,进行spark测试
Ø spark 测试运行环境:
l huawei RH2285设备( 8核16线程 48G内存)
l Windows +Vmvare Workstation +ubuntu Linux
l 9台虚拟机设备(1台Master 8台worker),Hadoop 2.6.0,Spark 1.6.0,Scala 2.10
Ø 模拟数据的生成
l 生成hivepartitiontest.txt,包含3行数据
root@master:/usr/local/IMF_testdata#cat hivepartitiontest.txt
001,zhangsan
002,lisi
003,wangwu
l 进行hive,创建分区表
createtable partition_test
(member_idstring,
namestring
)
partitionedby (
stat_datestring,
provincestring)
rowformat delimited fields terminated by ',';
l 表格式
hive>show create table partition_test;
OK
CREATETABLE `partition_test`(
`member_id` string,
`name` string)
PARTITIONEDBY (
`stat_date` string,
`province` string)
ROWFORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED ASINPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://master:9000/user/hive/warehouse/partition_test'
TBLPROPERTIES(
'transient_lastDdlTime'='1460257115')
Timetaken: 20.963 seconds, Fetched: 16 row(s)
hive>
l 加载本地文件hivepartitiontest.txt到hive表partition_test中
hive> LOAD DATA LOCAL INPATH'/usr/local/IMF_testdata/hivepartitiontest.txt' INTO TABLE partition_testPARTITION(stat_date='20160401', province='3');
Loading data to tabledefault.partition_test partition (stat_date=20160401, province=3)
Partitiondefault.partition_test{stat_date=20160401, province=3} stats: [numFiles=1,totalSize=33] OK Time taken: 2.08seconds
l 生成分区模拟文件addpartitions.sh
#/bin/sh
alias hive='/home/hadoop2/hive/bin/hive'
DATE_STR=`date +%Y%m%d`
for i in {1..100};
do
DAYS_AGO=`date +%Y%m%d -d "$i days ago"`
for num in {1..1000};
do
echo alter table partition_test add partition\(stat_date="'"${DAYS_AGO}"'",province="'"$num"'"\)\;
done;
done;
$chmod u+x addpartitions.sh
$./addpartitions.sh > partitions10w
$hive -f partitions10w
这里分批生成数据,3w一批,5w一批,10w一批, 如for i in {1..30};for i in {30..53};for i in {53..101};逐步压测, 中间被重启过一次设备,因此有部分文件没有生成,总计生成了100531个分区文件
root@master:~# hadoop fs -count /user/hive/warehouse/partition_test
100531 1 33/user/hive/warehouse/partition_test
Ø spark 模拟运行
启动hive 元数据
root@master:/usr/local/spark-1.6.0-bin-hadoop2.6/bin#hive --service metastore &
[1] 5089
启动spark-sql
root@master:/usr/local/spark-1.6.0-bin-hadoop2.6/bin#spark-sql --master spark://192.168.189.1:7077
Ø 运行结果,在hive10w分区表的情况下,spark-sql运行完全能跑起来
spark-sql> select * from partition_testwhere province ='3' and stat_date in ('20160401','20160328','20160405');
16/04/10 17:38:44 INFO parse.ParseDriver:Parsing command: select * from partition_test where province ='3' and stat_datein ('20160401','20160328','20160405')
16/04/10 17:38:44 INFO parse.ParseDriver:Parse Completed
......
16/04/10 17:41:13 INFO scheduler.DAGScheduler:Job 2 finished: processCmd at CliDriver.java:376, took 2.420353 s
001 zhangsan 20160401 3
002 lisi 20160401 3
003 wangwu 20160401 3
16/04/10 17:41:13 INFO CliDriver: Timetaken: 149.108 seconds, Fetched 3 row(s)
spark-sql> 16/04/10 17:41:14 INFOscheduler.StatsReportListener: Finished stage:org.apache.spark.scheduler.StageInfo@482626f0
Ø spark historyserver观测 2个task
Ø hive的运行结果
hive> select * from partition_test whereprovince ='3' and stat_date in ('20160401','20160328','20160405');
OK
001 zhangsan 20160401 3
002 lisi 20160401 3
003 wangwu 20160401 3
Time taken: 44.004 seconds, Fetched: 3row(s)
hive>
====================================================================
hive转化hive表的格式为parquet运行
====================================================================
Ø hive设置表格式为parquet
hive> ALTER TABLE partition_test SET FILEFORMAT parquet;
OK
Time taken: 2.882 seconds
hive> show create table partition_test;
OK
CREATE TABLE `partition_test`(
`member_id` string,
`name` string)
PARTITIONED BY (
`stat_date` string,
`province` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://master:9000/user/hive/warehouse/partition_test'
TBLPROPERTIES (
'last_modified_by'='root',
'last_modified_time'='1460286402',
'transient_lastDdlTime'='1460286402')
Time taken: 0.262 seconds, Fetched: 18row(s)
hive>
Ø hive再次运行sql
hive> select * from partition_test whereprovince ='3' and stat_date in ('20160401','20160328','20160405');
OK
001 zhangsan 20160401 3
002 lisi 20160401 3
003 wangwu 20160401 3
Time taken: 61.539 seconds, Fetched: 3row(s)
hive>
Ø spark sql运行查询
select * from partition_test where province='3' and stat_date in ('20160401','20160328','20160405');
报错了
org.apache.spark.SparkException: Jobaborted due to stage failure: Task 7 in stage 2.0 failed 4 times, most recentfailure: Lost task 7.3 in stage 2.0 (TID 20, worker3): java.io.IOException: Couldnot read footer: java.lang.RuntimeException:hdfs://master:9000/user/hive/warehouse/partition_test/stat_date=20160401/province=3/hivepartitiontest.txtis not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found[103, 119, 117, 10]
Ø spark sql查有内容的分区会报错,查一个没分内容的分区仍报错
spark-sql> select * from partition_test where province='2' and stat_date in ('20160402','20160322','20160406');
16/04/10 19:20:39 INFO parse.ParseDriver:Parsing command: select * from partition_test where province ='2' and stat_datein ('20160402','20160322','20160406')
16/04/10 19:20:39 INFO parse.ParseDriver:Parse Completed
Ø 删除掉有txt文件的分区
hive> alter table partition_test drop partition(stat_date='20160401', province='3');
Dropped the partition stat_date=20160401/province=3
OK
Time taken: 41.553 seconds
hive>
Ø sparksql再次运行跑一次,OK (此时hive表的格式为parquet)
spark-sql> select * from partition_test where province='100' and stat_date in ('20160302','20160312','20160306');
16/04/10 19:40:48 INFO parse.ParseDriver: Parsingcommand: select * from partition_test where province ='100' and stat_date in('20160302','20160312','20160306')
16/04/10 19:40:48 INFO parse.ParseDriver:Parse Completed
16/04/10 19:44:42 INFO scheduler.StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@50b216d7
16/04/10 19:44:42 INFO scheduler.StatsReportListener: task runtime:(count: 8, mean: 180.750000, stdev: 27.316433, max: 206.000000, min: 124.000000)
16/04/10 19:44:42 INFO scheduler.StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
16/04/10 19:44:42 INFO scheduler.StatsReportListener: 124.0 ms 124.0 ms 124.0 ms 169.0 ms 196.0 ms 204.0 ms206.0 ms 206.0 ms 206.0 ms
16/04/10 19:44:42 INFO scheduler.StatsReportListener: task result size:(count: 8, mean: 912.000000, stdev: 0.000000, max: 912.000000, min: 912.000000)
16/04/10 19:44:42 INFO scheduler.StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
16/04/10 19:44:42 INFO scheduler.StatsReportListener: 912.0 B 912.0 B 912.0 B 912.0 B 912.0 B 912.0 B 912.0 B 912.0 B 912.0 B
16/04/10 19:44:42 INFO scheduler.StatsReportListener: executor (non-fetch) time pct: (count: 8, mean: 3.594028, stdev: 2.948552, max: 7.258065, min: 0.000000)
16/04/10 19:44:42 INFO scheduler.StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
16/04/10 19:44:42 INFO scheduler.StatsReportListener: 0 % 0 % 0 % 1 % 5 % 7 % 7 % 7 % 7 %
16/04/10 19:44:42 INFO scheduler.StatsReportListener: other time pct: (count: 8, mean: 96.405972, stdev: 2.948552, max: 100.000000, min: 92.741935)
16/04/10 19:44:42 INFO scheduler.StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
16/04/10 19:44:42 INFO scheduler.StatsReportListener: 93 % 93 % 93 % 94 % 98 % 99 % 100 % 100 % 100 %
16/04/10 19:44:44 INFO datasources.DataSourceStrategy: Selected 3 partitions out of 100427, pruned 99.99701275553386% partitions.
16/04/10 19:44:44 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 212.8 KB, free 460.4 KB)
16/04/10 19:44:44 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 19.5 KB, free 479.9 KB)
16/04/10 19:44:44 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on 192.168.189.1:41077 (size: 19.5 KB, free: 517.3 MB)
16/04/10 19:44:44 INFO spark.SparkContext: Created broadcast 10 from processCmd at CliDriver.java:376
16/04/10 19:46:02 INFO spark.ContextCleaner: Cleaned accumulator 6
16/04/10 19:46:03 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on worker6:33406 in memory (size: 20.8 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on worker2:46221 in memory (size: 20.8 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on worker8:52611 in memory (size: 20.8 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on worker3:34777 in memory (size: 20.8 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on worker7:51829 in memory (size: 20.8 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on worker1:44626 in memory (size: 20.8 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on worker4:52414 in memory (size: 20.8 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on worker5:39102 in memory (size: 20.8 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_9_piece0 on 192.168.189.1:41077 in memory (size: 20.8 KB, free: 517.3 MB)
16/04/10 19:46:04 INFO spark.ContextCleaner: Cleaned accumulator 9
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on worker6:33406 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on worker2:46221 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on worker3:34777 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on worker8:52611 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on worker4:52414 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on worker1:44626 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on worker5:39102 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on worker7:51829 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_8_piece0 on 192.168.189.1:41077 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO spark.ContextCleaner: Cleaned accumulator 8
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on 192.168.189.1:41077 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on worker6:33406 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on worker7:51829 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on worker8:52611 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on worker1:44626 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on worker3:34777 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on worker2:46221 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on worker5:39102 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:04 INFO storage.BlockManagerInfo: Removed broadcast_6_piece0 on worker4:52414 in memory (size: 20.9 KB, free: 517.4 MB)
16/04/10 19:46:06 WARN hdfs.DFSClient: Slow ReadProcessor read fields took 78391ms (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: [192.168.189.4:50010, 192.168.189.7:50010, 192.168.189.9:50010]
16/04/10 19:46:06 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-367257699-192.168.189.1-1454825792055:blk_1073741921_1107
java.io.IOException: Bad response ERROR for block BP-367257699-192.168.189.1-1454825792055:blk_1073741921_1107 from datanode 192.168.189.7:50010
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
16/04/10 19:46:06 WARN hdfs.DFSClient: Error Recovery for block BP-367257699-192.168.189.1-1454825792055:blk_1073741921_1107 in pipeline 192.168.189.4:50010, 192.168.189.7:50010, 192.168.189.9:50010: bad datanode 192.168.189.7:50010
16/04/10 19:46:09 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
16/04/10 19:46:09 INFO parquet.ParquetRelation: Reading Parquet file(s) from
16/04/10 19:46:09 INFO parquet.ParquetRelation: Reading Parquet file(s) from
16/04/10 19:46:09 INFO parquet.ParquetRelation: Reading Parquet file(s) from
16/04/10 19:46:09 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
16/04/10 19:46:09 INFO scheduler.DAGScheduler: Job 9 finished: processCmd at CliDriver.java:376, took 0.000036 s
16/04/10 19:46:09 INFO CliDriver: Time taken: 322.634 seconds
spark-sql>
>