原SQL:
insert overwrite table in_yuncheng_tbshelf partition (pt) select userid, bookid, bookname, createts, rpid, addts, updatets, isdel, rcid, category_type, wapbookmarks, addmarkts, readingchapterid, readpercentage, readingts, substring(addts,0,10) as pt from search_product.yuncheng_tbshelf where pt>='2012-09-01'
报错:
[Fatal Error] Operator FS_3 (id=3): Number of dynamic partitions exceeded hive.exec.max.dynamic.partitions.pernode.. Killing the job.
hive.exec.max.dynamic.partitions.pernode (缺省值100):
每一个mapreduce job允许创建的分区的最大数量,如果超过了这个数量就会报错
hive.exec.max.dynamic.partitions (缺省值1000):一个dml语句允许创建的所有分区的最大数量
hive.exec.max.created.files (缺省值100000):所有的mapreduce job允许创建的文件的最大数量
当源表数据量很大时,单独一个mapreduce job中生成的数据在分区列上可能很分散,举个简单的例子,比如下面的表要用3个map:
1
1
1
2
2
2
3
3
3
如果数据这样分布,那每个mapreduce只需要创建1个分区就可以了:
|1
map1 --> |1
|1
|2
map2 --> |2
|2
|3
map3 --> |3
|3
但是如果数据按下面这样分布,那第一个mapreduce就要创建3个分区:
|1
map1 --> |2
|3
|1
map2 --> |2
|3
|1
map3 --> |2
|3
为了让分区列的值相同的数据尽量在同一个mapreduce中,
这样每一个mapreduce可以尽量少的产生新的文件夹,可以借助distribute by的功能,将分区列值相同的数据放到一起:
insert overwrite table in_yuncheng_tbshelf partition (pt) select userid, bookid, bookname, createts, rpid, addts, updatets, isdel, rcid, category_type, wapbookmarks, addmarkts, readingchapterid, readpercentage, readingts, substring(addts,0,10) as pt from search_product.yuncheng_tbshelf where pt>='2012-09-01' distribute by substring(addts,0,10)
另外,调大hive.exec.max.dynamic.partitions.pernode参数的值不知道是否可行,还没试。