0 现象:
仓库中一个业务表的一个指标是计算平均值,结果历史问题定义成int类型来存储(建表语句对应此字段为int),而且这个表是PARQUET类型的分区表。
实验方式1:
先建立原始表a的备份表b,使用前面文字中快速拷贝分区表的写法, 然后在b表中做实验,将b表的字段更新成double类型,然后在b表中随机某个分区内查询,分别在hue和hive命令行查询,看看是否报错
1.1 拷贝数据
创建备份表,在次表做实验 CREATE TABLE dm_teach_school_subject_count_day_bak ( id string comment 'mysql结果表中的自增主键,hive里留空串', period_type int comment '1:日,2:周,3:月,4:学期内(日)', province_id int comment '省份ID', province_name string comment '省份名称', city_id int comment '地市ID', city_name string comment '地市名称', county_id int comment '区县ID', county_name string comment '区县名称', school_id int comment '学校ID', school_name string comment '学校名称', subject_id string comment '科目ID', subject_name string comment '科目名称', teacher_count int comment '教师人数', courseware_user_count int comment '创建课件人数', courseware_user_count_rate float comment '创建课件人数比', not_courseware_user_count int comment '未创建课件人数', never_courseware_user_count int comment '从未创建课件人数', never_courseware_user_count_rate float comment '从未创建课件人数比', courseware_count int comment '创建课件数', avg_courseware_count int comment '平均创建课件数', created_time string comment '记录插入时间' ) COMMENT '授课-按学校科目统计表日表' PARTITIONED BY (day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n' STORED AS PARQUET;
1.2 将原来表 dm_teach_school_subject_count_day_bak 修改 avg_courseware_count 变成double
alter table dm_teach_school_subject_count_day_bak change avg_courseware_count avg_courseware_count double COMMENT '平均创建课件数' cascade;
1.3 随机从 dm_teach_school_subject_count_day 中查询某天的 数据 看是否正常
select * from dm_teach_school_subject_count_day_bak where day = '2017-04-04' and avg_courseware_count > 0 ;
结果: 在hive命令行正常查询到 ,在hue执行报错如下
实验方式2:
先建立原始表a的备份表c,使用前面文字中快速拷贝分区表的写法, 然后在c表中做实验,将b表的字段更新成string类型,然后在c表中随机某个分区内查询,分别在hue和hive命令行查询,看看是否报错
命令和上面步骤基本一样,只是将字段 avg_courseware_count 变成了 string, 操作结果如下:
hue和hive命令行都报错,如下:
Caused by: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.IntWritable at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveJavaObject(ParquetStringInspector.java:77) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getDouble(PrimitiveObjectInspectorUtils.java:743) at org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$DoubleConverter.convert(PrimitiveObjectInspectorConverter.java:238) at org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPGreaterThan.evaluate(GenericUDFOPGreaterThan.java:111) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65) at org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:106) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:97) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497) ... 9 more
3 最终采用的方式:
新建表d,新表表的字段类型为正确的字段类型,然后将a表的分区数据拷贝到d表中,最后a表重命名为a_bak表,d表重命名为a表作为新的业务表a.
CREATE TABLE dm_teach_school_subject_count_day_bak2 ( id string comment 'mysql结果表中的自增主键,hive里留空串', period_type int comment '1:日,2:周,3:月,4:学期内(日)', province_id int comment '省份ID', province_name string comment '省份名称', city_id int comment '地市ID', city_name string comment '地市名称', county_id int comment '区县ID', county_name string comment '区县名称', school_id int comment '学校ID', school_name string comment '学校名称', subject_id string comment '科目ID', subject_name string comment '科目名称', teacher_count int comment '教师人数', courseware_user_count int comment '创建课件人数', courseware_user_count_rate float comment '创建课件人数比', not_courseware_user_count int comment '未创建课件人数', never_courseware_user_count int comment '从未创建课件人数', never_courseware_user_count_rate float comment '从未创建课件人数比', courseware_count int comment '创建课件数', avg_courseware_count double comment '平均创建课件数', ----> 指定正确类型 online_user_count int comment '在线授课人数', online_user_count_rate float comment '在线授课人数比', not_online_user_count int comment '未在线授课人数', never_online_user_count int comment '从未在线授课人数', never_online_user_count_rate float comment '从未在线授课人数比', online_duration int comment '在线授课时长(秒)', avg_online_duration int comment '平均在线授课时长(秒)', created_time string comment '记录插入时间' ) COMMENT '授课-按学校科目统计表日表' PARTITIONED BY (day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n' STORED AS PARQUET;
将原来表数据拷贝到新表中:
set hive.exec.dynamic.partition.mode=nonstrict; 必须设置 insert overwrite table dm_teach_school_subject_count_day_bak2 partition(day) select id, period_type, province_id, province_name, city_id, city_name, county_id, county_name, school_id , school_name, subject_id , subject_name , teacher_count, courseware_user_count, courseware_user_count_rate, not_courseware_user_count, never_courseware_user_count, never_courseware_user_count_rate, courseware_count, avg_courseware_count, online_user_count, online_user_count_rate, not_online_user_count, never_online_user_count, never_online_user_count_rate, online_duration, avg_online_duration, created_time, day from dm_teach_school_subject_count_day distribute by day;
在 hue和hive命令行分别执行:
select * from dm_teach_school_subject_count_day_bak2 where day = '2017-04-04' and avg_courseware_count > 0 ; 结果均正常