本文的横表存储的是特征数据,表的数据结构是经常会变化的,比如添加特征操作。
特征表设计为横表操作不灵活,别人在引用特征时也将受到这种变化影响,增加特征后也可能带来数据异常。此外,特征数量庞大之后,选用特征也会及其不便。
所以我们考虑将关联得到的横表,转化为纵表进行存储。
代码操作的数据库为:Oracle
值得注意的是横表转纵表后,为了统一数据类型,一般统一存储为string类型。
with tmp1 as (
select dealer_id,count(subdealer_id) as dlr_subdealers_cnt
from dms_biz.dt_bb_subdealers
where sdf_status =50720005
group by dealer_id --状态为正常
),
temp2 as(
select company_id, count(dealer_id) as dlr_stations_cnt
from dms_biz.dt_bb_service_stations
where biz_status = 10160005 --营业状态正常
group by company_id
),
-- 以下为基于上面两个临时表以及其他主表,跑出来的横表结果
horizon_feature as (
select
cast(d1.dealer_id as varchar(255)) as dealer_id,
cast(d1.dealer_code as varchar(255)) as dealer_code,
cast(d1.company_id as varchar(255)) as company_id,
cast(d2.dict_value as varchar(255)) as dlr_category_desc,
cast(d1.is_new_energy as varchar(255)) as dlr_is_new_energy_cde,
cast(d1.is_used_car as varchar(255)) aS dlr_is_used_car_cde,
cast(r1.region_name as varchar(255)) as dlr_province_desc,
cast(r2.region_name as varchar(255)) as dlr_city_desc,
cast(r3.region_name as varchar(255)) as dlr_county_desc,
cast(tmp_subdealers.dlr_subdealers_cnt as varchar(255)) as subdealers_cnt,
cast(tmp_service_stations.dlr_stations_cnt as varchar(255)) as stations_cnt,
cast(tmp_dlr_cstm.dlr_customers_1m_acc as varchar(255)) as customers_1m_acc,
cast(tmp_dlr_cstm.dlr_customers_3m_avg as varchar(255)) as customers_3m_avg,
cast(tmp_dlr_cstm.dlr_customers_6m_avg as varchar(255)) as customers_6m_avg
from d1
left join d2 on d1.dealer_category=d2.dict_key
left join r1 on d1.province = r1.region_id and r1.is_inactive = 0
left join r2 on d1.city = r2.region_id and r2.is_inactive = 0
left join r3 on d1.county = r3.region_id and r3.is_inactive = 0
left join tmp1 on d1.dealer_id = tmp1.dealer_id
left join tmp2 on d1.company_id = tmp2.company_id
where d1.biz_status=100005 --状态为正常
)
--以下为横表horizon_feature转纵表代码
select dealer_id,dealer_code,company_id,tag_key,tag_value,sysdate
from horizon_feature
unpivot (
tag_value for tag_key in (
dlr_category_desc,
dlr_is_new_energy_cde,
dlr_is_used_car_cde,
dlr_province_desc,
dlr_city_desc,
dlr_county_desc,
dlr_subdealers_cnt,
dlr_stations_cnt,
dlr_customers_1m_acc,
dlr_customers_3m_avg,
dlr_customers_6m_avg
)
)
with as子查询
在生成横表之前,通过with as得到了2个子查询的结果,这个结果可以直接在加工横表时使用。horizon_feature这个横表,如果不落地的话,依然可以写在with as中。
这介个with as 表面上看是并列的关系,其实tmp1与tmp2已嵌套在horizon_feature中使用,本质是嵌套的关系。
cast 转换数据类型
值得注意的是,在加工横表时,不同字段数据类型不同,有字符串、日期或整型等,但横表转成纵表时,为了统一字段格式,需要将所有字段框定为字符串格式。不进行格式处理的话,代码肯定会报错的。
unpivot 实现横表转纵表
在oracle中可以使用pivot与unpivot进行表结构的纵横转换。如果对用法不清楚,可以单独检索pivot与unpivot的用法。
这里要解释一下,dealer_id, dealer_code, company_id是用于关联查找使用,其中dealer_id是主键。所以在加工成纵表时,这几个字段都进行保留。而tag_key, tag_value是纵表里新增加的字段名,你可以随意换成其他叫法。
接下来指明了tag_value是怎么来的,tag_key是指括号里的所有字段,这些字段来自横表。而tag_value就是横表中这些字段下的值。
select dealer_id,dealer_code,company_id,tag_key,tag_value,sysdate
from horizon_feature
unpivot (
tag_value for tag_key in (
dlr_category_desc,
dlr_is_new_energy_cde,
dlr_is_used_car_cde,
【解决方法】:
在oracle中,nvl()函数可以实现对空值的判断,若原值非空,返回原值;若原值为空,可以替换为指定的数值。
在旋转操作中使用nvl()是无效的,需要在生成横表的时候进行数据操作。
以下代码,region_name若为null,则返回0。
nvl(cast(r2.region_name as varchar(255)),0) as city_desc_dlr,
欢迎关注微信公众号“数据分析师手记”,一起成长进步!