用户在某返利平台注册时中会填写姓名、邮箱和手机号等信息 ,系统会给每次注册的用户标识一个注册app_id。在试运行过程中发现,由于同一个用户会利用不同的邮箱、手机号多次注册,会被业务系统初步识别为不同用户,导致返利金额偏大,获客成本偏高,运营质量偏低。
为了弥补运营漏洞,现在业务要求,当注册用户的手机、邮箱中有 任意信息相同 时,则应判定为同一用户。
以图理论为基础,将注册app_id视为顶点,手机号、邮箱为顶点的属性。通过属性相同来寻找app_id之间的直接映射关联,并进一步递归识别所有注册app_id的关系链
为了能灵活利用SQL的集合处理,这里需要预先调整数据结构格式为顶点-属性值,方便后续标准化处理。样例数据展示图如下,共有5个注册人员编号APP1…APP5,各有属性姓名、邮箱和手机号。若属性值相同,则用红色双向箭头连通。
--拆解数据
/*
app_id proValue proType
APP1 张三 name
APP2 张三 name
APP3 张三帅 name
APP4 张三2 name
APP5 张三 name
APP10 李四 name
APP11 李四1 name
*/
如果属性值相同,则肯定为同一个人。此处要按属性值分组聚合,取最小APP编号为分组父编号app_id_parent。这一步已经完成了由属性值一致,顺利得到了顶点之间的关联性。
;with cte1 as (
select app_id,name as proValue,'name' as proType from tb_regist_info
union all
select app_id,phone as proValue,'phone' as proType from tb_regist_info
union all
select app_id,email as proValue,'email' as proType from tb_regist_info
)
,cte_app_id_uid as (
select app_id,proValue,proType,min(app_id) over(partition by proValue, proType) as app_id_parent
from cte1
)select * from cte_app_id_uid order by proValue
/*
app_id proValue proType app_id_parent
APP2 18666777777 email APP2
APP2 [email protected] phone APP2
APP2 张三 name APP1
APP3 张三帅 name APP3
APP3 [email protected] phone APP2
*/
顶点之间的关联,实则上是父子关联性。这里需要首先保证父子顶点的关系性是唯一的,即子顶点:父顶点=N:1 。同时要过滤掉父顶点记录。
select app_id,min(app_id_parent) as app_id_parent
from (select distinct app_id, app_id_parent from cte_app_id_uid ) a
where app_id<>app_id_parent
group by app_id
/*
app_id app_id_parent
APP11 APP10
APP12 APP10
APP13 APP10
APP2 APP1
APP3 APP2
APP4 APP3
APP5 APP1
*/
到这里,已经做好所有的基础数据处理,父子关系结构已经确定,可以通过MSSQL 通用表表达式CTE来处理递归。得到每个父顶点的所有子节点数据,最后只需要通过子节点分组、取最小父节点编号即可实现业务需求。
select app_id,app_id_parent , 1 as rn from cte_mapping2
union all
select a.app_id,b.app_id_parent, a.rn+1 as rn
from cte_recrusive a join cte_mapping2 b on a.app_id_parent=b.app_id
/*
app_id app_id_parent rn
APP2 APP1 1
APP5 APP1 1
APP4 APP1 3
APP3 APP1 2
APP11 APP10 1
APP12 APP10 1
APP13 APP10 1
APP3 APP2 1
APP4 APP2 2
APP4 APP3 1
*/
这里最主要的是要有图论的思想,能意识到将原始的row data拆解成顶点-属性标准格式。有了这一步的支撑,在整体上就已经成功了一半。
附完整代码:
--create table tb_regist_info(app_id varchar(5), name varchar(30), phone varchar(20) , email varchar(50))
--insert into tb_regist_info
--values('APP1', '张三', '[email protected]','18611111111')
-- ,('APP2', '张三', '[email protected]','18666777777')
-- ,('APP3', '张三帅', '[email protected]','18666666666')
-- ,('APP4', '张三2', '[email protected]','18666666666')
-- ,('APP5', '张三', '[email protected]','18611111111')
-- ,('APP10', '李四', '[email protected]','19866666666')
-- ,('APP11', '李四1', '[email protected]','19611111111')
-- ,('APP12', '李四2', '[email protected]','19866666666')
-- ,('APP13', '李四', '[email protected]','19677777777')
;with cte1 as (
select app_id,name as proValue,'name' as proType from tb_regist_info
union all
select app_id,phone as proValue,'phone' as proType from tb_regist_info
union all
select app_id,email as proValue,'email' as proType from tb_regist_info
)
,cte_app_id_uid as (
select app_id,proValue,proType,min(app_id) over(partition by proValue, proType) as app_id_parent
from cte1
)
,cte_mapping2 as (
select app_id,min(app_id_parent) as app_id_parent
from (select distinct app_id, app_id_parent from cte_app_id_uid ) a
where app_id<>app_id_parent
group by app_id
)
,cte_recrusive as(
select app_id,app_id_parent , 1 as rn from cte_mapping2
union all
select a.app_id,b.app_id_parent, a.rn+1 as rn
from cte_recrusive a join cte_mapping2 b on a.app_id_parent=b.app_id
)
select * from cte_recrusive
order by app_id_parent