通过SQL实现阿里swing推荐算法

swing算法早起应用在阿里飞猪旅行产品,该产品特点是用户点击数据比较稀疏和发散,需要比电商 ( 淘宝 ) 扩充更多 ( 时间更长 ) 的数据才能得到比较理想的效果;现在swing算法在阿里已得到广泛应用, 是在阿里多个业务被验证过非常有效的一种召回方式,它认为 user-item-user 的结构比 itemCF 的单边结构更稳定 ),可解决协同过滤召回结果的准确率比较低,泛化性差的现象,可以认为是对协同算法的一种改进

算法介绍

Swing算法原理比较简单,是阿里早期使用到的一种召回算法,在阿里多个业务被验证过非常有效的一种召回方式,它认为 user-item-user 的结构比 itemCF 的单边结构更稳定,截止目前并没有公开的论文进行介绍和说明(可能是因为比较简单,阿里看不上哈哈),但是根据网上的各种资料,对该算法的原理进行介绍,如有错误,欢迎指正。

Swing指的是秋千,例如用户 u1 和用户 u2 ,都购买过同一件商品,则三者之间会构成一个类似秋千的关系图。若用户 u1 和用户 u2 之间除了购买过 item1 外,还购买过商品 item2 ,则认为两件商品是具有某种程度上的相似的。

也就是说,商品与商品之间的相似关系,是通过用户关系来传递的。为了衡量物品 item1 和 item2 的相似性,考察都购买了物品 item1 和 item2 的用户 u1 和用户 u2 , 如果这两个用户共同购买的物品越少,则物品 item1 和 item2 的相似性越高。

  • Swing算法的表达式如下:

SQL实现

  • 统计任意两个用户相交商品数量
sql("select a.u_i as u1 ,b.u_i as u2, count(a.pid) u1_u2_icn from tab as a left join tab as b on a.pid=b.pid where a.u_i <> b.u_i group by a.u_i, b.u_i").registerTempTable("u1_u2_icn")
  • 根据swing公式求解物品间相似性
sql("with tb1 as ( select a.u_i as u1 ,b.u_i as u2, a.pid as i1  from tab as a left join tab as b on a.pid=b.pid where a.u_i <> b.u_i ) , tb2 as (select a.u_i as u1 ,b.u_i as u2, a.pid as i2  from tab as a left join tab as b on a.pid=b.pid where a.u_i <> b.u_i)  select i1, i2, sum(1 / (0.5 + u1_u2_icn)) as sim from (select tb1.u1, tb1.u2, i1, i2 from tb1 left join tb2 on tb1.u1 = tb2.u1 and tb1.u2 = tb2.u2 where tb1.i1 <> tb2.i2) as tb left join u1_u2_icn as icn on tb.u1 = icn.u1 and tb.u2 = icn.u2 group by i1, i2 ").registerTempTable("sim")

效果展示

swing for python

import requests
import pandas as pd
import io
#49.234.60.213
sql = "select u1,  arrayStringConcat(groupArray( pid ), ',')  pids from(SELECT distinct u_i u1 , path(url) ph, splitByChar('/', ph)[-1] pid from app.scene_tracker where   startsWith(ph, '/h5l/detail') and length(u_i)>5  and day>'2021-12-20' order by  u_i ) group by u1 having length( groupArray(pid) )>1 format CSV "
rep = requests.post('http://xxx:8123/?user=&password=&query={}'.format(sql))

df = pd.read_csv(io.StringIO(rep.text), names=['uid','pids'])
df
dic_i = {}
dic_u = {}
pid_set = set()
for inx, row in df.iterrows():
    pids = set(row['pids'].replace("'",'').split(','))
    uid = row['uid']
    dic_u[uid] = pids
    for pid in pids:
        if len(pid)<2:
            continue
        pid_set.add(pid)
        if pid in dic_i:
            dic_i[pid].add(uid)
        else:
            dic_i[pid] = set([uid])
dic_s = {}
for i in dic_i.keys():
    for j in dic_i.keys():
        if i != j:
            ij_user = dic_i[i].intersection(dic_i[j])
#             print(ij_user)
            if i in dic_s:
                dic_s[i].append( (j, sum([1/(0.5 + len(dic_u[u].intersection(dic_u[v]))) for u in ij_user for v in ij_user if u != v])) )
            else:
                dic_s[i] = [(j, sum([1/(0.5 + len(dic_u[u].intersection(dic_u[v]))) for u in ij_user for v in ij_user if u != v]))]
                               

sims = {k: sorted(dic_s[k], key=lambda x: x[1], reverse=True)[:10] for k in dic_s.keys()}
                               
sims['2077110']
---
[('2065324', 22.17142857142856),
 ('2165703', 11.771428571428578),
 ('2159728', 11.10649350649351),
 ('2048707', 4.135064935064935),
 ('2159887', 3.9064935064935056),
 ('2155404', 2.4),
 ('2158141', 2.1714285714285713),
 ('2131052', 1.7350649350649348),
 ('2158117', 0.8),
 ('2168110', 0.8)]

补充

与swing算法思想类似的还有LLR相似度算法(Log-likelihood ratio),其核心思想都是在说当两个物品在用户的交互行为中共同出现的次数越多和其他物品共同出现的次数越少,理论上认为这两个物品越相似度越高;相关实现如下:

 
with tb as (SELECT pid ,count(1) cn from tab group by pid ) ,total as (select count(1) as N from tab)
select i1,i2,case when r+c>m then 2*(r+c-m) else 0 end sim from(
select i1,i2,(k11+k12+k21+k22)*log1p(k11+k12+k21+k22) - (k11+k12)*log1p(k11+k12) - (k21+k22)*log1p(k21+k22) as r,(k11+k12+k21+k22)*log1p(k11+k12+k21+k22) - (k11+k21)*log1p(k11+k21) - (k12+k22)*log1p(k12+k22) as c, (k11+k12+k21+k22)*log1p(k11+k12+k21+k22) - k11*log1p(k11) - k21*log1p(k21) - k11*log1p(k12) - k11*log1p(k12) - k22*log1p(k22) as m  from(
select i1,i2,k11,tb3.cn as k12, tb4.cn as k21,(total.N-k11-tb3.cn-tb4.cn) as k22 from(
select i1, i2, k11 from ( select tb1.pid as i1, tb2.pid as i2, count(1) k11 from tab as tb1 left join tab as tb2 on tb1.u_i=tb2.u_i where tb1.pid <> tb2.pid group by i1,i2) a 
) b,total left join tb as tb3 on b.i1=tb3.pid left join tb as tb4 on b.i2 = tb4.pid 
) as d) e order by i1 ,sim desc

你可能感兴趣的:(通过SQL实现阿里swing推荐算法)