Hivesql/Python求相同老师学生连续上多节课的数据

表结构:

表结构

方式一:

HIVESQL:

1.首先按照学生和上课时间排序(升序),并新增一列row值当作参考列

SELECT uid,`datetime`,teacher,id,ROW_NUMBER( ) OVER(order by `uid`,`datetime`) as rank from class_table结果如下:

2.利用lag(a,b,c)开窗函数,共三个参数,a为选中的滞后列,b为滞后的行数,c为空时默认值(可以不指定)

暂定连续上4节课

SELECT uid, `datetime`, teacher, `rank`

, lag(`rank`, 3) OVER (PARTITION BY uid, teacher ORDER BY `datetime`) AS `rank2`

FROM (

SELECT uid, `datetime`, teacher, ROW_NUMBER() OVER (ORDER BY `uid`, `datetime`) AS rank

FROM class_table

) tmp

结果如下:

3.通过判断列rank和rank2的关系即可找出相同老师学生连续上多节课的数据

SELECT *

FROM (

SELECT uid, `datetime`, teacher, `rank`

, lag(`rank`, 3) OVER (PARTITION BY uid, teacher ORDER BY `datetime`) AS `rank2`

FROM (

SELECT uid, `datetime`, teacher, ROW_NUMBER() OVER (ORDER BY `uid`, `datetime`) AS rank

FROM class_table

) tmp

) tmp2

WHERE `rank` = `rank2` + 3

调节滞后量即可查询其他连续数量


方式二:

原理都是一样的,pandas需要额外加一列自增列

data.sort_values(by=['uid','datetime'],inplace=True)

data['row'] = range(len(data))

data['rank'] = data.sort_values('datetime').groupby(['uid','teacher'])['row'].shift(3)

data[data['row']==data['rank']+3]

希望能帮助到大家



以上

你可能感兴趣的:(Hivesql/Python求相同老师学生连续上多节课的数据)