把原数据中的点播记录去掉除,然后去除没有节目信息的记录。将原先START_TIME替换为START_TIME_C为了减少节目跳变而产生的数据激增。但是排序还是使用START_TIME,因为在按newkey分组之后,每一个newkey后面会对应一个收看电视轨迹。
为的是将一个人收看的时间轨迹排序。
但是出现了下面的情况,(不明所以)
一个人没有换台的行为,可是还是出现了两条记录。
这样的话就没有产生换台的路径,这样,我们可以在最后的汇总中排除新闻锋线原地跳转的这些记录。
将流入流出量小于3的排除,用FULL OUTER JOIN & COALESCE()。前天学的,实践一下。
SELECT COALESCE(a."CHANNEL_NAME", b."CHANNEL_NAME") as CHANNEL_NAME,
COALESCE(a.comein, 0) as comein,
COALESCE(b.goaway, 0) as goaway
--流入
FROM (WITH a as (SELECT CONCAT("CUST_ID",'-',"PATCH_CODE") as newkey,
"CHANNEL_NAME", "PROGRAM_NAME", "START_TIME", "END_TIME", "DURATION", "START_TIME_C", "END_TIME_C",
ROW_NUMBER() OVER (PARTITION BY CONCAT("CUST_ID",'-',"PATCH_CODE") ORDER BY "START_TIME") AS ranks
FROM (SELECT * FROM "3月5日18-1930" UNION ALL SELECT * FROM "3月5日12-17") osum
WHERE CONCAT("CUST_ID",'-',"PATCH_CODE") IN (SELECT CONCAT("CUST_ID",'-',"PATCH_CODE")
FROM "3月5日18-1930"
WHERE "PROGRAM_NAME" = '新闻锋线'
AND "START_TIME_C" BETWEEN '2018/3/5 18:00:00' AND '2018/3/5 18:01:00'
)
)
SELECT "CHANNEL_NAME", COUNT(*) as comein
FROM (SELECT newkeys, "CHANNEL_NAME", "PROGRAM_NAME"
FROM (SELECT newkey as newkeys, MIN(ranks - 1) as lagranknum
FROM a
WHERE "PROGRAM_NAME"='新闻锋线' AND ranks-1 <> 0
GROUP BY newkey
ORDER BY newkey) AS b
INNER JOIN a
ON b.newkeys = a.newkey
WHERE lagranknum = ranks) nb
GROUP BY "CHANNEL_NAME"
HAVING "CHANNEL_NAME" NOT IN ('大连-1', '大连-1HD')
AND COUNT(*) >= 3
ORDER BY COUNT(*) DESC) a
FULL OUTER JOIN
--流出
(WITH a as (SELECT CONCAT("CUST_ID",'-',"PATCH_CODE") as newkey, "CHANNEL_NAME", "PROGRAM_NAME", "START_TIME", "END_TIME",
"DURATION", "START_TIME_C", "END_TIME_C",
ROW_NUMBER() OVER (PARTITION BY CONCAT("CUST_ID",'-',"PATCH_CODE") ORDER BY "START_TIME") AS ranks
FROM (SELECT * FROM "3月5日18-1930" UNION ALL SELECT * FROM "3月5日12-17") osum
WHERE CONCAT("CUST_ID",'-',"PATCH_CODE") IN (SELECT CONCAT("CUST_ID",'-',"PATCH_CODE")
FROM (SELECT * FROM "3月5日18-1930" UNION ALL SELECT * FROM "3月5日12-17") ojbk
WHERE "PROGRAM_NAME" = '新闻锋线'
AND "END_TIME_C" BETWEEN '2018/3/5 18:00:00' AND '2018/3/5 18:01:00' )
)
SELECT "CHANNEL_NAME", COUNT(*) as goaway
FROM (SELECT newkeys, "CHANNEL_NAME", "PROGRAM_NAME"
FROM (SELECT newkey as newkeys, MAX(ranks + 1) as leadranknum
FROM a
WHERE "PROGRAM_NAME"='新闻锋线'
GROUP BY newkey
ORDER BY newkey) AS b
INNER JOIN a
ON b.newkeys = a.newkey
WHERE leadranknum = ranks) nb
GROUP BY "CHANNEL_NAME"
HAVING "CHANNEL_NAME" NOT IN ('大连-1', '大连-1HD')
AND COUNT(*) >= 3
ORDER BY COUNT(*) DESC) b
ON
a."CHANNEL_NAME" = b."CHANNEL_NAME"
思路:对于流入该节目的人,我关注他所看的上一个频道。这里存在一种很普遍的情况就是,看电视是一个很主观和随意的行为。观众会来回换台,我只统计固定时间段内,第一次收看大连-1新闻锋线的记录的前一条记录。所以用
MIN(ranks - 1) as lagranknum WHERE "PROGRAM_NAME"='新闻锋线' AND ranks-1 <> 0
标记上一条记录的ranks,又通过where固定节目和排除那些一开机就看新闻锋线的用户。再用原表与newkey和leaderranknum表内连接,组成新表。外层查询用where找到ranks等于leaderranknum的记录。这时查询出的表为
newkeys || "CHANNEL_NAME" || "PROGRAM_NAME"
再一次用外层查询,以频道名称为分组依据,并计数。流出同理,用MAX(ranks+1)。为了一次性查询出流入流出,简化步骤,将两个查询合并。这里用到FULL OUTER JOIN 和COALESCE(),FULL OUTER JOIN将两个查询按照频道名称连接,如图。
COALESCE()将取第一个不为NULL的值
这里的应用我觉得是OK的,结果如下
comin表示从新闻锋线跳转到这个电视频道的人数,goaway表示从这个频道跳转到新闻锋线的人数。