节目时段内,每分钟的流入流出的来源及去向(FULL OUTER JOIN 和COALESCE()实践)

把原数据中的点播记录去掉除,然后去除没有节目信息的记录。将原先START_TIME替换为START_TIME_C为了减少节目跳变而产生的数据激增。但是排序还是使用START_TIME,因为在按newkey分组之后,每一个newkey后面会对应一个收看电视轨迹。

节目时段内,每分钟的流入流出的来源及去向(FULL OUTER JOIN 和COALESCE()实践)_第1张图片

为的是将一个人收看的时间轨迹排序。



但是出现了下面的情况,(不明所以)




一个人没有换台的行为,可是还是出现了两条记录。

这样的话就没有产生换台的路径,这样,我们可以在最后的汇总中排除新闻锋线原地跳转的这些记录。

将流入流出量小于3的排除,用FULL OUTER JOIN & COALESCE()。前天学的,实践一下。

SELECT COALESCE(a."CHANNEL_NAME", b."CHANNEL_NAME") as CHANNEL_NAME,
       COALESCE(a.comein, 0) as comein,
       COALESCE(b.goaway, 0) as goaway
--流入
FROM (WITH a as (SELECT CONCAT("CUST_ID",'-',"PATCH_CODE") as newkey, 
			"CHANNEL_NAME", "PROGRAM_NAME", "START_TIME", "END_TIME", "DURATION", "START_TIME_C", "END_TIME_C",
			ROW_NUMBER() OVER (PARTITION BY CONCAT("CUST_ID",'-',"PATCH_CODE") ORDER BY "START_TIME") AS ranks
		 FROM (SELECT * FROM "3月5日18-1930" UNION ALL SELECT * FROM "3月5日12-17") osum
	         WHERE CONCAT("CUST_ID",'-',"PATCH_CODE") IN (SELECT CONCAT("CUST_ID",'-',"PATCH_CODE")
		                                              FROM "3月5日18-1930"
		                                              WHERE "PROGRAM_NAME" = '新闻锋线'
		                                              AND "START_TIME_C" BETWEEN '2018/3/5 18:00:00' AND '2018/3/5 18:01:00'
		                                              )
                 )
       SELECT "CHANNEL_NAME", COUNT(*) as comein
       FROM (SELECT newkeys, "CHANNEL_NAME", "PROGRAM_NAME"
	     FROM (SELECT newkey as newkeys, MIN(ranks - 1) as lagranknum
		   FROM a 
		   WHERE "PROGRAM_NAME"='新闻锋线' AND ranks-1 <> 0
		   GROUP BY newkey
		   ORDER BY newkey) AS b 
		   INNER JOIN a
		   ON b.newkeys = a.newkey
		   WHERE lagranknum = ranks) nb
        GROUP BY "CHANNEL_NAME"
        HAVING "CHANNEL_NAME" NOT IN ('大连-1', '大连-1HD')
        AND COUNT(*) >= 3
        ORDER BY COUNT(*) DESC) a
FULL OUTER JOIN
--流出
       (WITH a as (SELECT CONCAT("CUST_ID",'-',"PATCH_CODE") as newkey, "CHANNEL_NAME", "PROGRAM_NAME", "START_TIME", "END_TIME", 
		          "DURATION", "START_TIME_C", "END_TIME_C",
		           ROW_NUMBER() OVER (PARTITION BY CONCAT("CUST_ID",'-',"PATCH_CODE") ORDER BY "START_TIME") AS ranks
	           FROM  (SELECT * FROM "3月5日18-1930" UNION ALL SELECT * FROM "3月5日12-17") osum
	           WHERE CONCAT("CUST_ID",'-',"PATCH_CODE") IN (SELECT CONCAT("CUST_ID",'-',"PATCH_CODE")
				                                FROM (SELECT * FROM "3月5日18-1930" UNION ALL SELECT * FROM "3月5日12-17") ojbk
					                        WHERE "PROGRAM_NAME" = '新闻锋线'
					                        AND "END_TIME_C" BETWEEN '2018/3/5 18:00:00' AND '2018/3/5 18:01:00'	)
	    )
        SELECT "CHANNEL_NAME", COUNT(*) as goaway
        FROM (SELECT newkeys, "CHANNEL_NAME", "PROGRAM_NAME"
	      FROM (SELECT newkey as newkeys, MAX(ranks + 1) as leadranknum
		    FROM a 
		    WHERE "PROGRAM_NAME"='新闻锋线'
	            GROUP BY newkey
		    ORDER BY newkey) AS b 
		    INNER JOIN a
		    ON b.newkeys = a.newkey
		    WHERE leadranknum = ranks) nb
        GROUP BY "CHANNEL_NAME"
        HAVING "CHANNEL_NAME" NOT IN ('大连-1', '大连-1HD')
        AND COUNT(*) >= 3
        ORDER BY COUNT(*) DESC) b
ON
a."CHANNEL_NAME" = b."CHANNEL_NAME"

思路:对于流入该节目的人,我关注他所看的上一个频道。这里存在一种很普遍的情况就是,看电视是一个很主观和随意的行为。观众会来回换台,我只统计固定时间段内,第一次收看大连-1新闻锋线的记录的前一条记录。所以用

MIN(ranks - 1) as lagranknum           WHERE "PROGRAM_NAME"='新闻锋线' AND ranks-1 <> 0

标记上一条记录的ranks,又通过where固定节目和排除那些一开机就看新闻锋线的用户。再用原表与newkey和leaderranknum表内连接,组成新表。外层查询用where找到ranks等于leaderranknum的记录。这时查询出的表为

newkeys || "CHANNEL_NAME" || "PROGRAM_NAME"

再一次用外层查询,以频道名称为分组依据,并计数。流出同理,用MAX(ranks+1)。为了一次性查询出流入流出,简化步骤,将两个查询合并。这里用到FULL OUTER JOIN 和COALESCE(),FULL OUTER JOIN将两个查询按照频道名称连接,如图。

节目时段内,每分钟的流入流出的来源及去向(FULL OUTER JOIN 和COALESCE()实践)_第2张图片

COALESCE()将取第一个不为NULL的值


这里的应用我觉得是OK的,结果如下

节目时段内,每分钟的流入流出的来源及去向(FULL OUTER JOIN 和COALESCE()实践)_第3张图片

comin表示从新闻锋线跳转到这个电视频道的人数,goaway表示从这个频道跳转到新闻锋线的人数。


你可能感兴趣的:(节目时段内,每分钟的流入流出的来源及去向(FULL OUTER JOIN 和COALESCE()实践))