打点数据的坑:没有参数的苦

今天吐吐数据分析师依赖打点数据做分析时的一些苦水「参数缺失的苦」。

业务背景:

一种学习APP 中的口语系列课程,用户在每次学习时的主要行为是录音

需要分析课程学习的平均完成率,和没有任何录音就退出的比例,希望后者比例越低越好

数据结构:

| page_name  | user_id | action                    | Data_date | time      | lesson_id |

| ----------- | ------- | ------------------------- | --------- | --------- | --------- |

| lesson_page | user_id | page_view  (一次学习开始) | Date      | timestamp | lesson_id |

| lesson_page | user_id | click_record (录音)      | Date      | timestamp | lesson_id |

| report_page | user_id | page_view (一次学习结束)  | Date      | timestamp | lesson_id |

参数缺失问题:

1. 用户在一天中可以重复学习同个 lesson 多次,所以 user_id + lesson_id 并不能关联起「一次学习开始-录音-一次学习结束」

2. 本可以打点时在 params 中带上每次学习唯一的 id,但是缺失了

3. 用户学习过的课程,再次进入时是上一次学习的报告页 report_page ,此时只有该页面的 page_view 点

解决方案:

1. 找到 User_id + lesson_id + report time + data_date  在当日的学习开始 time

2. 计算 user_id + lesson_id + start_time 对应的学习结束 time

3. 计算  user_id + lesson_id + start_time + report_time 中的录音次数

SQL代码:

```sql

SELECT lesson_id,

      count(user_id) AS num_user,

      1.0*count(report_time)/count(user_id) AS finish_rate,

      1.0*count(if(record_num=0,user_id))/count(user_id) AS bounce_rate

FROM

  (SELECT t3.user_id,

          t3.lesson_id,

          t3.start_time,

          t3.st_rn,

          t3.report_time,

          t3.report_time_1,

          count(r.user_id) AS record_num

  FROM

    (SELECT t2.*,

            coalesce(t2.report_time,s1.time, '2999-12-31 23:59:59') AS report_time_1 -- 处理 strat,start,report 情况

FROM

        (SELECT s.user_id,

                s.lesson_id,

                s.start_time,

                s.st_rn,

                min(t1.report_time) AS report_time -- 为了处理 start,report,report情况

FROM

          (SELECT *,

                  row_number()over(partition BY user_id

                                    ORDER BY start_time) AS st_rn

            FROM TABLE

            WHERE page_name = 'lesson_page'

              AND action = 'page_view') s

        LEFT JOIN

          (SELECT f.user_id,

                  f.lesson_id,

                  f.time AS report_time,

                  min(s.time) AS start_time

            FROM

              (SELECT *

              FROM TABLE

              WHERE page_name = 'report_page'

                AND action = 'page_view') f

            JOIN

              (SELECT *

              FROM TABLE

              WHERE page_name = 'lesson_page'

                AND action = 'page_view') s ON f.user_id = s.user_id

            AND f.lesson_id = s.lesson_id

            AND f.data_date = s.data_date

            AND date_diff('second', s.time, f.time) BETWEEN 1 AND 1800

            GROUP BY 1,

                    2,

                    3) t1 ON s.user_id = t1.user_id

        AND s.lesson_id = t1.lesson_id

        AND s.start_time = t1.start_time

        GROUP BY 1,

                  2,

                  3,

                  4) t2

      LEFT JOIN

        (SELECT *,

                row_number()over(partition BY user_id

                                ORDER BY time) AS st_rn

        FROM TABLE

        WHERE page_name = 'lesson_page'

          AND action = 'page_view') s1 ON t2.user_id = s1.user_id

      AND t2.lesson_id = s1.lesson_id

      AND t2.st_rn = s1.st_rn - 1) t3

  LEFT JOIN

    (SELECT *

      FROM TABLE

      WHERE page_name = 'lesson_page'

        AND action = 'click_record') r ON t3.user_id = r.user_id

  AND t3.lesson_id = r.lesson_id

  AND r.time BETWEEN t3.start_time AND t3.report_time_1

  GROUP BY 1,

            2,

            3,

            4,

            5) tt

GROUP BY 1;

```

你可能感兴趣的:(打点数据的坑:没有参数的苦)