数据来源于https://www.jianshu.com/p/4f0a10ea170e
根据原链接的问题所做的用户行为分析
原链接使用的是SQLSERVER,这里我使用的是Mysql
题目
1-统计不同月份的下单人数
2-统计用户三月份的回购率和复购率
3-统计男女的消费频次是否有差异
4-统计多次消费的用户,第一次和最后一次消费时间的间隔
5-统计不同年龄段的用户消费金额是否有差异
6-统计消费的二八法则,消费的top20%用户,贡献了多少额度
SELECT Concat(SUBSTR(paidTime,1,4),'-',SUBSTR(paidTime,6,2)) AS MONTH,
COUNT(*) AS XDRS
FROM orderinfo
WHERE isPaid='已支付'
GROUP BY MONTH;
第二遍代码:
SELECT Concat(SUBSTR(paidTime,1,4),'-0',SUBSTR(paidTime,6,1)) AS MONTH,
COUNT(DISTINCT(userID)) AS XDRS
FROM orderinfo
WHERE isPaid='已支付'
GROUP BY MONTH;
【总结】
由于源数据只有3/4/5月数据,所以合并时偷懒直接在3/4/5前面拼接字段时加了个‘0’,但是如果数据碰到10/11/12月份时行不通,需要再次修改代码
第三遍代码:
SELECT Concat(SUBSTR(paidTime,1,4),'-',IF(SUBSTR(paidTime,7,1)='/',Concat('0',SUBSTR(paidTime,6,1)),SUBSTR(paidTime,6,2)))AS MONTH,
COUNT(DISTINCT(userID)) AS XDRS
FROM orderinfo
WHERE isPaid='已支付'
GROUP BY MONTH;
【总结】
使用IF判断,跟excel里面if使用的方式一样,if(判断表达式,正确显示,错误显示):如果paidtime里面第7位数是/,则拼接‘0’和第6位数,反之则从第6位数开始提取2个数
2. 统计用户三月份的回购率和复购率
第一遍代码:
SELECT (COUNT(*)/(SELECT COUNT(*) FROM orderinfo WHERE SUBSTR(paidTime,6,1)='3' AND isPaid='已支付'))
AS FGL FROM
(SELECT userID
FROM orderinfo
WHERE SUBSTR(paidTime,6,1)='3'
AND isPaid='已支付'
GROUP BY userID HAVING COUNT(orderID)>1) AS a;
第二遍去重后代码:
SELECT (COUNT(*)/(SELECT COUNT(DISTINCT(userID)) FROM orderinfo WHERE SUBSTR(paidTime,6,1)='3' AND isPaid='已支付'))
AS '复购率' FROM
(SELECT DISTINCT(userID)
FROM orderinfo
WHERE SUBSTR(paidTime,6,1)='3'
AND isPaid='已支付'
GROUP BY userID HAVING COUNT(orderID)>1) AS a;
思路:用inner join将3月和4月都购买过的userid连接起来,计算回购人数,除以3月总购买人数
SELECT (COUNT(t.userID)/(SELECT COUNT(DISTINCT(userID)) FROM orderinfo WHERE SUBSTR(paidTime,6,1)='3' AND isPaid='已支付'))
AS '回购率' FROM
(SELECT DISTINCT(userID) FROM orderinfo WHERE SUBSTR(paidTime,6,1)='3' AND isPaid='已支付') AS t
INNER JOIN
(SELECT DISTINCT(userID) FROM orderinfo WHERE SUBSTR(paidTime,6,1)='4' AND isPaid='已支付') AS f
ON t.userID=f.userID;
3. 统计男女的消费频次是否有差异
思路:将orde表和user表连接起来,统计女性总购买次数/女性总消费人数,得出每人的平均消费次数,对比男性
第一遍代码:
SELECT ((SELECT COUNT(*) FROM t WHERE t.sex='女')/(SELECT COUNT(DISTINCT(userID)) FROM t WHERE t.sex='女'))
AS '女性购买频次',
((SELECT COUNT(*) FROM t WHERE t.sex='男')/(SELECT COUNT(DISTINCT(userID)) FROM t WHERE t.sex='男'))
AS '男性购买频次'
FROM
(SELECT o.orderID, o.userID, u.sex
FROM orderinfo AS o INNER JOIN userinfo AS u
ON o.userID=u.userID
WHERE o.isPaid='已支付')AS t;
报错:
Error Code: 1146. Table ‘order.t’ doesn’t exist
(还未想清楚原因)
第二遍代码改进:
SELECT t.sex,COUNT(t.orderID) AS'消费总次数',COUNT(DISTINCT(t.userID)) AS '消费总人数',
(COUNT(t.orderID)/COUNT(DISTINCT(t.userID))) AS '消费频次'
FROM
(SELECT o.orderID, o.userID, u.sex
FROM orderinfo AS o INNER JOIN userinfo AS u
ON o.userID=u.userID
WHERE o.isPaid='已支付' AND u.sex !='')AS t
GROUP BY t.sex;
4. 统计多次消费的用户,第一次和最后一次消费时间的间隔
思路:使用max和mix计算时间差,用group by userid来分组,having count来筛选消费超过1次的用户,或者having max时间不等于min时间来表示有多次消费
-- 错误代码
SELECT DISTINCT userID, (MAX(paidTime)-MIN(paidTime)) AS '时间差'
FROM orderinfo
WHERE isPaid='已支付'
GROUP BY userID HAVING MAX(paidTime)!=MIN(paidTime);
【错误原因】
paidTime是varchar属性的,使用max和min减出来为0
然而我使用了STR_TO_DATE(string, format)和
DATE_FORMAT(string,format)来转换成时间格式,两种都失败了
select date_format(paidTime,'%Y-%m-%d %H:%i:%s') FROM orderinfo;
输出:
使用timediff计算时间差值也不行,还是因为是 varchar的原因
SELECT TIMEDIFF(MAX(paidTime),MIN(paidTime)) FROM orderinfo GROUP BY userID;
SELECT userID,COUNT(orderID),
MAX(SUBSTR(paidTime,1,(LOCATE(' ',paidTime)-1))) AS max_t,
MIN(SUBSTR(paidTime,1,(LOCATE(' ',paidTime)-1))) AS min_t,
DATEDIFF(MAX(SUBSTR(paidTime,1,(LOCATE(' ',paidTime)-1))),
MIN(SUBSTR(paidTime,1,(LOCATE(' ',paidTime)-1))))AS '时间差'
FROM orderinfo
WHERE isPaid='已支付'
AND paidTime is not null
GROUP BY userID HAVING COUNT(orderID)>1;
说明:
试验后,可以通过LOCATE判断空格在paidtime里的位置,通过substr来提取出日期,这样提取出的日期再用datediff计算时间差是可以通过的,但是存在问题,时间差里面有负值,max计算的最大日期比min最小日期还要小,猜测可能是因为max和min函数提取的仍然不是日期格式,而是提取的文本排列顺序的第一行或最后一行?
解决思路1:再次通过STR_TO_DATE或者DATE_FORMAT来对SUBSTR提取出来的文本进行日期转化看能否行通
MAX(STR_TO_DATE(SUBSTR(paidTime,1,(LOCATE(' ',paidTime)-1)),'%Y-%m-%d')) AS max_t,
输出:
使用STR_TO_DATE将提取后的paidtime文本转换后还是不能显示成日期,失败。
解决思路2:先将orderinfo表根据userid和paidTime排序,重命名为新表,然后再使用max和min来提取paidtime的最大行和最小行
SELECT orderID,userID,paidTime FROM orderinfo WHERE isPaid='已支付'
AND paidTime is not null
ORDER BY userID, paidTime;
说明:
出错,根据paidTime排序出来的表,是根据阿拉伯数字顺序排列的,而不是按照日期大小,失败。
【未解决,TBC……】
5. 统计不同年龄段的用户消费金额是否有差异
思路:user表里面用当前年份-出生日期获取年龄,将年龄分组命名,再将order表和user表连接,根据分组的年龄,计算每组年龄的总消费额/该组人数
【分解步骤】
第一步:
SELECT DISTINCT (YEAR(CURDATE())-SUBSTR(birth,1,4))as age FROM userinfo
WHERE SUBSTR(birth,1,4)>1900
ORDER BY age desc ;
is not null
和!=''
都失败,不能排除birth的空值,所以改用筛选birth里面年份>1900的方式来排除,应该没人超过120岁吧O(∩_∩)O~第二步:
SELECT userId,(CASE
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =0 THEN '0-9'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =1 THEN '10-19'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =2 THEN '20-29'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =3 THEN '30-39'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =4 THEN '40-49'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =5 THEN '50-59'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =6 THEN '60-69'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =7 THEN '70-79'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =8 THEN '80-89'
ELSE '90-99'
END) AS age
FROM userinfo
WHERE SUBSTR(birth,1,4)>1900;
最终步骤,将user表和order表连接,根据age分组计算平均价格:
SELECT u.age,avg(o.price) FROM orderinfo AS o INNER JOIN
(SELECT userId,(CASE
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =0 THEN '0-9'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =1 THEN '10-19'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =2 THEN '20-29'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =3 THEN '30-39'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =4 THEN '40-49'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =5 THEN '50-59'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =6 THEN '60-69'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =7 THEN '70-79'
WHEN (YEAR(CURDATE())-SUBSTR(birth,1,4)) DIV 10 =8 THEN '80-89'
ELSE '90-99'
END) AS age
FROM userinfo
WHERE SUBSTR(birth,1,4)>1900) AS u
ON o.userID=u.userId
WHERE o.isPaid='已支付'
GROUP BY age
ORDER BY avg(o.price) desc;
输出:
计算结果和原链接有±5的略微差异,可能是清洗数据上的问题?
6. 统计消费的二八法则,消费的top20%用户,贡献了多少额度
思路:
order by desc 选出sum总消费排行前20%的用户,根据前20%的总消费/所有总消费
SELECT SUM(t.total) AS'前20%总消费额',
(SELECT SUM(price) FROM orderinfo WHERE isPaid='已支付') AS '全员总消费额',
SUM(t.total)/(SELECT SUM(price) FROM orderinfo WHERE isPaid='已支付')AS '前20%消费额占比'
FROM
(SELECT userID,sum(price) AS total FROM orderinfo
WHERE isPaid='已支付'
GROUP BY userID
ORDER BY total desc LIMIT 0,20)AS t
【TBC问题】