一、数据来源及字段解释
本数据库来源于:https://www.mysqltutorial.org/mysql-sample-database.aspx
如上图所示这其实是mysql样本数据库,整个数据库包含了各种所需表,表结构清晰,主键外键清晰明了,本文主要讲解工作中会常用到的sql查询,至于查询结果出来后的具体分析,你可以从行业角度分析数据,来判断数据高低,也可以结合同比与环比跟自身进行比较。
二、数据清洗
- 缺失值查找
我们先来看一看orderdetails这张表里有没有缺失数据。
为了不一遍又一遍输入select count(xxx) from where xxx is null,我们直接创建一个存储过程,来完成这个查询空值个数的操作。
CREATE PROCEDURE `check_column_null_num`(IN schema_name VARCHAR(100), IN table_name2 VARCHAR(100))
BEGIN
-- 数据表 schema_name 中的列名称
DECLARE temp_column VARCHAR(100);
-- 创建结束标志变量
DECLARE done INT DEFAULT false;
-- 定义游标来操作每一个 COLUMN_NAME
DECLARE cursor_column CURSOR FOR
SELECT COLUMN_NAME FROM information_schema.COLUMNS WHERE table_schema = schema_name AND table_name = table_name2;
-- 指定游标循环结束时的返回值
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = true;
-- 打开游标
OPEN cursor_column;
read_loop:LOOP
FETCH cursor_column INTO temp_column;
-- 判断游标的循环是否结束
IF done THEN
LEAVE read_loop;
END IF;
-- 这里需要设置具体的 SQL 语句 temp_query
SET @temp_query=CONCAT('SELECT COUNT(*) as ', temp_column, '_null_num FROM ', table_name2, ' WHERE ', temp_column, ' IS NULL');
-- 执行 SQL 语句
PREPARE stmt FROM @temp_query;
EXECUTE stmt;
END LOOP;
-- 关闭游标
CLOSE cursor_column;
END
call check_column_null_num('classicmodels','orderdetails');
orderdetails这张表有5个字段,所以一共有5个结果,然后每一个结果都显示没有空值。
- 重复值删除
select count(*) from (select distinct * from orderdetails) as a ;
select count(*) from orderdetails ;
最终结果都是2996,所以没有重复行,如果有重复行,我们可以把唯一行输入到一张新表,如果可以的话把旧表删除,如下所示。
create table tmptable like orderdetails;
select distinct * into tmptable from orderdetails;
drop table orderdetails;
rename table tmptable to orderdetails;
- 异常值处理
你可以先肉眼观察一下表格,看看有没有显而易见的错误,如果没有那么就根据对业务的理解设定筛选条件,比如购买价格和购买数量不能为0。
select * from orderdetails where priceeach = 0 or quantityOrdered = 0;
三、问题提出与分析
- 2003年卖得数量最多的十款产品,及其购买人数和购买金额?
Select productCode,count(DISTINCT customerNumber),sum(quantityOrdered*priceEach),sum(quantityOrdered)
from orderdetails join orders using(orderNumber)
where year(orderdate)=2003 group by productCode order by sum(quantityOrdered) desc limit 10;
- 2004年新客老客对应人数、订单数、客单价、客件数分别是多少?
select a.顾客分类,count(DISTINCT a.customerNumber) as 人数,
count(a.orderNumber) as 订单数,
sum(a.priceEach*a.quantityOrdered)/count(DISTINCT a.customerNumber) 客单价,
sum(a.quantityOrdered)/count(DISTINCT a.customerNumber) 客件数
from
(select orderNumber, customerNumber,quantityOrdered,priceEach,case
when customernumber in (select distinct customerNumber from orders where year(orderDate)<2004 and customerNumber is not NULL order by customerNumber) then '老客'
when customernumber not in (select distinct customerNumber from orders where year(orderDate)<2004 and customerNumber is not NULL order by customerNumber) then '新客'
end 顾客分类
from orders join orderdetails using(ordernumber) where year(orderDate)=2004) a
group by 顾客分类 ;
在这里新客的定义是首次购买在这个时间段内。
- 2004年客户平均复购周期及复购率是多少?
SELECT sum(DATEDIFF(orderDate,pre_oder_date))/sum(if (row_num=2 ,1,0)) 平均复购周期,
sum(if (row_num=2 ,1,0))/sum(if (pre_oder_date is null ,1,0)) 复购率
FROM (SELECT customerNumber,orderDate,ROW_NUMBER() over(partition by customerNumber) row_num ,lag(orderDate,1) over(partition by customerNumber) pre_oder_date
FROM orders WHERE year(orderDate)=2004) as a ;
- 2004年首购用户在首购一个月内复购人数是多少?
SELECT sum(if(datediff(orderDate,pre_orderdate)<=30,1,0)) 人数 FROM
(select customerNumber,orderDate, lag(orderDate,1) over(partition by customerNumber order by orderDate ) pre_orderdate,row_number() over (partition by customerNumber order by orderDate) row_num
from orders
where customerNumber not in (select DISTINCT customerNumber from orders where year(orderDate)<2004) and year(orderDate)=2004
order by customerNumber) as a
where row_num=2;
- 2004年至少连续三个月进行购买的买家有谁?
select customerNumber,tep_column,count(tep_column) 连续次数 FROM
(SELECT *,month(orderDate)-rnk as tep_column FROM
(SELECT *,row_number() over(PARTITION by customerNumber) as rnk FROM orders where year(orderDate)=2004) t1) t2
group by customerNumber,tep_column HAVING count(tep_column)>=3;
这里
- 根据RFM模型对用户进行分类
在这里我们假定现在是2004年12月,最近一次消费就用12减去最近一次消费月份,频率和金额就是04年所消费的次数和金额。第一步我们先把R/F/M分别算出来。
SELECT a.customerNumber,Recent,Money,Frequency
FROM
(SELECT customerNumber, 12-month(recent_orderdate) Recent FROM
(SELECT customerNumber,first_value(orderDate) over(partition by customerNumber order by orderDate desc ) recent_orderdate ,row_number() over(partition by customerNumber) rnk FROM orders WHERE year(orderDate)=2004 ) as t1
WHERE rnk=1) as a join
(SELECT customerNumber,sum(quantityOrdered*priceEach) as Money,count(customerNumber) as Frequency FROM orders join orderdetails USING(orderNumber)
GROUP BY customerNumber) as b on a.customerNumber=b.customerNumber;
整个计算方法分为3步,第一步算出对应的Renct:
第二步算出对应的Money和Frequency:
第三步就是根据"customerNumber"字段进行内连接,组成一张表。
获取RFM表后,我们就需要根据业务理解来进行分类,比如我们的商品是消耗品平均要3个月才会用完,那么我们就根据统计学的经验法则把Recnt 3±3std的用户描述成目标客户,向这些客户发送推送或短信加上优惠券,让他们进行复购。
在这里我们进行个简单的分类,将Recent在2-4之间的为目标用户,0-2的为已购买用户,5-12为流失用户,金额在10万及以上的为大额用户,在10万以下的为小额用户,而对于频率我们不做分类。
SELECT *,CASE
WHEN 2>=Recent and Money<=100000 THEN '小额已购买用户'
When 2>=Recent and Money>100000 THEN '大额已购买用户'
When 2100000 THEN '大额目标用户'
When 4100000 THEN '大额目标用户'
END category
FROM
(SELECT a.customerNumber,Recent,Money,Frequency
FROM
(SELECT customerNumber, 12-month(recent_orderdate) Recent FROM
(SELECT customerNumber,first_value(orderDate) over(partition by customerNumber order by orderDate desc ) recent_orderdate ,row_number() over(partition by customerNumber) rnk FROM orders WHERE year(orderDate)=2004 ) as t1
WHERE rnk=1) as a join
(SELECT customerNumber,sum(quantityOrdered*priceEach) as Money,count(customerNumber) as Frequency FROM orders join orderdetails USING(orderNumber)
GROUP BY customerNumber) as b on a.customerNumber=b.customerNumber) t2;
除了根据RFM进行分类外,你还可以建立一个评分机制,进行加权求和,在这里就不多赘述。
以上就是用SQL做数据分析时常见的业务需求与操作,数据分析简而言之就是将上万条甚至几千万条数据提炼成复购率等信息,再根据信息进行决策判断,最重要的是要结合业务。