如何用SQL做数据分析

一、数据来源及字段解释

本数据库来源于:https://www.mysqltutorial.org/mysql-sample-database.aspx

image.png

如上图所示这其实是mysql样本数据库,整个数据库包含了各种所需表,表结构清晰,主键外键清晰明了,本文主要讲解工作中会常用到的sql查询,至于查询结果出来后的具体分析,你可以从行业角度分析数据,来判断数据高低,也可以结合同比与环比跟自身进行比较。

二、数据清洗

  1. 缺失值查找
    我们先来看一看orderdetails这张表里有没有缺失数据。
    为了不一遍又一遍输入select count(xxx) from where xxx is null,我们直接创建一个存储过程,来完成这个查询空值个数的操作。
CREATE PROCEDURE `check_column_null_num`(IN schema_name VARCHAR(100), IN table_name2 VARCHAR(100))
BEGIN
-- 数据表 schema_name 中的列名称
DECLARE temp_column VARCHAR(100); 
-- 创建结束标志变量  
DECLARE done INT DEFAULT false;
-- 定义游标来操作每一个 COLUMN_NAME
DECLARE cursor_column CURSOR FOR
SELECT COLUMN_NAME FROM information_schema.COLUMNS WHERE table_schema = schema_name AND table_name = table_name2;
-- 指定游标循环结束时的返回值  
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = true;  
-- 打开游标
OPEN cursor_column;
read_loop:LOOP
           FETCH cursor_column INTO temp_column;
           -- 判断游标的循环是否结束 
           IF done THEN 
                    LEAVE read_loop;
           END IF;
           -- 这里需要设置具体的 SQL 语句 temp_query
           SET @temp_query=CONCAT('SELECT COUNT(*) as ', temp_column, '_null_num FROM ', table_name2, ' WHERE ', temp_column, ' IS NULL');
           -- 执行 SQL 语句
           PREPARE stmt FROM @temp_query;           
           EXECUTE stmt;
END LOOP;
-- 关闭游标
CLOSE cursor_column;
END

call check_column_null_num('classicmodels','orderdetails');
image.png

orderdetails这张表有5个字段,所以一共有5个结果,然后每一个结果都显示没有空值。

  1. 重复值删除
select count(*)  from (select  distinct * from orderdetails) as a ;
select count(*)  from orderdetails ;
image.png

最终结果都是2996,所以没有重复行,如果有重复行,我们可以把唯一行输入到一张新表,如果可以的话把旧表删除,如下所示。

create table tmptable like orderdetails;
select distinct * into  tmptable  from orderdetails;
drop table orderdetails;
rename table tmptable to orderdetails;
  1. 异常值处理

你可以先肉眼观察一下表格,看看有没有显而易见的错误,如果没有那么就根据对业务的理解设定筛选条件,比如购买价格和购买数量不能为0。

select * from orderdetails where priceeach = 0 or quantityOrdered = 0;
image.png

三、问题提出与分析

  1. 2003年卖得数量最多的十款产品,及其购买人数和购买金额?
Select productCode,count(DISTINCT customerNumber),sum(quantityOrdered*priceEach),sum(quantityOrdered) 
from orderdetails join orders using(orderNumber) 
where year(orderdate)=2003 group by productCode order by sum(quantityOrdered) desc limit 10;
image.png
  1. 2004年新客老客对应人数、订单数、客单价、客件数分别是多少?
select a.顾客分类,count(DISTINCT a.customerNumber) as  人数,
count(a.orderNumber) as 订单数,
sum(a.priceEach*a.quantityOrdered)/count(DISTINCT a.customerNumber) 客单价,
sum(a.quantityOrdered)/count(DISTINCT a.customerNumber) 客件数   
from 
(select orderNumber, customerNumber,quantityOrdered,priceEach,case 
when customernumber in  (select distinct  customerNumber from orders  where  year(orderDate)<2004 and customerNumber is not NULL order by customerNumber) then '老客' 
when customernumber not  in  (select distinct  customerNumber from orders  where  year(orderDate)<2004 and customerNumber is not NULL order by customerNumber) then '新客' 
end 顾客分类  
from orders join orderdetails using(ordernumber) where year(orderDate)=2004) a   
group by 顾客分类 ;
image.png

在这里新客的定义是首次购买在这个时间段内。

  1. 2004年客户平均复购周期及复购率是多少?
SELECT sum(DATEDIFF(orderDate,pre_oder_date))/sum(if (row_num=2 ,1,0)) 平均复购周期,
sum(if (row_num=2 ,1,0))/sum(if (pre_oder_date is null ,1,0)) 复购率
FROM  (SELECT   customerNumber,orderDate,ROW_NUMBER() over(partition by customerNumber) row_num ,lag(orderDate,1) over(partition by customerNumber) pre_oder_date 
FROM orders WHERE year(orderDate)=2004) as a ;
image.png
  1. 2004年首购用户在首购一个月内复购人数是多少?
SELECT sum(if(datediff(orderDate,pre_orderdate)<=30,1,0)) 人数 FROM
(select customerNumber,orderDate, lag(orderDate,1) over(partition by customerNumber order by orderDate ) pre_orderdate,row_number() over (partition by customerNumber order by orderDate) row_num 
from orders 
where customerNumber  not in    (select DISTINCT customerNumber from orders where year(orderDate)<2004) and  year(orderDate)=2004  
order by customerNumber) as a 
where row_num=2;

image.png
  1. 2004年至少连续三个月进行购买的买家有谁?
select customerNumber,tep_column,count(tep_column) 连续次数    FROM
(SELECT *,month(orderDate)-rnk as  tep_column FROM 
(SELECT *,row_number() over(PARTITION by customerNumber) as rnk FROM orders where year(orderDate)=2004) t1) t2
group by customerNumber,tep_column HAVING count(tep_column)>=3;

image.png

这里

  1. 根据RFM模型对用户进行分类

在这里我们假定现在是2004年12月,最近一次消费就用12减去最近一次消费月份,频率和金额就是04年所消费的次数和金额。第一步我们先把R/F/M分别算出来。

SELECT a.customerNumber,Recent,Money,Frequency 
FROM 
(SELECT customerNumber, 12-month(recent_orderdate) Recent  FROM
(SELECT customerNumber,first_value(orderDate) over(partition by customerNumber order by orderDate desc ) recent_orderdate ,row_number()  over(partition by customerNumber)  rnk FROM orders WHERE year(orderDate)=2004 ) as t1  
WHERE rnk=1) as a  join 
(SELECT customerNumber,sum(quantityOrdered*priceEach) as Money,count(customerNumber) as Frequency FROM orders join orderdetails USING(orderNumber)
GROUP BY customerNumber) as b  on a.customerNumber=b.customerNumber;

image.png

整个计算方法分为3步,第一步算出对应的Renct:


image.png

第二步算出对应的Money和Frequency:


image.png

第三步就是根据"customerNumber"字段进行内连接,组成一张表。

获取RFM表后,我们就需要根据业务理解来进行分类,比如我们的商品是消耗品平均要3个月才会用完,那么我们就根据统计学的经验法则把Recnt 3±3std的用户描述成目标客户,向这些客户发送推送或短信加上优惠券,让他们进行复购。

在这里我们进行个简单的分类,将Recent在2-4之间的为目标用户,0-2的为已购买用户,5-12为流失用户,金额在10万及以上的为大额用户,在10万以下的为小额用户,而对于频率我们不做分类。


SELECT *,CASE 
    WHEN 2>=Recent and  Money<=100000 THEN  '小额已购买用户'
    When 2>=Recent and  Money>100000 THEN  '大额已购买用户'
    When 2100000 THEN  '大额目标用户'
    When 4100000 THEN  '大额目标用户'
END   category
FROM 
(SELECT a.customerNumber,Recent,Money,Frequency 
FROM 
(SELECT customerNumber, 12-month(recent_orderdate) Recent  FROM
(SELECT customerNumber,first_value(orderDate) over(partition by customerNumber order by orderDate desc ) recent_orderdate ,row_number()  over(partition by customerNumber)  rnk FROM orders WHERE year(orderDate)=2004 ) as t1  
WHERE rnk=1) as a  join 
(SELECT customerNumber,sum(quantityOrdered*priceEach) as Money,count(customerNumber) as Frequency FROM orders join orderdetails USING(orderNumber)
GROUP BY customerNumber) as b  on a.customerNumber=b.customerNumber) t2;
image.png

除了根据RFM进行分类外,你还可以建立一个评分机制,进行加权求和,在这里就不多赘述。

以上就是用SQL做数据分析时常见的业务需求与操作,数据分析简而言之就是将上万条甚至几千万条数据提炼成复购率等信息,再根据信息进行决策判断,最重要的是要结合业务。

你可能感兴趣的:(如何用SQL做数据分析)