上一次的内容中,我们已经把确定目标和数据获取的部分完成了,接下来我们来讲一下如何进行数据处理和数据展示吧。
第三部分:数据处理
数据处理是数据分析的核心部分,我也给它打了50%的权重。之前的数据获取部分,我们拿到了1615条原始的课程数据。但是,这个数据中可能有无法使用和重复等情况,而且我们还要根据之后展示数据的要求进行数据提取。可见,我们的数据处理又可以分成两步:数据清洗和数据提取。
一、数据清洗
我们还是使用python来对数据进行清洗。在之前的数据获取中我们拿到了三张表:总课程表、课程详情表和教师表。
我们先来看一下总课程表:
可以看出,我们之前为了爬取课程详情在课程表中增加了lessonurl列,在清洗时可以去掉。同时根据年级(grade)我们还可以增加一列学段(小学、初中、高中),可以为之后的展示提供更多的角度。
代码如下
#导入模块
import numpy as np
import pandas as pd
#打印出前5行,以确保数据运行正常
lessonDf.head()
#选择子集
lessonDf = lessonDf[['grade','channelid','lessonid','lessonname']]
#通过条件判断筛选
#构建查询条件
querySer = lessonDf['lessonid'] != 'groups/'
#根据查询条件筛选
lessonDf = lessonDf.loc[querySer, : ]
lessonDf.shape
#通过条件判断筛选
#构建查询条件
querySer = lessonDf['lessonid'] != 'groups/'
#根据查询条件筛选
lessonDf = lessonDf.loc[querySer, : ]
lessonDf.shape
#给channelid赋值
for i in range(0, 1389):
channel = lessonDf.iloc[i,1]
if channel == 1:
lessonDf.iloc[i,1]='语文'
elif channel == 2:
lessonDf.iloc[i,1]='数学'
elif channel == 3:
lessonDf.iloc[i,1]='英语'
elif channel == 201:
lessonDf.iloc[i,1]='编程'
elif channel == 4:
lessonDf.iloc[i,1]='物理'
elif channel == 5:
lessonDf.iloc[i,1]='化学'
elif channel == 6:
lessonDf.iloc[i,1]='生物'
elif channel == 7:
lessonDf.iloc[i,1]='历史'
elif channel == 8:
lessonDf.iloc[i,1]='地理'
elif channel == 9:
lessonDf.iloc[i,1]='政治'
elif channel == 14:
lessonDf.iloc[i,1]='道德与法制'
else :
pass
#创建studyphase列
lessonDf['studyphase'] = lessonDf['grade']
#给studyphase赋值
for i in range(0, 1389):
phase = lessonDf.iloc[i,4]
if 0< phase<=6:
lessonDf.iloc[i,4]='小学'
elif 6< phase <=9:
lessonDf.iloc[i,4]='初中'
elif phase > 9:
lessonDf.iloc[i,4]='高中'
else :
pass
#去重
lessonDf = lessonDf.drop_duplicates()
#导出数据
lessonDf.to_excel('lesson.xls')
清洗结果如下:
同样的,我们也对课程详细表和教师表进行去除脏数据、去重等操作,导出teacher和lessondetail两张表:
二、数据提取
处理好数据之后,下一步就是根据实际的数据展示需求来提取数据了。参考下面的脑图我们可以更好的理解应当提取哪些数据:
这幅脑图和第一部分爬取数据时的脑图相比更加清晰,每个子主题都可以对应一个提数需求。这里我将上面数据清洗得到的三张表导入Sequel Pro,使用SQL语句来提取所需的数据。
1、学科、学段、年级下课程数排名
(1)不同学科的课程数量
select channelid '学科类型', count(distinct lessonid) '课程数量' from lesson
group by channelid;
(2)不同学段的课程数量
select studyphase '学段', count(distinct lessonid) '课程数量' from lesson
group by studyphase;
(3)不同年级的课程数量
select grade '年级', count(distinct lessonid) '课程数量' from lesson
group by grade;
2、学科、学段、年级下课程报名人数排名
(1)不同年级、学科下报名人数
select grade '年级', channelid '学科', sum(signup_number) '报名人数' from lessondetail d
left join
(select distinct grade, channelid, lessonid
from lesson) t
on d.lesson_id = t.lessonid
group by grade, channelid
(2)不同学段下报名人数
select studyphase '学段', sum(signup_number) '报名人数' from lessondetail d
left join
(select distinct studyphase, lessonid
from lesson) t
on d.lesson_id = t.lessonid
group by studyphase
(3)不同年级下课程报名人数
select grade '年级', sum(signup_number) '报名人数' from lessondetail d
left join
(select distinct grade, lessonid from lesson) t
on d.lesson_id = t.lessonid
group by grade
3、学科、学段、年级下老师数量排名
老师总数量
select count(distinct teacher_id) from teacher;
(1)不同学科的老师数量
select channelid '学科', count(teacher_id) '教师数'
from (select distinct channelid, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by channelid
(2)不同学段的老师数量
select studyphase '学段', count(teacher_id) '教师数'
from (select distinct studyphase, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by studyphase
(3)不同年级的老师数量
select grade '年级' , count(teacher_id) '教师数'
from (select distinct grade, teacher_id
from lesson l
join teacher t
on l.lessonid = t.lesson_id) t
group by grade
(4)课程配师的分布情况
select teacher_num '课程配备教师个数', count(t.lesson_id) '课程数量' from(
select lesson_id, count(1) as teacher_num from teacher
group by lesson_id) t
group by teacher_num
(5)老师上的课程数分布
select teacher_id, ct '上课数' from(
select teacher_id , count(1) ct from teacher
group by teacher_id) t
order by t.ct desc
4、学科、学段、年级下营收情况排名
(1)不同学科的营收
select channelid '学科' , sum(profit) '营收' from
(#对学科下的课程id去重
select channelid, lesson_id, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by channelid, lesson_id) t
group by channelid
(2)不同学段的营收
select studyphase '学段' , sum(profit) '营收' from
(#对学段下的课程id去重
select studyphase, lesson_id, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by studyphase, lesson_id) t
group by studyphase
(3)不同年级+学科的营收
select grade '年级' ,channelid '学科', sum(profit) '营收' from
(#对年级下的课程id去重
select grade, lesson_id, channelid, count(lesson_id), price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
group by grade, lesson_id, channelid) t
group by grade, channelid
(4)老师对营收的贡献排名
select teacher_id '教师id' , studyphase '学段' , channelid '学科' , sum(signup_num) '学生数',
count(1) '代课数' , sum(profit) '营收' from
(select lesson_id, signup_num, profit from
(#对多老师的课程营收分配成相应价格
select l.lesson_id, count(l.lesson_id), sum(signup_number)/count(l.lesson_id) signup_num,
price*signup_number/count(l.lesson_id) profit
from lessondetail l
join teacher t
on l.lesson_id = t.lesson_id
group by l.lesson_id) t1) t2
join teacher t
join lesson l
on t.lesson_id = t2.lesson_id
and t2.lesson_id = l.lessonid
group by teacher_id, studyphase, channelid
(5)课程对营收的贡献排名
select lesson_id, sum(price*signup_number) profit
from lessondetail
group by lesson_id
以上SQL语句基本上满足了大部分的提数需求。此外,还有一张表可以用来计算相关性矩阵:
合并表
select d.lesson_id, price, signup_number,
studyphase, grade, channelid,
price*signup_number profit
from lessondetail d
join lesson l
on d.lesson_id = l.lessonid
第四部分、数据展示
经过以上步骤的数据提取,我们可以根据需求来展示数据了。一个比较经典的图是这样的:
考虑到我们需要展示的数据主要目的为比较和构成,因此我选择条形图、柱状体、饼图等基本图形来进行展示。
由于篇幅有限,这里只展示一下相关性矩阵:
可见,课程的营收情况主要和报名人数以及年级成正相关。其他具体数据展示我会通过一个具体的报告进行呈现,敬请期待。
小结
本篇文章主要讨论了对猿辅导网站获取的数据进行清洗和提数的过程。在清洗数据时利用python简化流程,在提数过程中先使用Xmind确定提数需求,再使用SQL语句完成提数。