计算机毕设:网民社交网络数据的分析与挖掘

计算机毕设:网民社交网络数据的分析与挖掘_第1张图片

1.读数据表

首先,我们读取原始数据,并查看各字段基本情况。

gradyear gender age friends basketball football soccer softball volleyball swimming cheerleading baseball tennis sports cute sex sexy hot kissed dance band marching music rock god church jesus bible hair dress blonde mall shopping clothes hollister abercrombie die death drunk drugs
2006 M 18.98 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 F 18.801 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 2 1 0 0 0 6 4 0 1 0 0 0 0 0 0 0 0
2006 M 18.335 69 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2006 F 18.875 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 18.995 10 0 0 0 0 0 0 0 0 0 0 0 1 0 0 5 1 1 0 3 0 1 0 0 0 1 0 0 0 2 0 0 0 0 0 1 1

2.年龄缺失值填补

缺失值(missing value)是指现有数据集中某个或某些属性的值是不完全的。 由于大部分机器学习模型无法处理缺失值,在数据建模前需要填补或者剔除缺失值。对于连续变量age,我们使用该列的均值进行填充,结果如下表所示。

gradyear gender age friends basketball football soccer softball volleyball swimming cheerleading baseball tennis sports cute sex sexy hot kissed dance band marching music rock god church jesus bible hair dress blonde mall shopping clothes hollister abercrombie die death drunk drugs
2006 M 18.98 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 F 18.801 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 2 1 0 0 0 6 4 0 1 0 0 0 0 0 0 0 0
2006 M 18.335 69 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2006 F 18.875 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 18.995 10 0 0 0 0 0 0 0 0 0 0 0 1 0 0 5 1 1 0 3 0 1 0 0 0 1 0 0 0 2 0 0 0 0 0 1 1

3.性别缺失值填补

对于离散变量gender,我们使用“未知”进行填充,结果如下表所示。

gradyear gender age friends basketball football soccer softball volleyball swimming cheerleading baseball tennis sports cute sex sexy hot kissed dance band marching music rock god church jesus bible hair dress blonde mall shopping clothes hollister abercrombie die death drunk drugs
2006 M 18.98 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 F 18.801 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 2 1 0 0 0 6 4 0 1 0 0 0 0 0 0 0 0
2006 M 18.335 69 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2006 F 18.875 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 未知 18.995 10 0 0 0 0 0 0 0 0 0 0 0 1 0 0 5 1 1 0 3 0 1 0 0 0 1 0 0 0 2 0 0 0 0 0 1 1

计算机毕设:网民社交网络数据的分析与挖掘_第2张图片

 

5.异常值处理前直方图

异常值(outlier),也称为极端值,是数据集中某些数值明显偏离其余数据点的样本点。因为线性回归模型等机器学习模型对异常值较为敏感,对异常值进行处理有利于提高建模的鲁棒性。

接下来,我们用直方图查看friends列数据分布情况。

计算机毕设:网民社交网络数据的分析与挖掘_第3张图片

6.异常值处理

通过数据筛选组件,我们可以剔除掉大于�3+1.5×���Q3​+1.5×IQR的数据点,结果如下表所示。

gradyear gender age friends basketball football soccer softball volleyball swimming cheerleading baseball tennis sports cute sex sexy hot kissed dance band marching music rock god church jesus bible hair dress blonde mall shopping clothes hollister abercrombie die death drunk drugs
2006 M 18.98 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 F 18.801 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 2 1 0 0 0 6 4 0 1 0 0 0 0 0 0 0 0
2006 M 18.335 69 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2006 F 18.875 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 未知 18.995 10 0 0 0 0 0 0 0 0 0 0 0 1 0 0 5 1 1 0 3 0 1 0 0 0 1 0 0 0 2 0 0 0 0 0 1 1

7.Z-Score标准化

数据标准化指的是将数据按比例缩放的预处理操作。 当我们希望消除量纲的影响、帮助模型收敛、适应模型假设时,就可能需要进行数据标准化。

在本案例中,我们将介绍比较常用的Z-Score标准化和MinMax标准化。下面我们对数据集中friends列做Z-Score标准化,使得处理后的数据均值为0,标准差为1。

gradyear gender age friends basketball football soccer softball volleyball swimming cheerleading baseball tennis sports cute sex sexy hot kissed dance band marching music rock god church jesus bible hair dress blonde mall shopping clothes hollister abercrombie die death drunk drugs
2006 M 18.98 -0.720678 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 F 18.801 -0.99873 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 2 1 0 0 0 6 4 0 1 0 0 0 0 0 0 0 0
2006 M 18.335 1.742069 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2006 F 18.875 -0.99873 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2006 未知 18.995 -0.601512 0 0 0 0 0 0 0 0 0 0 0 1 0 0 5 1 1 0 3 0 1 0 0 0 1 0 0 0 2 0 0 0 0 0 1 1
特征 均值 标准差
friends 25.143165 25.175147

8.异常值处理后直方图

 计算机毕设:网民社交网络数据的分析与挖掘_第4张图片计算机毕设:网民社交网络数据的分析与挖掘_第5张图片

你可能感兴趣的:(数据分析,计算机毕设,网民社交网络分析,数据分析,数据挖掘,数据库,机器学习)