影厅观影人数预测
实验要求:
1.读取给定文件中数据集文件。(数据集路径:data/data72160/1_film.csv)
2.绘制影厅观影人数(filmnum)与影厅面积(filmsize)的散点图。
3.绘制影厅人数数据集的散点图矩阵。
4.选取特征变量与相应变量,并进行数据划分。
5.进行线性回归模型训练。
6.根据求出的参数对测试集进行预测。
7.绘制测试集相应变量实际值与预测值的比较。
8.对预测结果进行评价。
from jupyterthemes import jtplot
jtplot.style(theme='monokai')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
读取给定文件中数据集文件
data = pd.read_csv('../datasets/1_film.csv')
data.head()
|
filmnum |
filmsize |
ratio |
quality |
0 |
45 |
106 |
17 |
6 |
1 |
44 |
99 |
15 |
18 |
2 |
61 |
149 |
27 |
10 |
3 |
41 |
97 |
27 |
16 |
4 |
54 |
148 |
30 |
8 |
绘制影厅观影人数(filmnum)与影厅面积(filmsize)的散点图
x = data["filmsize"]
y = data["filmnum"]
plt.figure(figsize=(16,10))
plt.scatter(x,y)
plt.xlabel('filmsize')
plt.ylabel('filmnum')
plt.show()
绘制影厅人数数据集的散点图矩阵
import seaborn as sns
sns.pairplot(data=data,hue='filmnum',vars=['filmsize', 'quality','ratio'],height=4)
选取特征变量与相应变量,并进行数据划分
from sklearn.linear_model import LinearRegression
result = {}
plt.figure(figsize=(3*10,6))
for i,j in enumerate(data.columns[1:]):
train_y = data["filmnum"]
train_x = data.loc[:,j].values.reshape(-1, 1)
linear_model = LinearRegression()
linear_model.fit(train_x,train_y)
score = linear_model.score(train_x,train_y)
axes = plt.subplot(1,3,i+1)
plt.scatter(train_x,train_y,color="blue")
axes.set_title("the result of " + j + " is " + str(score))
result[j] = score
plt.xlabel(j)
plt.ylabel('filmnum')
plt.scatter(train_x,train_y)
import operator
result = sorted(result.items(),key=operator.itemgetter(1),reverse=True)
resMax = result[0]
print("The greatest impact on filmnum is",resMax[0])
print("corresponding coefficient of determination is",resMax[1])
The greatest impact on filmnum is filmsize
corresponding coefficient of determination is 0.7805367573872601
from sklearn.model_selection import train_test_split
y = data["filmnum"].values.reshape(-1, 1)
x = data["filmsize"].values.reshape(-1, 1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.5)
进行线性回归模型训练
linear_model = LinearRegression()
linear_model.fit(train_x,train_y)
LinearRegression()
根据求出的参数对测试集进行预测
y_hat = linear_model.predict(x_test)
绘制测试集相应变量实际值与预测值的比较
plt.figure(figsize=(10,6))
t = np.arange(len(x_test))
plt.plot(t, y_test, 'r', linewidth=2, label='y_test')
plt.plot(t, y_hat, 'b', linewidth=2, label='y_hat')
plt.show()
对预测结果进行评价
from sklearn import metrics
from sklearn.metrics import r2_score
print ("r2:",linear_model.score(x_test, y_test))
print ("r2_score:",r2_score(y_test, y_hat))
print ("MAE:", metrics.mean_absolute_error(y_test, y_hat))
print ("MSE:", metrics.mean_squared_error(y_test, y_hat))
print ("RMSE:", np.sqrt(metrics.mean_squared_error(y_test, y_hat)))
r2: -39.09824185600666
r2_score: -39.09824185600666
MAE: 88.92401507023722
MSE: 8222.95827775469
RMSE: 90.68052865833265