问题描述
有三个海岛,分别驻扎了三个气温测量员A, B, C。假设已知C所在海岛的温度Y和,A,B所在海岛的气温, X_A,X_B 满足某个函数关系
=, .Y=f(X_A,X_B ).
但f的具体表达式未知。CC想要找出f,所以他聘请了一个数据科学家来帮他,并提供了过往200天的三地气温数据。
,X_A,X_B为自变量(independent variable)或特征(features), Y被称为因变量(dependent variable),response.
数据:’complete_train_samples.csv’, ‘test_samples.csv’
编写python代码,通过complete_train_samples.csv搭建、训练线性回归模型、KNN模型,利用XA,XB预测Y.
# 导入所需库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
# 1. 数据导入
train_data = pd.read_csv('complete_train_samples.csv')
test_data = pd.read_csv('test_samples.csv')
# 2. 数据探索和预处理
# (这里可以添加数据探索和预处理的代码,如查看数据的前几行,处理缺失值,异常值等)
# 3. 特征和标签分离
X = train_data[['XA', 'XB']]
y = train_data['Y']
# 4. 划分训练和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 5. 搭建和训练线性回归模型
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
# 6. 搭建和训练KNN模型
knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model.fit(X_train, y_train)
# 7. 预测和评估模型在训练数据上的表现
linear_train_predictions = linear_model.predict(X_train)
knn_train_predictions = knn_model.predict(X_train)
# 计算R^2分数
linear_r2 = r2_score(y_train, linear_train_predictions)
knn_r2 = r2_score(y_train, knn_train_predictions)
# 8. 可视化模型预测和真实值
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(y_train, linear_train_predictions, label=f'Linear Regression, R^2={linear_r2:.2f}')
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.legend()
plt.subplot(1, 2, 2)
plt.scatter(y_train, knn_train_predictions, label=f'KNN, R^2={knn_r2:.2f}', color='red')
plt.xlabel('True Values')
plt.ylabel('Predictions')
plt.legend()
plt.tight_layout()
plt.show()
# 9. 使用模型预测测试数据
test_X = test_data[['XA', 'XB']]
linear_test_predictions = linear_model.predict(test_X)
knn_test_predictions = knn_model.predict(test_X)
# 10. 保存测试结果到 'test_prediction.csv'
test_predictions_df = pd.DataFrame({'XA': test_data['XA'], 'XB': test_data['XB'],
'Prediction_Linear': linear_test_predictions, 'Prediction_KNN': knn_test_predictions})
test_predictions_df.to_csv('test_prediction.csv', index=False)