dhu 数据科学与技术 第8次作业

一. 简答题(共2题,100分)

  1. (简答题, 50分)
    Energy Efficiency数据集( ENB2012_data.xlsx,ENB2012.names)记录不同房屋的制热能源消耗和制冷能源消耗。包括768条记录,8个特征属性,两个预测值。具体说明见ENB2012.names。

1)在全数据集上训练线性回归模型预测制热能耗,计算模型性能:RMSE以及R2;

2)将数据集划分训练集和测试集,在训练集上训练线性回归模型,分析模型在训练集

和测试集上的性能。

import time
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn import model_selection
from sklearn import metrics

#处理数据
data = pd.read_excel('C:\\python\\ENB2012_data.xlsx')
X = data.iloc[:,0:8].values.astype(float)
y = data.iloc[:,9].values.astype(float)
#划分数据集
X_train,X_test,y_train,y_test = model_selection.train_test_split(X,y,test_size=0.25,random_state=int(time.time()))

# (1
#全数据集训练
learningall = LinearRegression()
learningall.fit(X,y,sample_weight=1)

predict_score1 = learningall.score(X_test,y_test)
print('R2性能:全训练集{:.2f}'.format(predict_score1))
y_test_pred = learningall.predict(X_test)
test_err = metrics.mean_squared_error(y_test,y_test_pred)
print('RMSE性能:全训练集{:.2f}'.format(test_err))

# (2
#训练集训练
learning = LinearRegression()
learning.fit(X_train,y_train)

#计算训练集的模型性能
train_score = learning.score(X_train,y_train)
test_score = learning.score(X_test,y_test)
print('R2性能:训练集{:.2f},测试集{:.2f}'.format(train_score,test_score))
y_train_pred = learning.predict(X_train)
y_test_pred = learning.predict(X_test)
train_err = metrics.mean_squared_error(y_train,y_train_pred)
test_err = metrics.mean_squared_error(y_test,y_test_pred)
print('RMSE性能:训练集{:.2f},测试集{:.2f}'.format(train_err,test_err))
  1. (简答题, 50分)
    基于bankpep数据集,划分训练集与测试集,建立分类模型。

1)使用决策树在训练集上建立分类模型,记录模型在测试集上的性能;

2)自学朴素贝叶斯、支持向量集训练分类模型的方法,分别在训练集上建立分类模型,记录模型在测试集上的性能;

  1. 分析比较三种方法的性能差异。
import time
import pandas as pd
from sklearn import model_selection
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from pandas import DataFrame
import matplotlib.pyplot as plt
import matplotlib
zhfont1 = matplotlib.font_manager.FontProperties(fname="C:\\python\\SourceHanSansSC-Bold.otf")

data = pd.read_csv('C:\\python\\bankpep.csv',index_col=0,header=0)

seq1 = ['married','car','save_act','current_act','mortgage','pep']
for feature in seq1:
    data.loc[data[feature]=='YES',feature] = 1
    data.loc[data[feature]=='NO',feature] = 0
data.loc[data['sex']=='FEMALE','sex'] = 0
data.loc[data['sex']=='MALE','sex'] = 1
data.loc[data['region']=='INNER_CITY','region'] = 1
data.loc[data['region']=='RURAL','region'] = 2
data.loc[data['region']=='TOWN','region'] = 3
data.loc[data['region']=='SUBURBAN','region'] = 4
X1 = data.iloc[:,0:9].values.astype(float)
y1 = data.iloc[:,10].values.astype(int)

# (1
X_train1,X_test1,y_train1,y_test1 = model_selection.train_test_split(X1,y1,test_size=0.25,random_state=int(time.time()))
learning = tree.DecisionTreeClassifier()
learning = learning.fit(X_train1,y_train1)
print('决策树性能:{:.2f}'.format(learning.score(X_test1,y_test1)))
d1 = learning.score(X_test1,y_test1)

# (2
learning = GaussianNB()
learning.fit(X_train1, y_train1)
print("朴素贝叶斯性能:{:.2f}".format(learning.score(X_test1,y_test1)))
d2 = learning.score(X_test1,y_test1)

data.loc[data['sex']=='FEMALE','sex'] = 1
data.loc[data['sex']=='MALE','sex'] = 0
dumm_reg = pd.get_dummies(data['region'],prefix='region')
dumm_child = pd.get_dummies(data['children'],prefix='children')
df1 = data.drop(['region','children'],axis=1)
df2 = df1.join([dumm_reg,dumm_child],how='outer')
X3 = df2.drop(['pep'],axis=1).values.astype(float)
y3 = df2['pep'].values.astype(int)
X_train3,X_test3,y_train3,y_test3 = model_selection.train_test_split(X3,y3,test_size=0.25,random_state=int(time.time()))
learning = svm.SVC(kernel='rbf',gamma=0.7,C=0.001)
learning.fit(X_train3,y_train3)
print("支持向量集性能:{:.2f}".format(learning.score(X_test3,y_test3)))
d3 = learning.score(X_test3,y_test3)

# (3
data = [d1,d2,d3]
data = DataFrame(data,columns=['score'],index=['决策树','朴素贝叶斯','支持向量集'])
data.plot(kind='bar',title='decision score on test set',rot=0)
plt.xticks(fontproperties=zhfont1)
plt.show()

你可能感兴趣的:(python,dhu,python,人工智能)