这是国外大哥的一个单子,总的来说没有什么技术难点
In this homework, you will develop a model t0 predict whether a given ca gets high or low gasmileage based on the Auto data set.
在本作业中,您将开发一个模型来预测给定的ca是高还是低的汽油里程,基于Auto数据集
import numpy as np
import pandas as pd
%matplotlib inline
#读取数据
test=pd.read_csv('Auto.csv')
#展示数据前5行
test.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
#数据信息展示
test.info()
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg 397 non-null float64
cylinders 397 non-null int64
displacement 397 non-null float64
horsepower 397 non-null object
weight 397 non-null int64
acceleration 397 non-null float64
year 397 non-null int64
origin 397 non-null int64
name 397 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
#由于有缺失值,所以将缺失值删除
test.replace('?',np.nan,inplace = True)
test.dropna(inplace=True)
#强制转换为int类型
test['horsepower']=test['horsepower'].astype('int')
Create a binary variable. mpg01. that contains a 1 if mpg contains a value above its median, and a0 if mpg contains a value below its median. Y ou can compute the median using themedian0) function. 10 points Explore the data graphically in order to investigate the association between mpg01 and theother features. Which of the other features seem most likely to be useful in predicting mpg01?Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
(A)创建一个二进制变量。mpg 01。如果mpg包含高于中位数的值,则包含1;如果mpg包含低于中位数的值,则为0。你可以用梅迪安0)函数来计算中值。10点以图形方式研究mpg 01与其他特征之间的关联。在预测mpg 01时,其他哪些功能似乎最有用?散乱图和盒图可能是回答这个问题的有用工具。描述你的发现。
#查看他的中位数
test['mpg'].median()
#编写函数,分割类别变量
def function(x):
if x>23.0:
return 1
else:
return 0
test['mpg01']=test['mpg'].apply(lambda x: function(x))
#查看相关性高低
test.corr()
import seaborn as sns
g = sns.pairplot(test, hue='mpg01', palette='seismic', diag_kind = 'kde',diag_kws=dict(shade=True))
g.set()
© Split the data into a training set and a test set
将数据分为训练集和测试集
from sklearn.model_selection import train_test_split
# 使用train_test_split方法,划分训练集和测试集,指定80%数据为训练集,20%为测试集
x=test.drop(['mpg01','mpg','name'],axis=1)
y=test['mpg01']
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
Perform LDA on the training data in order to predict mpg01 using the variables that seemed
most associated with mpg01 in (b). What is the test error of the model obtained?
对训练数据进行LDA,使用(b)中与mpg01关联最大的变量来预测mpg01,得到的模型的测试误差是多少?(15分)
test.info()
#导入包
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
numerical=['weight']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
lda = LinearDiscriminantAnalysis(n_components=1)
lda.fit(X_train1, y_train)
print(lda.score(X_test1, y_test)) #score是指分类的正确率
Perform QDA on the training data in order to predict mpg01 using the variables that seemed
most associated with mpg01 in (b). What is the test error of the model obtained?
对训练数据进行QDA,使用(b)中与mpg01关联最大的变量来预测mpg01,得到的模型的测试误差是多少?(15分)
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Qda = QuadraticDiscriminantAnalysis()
Qda.fit(X_train1, y_train)
print(Qda.score(X_test1, y_test)) #score是指分类的正确率
Perform logistic regression on the training data in order to predict mpg01 using the variables
that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
对训练数据进行逻辑回归,使用(b)中与mpg01关系最密切的变量来预测mpg01,得到的模型检验误差是多少?
numerical=['weight','cylinders']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(X_train1,y_train)
from sklearn.metrics import classification_report
print('----------------Train Set----------------------')
print(classification_report(y_train, lr.predict(X_train1)))
print('----------------test set----------------------')
print(classification_report(y_test, lr.predict(X_test1)))
Perform KNN on the training data, with several values of K, in order to predict mpg01. Use
only the variables that seemed most associated with mpg01 in (b). What test errors do you
obtain? Which value of K seems to perform the best on this data set?
对训练数据执行几个K值的KNN,以预测mpg01。只使用(b)中与mpg01关联最大的变量。你得到了什么测试错误?K的哪个值在这个数据集中表现最好?
from sklearn.neighbors import KNeighborsClassifier
# K参数选项
neighbors=range(1,30)
# 准确率
numerical=['weight']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
knn_acc=[]
# 尝试neighbors中所列举的所有K选项,使用KNeighborsClassifier模型做多次训练。
# 针对每种K值情况计算一次在测试集上的准确率,打印每次训练所获得的准确率,并将每次准确率结果添入列表knn_acc中。
for i in neighbors:
model = KNeighborsClassifier(n_neighbors=i)
model.fit(X_train1, y_train)
knn_acc.append(model.score(X_test1, y_test))
print(knn_acc)
import matplotlib.pyplot as plt
plt.plot(neighbors,knn_acc, label='AUC')