本博客用于记录(或者说是用来备份)我在2019s 数据科学技术与应用课 MOOC课上的回家作业即得分
课程链接:数据科学技术与应用
呃因为年代稍微有点远了,所以只有个别每一次作业的成绩,而不是单独的一个程序的分数
1.从键盘输入若干同学的姓名,保存在字符串列表中;
输入某个同学的名字,检索是否已保存在列表中。
i = 0
a = input("输入学生姓名,用英文逗号隔开,输入回车结束:")
a = a.split(",") #将a转变为符合题设要求的字符串列表
#把检索作为一个不停循环的步骤
i = 1
while True:
check = input("需要检索的姓名,输入end结束检索:")
if check == "end":
break
if check in a :
print(check+"已保存在列表中")
else:
print(check+"未保存在列表中")
2.编写Python程序实现以下功能:使用字典记录多位同学的姓名及对应身高;
输入任意同学的姓名,查找并显示所有高于此身高的同学信息。
dic = {"a":160,"b":170,"c":180,"d":190} #构造一个字典,键为名字,值为其身高
check = input("输入任意同学的姓名:")
for key in dic.keys():
if dic[key] > dic[check]:
print("姓名"+key+",身高"+str(dic[key])) #显示的是目标同学的全部信息。
成绩:78/100
1.“大润发”、“沃尔玛”、“联华”和“农工商”四个超市都卖苹果、香蕉、桔子、猕猴桃和芒果5种水果。使用NumPy的ndarray实现以下功能。
1)创建2个一维数组分别存储超市名称和水果名称;
2)创建1个4×5的二维数组存储不同超市的水果价格,其中价格由4到10范围内的随机数生成;
3)选择“大润发”的苹果和“联华”的香蕉,并将价格增加1元;
4)“农工商”水果大减价,所有水果价格减少2元;
5)统计四个超市苹果和芒果的销售均价;
6)找出桔子价格最贵的超市名称(不是序号)
import numpy as np
shop = np.array(["大润发","沃尔玛","联华","农工商"])
fruit = np.array(["苹果","香蕉","桔子","猕猴桃","芒果"])
price = np.random.randint(4,10,size=(4,5))
print(price)
print(price[(0,2),(0,1)]+1)
print(price[3,:]-1)
print(price[:,fruit=="苹果"].mean())
print(price[:,fruit=="芒果"].mean())
print(shop[price[:,fruit=="桔子" ].argmax()])
2.基于随机游走实例,使用ndarray和随机数生成函数模拟一个物体在三维空间随机游走的过程。
1)创建3×10的二维数组,记录物体每一步在三个轴向上的移动距离。
在每个轴向的移动距离服从标准正态分布(期望为0,方差为1)。
行序0、1、2分别对应x、y和z轴;
2)计算每一步走完后物体在三维空间的位置;
3)计算每一步走完后物体距离原点的距离;
4)统计物体在z轴上到达的最远距离;
(提示:使用abs()绝对值函数对z轴每一步运动后的位置求绝对值,然后求最大距离)
5)统计物体在三维空间距离原点的最近距离值。
import numpy as np
rndwlk = np.random.normal(0,1, size = (3,10))
print(rndwlk)
position = rndwlk.cumsum(axis = 1)
print(position)
dists = np.sqrt(position[0]**2+position[1]**2+position[2]**2)
print(dists)
zmax = abs(position[2].max()) #正负未知,但是肯定是未绝对运算后数值中最大的
fmax = abs(position[2].min())
if zmax > fmax:
print(zmax)
else:
print(fmax)
print(dists.min())
1.创建并访问DataFrame对象。
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
data = [[1,2,3],[4,5,6],[7,8,9]]
form=DataFrame(data,index=['a','b','c'],columns=['one','two','three'])
print(form.loc[:,['two','three']]) #查询列索引为‘two’和‘three’两列数据
print(form.iloc[[0,2],:]) #查询第0行、第2行数据
print(form.iloc[:,[0,2]]) #查询第0列、第2列数据
mask = form.iloc[:,1] > 2
data1 = form.loc[mask,:] #筛选第1列中值大于2的所有行数据,另存为data1对象
data1.loc[:,'four'] = 10#为data1添加一列数据,列索引为‘four’,值都为10
print(data1)
mask = data1.values > 9
data1[mask]=8 #将data1所有值大于9的数据修改为8
print(data1)
data1.drop([data1.index.values[0],data1.index.values[1]],axis=0,inplace=True)#删除data1中第0行和第1行数据
print(data1)
2.海伦一直使用在线交友网站寻找适合的约会对象, 她将交友数据存放在datingTestSet.xls文件中。
1)从文件中读取有效数据保存到Dataframe对象中,跳过所有文字解释行;
2)列索引名设为 [‘flymiles’,‘videogame’,‘icecream’,'type‘];
3)显示读取到的前5条数据;
4)显示所有’type’为’largeDoses‘的数据;
5)将平均每周玩视频游戏时间超过10的数据都改成10;
6)将修改后的DataFrame对象保存到文件中,保留行、列索引。
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df = pd.read_csv('C:\\Users\\14401\\Downloads\\datingTestSet.csv',header=None,names=['flymiles','videogame','icecream','type'],skiprows=2)
print(df[:5]) #显示读取到的前5条数据
mask = df['type'] == 'largeDoses'
print(df.loc[mask,:]) #显示所有'type'为'largeDoses‘的数据;
mask = df['videogame'] > 10
df.loc[mask,'videogame'] = 10
print(df)
df.to_csv('C:\\Users\\14401\\Downloads\\datingTestSet.csv',mode='w') #将修改后的DataFrame对象保存到文件中,保留行、列索引。
成绩:95/100
1.数据清洗和填充
1)从studentsInfo.xlsx 文件的“Group1”表单中读取数据;
2)将“案例教学”列数据值全改为NaN;
3)滤除每行数据中缺失3项以上(包括3项)的行;
4)滤除值全部为NaN的列;
5)使用列的平均值填充“体重”和“成绩”列的NaN数据;
6)使用上一行数据填充“年龄”列的NaN数据;
7)使用“中位数”填充“生活费用”NaN数据。
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
st1 = pd.read_excel('C:\\Users\\14401\\Downloads\\studentsInfo.xlsx',"Group1",index_col=0)
st1['案例教学']= np.NAN #将“案例教学”列数据值全改为NaN
print(st1)
st2 = st1.dropna(thresh=7) #滤除每行数据中缺失3项以上(包括3项)的行;
print(st2)
st1.dropna(how='all',axis=1,inplace=True) #滤除值全部为NaN的列;
print(st1)
st1.fillna({'体重':st1['体重'].mean(),'成绩':st1['成绩'].mean()},inplace=True) #使用列的平均值填充“体重”和“成绩”列的NaN数据;
print(st1)
st1['年龄'].fillna(method='ffill',inplace=True) #使用上一行数据填充“年龄”列的NaN数据;
print(st1)
st1.fillna({'月生活费':st1['月生活费'].median()},inplace=True)
print(st1)
2.数据合并与排序
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
st = pd.read_excel('C:\\Users\\14401\\Downloads\\studentsInfo.xlsx',"Group3",index_col=None)
data1 = st.loc[:,['序号','性别','年龄']]
print(data1)
data2 = st.loc[:,['序号','身高','体重','成绩']]
print(data2)
st1 = pd.merge(data1,data2,how='inner')
print(st1)
st2 = st.sort_values(by='月生活费',ascending=True)
print(st2)
st['身高排名'] = st['身高'].rank(ascending=True,method='min')
st3 = st.sort_values(by='身高排名',ascending=False)
print(st3)
3.根据某系的实验教学计划,完成以下分析:
import numpy as np
import pandas as pd
from pandas import DataFrame,Series
data = pd.read_excel('C:\\Users\\14401\\Downloads\\DataScience.xls') #1
a = len(data.index)
b = data.columns.values
c = len(b)
print('当前的数据共有{}种属性特征,分别是{},数据共有{}条记录。'.format(c,b,a)) #2
dt1 = data.loc[data.isnull().any(axis=1)] #3
dt1.to_csv('C:\\Users\\14401\\Downloads\\pre.csv',mode='w',header=True,index=True,encoding="utf_8_sig")
data.dropna(how='all',axis=0,inplace=True)
#如果一整行都是NaN,可以用删除,如果一行中不都是NaN,可以根据情况复制上一行,或手动输入(说不准和上一行内容会不会一样)
dt2 = data.loc[:,['课程名称','实验项目名称','实验类型','二级实验室']] #4
gp1 = data.groupby(['课程名称']).aggregate({'实验课时数':np.sum}) #5
print(gp1)
gp2 = data.groupby(['周次']).aggregate({'实验课时数':np.sum}) #6
print(gp2)
dt3 = pd.crosstab(data['课程名称'],data['实验类型']) #7
print(dt3)
dt4 = data.loc[:,['班级','周次','星期','课程名称','实验地点门牌号']] #8
print(dt4)
gp3 = data.groupby(['二级实验室名称']).aggregate({'实验课时数':np.sum}) #9
print(gp3)
dt5 = data.loc[:,['二级实验室名称','实验类型']] #10
print(dt5.drop_duplicates().dropna())
成绩:97/100
1.2012—2017年我国人均可支配收入为[1.47, 1.62, 1.78, 1.94, 2.38, 2.60]
(单位:万 元)。按照要求绘制以下图形。
1)模仿例4-1和例4-3,绘制人均可支配收入折线图(效果如图4-6所示)。
用小矩形标记数据点,黑色虚线,用注解标注最高点,图标题为“Income chart”,
设置坐标轴标题,最后将图形保存为JPG文件。
2)模仿例4-2,使用多个子图分别绘制人均可支配收入的折线图、箱形图及柱状图(效果如图4-7所示)。
import matplotlib.pyplot as plt
from pandas import DataFrame
income=[1.47, 1.62, 1.78, 1.94, 2.38, 2.60]
data=DataFrame({'Income':income},index=['2012','2013','2014','2015','2016','2017'])
plt.figure()
data.plot(color='black',label='Income',linestyle='dashed',marker='s')
plt.title('2012~2017 Income Chart')
plt.xlabel('Year')
plt.ylabel('Income(RMB Ten Thousand)')
plt.xlim(0,5)
plt.ylim(0.0,3.0)
plt.xticks(range(0,6),('2012','2013','2014','2015','2016','2017'))
plt.grid()
plt.legend(loc='upper right')
plt.annotate('Largest!',color='red',xy=(5,2.60),xytext=(3,2.50),arrowprops=dict(arrowstyle='->',color='red'))
#plt.savefig('F:\\1.1.jpg')
plt.show()
fig=plt.figure(figsize=(8,6))
plt.subplots_adjust(wspace=0.6)
ax1=fig.add_subplot(2,2,1)
plt.xticks([0,2,4])
plt.title('line chart')
plt.xlabel('Year')
plt.ylabel('Income')
plt.subplots_adjust(hspace = 0.4)
ax1.plot(data)
ax2=fig.add_subplot(2,1,2)
data.plot(kind='bar',title='bar chart',ax=ax2,fontsize='small')
plt.ylabel('Income')
ax3=fig.add_subplot(2,2,2)
data.plot(kind='box',fontsize='small',title='box-whisker plot',xticks=[],ax=ax3)
plt.ylabel('Income')
plt.xlabel('2012~2017',fontsize=12)
plt.show()
2.数据文件high-speed rail.csv存放着世界各国高铁的情况,数据格式如表4-6所示,
请对世界各国高铁的数据进行绘图分析。
1)各国运营里程对比柱状图,标注China为“Longest”,如图4-21所示。
2)各国运营里程现状和发展堆叠柱状图,如图4-22所示。
3)各国运营里程占比饼图,China扇形离开中心点,如图4-23所示。
4)绘制现有里程的散点地图,用点的大小表示数量由大到小。
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
import matplotlib.pyplot as plt
#from mpl_toolkits.basemap import Basemap
f = pd.read_csv('C:\\Users\\14401\\Downloads\\high-speed rail.csv',header='infer',index_col='Country')
plt.figure()
f1 = f['Operation']
f1.plot(kind='bar',title='Operation Mileage',use_index=True,rot=45)
plt.ylabel('Mileage')
plt.xlabel('Country')
plt.annotate('Longest!', color='red', xy=(0,20000),xytext=(1,20000),arrowprops=dict(arrowstyle='->',color='red'))
plt.show()
plt.figure()
f2 = f[['Operation','Under-construction','Planning']]
f2.plot(kind='barh',title='Global trends of high-speed rail',stacked=True)
plt.ylabel('Country')
plt.xlabel('Mileage(km)')
plt.show()
plt.figure()
f1.plot(kind='pie',subplots='True',startangle=60,title='Operation Mileage',explode=[0.1,0,0,0,0,0],autopct='%1.1f%%')
plt.ylabel('')
plt.show()
#plt.figure()
#map1 = Basemap(projection='cyl',llcrnrlon=0,urcrnrlon=150,llcrnrlat=0,urcrnrlat=55)
#map1.drawcountries()
#map1.drawcoastlines()
#lat=np.array(f["Latitude"])
#lon=np.array(f['Longitude'])
#map1.scatter(lon,lat,s=(f['Operation']/np.sum(f['Operation']))*10000)
#plt.show()
3.文件bankpep.csv存放着银行储户的基本信息,数据格式如表4-7所示。
通过绘图对这些客户数据进行探索性分析。
1)客户年龄分布的直方图和密度图(见图4-24)。
2)客户年龄和收入关系的散点图(见图4-25)。
3)绘制散点图观察账户(年龄、收入、孩子数)之间的关系,对角线显示直方图(见图4-26)。
4)按区域展示平均收入的柱状图,并显示标准差(见图4-27)。
5)多子图绘制:账户中性别占比饼图,有车的性别占比饼图,按孩子数的账户占比饼图(见图4-28)。
6)各性别收入的箱形图(见图4-29)。
具体表、图见习题1提供的文件。
import numpy as np
import pandas as pd
from pandas import DataFrame,Series
import matplotlib.pyplot as plt
f = pd.read_csv('C:\\Users\\14401\\Downloads\\bankpep.csv',index_col='id',header='infer')
plt.figure()
f['age'].plot(kind='hist',bins=10,title='Customer age',xticks=[0,20,40,60,80],xlim=[-20,100],normed=True)
f['age'].plot(kind='kde',title='Customer age')
plt.xlabel('age')
plt.show()
plt.figure()
f.plot(kind='scatter',x='age',y='income',title='Customer Income',grid=True,xlim=[0,80],ylim=[0,70000],yticks=[10000,20000,30000,40000,50000,60000],label='(age,income)')
plt.ylabel('Income')
plt.xlabel('age')
plt.legend(loc='upper left')
plt.show()
plt.figure()
data = f[['age','income','children']]
pd.plotting.scatter_matrix(data, diagonal='hist',color='m')
plt.show()
plt.figure()
mean = f.groupby(['region']).aggregate({'income':np.mean})
std = f.groupby(['region']).aggregate({'income':np.std})
plt.subplots_adjust(wspace=0.6)
mean.plot(kind='bar',yerr=std,title='Customer Income',color='r',rot=45,label=None)
fig=plt.figure(figsize=(8,6))
ax1=fig.add_subplot(2,2,1)
f1 = f["sex"].value_counts()
f1.plot(kind='pie',startangle=45,title='Customer Sex',autopct='%1.1f%%',ax=ax1)
ax2=fig.add_subplot(2,2,2)
f21 = f.loc[f['car']=='YES',:]
f2 = f21["sex"].value_counts()
f2.plot(kind='pie',startangle=45,title='Customer Car Sex',autopct='%1.1f%%',ax=ax2)
ax3=fig.add_subplot(2,2,3)
f3 = f["children"].value_counts()
f3.plot(kind='pie',startangle=45,title='Customer Children',autopct='%1.1f%%',ax=ax3)
plt.show()
plt.figure()
f4 = f[['sex','income']]
f4.boxplot(by='sex',fontsize='small',figsize=(7,6))
#plt.title('boxplot grouped by sex')
plt.show()
1.Energy Efficiency数据集( ENB2012_data.xlsx,ENB2012.names)记录不同房屋的制热能源消耗和制冷能源消耗。
包括768条记录,8个特征属性,两个预测值。具体说明见ENB2012.names。
1)在全数据集上训练线性回归模型预测制热能耗,计算模型性能:RMSE以及R2;
2)将数据集划分训练集和测试集,在训练集上训练线性回归模型,分析模型在训练集和测试集上的性能。
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn import model_selection
from sklearn import metrics
data = pd.read_excel('C:\\Users\\14401\\Downloads\\ENB2012_data.xlsx',index_col=None)
#建立8个自变量与目标变量的线性回归模型,计算误差。
X = data.iloc[:,0:8].values.astype(float)
y = data.iloc[:,9].values.astype(float)
linreg = LinearRegression()
linreg.fit(X,y) #以全数据集训练
X_train,X_test,y_train,y_test = model_selection.train_test_split(X,y,test_size=0.35,random_state=1)
linregTr = LinearRegression() #在训练集上学习回归模型
linregTr.fit(X_train,y_train)
#计算全数据集的模型性能
predict_score1 = linreg.score(X_test,y_test)
print('The R2 of model trained with all is:{:.2f}'.format(predict_score1))
y_test_pred1 = linreg.predict(X_test)
test_err1 = metrics.mean_squared_error(y_test,y_test_pred1)
print('The RMSE of test with all:{:.2f}'.format(test_err1))
#计算训练集的模型性能
y_train_pred = linregTr.predict(X_train)
y_test_pred = linregTr.predict(X_test)
train_score = linregTr.score(X_train,y_train)
test_score = linregTr.score(X_test,y_test)
print('The R2 of train is {:.2f},of test is {:.2f}'.format(train_score,test_score))
train_err = metrics.mean_squared_error(y_train,y_train_pred)
test_err = metrics.mean_squared_error(y_test,y_test_pred)
print('The RMSE of train is {:.2f},of test is {:.2f}'.format(train_err,test_err))
1.基于bankpep数据集,划分训练集与测试集,建立分类模型。
1)使用决策树建立分类模型,记录模型在测试集上的性能;
2)自学朴素贝叶斯、支持向量集建立分类模型的方法,记录模型在测试集上的性能;
3)使用使用梯度提升机以及XGBoost训练分累模型,并与步骤1、2的结果进行比较。
4)本次作业需提交源代码和结果分析报告,分析报告说明使用的数据集,数据集特征项、
数据量大小、尝试的每种方法、获得的结果。最后绘图比较这些方法在性能上的差别。
import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.ensemble import GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from pandas import DataFrame,Series
import matplotlib.pyplot as plt
filename = 'C:\\Users\\14401\\Downloads\\bankpep.csv'
data1 = pd.read_csv(filename,index_col=0,header=0)
seq1 = ['married','car','save_act','current_act','mortgage','pep']
for feature in seq1:
data1.loc[data1[feature]=='YES',feature] = 1
data1.loc[data1[feature]=='NO',feature] = 0
data1.loc[data1['sex']=='FEMALE','sex'] = 0
data1.loc[data1['sex']=='MALE','sex'] = 1
data1.loc[data1['region']=='INNER_CITY','region'] = 1
data1.loc[data1['region']=='RURAL','region'] = 2
data1.loc[data1['region']=='TOWN','region'] = 3
data1.loc[data1['region']=='SUBURBAN','region'] = 4
X1 = data1.iloc[:,0:9].values.astype(float)
y1 = data1.iloc[:,10].values.astype(int)
X_train1,X_test1,y_train1,y_test1 = model_selection.train_test_split(X1,y1,test_size=0.3,random_state=1)
clf1 = tree.DecisionTreeClassifier()
clf1 = clf1.fit(X_train1,y_train1)
print('用决策树建模在测试集上的性能:{:.2f}'.format(clf1.score(X_test1,y_test1)))
d1 = clf1.score(X_test1,y_test1)
model = GaussianNB()
model.fit(X_train1, y_train1)
print("用朴素贝叶斯建模在测试集上的性能:{:.2f}".format(model.score(X_test1,y_test1)))
d2 = model.score(X_test1,y_test1)
data2 = pd.read_csv(filename,index_col='id')
seq = ['married','car','save_act','current_act','mortgage','pep']
for feature in seq:
data2.loc[data2[feature]=='YES',feature] = 1
data2.loc[data2[feature]=='NO',feature] = 0
data2.loc[data2['sex']=='FEMALE','sex'] = 1
data2.loc[data2['sex']=='MALE','sex'] = 0
dumm_reg = pd.get_dummies(data2['region'],prefix='region')
dumm_child = pd.get_dummies(data2['children'],prefix='children')
df1 = data2.drop(['region','children'],axis=1)
df2 = df1.join([dumm_reg,dumm_child],how='outer')
X3 = df2.drop(['pep'],axis=1).values.astype(float)
y3 = df2['pep'].values.astype(int)
X_train3,X_test3,y_train3,y_test3 = model_selection.train_test_split(X3,y3,test_size=0.3,random_state=1)
clf3 = svm.SVC(kernel='rbf',gamma=0.7,C=0.001)
clf3.fit(X_train3,y_train3)
print("用支持向量集建立分类模型在测试集上的性能:{:.2f}".format(clf3.score(X_test3,y_test3)))
d3 = clf3.score(X_test3,y_test3)
X_train4,X_test4,y_train4,y_test4 = model_selection.train_test_split(X3,y3,test_size=0.3,random_state=1)
clf4=GradientBoostingClassifier()
clf4.fit(X_train4,y_train4)
print('使用梯度提升机的模型在测试集上的性能{:.2f}'.format(clf4.score(X_test4,y_test4)))
d4 = clf4.score(X_test4,y_test4)
clf5=XGBClassifier(silent=0,max_depth=6,gamma=0,subsample=1,colsample_bytree=1)
clf5.fit(X_train4,y_train4)
print('使用XGBoost的模型在测试集上的性能{:.2f}'.format(clf5.score(X_test4,y_test4)))
d5 = clf5.score(X_test4,y_test4)
plt.figure()
data = [d1,d2,d3,d4,d5]
f = DataFrame(data,columns=['score'],index=['tree','Naive Bayes','svm','gbrt','xgboot'])
f.plot(kind='bar',title='decision score on test set',use_index=True,rot=45)
成绩:94/100
1。葡萄酒数据集(wine.data)搜集了法国不同产区葡萄酒的化学指标。试建立决策树、
SVM和神经网络3种分类器模型,比较各种分类器在此数据集上的效果。
import pandas as pd
import numpy as np
from pandas import DataFrame
data = pd.read_csv('C:\\Users\\14401\\Downloads\\wine.data')
d_input = []
[d_input.append(str(i)) for i in range(14)]
data.columns = d_input
X = data.iloc[:,1:14].values.astype(float)
y = data.iloc[:,0].values
from sklearn import model_selection
X_train,X_test,y_train,y_test = model_selection.train_test_split(X,y,test_size=0.35,random_state=1)
from sklearn import tree
clf1 = tree.DecisionTreeClassifier()
clf1.fit(X_train, y_train)
s1 = clf1.score(X_test, y_test)
print('用决策树建模在测试集上的性能:{:.2f}'.format(s1))
from sklearn import svm
clf2 = svm.SVC(kernel='linear', gamma=0.6, C = 100)
clf2.fit(X_train, y_train)
s2 = clf2.score(X_test, y_test)
print('用SVM建模在测试集上的性能:{:.2f}'.format(s2))
from sklearn.neural_network import MLPClassifier
clf3 = MLPClassifier(solver='lbfgs',activation='identity',alpha=1e-5,hidden_layer_sizes=(9, 9),random_state=1)
clf3.fit(X_train, y_train)
s3 = clf3.score(X_test, y_test)
print('用神经网络建模在测试集上的性能:{:.2f}'.format(s3))
import matplotlib.pyplot as plt
plt.figure()
data = [s1,s2,s3]
f = DataFrame(data,columns=['score'],index=['tree','svm','MLP'])
f.plot(kind='bar',title='decision score on test set',use_index=True,rot=45)
2.基于Keras建立深度神经网络模型,在bankpep数据集上训练神经网络分类模型,
将训练模型的耗时以及模型性能,与XGBoost、SVM、朴素贝叶斯等方法进行比较。
import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.ensemble import GradientBoostingClassifier
from xgboost.sklearn import XGBClassifier
from pandas import DataFrame,Series
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPClassifier
import datetime
filename = 'C:\\Users\\14401\\Downloads\\bankpep.csv'
data1 = pd.read_csv(filename,index_col=0,header=0)
seq1 = ['married','car','save_act','current_act','mortgage','pep']
for feature in seq1:
data1.loc[data1[feature]=='YES',feature] = 1
data1.loc[data1[feature]=='NO',feature] = 0
data1.loc[data1['sex']=='FEMALE','sex'] = 0
data1.loc[data1['sex']=='MALE','sex'] = 1
data1.loc[data1['region']=='INNER_CITY','region'] = 1
data1.loc[data1['region']=='RURAL','region'] = 2
data1.loc[data1['region']=='TOWN','region'] = 3
data1.loc[data1['region']=='SUBURBAN','region'] = 4
X1 = data1.iloc[:,0:9].values.astype(float)
y1 = data1.iloc[:,10].values.astype(int)
X_train1,X_test1,y_train1,y_test1 = model_selection.train_test_split(X1,y1,test_size=0.3,random_state=1)
start1= datetime.datetime.now()
clf1 = MLPClassifier(solver='lbfgs',activation='identity',alpha=1e-5,hidden_layer_sizes=(9, 9),random_state=1)
clf1.fit(X_train1, y_train1)
end1= datetime.datetime.now()
d1 = clf1.score(X_test1, y_test1)
s1 = end1-start1
print("用神经网络建模的运行时长:",s1)
print('用神经网络建模在测试集上的性能:{:.2f}'.format(d1))
start2= datetime.datetime.now()
clf2 = GaussianNB()
clf2.fit(X_train1, y_train1)
end2= datetime.datetime.now()
d2 = clf2.score(X_test1,y_test1)
s2 = end2-start2
print("用朴素贝叶斯建模的运行时长:",s2)
print("用朴素贝叶斯建模在测试集上的性能:{:.2f}".format(d2))
data2 = pd.read_csv(filename,index_col='id')
seq = ['married','car','save_act','current_act','mortgage','pep']
for feature in seq:
data2.loc[data2[feature]=='YES',feature] = 1
data2.loc[data2[feature]=='NO',feature] = 0
data2.loc[data2['sex']=='FEMALE','sex'] = 1
data2.loc[data2['sex']=='MALE','sex'] = 0
dumm_reg = pd.get_dummies(data2['region'],prefix='region')
dumm_child = pd.get_dummies(data2['children'],prefix='children')
df1 = data2.drop(['region','children'],axis=1)
df2 = df1.join([dumm_reg,dumm_child],how='outer')
X3 = df2.drop(['pep'],axis=1).values.astype(float)
y3 = df2['pep'].values.astype(int)
X_train3,X_test3,y_train3,y_test3 = model_selection.train_test_split(X3,y3,test_size=0.3,random_state=1)
start3= datetime.datetime.now()
clf3 = svm.SVC(kernel='rbf',gamma=0.7,C=0.001)
clf3.fit(X_train3,y_train3)
end3= datetime.datetime.now()
d3 = clf3.score(X_test3,y_test3)
s3 = end3-start3
print("用SVM的运行时长:",s3)
print("用SVM在测试集上的性能:{:.2f}".format(d3))
start4= datetime.datetime.now()
clf4=XGBClassifier(silent=0,max_depth=6,gamma=0,subsample=1,colsample_bytree=1)
clf4.fit(X_train3,y_train3)
end4= datetime.datetime.now()
d4 = clf4.score(X_test3,y_test3)
s4 = end4-start4
print("用XGBoost的运行时长:",s4)
print('使用XGBoost的模型在测试集上的性能{:.2f}'.format(d4))
plt.figure()
data = [d1,d2,d3,d4]
f = DataFrame(data,columns=['score'],index=['MLP','Naive Bayes','svm','xgboot'])
f.plot(kind='bar',title='decision score on test set',use_index=True,rot=45)
plt.figure()
data = [s1,s2,s3,s4]
f = DataFrame(data,columns=['time'],index=['MLP','Naive Bayes','svm','xgboot'])
f.plot(kind='bar',title='decision time on test set',use_index=True,rot=45)