sklearn_SVM:SVC真实案例:天气预测_菜菜视频学习笔记

SVC真实案例:天气预测

    • 1. 导库导数据,探索特征
    • 2. 分训练集和测试集,优先探索标签
    • 3. 探索特征
      • 3.1 描述性统计与异常值
      • 3.2 处理困难特征:日期
      • 3.3 处理困难特征:地点
      • 3.4 处理分类型变量:缺失值
      • 3.5 处理分类型变量:将分类型变量编码
      • 3.6 处理连续型变量:填补缺失值
      • 3.7 处理连续型变量:无量纲化
    • 4. 建模与模型评估
    • 5. 建模调参
      • 5.1 追求最高召回率recall
      • 5.2 追求最高准确率
      • 5.3 追求模型精确度和召回率的平衡

1. 导库导数据,探索特征

#5000行,21个特征的预测天气数据集
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
weather = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_weatherAUS5000.csv",index_col=0)
weather.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainTomorrow
0 2015-03-24 Adelaide 12.3 19.3 0.0 5.0 NaN S 39.0 S ... 19.0 59.0 47.0 1022.2 1021.4 NaN NaN 15.1 17.7 No
1 2011-07-12 Adelaide 7.9 11.4 0.0 1.0 0.5 N 20.0 NNE ... 7.0 70.0 59.0 1028.7 1025.7 NaN NaN 8.4 11.3 No
2 2010-02-08 Adelaide 24.0 38.1 0.0 23.4 13.0 SE 39.0 NNE ... 19.0 36.0 24.0 1018.0 1016.0 NaN NaN 32.4 37.4 No
3 2016-09-19 Adelaide 6.7 16.4 0.4 NaN NaN N 31.0 N ... 15.0 65.0 40.0 1014.4 1010.0 NaN NaN 11.2 15.9 No
4 2014-03-05 Adelaide 16.7 24.8 0.0 6.6 11.7 S 37.0 S ... 24.0 61.0 48.0 1019.3 1018.9 NaN NaN 20.8 23.7 No

5 rows × 22 columns

#将特征矩阵和标签Y分开
X = weather.iloc[:,:-1]
Y = weather.iloc[:,-1]
#分裂的快捷键:ctrl shift -

#合并的快捷键:shift M
X.shape #5000行随机
(5000, 21)
#探索数据类型
X.info()

Int64Index: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           5000 non-null   object 
 1   Location       5000 non-null   object 
 2   MinTemp        4979 non-null   float64
 3   MaxTemp        4987 non-null   float64
 4   Rainfall       4950 non-null   float64
 5   Evaporation    2841 non-null   float64
 6   Sunshine       2571 non-null   float64
 7   WindGustDir    4669 non-null   object 
 8   WindGustSpeed  4669 non-null   float64
 9   WindDir9am     4651 non-null   object 
 10  WindDir3pm     4887 non-null   object 
 11  WindSpeed9am   4949 non-null   float64
 12  WindSpeed3pm   4919 non-null   float64
 13  Humidity9am    4936 non-null   float64
 14  Humidity3pm    4880 non-null   float64
 15  Pressure9am    4506 non-null   float64
 16  Pressure3pm    4504 non-null   float64
 17  Cloud9am       3111 non-null   float64
 18  Cloud3pm       3012 non-null   float64
 19  Temp9am        4967 non-null   float64
 20  Temp3pm        4912 non-null   float64
dtypes: float64(16), object(5)
memory usage: 859.4+ KB
#探索缺失值
X.isnull().mean() #缺失值所占总值的比例 isnull().sum(全部的True)/X.shape[0]
#我们要有不同的缺失值填补策略
Date             0.0000
Location         0.0000
MinTemp          0.0042
MaxTemp          0.0026
Rainfall         0.0100
Evaporation      0.4318
Sunshine         0.4858
WindGustDir      0.0662
WindGustSpeed    0.0662
WindDir9am       0.0698
WindDir3pm       0.0226
WindSpeed9am     0.0102
WindSpeed3pm     0.0162
Humidity9am      0.0128
Humidity3pm      0.0240
Pressure9am      0.0988
Pressure3pm      0.0992
Cloud9am         0.3778
Cloud3pm         0.3976
Temp9am          0.0066
Temp3pm          0.0176
dtype: float64
#在上方添加一个新的cell ESC a enter
#在下方添加一个新的cell ESC b enter
#删除一个cell ESC d d or ESC x
Y.shape
(5000,)
Y.isnull().sum() #加和的时候,True是1,False是0
0
#探索标签的分类,提取标签中不重复的值
np.unique(Y) #我们的标签是二分类
array(['No', 'Yes'], dtype=object)

2. 分训练集和测试集,优先探索标签

#分训练集和测试集
#防止训练模型受测试集影响
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y,test_size=0.3,random_state=420) #随机抽样
Xtrain.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
1809 2015-08-24 Katherine 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 17.0 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN
4176 2016-12-10 Tuggeranong 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 7.0 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6
110 2010-04-18 Albany 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 17.0 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8
3582 2009-11-26 Sale 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 11.0 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5
2162 2014-04-25 Mildura 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4

5 rows × 21 columns

#恢复索引
for i in [Xtrain, Xtest, Ytrain, Ytest]:
    i.index = range(i.shape[0])
#使索引值为行数
Xtrain.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
0 2015-08-24 Katherine 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 17.0 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN
1 2016-12-10 Tuggeranong 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 7.0 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6
2 2010-04-18 Albany 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 17.0 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8
3 2009-11-26 Sale 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 11.0 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5
4 2014-04-25 Mildura 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4

5 rows × 21 columns

Ytrain.head()
0     No
1     No
2     No
3    Yes
4     No
Name: RainTomorrow, dtype: object
#是否有样本不平衡问题?
Ytrain.value_counts()
No     2704
Yes     796
Name: RainTomorrow, dtype: int64
Ytest.value_counts()
No     1157
Yes     343
Name: RainTomorrow, dtype: int64
#有轻微的样本不均衡问题
Ytrain.value_counts()[0]/Ytrain.value_counts()[1]
3.3969849246231156
#将标签编码
from sklearn.preprocessing import LabelEncoder #标签专用,第三章讲过
encorder = LabelEncoder().fit(Ytrain) #允许一维数据的输入的;其他类大多是 不允许一维数据输入
#encorder建模,认得了:有两类,YES和NO,YES是1,NO是0
#使用训练集进行训练,然后在训练集和测试集上分别进行transform
Ytrain = pd.DataFrame(encorder.transform(Ytrain))
Ytest = pd.DataFrame(encorder.transform(Ytest))

#如果我们的测试集中,出现了训练集中没有出现过的标签类别,则需要重新建模
#比如说,测试集中有YES, NO, UNKNOWN
#而我们的训练集中只有YES和NO
Ytrain
0
0 0
1 0
2 0
3 1
4 0
... ...
3495 0
3496 1
3497 0
3498 0
3499 0

3500 rows × 1 columns

Ytest.head()
0
0 0
1 0
2 1
3 0
4 0

3. 探索特征

3.1 描述性统计与异常值

Ytrain.to_csv("你想要保存这个文件的地址.文件名.csv")
#如果确定上述处理无误,保存数据,避免操作失误重新执行代码
#描述性统计
Xtrain.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
count mean std min 1% 5% 10% 25% 50% 75% 90% 99% max
MinTemp 3486.0 12.225645 6.396243 -6.5 -1.715 1.800 4.1 7.7 12.0 16.7 20.9 25.900 29.0
MaxTemp 3489.0 23.245543 7.201839 -3.7 8.888 12.840 14.5 18.0 22.5 28.4 33.0 40.400 46.4
Rainfall 3467.0 2.487049 7.949686 0.0 0.000 0.000 0.0 0.0 0.0 0.8 6.6 41.272 115.8
Evaporation 1983.0 5.619163 4.383098 0.0 0.400 0.800 1.4 2.6 4.8 7.4 10.2 20.600 56.0
Sunshine 1790.0 7.508659 3.805841 0.0 0.000 0.345 1.4 4.6 8.3 10.6 12.0 13.300 13.9
WindGustSpeed 3263.0 39.858413 13.219607 9.0 15.000 20.000 24.0 31.0 39.0 48.0 57.0 76.000 117.0
WindSpeed9am 3466.0 14.046163 8.670472 0.0 0.000 0.000 4.0 7.0 13.0 19.0 26.0 37.000 65.0
WindSpeed3pm 3437.0 18.553390 8.611818 0.0 2.000 6.000 7.0 13.0 19.0 24.0 30.0 43.000 65.0
Humidity9am 3459.0 69.069095 18.787698 2.0 18.000 35.000 45.0 57.0 70.0 83.0 94.0 100.000 100.0
Humidity3pm 3408.0 51.651995 20.697872 2.0 9.000 17.000 23.0 37.0 52.0 66.0 79.0 98.000 100.0
Pressure9am 3154.0 1017.622067 7.065236 985.1 1000.506 1006.100 1008.9 1012.8 1017.6 1022.3 1027.0 1033.247 1038.1
Pressure3pm 3154.0 1015.227077 7.032531 980.2 998.000 1004.000 1006.5 1010.3 1015.2 1020.0 1024.4 1030.800 1036.0
Cloud9am 2171.0 4.491939 2.858781 0.0 0.000 0.000 1.0 1.0 5.0 7.0 8.0 8.000 8.0
Cloud3pm 2095.0 4.603819 2.655765 0.0 0.000 0.000 1.0 2.0 5.0 7.0 8.0 8.000 8.0
Temp9am 3481.0 16.989859 6.537552 -5.2 2.400 7.000 9.0 12.2 16.6 21.6 26.0 31.000 38.0
Temp3pm 3431.0 21.719003 7.031199 -4.1 7.460 11.500 13.3 16.6 21.0 26.6 31.4 38.600 45.9
Xtest.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
count mean std min 1% 5% 10% 25% 50% 75% 90% 99% max
MinTemp 1493.0 11.916812 6.375377 -8.5 -2.024 1.600 3.70 7.3 11.8 16.5 20.48 25.316 28.3
MaxTemp 1498.0 22.906809 6.986043 -0.8 9.134 13.000 14.50 17.8 22.4 27.8 32.60 38.303 45.1
Rainfall 1483.0 2.241807 7.988822 0.0 0.000 0.000 0.00 0.0 0.0 0.8 5.20 35.372 108.2
Evaporation 858.0 5.657809 4.105762 0.0 0.400 1.000 1.60 2.8 4.8 7.6 10.40 19.458 38.8
Sunshine 781.0 7.677465 3.862294 0.0 0.000 0.300 1.50 4.7 8.6 10.7 12.20 13.400 13.9
WindGustSpeed 1406.0 40.044097 14.027052 9.0 15.000 20.000 24.00 30.0 39.0 48.0 57.00 78.000 122.0
WindSpeed9am 1483.0 13.986514 9.124337 0.0 0.000 0.000 4.00 7.0 13.0 20.0 26.00 39.360 72.0
WindSpeed3pm 1482.0 18.601215 8.850446 0.0 2.000 6.000 7.00 13.0 19.0 24.0 31.00 43.000 56.0
Humidity9am 1477.0 68.688558 18.876448 4.0 20.000 36.000 44.00 57.0 69.0 82.0 95.00 100.000 100.0
Humidity3pm 1472.0 51.431386 20.459957 2.0 8.710 18.000 23.00 37.0 52.0 66.0 78.00 96.290 100.0
Pressure9am 1352.0 1017.763536 6.910275 988.5 1000.900 1006.255 1008.61 1013.2 1017.8 1022.3 1026.50 1033.449 1038.2
Pressure3pm 1350.0 1015.397926 6.916976 986.2 999.198 1003.900 1006.49 1010.9 1015.4 1020.0 1024.20 1031.151 1036.9
Cloud9am 940.0 4.494681 2.870468 0.0 0.000 0.000 1.00 1.0 5.0 7.0 8.00 8.000 8.0
Cloud3pm 917.0 4.403490 2.731969 0.0 0.000 0.000 1.00 2.0 5.0 7.0 8.00 8.000 8.0
Temp9am 1486.0 16.751817 6.339816 -5.3 2.370 6.725 9.00 12.1 16.5 21.3 25.45 30.200 35.1
Temp3pm 1481.0 21.483660 6.770567 -1.2 8.540 11.800 13.30 16.5 20.9 26.2 30.90 37.400 42.9
#如果异常值,首先你察这个异常值出现的频率
#如果异常值只出现了一次,多半是输入错误,直接把异常值删除
#如果异常值出现了多次,去跟业务人员沟通,人为造成的错误异常值留着是没有用的
#如果异常值占到你总数据量的10%左右了,这份数据可能就不能用了 - 
#把异常值替换成非异常但是非干扰的项,比如说用0来进行替换,或者把异常当缺失,用均值或缺失代替
Xtrain.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
0 2015-08-24 Katherine 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 17.0 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN
1 2016-12-10 Tuggeranong 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 7.0 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6
2 2010-04-18 Albany 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 17.0 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8
3 2009-11-26 Sale 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 11.0 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5
4 2014-04-25 Mildura 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4

5 rows × 21 columns

type(Xtrain.iloc[0,0]) #字符串
str

3.2 处理困难特征:日期

Xtrainc = Xtrain.copy()
Xtrainc.sort_values(by="Location")
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
2796 2015-03-24 Adelaide 12.3 19.3 0.0 5.0 NaN S 39.0 S ... 13.0 19.0 59.0 47.0 1022.2 1021.4 NaN NaN 15.1 17.7
2975 2012-08-17 Adelaide 7.8 13.2 17.6 0.8 NaN SW 61.0 SW ... 20.0 28.0 76.0 47.0 1012.5 1014.7 NaN NaN 8.3 12.5
775 2013-03-16 Adelaide 17.4 23.8 NaN NaN 9.7 SSE 46.0 S ... 9.0 19.0 63.0 57.0 1019.9 1020.5 NaN NaN 19.1 20.7
861 2011-07-12 Adelaide 7.9 11.4 0.0 1.0 0.5 N 20.0 NNE ... 7.0 7.0 70.0 59.0 1028.7 1025.7 NaN NaN 8.4 11.3
2906 2015-08-24 Adelaide 9.2 14.3 0.0 NaN NaN SE 48.0 SE ... 17.0 19.0 64.0 42.0 1024.7 1024.1 NaN NaN 9.9 13.4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2223 2009-05-08 Woomera 9.2 20.6 0.0 5.2 10.4 ESE 37.0 SE ... 19.0 19.0 64.0 34.0 1030.5 1026.9 0.0 1.0 13.7 20.1
1984 2014-05-26 Woomera 15.5 23.6 0.0 24.0 NaN NNW 43.0 NNE ... 9.0 26.0 49.0 37.0 1014.2 1010.3 7.0 7.0 18.0 21.5
1592 2012-01-10 Woomera 16.8 26.7 0.0 10.0 5.3 SW 46.0 S ... 20.0 22.0 52.0 33.0 1019.1 1016.8 4.0 6.0 18.3 24.9
2824 2015-11-03 Woomera 16.2 28.5 7.8 4.2 4.5 WSW 80.0 NE ... 26.0 50.0 76.0 53.0 1009.6 1006.8 6.0 7.0 20.5 26.2
1005 2010-05-14 Woomera 3.9 19.3 0.0 5.8 10.5 NE 33.0 ENE ... 15.0 13.0 43.0 19.0 1020.2 1016.4 1.0 1.0 11.5 18.5

3500 rows × 21 columns

#判断日期数据的类型
#检查日期是否重复
#非重复-连续型
#重复-离散型数据,分类过多(beyond 5000)
Xtrain.iloc[:,0].value_counts()
2015-10-12    6
2014-05-16    6
2015-07-03    6
2009-03-30    5
2016-09-07    5
             ..
2010-06-14    1
2013-12-01    1
2009-01-18    1
2014-11-24    1
2014-04-04    1
Name: Date, Length: 2141, dtype: int64
#不同地点上一段相似的时间的数据
Xtrain.loc[Xtrain.iloc[:,0] == "2015-08-24",:]
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm
0 2015-08-24 Katherine 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 17.0 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN
2906 2015-08-24 Adelaide 9.2 14.3 0.0 NaN NaN SE 48.0 SE ... 17.0 19.0 64.0 42.0 1024.7 1024.1 NaN NaN 9.9 13.4

2 rows × 21 columns

#首先,日期不是独一无二的,日期有重复
#其次,在我们分训练集和测试集之后,日期也不是连续的,而是分散的
#某一年的某一天倾向于会下雨?或者倾向于不会下雨吗?
#不是日期影响了下雨与否,反而更多的是这一天的日照时间,湿度,温度等等这些因素影响了是否会下雨
#光看日期,其实感觉它对我们的判断并无直接影响
#如果我们把它当作连续型变量处理,那算法会人为它是一系列1~3000左右的数字,不会意识到这是日期
Xtrain.iloc[:,0].value_counts().count()
#如果我们把它当作分类型变量处理,类别太多,有2141类,如果换成数值型,会被直接当成连续型变量,如果做成哑变量,我们特征的维度会爆炸
2141
Xtrain["Rainfall"].head(20)
0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      0.0
6      0.0
7      0.2
8      0.0
9      0.2
10     1.0
11     0.0
12     0.2
13     0.0
14     0.0
15     3.0
16     0.2
17     0.0
18    35.2
19     0.0
Name: Rainfall, dtype: float64
Xtrain["Rainfall"].isnull().sum()
#假设你没有下雨
#复制你的空值
33
Xtrain.loc[Xtrain.loc[:,"Rainfall"] >= 1,"RainToday"] = "Yes"#取出rainfall判断其值的大小,给RainToday这列赋值
Xtrain.loc[Xtrain.loc[:,"Rainfall"] < 1,"RainToday"] = "No"
Xtrain.loc[Xtrain.loc[:,"Rainfall"] == np.nan,"RainToday"] = np.nan
Xtrain.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 2015-08-24 Katherine 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN No
1 2016-12-10 Tuggeranong 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6 No
2 2010-04-18 Albany 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8 No
3 2009-11-26 Sale 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5 No
4 2014-04-25 Mildura 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4 No

5 rows × 22 columns

Xtrain.loc[:,"RainToday"].value_counts()
No     2642
Yes     825
Name: RainToday, dtype: int64
Xtest.loc[Xtest["Rainfall"] >= 1,"RainToday"] = "Yes"
Xtest.loc[Xtest["Rainfall"] < 1,"RainToday"] = "No"
Xtest.loc[Xtest["Rainfall"] == np.nan,"RainToday"] = np.nan
Xtrain.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 2015-08-24 Katherine 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN No
1 2016-12-10 Tuggeranong 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6 No
2 2010-04-18 Albany 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8 No
3 2009-11-26 Sale 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5 No
4 2014-04-25 Mildura 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4 No

5 rows × 22 columns

Xtest.head()
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 2016-01-23 NorahHead 22.0 27.8 25.2 NaN NaN SSW 57.0 S ... 37.0 91.0 86.0 1006.6 1008.1 NaN NaN 26.2 23.1 Yes
1 2009-03-05 MountGambier 12.0 18.6 2.2 3.0 7.8 SW 52.0 SW ... 28.0 88.0 62.0 1020.2 1019.9 8.0 7.0 14.8 17.5 Yes
2 2010-03-05 MountGinini 9.1 13.3 NaN NaN NaN NE 41.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2013-10-26 Wollongong 13.1 20.3 0.0 NaN NaN SW 33.0 W ... 24.0 40.0 51.0 1021.3 1019.5 NaN NaN 16.8 19.6 No
4 2016-11-28 Sale 12.2 20.0 0.4 NaN NaN E 33.0 SW ... 19.0 92.0 69.0 1015.6 1013.2 8.0 4.0 13.6 19.0 No

5 rows × 22 columns

Xtrain.loc[0,"Date"].split("-") #,以"-"分割字符串
['2015', '08', '24']
int(Xtrain.loc[0,"Date"].split("-")[1]) #提取出月份
8
Xtrain["Date"] = Xtrain.loc[:,"Date"].apply(lambda x:int(x.split("-")[1]))
#apply是对dataframe上的某一列进行处理的一个函数
#lambda x匿名函数,请在dataframe上这一列中的每一行帮我执行冒号后的命令
#类循环,比循环快的多
Xtrain.loc[:,"Date"].value_counts()
3     334
5     324
7     316
6     302
9     302
1     300
11    299
10    282
4     265
2     264
12    259
8     253
Name: Date, dtype: int64
#替换完毕后,我们需要修改列的名称
#rename是比较少用的,可以用来修改单个列名的函数
#我们通常都直接使用 df.columns = 某个列表 这样的形式来一次修改所有的列名
#但rename允许我们只修改某个单独的列
Xtrain = Xtrain.rename(columns={"Date":"Month"})
Xtrain.head()
Month Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 8 Katherine 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN No
1 12 Tuggeranong 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6 No
2 4 Albany 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8 No
3 11 Sale 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5 No
4 4 Mildura 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4 No

5 rows × 22 columns

Xtest["Date"] = Xtest.loc[:,"Date"].apply(lambda x:int(x.split("-")[1]))
Xtest = Xtest.rename(columns={"Date":"Month"})
Xtest.head()
Month Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 1 NorahHead 22.0 27.8 25.2 NaN NaN SSW 57.0 S ... 37.0 91.0 86.0 1006.6 1008.1 NaN NaN 26.2 23.1 Yes
1 3 MountGambier 12.0 18.6 2.2 3.0 7.8 SW 52.0 SW ... 28.0 88.0 62.0 1020.2 1019.9 8.0 7.0 14.8 17.5 Yes
2 3 MountGinini 9.1 13.3 NaN NaN NaN NE 41.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 10 Wollongong 13.1 20.3 0.0 NaN NaN SW 33.0 W ... 24.0 40.0 51.0 1021.3 1019.5 NaN NaN 16.8 19.6 No
4 11 Sale 12.2 20.0 0.4 NaN NaN E 33.0 SW ... 19.0 92.0 69.0 1015.6 1013.2 8.0 4.0 13.6 19.0 No

5 rows × 22 columns

3.3 处理困难特征:地点

# 现在开始处理困难特征:地点
Xtrain.loc[:,"Location"].value_counts()
Bendigo             94
Sydney              92
SalmonGums          92
Canberra            87
Ballarat            87
Darwin              86
Cairns              84
Wollongong          82
Albury              80
Townsville          80
Newcastle           78
Adelaide            77
BadgerysCreek       77
Dartmoor            76
Moree               76
Launceston          76
CoffsHarbour        76
Witchcliffe         76
WaggaWagga          75
NorahHead           74
Mildura             72
MelbourneAirport    72
SydneyAirport       72
Cobar               71
Richmond            71
PerthAirport        71
Hobart              71
Perth               70
Walpole             69
PearceRAAF          69
NorfolkIsland       68
Nuriootpa           68
MountGambier        68
Woomera             67
Albany              67
GoldCoast           66
Watsonia            66
Penrith             65
MountGinini         64
Brisbane            63
AliceSprings        63
Williamtown         63
Tuggeranong         62
Sale                62
Portland            60
Katherine           53
Melbourne           52
Uluru               46
Nhil                44
Name: Location, dtype: int64
# 现在开始处理困难特征:地点
Xtrain.loc[:,"Location"].value_counts().count()
#超过25个类别的分类型变量,都会被算法当成是连续型变量,不会被当作类
#把城市转化为气候,生成分类型变量
#把四十九种城市转化为7种气候 
#以距离气象站最近城市的气候作为该气象站的气候,所以调用城市经纬度,以寻找距离气象站最近的城市
49
cityll = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_cityll.csv",index_col=0)
city_climate = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_Cityclimate.csv")
cityll.head() #每个城市对应的经纬度,这些城市是澳大利亚统计局做的那张地图上的城市
City Latitude Longitude Latitudedir Longitudedir
0 Adelaide 34.9285° 138.6007° S, E
1 Albany 35.0275° 117.8840° S, E
2 Albury 36.0737° 146.9135° S, E
3 Wodonga 36.1241° 146.8818° S, E
4 AliceSprings 23.6980° 133.8807° S, E
  float(cityll.loc[0,"Latitude"][:-1])
34.9285
cityll.loc[:,"Latitudedir"].value_counts()
S,    100
Name: Latitudedir, dtype: int64
city_climate.head() #澳大利亚统计局做的每个城市对应的气候
City Climate
0 Adelaide Warm temperate
1 Albany Mild temperate
2 Albury Hot dry summer, cool winter
3 Wodonga Hot dry summer, cool winter
4 AliceSprings Hot dry summer, warm winter
#去掉度数符号
cityll["Latitudenum"] = cityll["Latitude"].apply(lambda x:float(x[:-1]))
cityll["Longitudenum"] = cityll["Longitude"].apply(lambda x:float(x[:-1]))
#观察一下所有的经纬度方向都是一致的,全部是南纬,东经,因为澳大利亚在南半球,东半球
#所以经纬度的方向我们可以舍弃了
citylld = cityll.iloc[:,[0,5,6]]
citylld
City Latitudenum Longitudenum
0 Adelaide 34.9285 138.6007
1 Albany 35.0275 117.8840
2 Albury 36.0737 146.9135
3 Wodonga 36.1241 146.8818
4 AliceSprings 23.6980 133.8807
... ... ... ...
95 Wollongong 34.4278 150.8931
96 Wyndham 15.4825 128.1228
97 Yalgoo 28.3445 116.6851
98 Yulara 25.2335 130.9849
99 Uluru 25.3444 131.0369

100 rows × 3 columns

#将city_climate中的气候添加到我们的citylld中
citylld["climate"] = city_climate.iloc[:,-1]
C:\Users\chen'bu'rong\AppData\Local\Temp\ipykernel_7292\702061772.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  citylld["climate"] = city_climate.iloc[:,-1]
citylld.head()
City Latitudenum Longitudenum climate
0 Adelaide 34.9285 138.6007 Warm temperate
1 Albany 35.0275 117.8840 Mild temperate
2 Albury 36.0737 146.9135 Hot dry summer, cool winter
3 Wodonga 36.1241 146.8818 Hot dry summer, cool winter
4 AliceSprings 23.6980 133.8807 Hot dry summer, warm winter
citylld.loc[:,"climate"].value_counts()
Hot dry summer, cool winter          24
Warm temperate                       18
Hot dry summer, warm winter          18
High humidity summer, warm winter    17
Mild temperate                        9
Cool temperate                        9
Warm humid summer, mild winter        5
Name: climate, dtype: int64
samplecity = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_samplecity.csv",index_col=0)
samplecity.head()
City Latitude Longitude Latitudedir Longitudedir
0 Canberra 35.2809° 149.1300° S, E
1 Sydney 33.8688° 151.2093° S, E
2 Perth 31.9505° 115.8605° S, E
3 Darwin 12.4634° 130.8456° S, E
4 Hobart 42.8821° 147.3272° S, E
#我们对samplecity也执行同样的处理:去掉经纬度中度数的符号,并且舍弃我们的经纬度的方向
samplecity["Latitudenum"] = samplecity["Latitude"].apply(lambda x:float(x[:-1]))
samplecity["Longitudenum"] = samplecity["Longitude"].apply(lambda x:float(x[:-1]))
samplecityd = samplecity.iloc[:,[0,5,6]]
samplecityd.head()#这里的city是气象站的名称
City Latitudenum Longitudenum
0 Canberra 35.2809 149.1300
1 Sydney 33.8688 151.2093
2 Perth 31.9505 115.8605
3 Darwin 12.4634 130.8456
4 Hobart 42.8821 147.3272
#首先使用radians将角度(经纬度)转换成弧度
from math import radians, sin, cos, acos
citylld.loc[:,"slat"] = citylld.iloc[:,1].apply(lambda x : radians(x))

citylld.loc[:,"slon"] = citylld.iloc[:,2].apply(lambda x : radians(x))
samplecityd.loc[:,"elat"] = samplecityd.iloc[:,1].apply(lambda x : radians(x))
samplecityd.loc[:,"elon"] = samplecityd.iloc[:,2].apply(lambda x : radians(x))
import sys
for i in range(samplecityd.shape[0]):
    slat = citylld.loc[:,"slat"]
    slon = citylld.loc[:,"slon"]
    elat = samplecityd.loc[i,"elat"]
    elon = samplecityd.loc[i,"elon"]
    dist = 6371.01 * np.arccos(np.sin(slat)*np.sin(elat) + 
                          np.cos(slat)*np.cos(elat)*np.cos(slon.values - elon))
    city_index = np.argsort(dist)[0]#对dist距离进行索引排序 
    #每次计算后,取距离最近的城市,然后将最近的城市和城市对应的气候都匹配到samplecityd中
    samplecityd.loc[i,"closest_city"] = citylld.loc[city_index,"City"]
    samplecityd.loc[i,"climate"] = citylld.loc[city_index,"climate"]
#查看最后的结果,需要检查城市匹配是否基本正确
samplecityd.head(300)
City Latitudenum Longitudenum elat elon closest_city climate
0 Canberra 35.2809 149.1300 0.615768 2.602810 Canberra Cool temperate
1 Sydney 33.8688 151.2093 0.591122 2.639100 Sydney Warm temperate
2 Perth 31.9505 115.8605 0.557641 2.022147 Perth Warm temperate
3 Darwin 12.4634 130.8456 0.217527 2.283687 Darwin High humidity summer, warm winter
4 Hobart 42.8821 147.3272 0.748434 2.571345 Hobart Cool temperate
5 Brisbane 27.4698 153.0251 0.479438 2.670792 Brisbane Warm humid summer, mild winter
6 Adelaide 34.9285 138.6007 0.609617 2.419039 Adelaide Warm temperate
7 Bendigo 36.7570 144.2794 0.641531 2.518151 Ballarat Cool temperate
8 Townsville 19.2590 146.8169 0.336133 2.562438 Townsville High humidity summer, warm winter
9 AliceSprings 23.6980 133.8807 0.413608 2.336659 AliceSprings Hot dry summer, warm winter
10 MountGambier 37.8284 140.7804 0.660230 2.457082 KingstonSE Mild temperate
11 Launceston 41.4332 147.1441 0.723146 2.568149 Launceston Cool temperate
12 Ballarat 37.5622 143.8503 0.655584 2.510661 Ballarat Cool temperate
13 Albany 35.0275 117.8840 0.611345 2.057464 Albany Mild temperate
14 Albury 36.0737 146.9135 0.629605 2.564124 Albury Hot dry summer, cool winter
15 PerthAirport 31.9440 115.9680 0.557528 2.024023 PerthAirport Warm temperate
16 MelbourneAirport 37.6697 144.8488 0.657460 2.528088 MelbourneAirport Mild temperate
17 Mildura 34.2080 142.1246 0.597042 2.480542 Mildura Hot dry summer, cool winter
18 SydneyAirport 33.9399 151.1753 0.592363 2.638507 SydneyAirport Warm temperate
19 Nuriootpa 34.4666 138.9917 0.601556 2.425863 Adelaide Warm temperate
20 Sale 38.1026 147.0730 0.665016 2.566908 LakesEntrance Mild temperate
21 Watsonia 37.7080 145.0830 0.658129 2.532176 Ivanhoe Hot dry summer, cool winter
22 Tuggeranong 35.4244 149.0888 0.618272 2.602090 Canberra Cool temperate
23 Portland 38.3609 141.6041 0.669524 2.471458 MountGambier Mild temperate
24 Woomera 31.1656 136.8193 0.543942 2.387947 LeighCreek Warm temperate
25 Cairns 16.9186 145.7781 0.295285 2.544308 Cairns High humidity summer, warm winter
26 Cobar 31.4949 145.8402 0.549690 2.545392 Bourke Hot dry summer, cool winter
27 Wollongong 34.4278 150.8931 0.600878 2.633581 Wollongong Warm temperate
28 GoldCoast 28.0167 153.4000 0.488984 2.677335 Southport Cool temperate
29 WaggaWagga 35.1082 147.3598 0.612754 2.571914 Albury Hot dry summer, cool winter
30 NorfolkIsland 29.0408 167.9547 0.506858 2.931363 LordHoweIsland Warm temperate
31 Penrith 33.7500 150.7000 0.589049 2.630211 SydneyAirport Warm temperate
32 SalmonGums 32.9879 121.6422 0.575747 2.123057 Norseman Hot dry summer, cool winter
33 Newcastle 32.9283 151.7817 0.574707 2.649090 Newcastle Warm temperate
34 CoffsHarbour 30.2986 153.1094 0.528810 2.672263 CoffsHarbour Warm humid summer, mild winter
35 Witchcliffe 34.0082 115.1155 0.593555 2.009144 MargaretRiver Warm temperate
36 Richmond 37.8230 144.9980 0.660136 2.530693 Melbourne Mild temperate
37 Dartmoor 37.9144 141.2730 0.661731 2.465679 MountGambier Mild temperate
38 NorahHead 33.2833 151.5667 0.580903 2.645338 Swansea Cool temperate
39 BadgerysCreek 33.8829 150.7609 0.591368 2.631274 SydneyAirport Warm temperate
40 MountGinini 35.5294 148.7723 0.620105 2.596566 Canberra Cool temperate
41 Moree 29.4658 149.8339 0.514275 2.615095 Goondiwindi Hot dry summer, warm winter
42 Walpole 34.9551 116.7696 0.610082 2.038014 Albany Mild temperate
43 PearceRAAF 31.6666 116.0257 0.552686 2.025030 PerthAirport Warm temperate
44 Williamtown 32.8150 151.8428 0.572730 2.650157 Newcastle Warm temperate
45 Melbourne 37.8136 144.9631 0.659972 2.530083 Melbourne Mild temperate
46 Nhil 36.3328 141.6503 0.634127 2.472264 Horsham Mild temperate
47 Katherine 14.4521 132.2715 0.252237 2.308573 Katherine High humidity summer, warm winter
48 Uluru 25.3444 131.0369 0.442343 2.287025 Uluru Hot dry summer, warm winter
#查看气候的分布
samplecityd["climate"].value_counts()
Warm temperate                       15
Mild temperate                       10
Cool temperate                        9
Hot dry summer, cool winter           6
High humidity summer, warm winter     4
Hot dry summer, warm winter           3
Warm humid summer, mild winter        2
Name: climate, dtype: int64
#确认无误后,取出样本气象站所对应的城市的气候,并保存
locafinal = samplecityd.iloc[:,[0,-1]]
locafinal.head()
Location Climate
0 Canberra Cool temperate
1 Sydney Warm temperate
2 Perth Warm temperate
3 Darwin High humidity summer, warm winter
4 Hobart Cool temperate
locafinal.columns = ["Location","Climate"]
#在这里设定locafinal的索引为地点,是为了之后通过地点,将气候匹配到训练集的location上
locafinal = locafinal.set_index(keys="Location")
locafinal
Climate
Location
Canberra Cool temperate
Sydney Warm temperate
Perth Warm temperate
Darwin High humidity summer, warm winter
Hobart Cool temperate
Brisbane Warm humid summer, mild winter
Adelaide Warm temperate
Bendigo Cool temperate
Townsville High humidity summer, warm winter
AliceSprings Hot dry summer, warm winter
MountGambier Mild temperate
Launceston Cool temperate
Ballarat Cool temperate
Albany Mild temperate
Albury Hot dry summer, cool winter
PerthAirport Warm temperate
MelbourneAirport Mild temperate
Mildura Hot dry summer, cool winter
SydneyAirport Warm temperate
Nuriootpa Warm temperate
Sale Mild temperate
Watsonia Hot dry summer, cool winter
Tuggeranong Cool temperate
Portland Mild temperate
Woomera Warm temperate
Cairns High humidity summer, warm winter
Cobar Hot dry summer, cool winter
Wollongong Warm temperate
GoldCoast Cool temperate
WaggaWagga Hot dry summer, cool winter
NorfolkIsland Warm temperate
Penrith Warm temperate
SalmonGums Hot dry summer, cool winter
Newcastle Warm temperate
CoffsHarbour Warm humid summer, mild winter
Witchcliffe Warm temperate
Richmond Mild temperate
Dartmoor Mild temperate
NorahHead Cool temperate
BadgerysCreek Warm temperate
MountGinini Cool temperate
Moree Hot dry summer, warm winter
Walpole Mild temperate
PearceRAAF Warm temperate
Williamtown Warm temperate
Melbourne Mild temperate
Nhil Mild temperate
Katherine High humidity summer, warm winter
Uluru Hot dry summer, warm winter
locafinal.to_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\samplelocation.csv")
#之后使用气候来替换原本的气象站名称
#通过map功能,以字典形式将,通过索引将location映射为气候
#是否还记得训练集长什么样呢?
Xtrain.head()
Month Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 8 Katherine 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN No
1 12 Tuggeranong 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6 No
2 4 Albany 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8 No
3 11 Sale 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5 No
4 4 Mildura 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4 No

5 rows × 22 columns

#将location中的内容替换,并且确保匹配进入的气候字符串中不含有逗号,气候两边不含有空格
#我们使用re这个模块来消除逗号
#re.sub(希望替换的值,希望被替换成的值,要操作的字符串) #去掉逗号
#x.strip()是去掉空格的函数
#把location替换成气候的是我们的map的映射
import re
#气象站的名字替换成了对应的城市对应的气候
Xtrain["Location"] = Xtrain["Location"].map(locafinal.iloc[:,0])
Xtrain.head()
Month Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 8 High humidity summer, warm winter 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN No
1 12 Cool temperate 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6 No
2 4 Mild temperate 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8 No
3 11 Mild temperate 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5 No
4 4 Hot dry summer, cool winter 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4 No

5 rows × 22 columns

#城市的气候中所含的逗号和空格都去掉
Xtrain["Location"] = Xtrain["Location"].apply(lambda x:re.sub(",","",x.strip()))
Xtest["Location"] = Xtest["Location"].map(locafinal.iloc[:,0]).apply(lambda x:re.sub(",","",x.strip()))
#re.sub(要被替换字符,要替换成的字符,要被替换的整个对象)
#x.strip() 去空格
#修改特征内容之后,我们使用新列名“Climate”来替换之前的列名“Location”
#注意这个命令一旦执行之后,就再没有列"Location"了,使用索引时要特别注意
Xtrain = Xtrain.rename(columns={"Location":"Climate"})
Xtest = Xtest.rename(columns={"Location":"Climate"})
Xtrain.head()
Month Climate MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 8 High humidity summer warm winter 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 15.0 57.0 NaN 1016.8 1012.2 0.0 NaN 27.5 NaN No
1 12 Cool temperate 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 17.0 59.0 31.0 1020.4 1017.5 NaN NaN 14.6 23.6 No
2 4 Mild temperate 13.0 22.6 0.0 3.8 10.4 NaN NaN NE ... 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8 No
3 11 Mild temperate 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5 No
4 4 Hot dry summer cool winter 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4 No

5 rows × 22 columns

Xtest.head()
Month Climate MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 1 Cool temperate 22.0 27.8 25.2 NaN NaN SSW 57.0 S ... 37.0 91.0 86.0 1006.6 1008.1 NaN NaN 26.2 23.1 Yes
1 3 Mild temperate 12.0 18.6 2.2 3.0 7.8 SW 52.0 SW ... 28.0 88.0 62.0 1020.2 1019.9 8.0 7.0 14.8 17.5 Yes
2 3 Cool temperate 9.1 13.3 NaN NaN NaN NE 41.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 10 Warm temperate 13.1 20.3 0.0 NaN NaN SW 33.0 W ... 24.0 40.0 51.0 1021.3 1019.5 NaN NaN 16.8 19.6 No
4 11 Mild temperate 12.2 20.0 0.4 NaN NaN E 33.0 SW ... 19.0 92.0 69.0 1015.6 1013.2 8.0 4.0 13.6 19.0 No

5 rows × 22 columns

3.4 处理分类型变量:缺失值

#查看缺失值的缺失情况
#现实中多使用均值,众数来填补缺失值
#不使用算法的原因有:运算时间长,可解释性差
Xtrain.isnull().mean()
Month            0.000000
Climate          0.000000
MinTemp          0.004000
MaxTemp          0.003143
Rainfall         0.009429
Evaporation      0.433429
Sunshine         0.488571
WindGustDir      0.067714
WindGustSpeed    0.067714
WindDir9am       0.067429
WindDir3pm       0.024286
WindSpeed9am     0.009714
WindSpeed3pm     0.018000
Humidity9am      0.011714
Humidity3pm      0.026286
Pressure9am      0.098857
Pressure3pm      0.098857
Cloud9am         0.379714
Cloud3pm         0.401429
Temp9am          0.005429
Temp3pm          0.019714
RainToday        0.009429
dtype: float64
Xtrain.info()

RangeIndex: 3500 entries, 0 to 3499
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Month          3500 non-null   int64  
 1   Climate        3500 non-null   object 
 2   MinTemp        3486 non-null   float64
 3   MaxTemp        3489 non-null   float64
 4   Rainfall       3467 non-null   float64
 5   Evaporation    1983 non-null   float64
 6   Sunshine       1790 non-null   float64
 7   WindGustDir    3263 non-null   object 
 8   WindGustSpeed  3263 non-null   float64
 9   WindDir9am     3264 non-null   object 
 10  WindDir3pm     3415 non-null   object 
 11  WindSpeed9am   3466 non-null   float64
 12  WindSpeed3pm   3437 non-null   float64
 13  Humidity9am    3459 non-null   float64
 14  Humidity3pm    3408 non-null   float64
 15  Pressure9am    3154 non-null   float64
 16  Pressure3pm    3154 non-null   float64
 17  Cloud9am       2171 non-null   float64
 18  Cloud3pm       2095 non-null   float64
 19  Temp9am        3481 non-null   float64
 20  Temp3pm        3431 non-null   float64
 21  RainToday      3467 non-null   object 
dtypes: float64(16), int64(1), object(5)
memory usage: 601.7+ KB
Xtrain.dtypes == "object"
Month            False
Climate           True
MinTemp          False
MaxTemp          False
Rainfall         False
Evaporation      False
Sunshine         False
WindGustDir       True
WindGustSpeed    False
WindDir9am        True
WindDir3pm        True
WindSpeed9am     False
WindSpeed3pm     False
Humidity9am      False
Humidity3pm      False
Pressure9am      False
Pressure3pm      False
Cloud9am         False
Cloud3pm         False
Temp9am          False
Temp3pm          False
RainToday         True
dtype: bool
#首先找出,分类型特征都有哪些
cate = Xtrain.columns[Xtrain.dtypes == "object"].tolist()
cate
['Climate', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
#除了特征类型为"object"的特征们,还有虽然用数字表示,但是本质为分类型特征的云层遮蔽程度
cloud = ["Cloud9am","Cloud3pm"]
cate = cate + cloud
cate
['Climate',
 'WindGustDir',
 'WindDir9am',
 'WindDir3pm',
 'RainToday',
 'Cloud9am',
 'Cloud3pm']
#对于分类型特征,我们使用众数来进行填补
from sklearn.impute import SimpleImputer #0.20, conda, pip

si = SimpleImputer(missing_values=np.nan,strategy="most_frequent")#当值是空的时候用众数填补的函数
#注意,我们使用训练集数据来训练我们的填补器,本质是在生成训练集中的众数
si.fit(Xtrain.loc[:,cate])
SimpleImputer(strategy='most_frequent')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SimpleImputer
SimpleImputer(strategy='most_frequent')
#然后我们用训练集中的众数来同时填补训练集和测试集
Xtrain.loc[:,cate] = si.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = si.transform(Xtest.loc[:,cate])
Xtrain.head()
Month Climate MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 8 High humidity summer warm winter 17.5 36.0 0.0 8.8 NaN ESE 26.0 NNW ... 15.0 57.0 NaN 1016.8 1012.2 0.0 7.0 27.5 NaN No
1 12 Cool temperate 9.5 25.0 0.0 NaN NaN NNW 33.0 NE ... 17.0 59.0 31.0 1020.4 1017.5 7.0 7.0 14.6 23.6 No
2 4 Mild temperate 13.0 22.6 0.0 3.8 10.4 W NaN NE ... 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8 No
3 11 Mild temperate 13.9 29.8 0.0 5.8 5.1 S 37.0 N ... 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5 No
4 4 Hot dry summer cool winter 6.0 23.5 0.0 2.8 8.6 NNE 24.0 E ... 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4 No

5 rows × 22 columns

Xtest.head()
Month Climate MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 1 Cool temperate 22.0 27.8 25.2 NaN NaN SSW 57.0 S ... 37.0 91.0 86.0 1006.6 1008.1 7.0 7.0 26.2 23.1 Yes
1 3 Mild temperate 12.0 18.6 2.2 3.0 7.8 SW 52.0 SW ... 28.0 88.0 62.0 1020.2 1019.9 8.0 7.0 14.8 17.5 Yes
2 3 Cool temperate 9.1 13.3 NaN NaN NaN NE 41.0 N ... NaN NaN NaN NaN NaN 7.0 7.0 NaN NaN No
3 10 Warm temperate 13.1 20.3 0.0 NaN NaN SW 33.0 W ... 24.0 40.0 51.0 1021.3 1019.5 7.0 7.0 16.8 19.6 No
4 11 Mild temperate 12.2 20.0 0.4 NaN NaN E 33.0 SW ... 19.0 92.0 69.0 1015.6 1013.2 8.0 4.0 13.6 19.0 No

5 rows × 22 columns

#查看分类型特征是否依然存在缺失值
Xtrain.loc[:,cate].isnull().mean()
Climate        0.0
WindGustDir    0.0
WindDir9am     0.0
WindDir3pm     0.0
RainToday      0.0
Cloud9am       0.0
Cloud3pm       0.0
dtype: float64
Xtest.loc[:,cate].isnull().mean()
Climate        0.0
WindGustDir    0.0
WindDir9am     0.0
WindDir3pm     0.0
RainToday      0.0
Cloud9am       0.0
Cloud3pm       0.0
dtype: float64

3.5 处理分类型变量:将分类型变量编码

# 处理分类型变量:将分类型变量编码
#先填缺失值后编码
#将所有的分类型变量编码为数字,一个类别是一个数字

from sklearn.preprocessing import OrdinalEncoder #只允许二维以上的数据进行输入
oe = OrdinalEncoder()
#利用训练集进行fit
oe = oe.fit(Xtrain.loc[:,cate])
#用训练集的编码结果来编码训练和测试特征矩阵
#在这里如果测试特征矩阵报错,那么测试集可能有异常值,错误值,或说明测试集中出现了训练集中从未见过的类别
#那么就需要重新调整你的模型
Xtrain.loc[:,cate] = oe.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = oe.transform(Xtest.loc[:,cate])
cate
Xtrain.loc[:,cate].head()
Climate WindGustDir WindDir9am WindDir3pm RainToday Cloud9am Cloud3pm
0 1.0 2.0 6.0 0.0 0.0 0.0 7.0
1 0.0 6.0 4.0 6.0 0.0 7.0 7.0
2 4.0 13.0 4.0 0.0 0.0 1.0 3.0
3 4.0 8.0 3.0 8.0 0.0 6.0 6.0
4 2.0 5.0 0.0 6.0 0.0 2.0 4.0
Xtest.loc[:,cate].head()
Climate WindGustDir WindDir9am WindDir3pm RainToday Cloud9am Cloud3pm
0 0.0 11.0 8.0 11.0 1.0 7.0 7.0
1 4.0 12.0 12.0 8.0 1.0 8.0 7.0
2 0.0 4.0 3.0 9.0 0.0 7.0 7.0
3 6.0 12.0 13.0 9.0 0.0 7.0 7.0
4 4.0 0.0 12.0 0.0 0.0 8.0 4.0
Xtrain.head()
Month Climate MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 8 1.0 17.5 36.0 0.0 8.8 NaN 2.0 26.0 6.0 ... 15.0 57.0 NaN 1016.8 1012.2 0.0 7.0 27.5 NaN 0.0
1 12 0.0 9.5 25.0 0.0 NaN NaN 6.0 33.0 4.0 ... 17.0 59.0 31.0 1020.4 1017.5 7.0 7.0 14.6 23.6 0.0
2 4 4.0 13.0 22.6 0.0 3.8 10.4 13.0 NaN 4.0 ... 31.0 79.0 68.0 1020.3 1015.7 1.0 3.0 17.5 20.8 0.0
3 11 4.0 13.9 29.8 0.0 5.8 5.1 8.0 37.0 3.0 ... 28.0 82.0 44.0 1012.5 1005.9 6.0 6.0 18.5 27.5 0.0
4 4 2.0 6.0 23.5 0.0 2.8 8.6 5.0 24.0 0.0 ... 15.0 58.0 35.0 1019.8 1014.1 2.0 4.0 12.4 22.4 0.0

5 rows × 22 columns

3.6 处理连续型变量:填补缺失值

col = Xtrain.columns.tolist()
col
['Month',
 'Climate',
 'MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustDir',
 'WindGustSpeed',
 'WindDir9am',
 'WindDir3pm',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Cloud9am',
 'Cloud3pm',
 'Temp9am',
 'Temp3pm',
 'RainToday']
cate
['Climate',
 'WindGustDir',
 'WindDir9am',
 'WindDir3pm',
 'RainToday',
 'Cloud9am',
 'Cloud3pm']
#赋值一个列表剔除分类型的列
for i in cate:
    col.remove(i)
col
['Month',
 'MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Temp9am',
 'Temp3pm']
#实例化模型,填补策略为"mean"表示均值
impmean = SimpleImputer(missing_values=np.nan,strategy = "mean")
#用训练集来fit模型
impmean = impmean.fit(Xtrain.loc[:,col])
#分别在训练集和测试集上进行均值填补
Xtrain.loc[:,col] = impmean.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = impmean.transform(Xtest.loc[:,col])
Xtrain.isnull().mean()
Month            0.0
Climate          0.0
MinTemp          0.0
MaxTemp          0.0
Rainfall         0.0
Evaporation      0.0
Sunshine         0.0
WindGustDir      0.0
WindGustSpeed    0.0
WindDir9am       0.0
WindDir3pm       0.0
WindSpeed9am     0.0
WindSpeed3pm     0.0
Humidity9am      0.0
Humidity3pm      0.0
Pressure9am      0.0
Pressure3pm      0.0
Cloud9am         0.0
Cloud3pm         0.0
Temp9am          0.0
Temp3pm          0.0
RainToday        0.0
dtype: float64
Xtest.isnull().mean()
Month            0.0
Climate          0.0
MinTemp          0.0
MaxTemp          0.0
Rainfall         0.0
Evaporation      0.0
Sunshine         0.0
WindGustDir      0.0
WindGustSpeed    0.0
WindDir9am       0.0
WindDir3pm       0.0
WindSpeed9am     0.0
WindSpeed3pm     0.0
Humidity9am      0.0
Humidity3pm      0.0
Pressure9am      0.0
Pressure3pm      0.0
Cloud9am         0.0
Cloud3pm         0.0
Temp9am          0.0
Temp3pm          0.0
RainToday        0.0
dtype: float64

3.7 处理连续型变量:无量纲化

# 无量纲化处理连续型数据

col.remove("Month")
#月份没有缺失值,所以分类型填补缺失值时未导入,导入连续型数据填补缺失值时也无影响
#但是我们不想把他无量纲化
col
['MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Temp9am',
 'Temp3pm']
from sklearn.preprocessing import StandardScaler #数据转换为均值为0,方差为1的数据
#标准化不改变数据的分布,不会把数据变成正态分布的
ss = StandardScaler()
ss = ss.fit(Xtrain.loc[:,col])
Xtrain.loc[:,col] = ss.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = ss.transform(Xtest.loc[:,col])
Xtrain.head()
Month Climate MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 8.0 1.0 0.826375 1.774044 -0.314379 0.964367 0.000000 2.0 -1.085893e+00 6.0 ... -0.416443 -0.646283 0.000000 -0.122589 -0.453507 0.0 7.0 1.612270 0.000000 0.0
1 12.0 0.0 -0.427048 0.244031 -0.314379 0.000000 0.000000 6.0 -5.373993e-01 4.0 ... -0.182051 -0.539186 -1.011310 0.414254 0.340522 7.0 7.0 -0.366608 0.270238 0.0
2 4.0 4.0 0.121324 -0.089790 -0.314379 -0.551534 1.062619 13.0 -1.113509e-15 4.0 ... 1.458692 0.531786 0.800547 0.399342 0.070852 1.0 3.0 0.078256 -0.132031 0.0
3 11.0 4.0 0.262334 0.911673 -0.314379 0.054826 -0.885225 8.0 -2.239744e-01 3.0 ... 1.107105 0.692432 -0.374711 -0.763819 -1.397352 6.0 6.0 0.231658 0.830540 0.0
4 4.0 2.0 -0.975421 0.035393 -0.314379 -0.854715 0.401087 5.0 -1.242605e+00 0.0 ... -0.416443 -0.592734 -0.815433 0.324780 -0.168855 2.0 4.0 -0.704091 0.097837 0.0

5 rows × 22 columns

Xtest.head()
Month Climate MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
0 1.0 0.0 1.531425 0.633489 2.871067 0.000000 0.000000 11.0 1.343150 8.0 ... 2.161868e+00 1.174369 1.681991 -1.643646 -1.067755 7.0 7.0 1.412848 0.198404 1.0
1 3.0 4.0 -0.035354 -0.646158 -0.036285 -0.794079 0.107073 12.0 0.951369 12.0 ... 1.107105e+00 1.013723 0.506733 0.384430 0.700082 8.0 7.0 -0.335927 -0.606132 1.0
2 3.0 0.0 -0.489720 -1.383346 0.000000 0.000000 0.000000 4.0 0.089450 3.0 ... -4.163637e-16 0.000000 0.000000 0.000000 0.000000 7.0 7.0 0.000000 0.000000 0.0
3 10.0 6.0 0.136992 -0.409702 -0.314379 0.000000 0.000000 12.0 -0.537399 13.0 ... 6.383207e-01 -1.556609 -0.031928 0.548465 0.640155 7.0 7.0 -0.029125 -0.304431 0.0
4 11.0 4.0 -0.004018 -0.451429 -0.263817 0.000000 0.000000 0.0 -0.537399 12.0 ... 5.234093e-02 1.227917 0.849516 -0.301537 -0.303690 8.0 4.0 -0.520009 -0.390632 0.0

5 rows × 22 columns

#完成特征工程,开始建模与模型评估

Ytrain.head()
0
0 0
1 0
2 0
3 1
4 0

4. 建模与模型评估

from time import time #随时监控我们的模型的运行时间
import datetime
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, recall_score
Ytrain = Ytrain.iloc[:,0].ravel()
Ytest = Ytest.iloc[:,0].ravel()
#建模选择自然是我们的支持向量机SVC,首先用核函数的学习曲线来选择核函数
#我们希望同时观察,精确性,recall以及AUC分数
times = time() #因为SVM是计算量很大的模型,所以我们需要时刻监控我们的模型运行时间

for kernel in ["linear","poly","rbf","sigmoid"]:
    clf = SVC(kernel = kernel
              ,gamma="auto"
              ,degree = 1
              ,cache_size = 5000  #cache_size 设定越大,算法使用的内存越多,速度越快
             ).fit(Xtrain, Ytrain)
    result = clf.predict(Xtest)  #获取模型预测结果
    score = clf.score(Xtest,Ytest)  #接口score返回准确度accuracy
    recall = recall_score(Ytest, result)
    auc = roc_auc_score(Ytest,clf.decision_function(Xtest))#auc第二个参数是置信度
    print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
linear 's testing accuracy 0.844000, recall is 0.469388', auc is 0.869029
00:03:751689
poly 's testing accuracy 0.840667, recall is 0.457726', auc is 0.868157
00:04:253937
rbf 's testing accuracy 0.813333, recall is 0.306122', auc is 0.814873
00:05:900434
sigmoid 's testing accuracy 0.655333, recall is 0.154519', auc is 0.437308
00:06:403792

5. 建模调参

5.1 追求最高召回率recall

# 追求最高召回率recall
times = time()
for kernel in ["linear","poly","rbf","sigmoid"]:
    clf = SVC(kernel = kernel
              ,gamma="auto"
              ,degree = 1
              ,cache_size = 5000
              ,class_weight = "balanced"
             ).fit(Xtrain, Ytrain)
    result = clf.predict(Xtest)
    score = clf.score(Xtest,Ytest)
    recall = recall_score(Ytest, result)
    auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
    print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
linear 's testing accuracy 0.796000, recall is 0.775510', auc is 0.870065
00:04:266080
poly 's testing accuracy 0.793333, recall is 0.763848', auc is 0.871448
00:04:972567
rbf 's testing accuracy 0.803333, recall is 0.600583', auc is 0.819713
00:06:837272
sigmoid 's testing accuracy 0.562000, recall is 0.282799', auc is 0.437119
00:08:094214
times = time()
clf = SVC(kernel = "linear"
          ,gamma="auto"
          ,cache_size = 5000
          ,class_weight = {1:10}
#注意,这里写的其实是,类别1:10,(类别1权重为10)隐藏了类别0:1(类别0权重为1)这个比例
         ).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("testing accuracy %f, recall is %f', auc is %f" %(score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

#高精度预测少数类发生的情况,但是也把大量多数类预测失误为少数类
testing accuracy 0.636667, recall is 0.912536', auc is 0.866360
00:07:806017

5.2 追求最高准确率

# 最求最高准确度

valuec = pd.Series(Ytest).value_counts()
valuec# 少数类标签为1,多数类标签为0
0    1157
1     343
dtype: int64
valuec[0]/valuec.sum()
0.7713333333333333
#查看模型的特异度
from sklearn.metrics import confusion_matrix as CM
clf = SVC(kernel = "linear"
          ,gamma="auto"
          ,cache_size = 5000
         ).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
# 获取预测结果
# 构建混淆矩阵
cm = CM(Ytest,result,labels=(1,0))
cm
array([[ 161,  182],
       [  52, 1105]], dtype=int64)
specificity = cm[1,1]/cm[1,:].sum()
specificity #几乎所有的多类样本0都被判断正确了,还有不少少数类1也被判断正确了
0.9550561797752809
# 多数类准确率已经挺高了,试一试微微调节使得少数类准确率上升来使得,总体准确率上升
irange = np.linspace(0.01,0.05,10)
irange
array([0.01      , 0.01444444, 0.01888889, 0.02333333, 0.02777778,
       0.03222222, 0.03666667, 0.04111111, 0.04555556, 0.05      ])
for i in irange:
    times = time()
    clf = SVC(kernel = "linear"
              ,gamma="auto"
              ,cache_size = 5000
              ,class_weight = {1:1+i}
             ).fit(Xtrain, Ytrain)
    result = clf.predict(Xtest)
    score = clf.score(Xtest,Ytest)
    recall = recall_score(Ytest, result)
    auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
    print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" %(1+i,score,recall,auc))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
under ratio 1:1.010000 testing accuracy 0.844667, recall is 0.475219', auc is 0.869157
00:03:688484
under ratio 1:1.014444 testing accuracy 0.844667, recall is 0.478134', auc is 0.869185
00:03:856429
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869200
00:03:745157
under ratio 1:1.023333 testing accuracy 0.845333, recall is 0.481050', auc is 0.869175
00:03:598711
under ratio 1:1.027778 testing accuracy 0.844000, recall is 0.481050', auc is 0.869394
00:03:667787
under ratio 1:1.032222 testing accuracy 0.844000, recall is 0.481050', auc is 0.869528
00:03:641163
under ratio 1:1.036667 testing accuracy 0.844000, recall is 0.481050', auc is 0.869659
00:03:895604
under ratio 1:1.041111 testing accuracy 0.844667, recall is 0.483965', auc is 0.869629
00:03:787082
under ratio 1:1.045556 testing accuracy 0.844667, recall is 0.483965', auc is 0.869712
00:03:729805
under ratio 1:1.050000 testing accuracy 0.845333, recall is 0.486880', auc is 0.869863
00:03:800089
irange_ = np.linspace(0.018889,0.027778,10)
for i in irange_:
    times = time()
    clf = SVC(kernel = "linear"
              ,gamma="auto"
              ,cache_size = 5000
              ,class_weight = {1:1+i}
             ).fit(Xtrain, Ytrain)
    result = clf.predict(Xtest)
    score = clf.score(Xtest,Ytest)
    recall = recall_score(Ytest, result)
    auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
    print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" %(1+i,score,recall,auc))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869213
00:03:654617
under ratio 1:1.019877 testing accuracy 0.844000, recall is 0.478134', auc is 0.869228
00:03:753644
under ratio 1:1.020864 testing accuracy 0.844000, recall is 0.478134', auc is 0.869218
00:03:743298
under ratio 1:1.021852 testing accuracy 0.844667, recall is 0.478134', auc is 0.869188
00:03:557083
under ratio 1:1.022840 testing accuracy 0.844667, recall is 0.478134', auc is 0.869220
00:03:805152
under ratio 1:1.023827 testing accuracy 0.844667, recall is 0.481050', auc is 0.869188
00:03:774551
under ratio 1:1.024815 testing accuracy 0.844667, recall is 0.481050', auc is 0.869231
00:03:644071
under ratio 1:1.025803 testing accuracy 0.844667, recall is 0.481050', auc is 0.869238
00:03:772898
under ratio 1:1.026790 testing accuracy 0.844000, recall is 0.481050', auc is 0.869314
00:03:660354
under ratio 1:1.027778 testing accuracy 0.844000, recall is 0.481050', auc is 0.869326
00:03:715478
# 没有出现比84.53更高的精度,已经无法通过调节权重来使得模型的准确度达到质变
# 接下来只能通过更换模型来提高模型精度了
from sklearn.linear_model import LogisticRegression as LR
logclf = LR(solver="liblinear").fit(Xtrain, Ytrain)
logclf.score(Xtest,Ytest)
0.8486666666666667
C_range = np.linspace(5,10,10)
for C in C_range:
    logclf = LR(solver="liblinear",C=C).fit(Xtrain, Ytrain)
    print(C,logclf.score(Xtest,Ytest))
5.0 0.8493333333333334
5.555555555555555 0.8493333333333334
6.111111111111111 0.8486666666666667
6.666666666666667 0.8493333333333334
7.222222222222222 0.8493333333333334
7.777777777777778 0.8493333333333334
8.333333333333334 0.8493333333333334
8.88888888888889 0.8493333333333334
9.444444444444445 0.8493333333333334
10.0 0.8493333333333334

5.3 追求模型精确度和召回率的平衡

# 模型的精确度还是没有产生质变,可能通过尝试集成算法达成目标
# 追求模型精确度和召回率的平衡
times = time()
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
          ,class_weight = "balanced"
         ).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("testing accuracy %f,recall is %f', auc is %f" % (score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
testing accuracy 0.794000,recall is 0.772595', auc is 0.870143
00:11:026793
from sklearn.metrics import roc_curve as ROC
import matplotlib.pyplot as plt
FPR, Recall, thresholds = ROC(Ytest,clf.decision_function(Xtest),pos_label=1)#正样本标签为1(少数类标签为1)
# roc曲线输入的是概率预测值,或者是置信度
area = roc_auc_score(Ytest,clf.decision_function(Xtest))
area
0.8701426983930995
plt.figure()
plt.plot(FPR, Recall, color='red',
         label='ROC curve (area = %0.2f)' % area)
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('Recall')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

sklearn_SVM:SVC真实案例:天气预测_菜菜视频学习笔记_第1张图片

# 以此为基础求解最佳阈值
maxindex = (Recall - FPR).tolist().index(max(Recall - FPR))#提取列表中最大值的标签
thresholds[maxindex]
-0.09027758680662012
from sklearn.metrics import accuracy_score as AC
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
          ,class_weight = "balanced"
         ).fit(Xtrain, Ytrain)
prob = pd.DataFrame(clf.decision_function(Xtest))# 置信度
prob.head()
0
0 2.186028
1 0.373602
2 -0.019583
3 -1.134845
4 -0.237963
prob.loc[prob.iloc[:,0] >= thresholds[maxindex],"y_pred"]=1
prob.loc[prob.iloc[:,0] < thresholds[maxindex],"y_pred"]=0
prob.loc[:,"y_pred"].isnull().sum()
0
times = time()
score = AC(Ytest,prob.loc[:,"y_pred"].values)
recall = recall_score(Ytest, prob.loc[:,"y_pred"])
print("testing accuracy %f,recall is %f" % (score,recall))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
# 最佳阈值下模型的精度与召回率也并未产生质变
testing accuracy 0.790000,recall is 0.804665
00:00:002001

说明调参已达该模型极限,想要提升模型效果,只能更换算法

你可能感兴趣的:(sklearn,python,机器学习,svm)