SVC真实案例:天气预测
-
- 1. 导库导数据,探索特征
- 2. 分训练集和测试集,优先探索标签
- 3. 探索特征
-
- 3.1 描述性统计与异常值
- 3.2 处理困难特征:日期
- 3.3 处理困难特征:地点
- 3.4 处理分类型变量:缺失值
- 3.5 处理分类型变量:将分类型变量编码
- 3.6 处理连续型变量:填补缺失值
- 3.7 处理连续型变量:无量纲化
- 4. 建模与模型评估
- 5. 建模调参
-
- 5.1 追求最高召回率recall
- 5.2 追求最高准确率
- 5.3 追求模型精确度和召回率的平衡
1. 导库导数据,探索特征
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
weather = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_weatherAUS5000.csv",index_col=0)
weather.head()
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainTomorrow |
0 |
2015-03-24 |
Adelaide |
12.3 |
19.3 |
0.0 |
5.0 |
NaN |
S |
39.0 |
S |
... |
19.0 |
59.0 |
47.0 |
1022.2 |
1021.4 |
NaN |
NaN |
15.1 |
17.7 |
No |
1 |
2011-07-12 |
Adelaide |
7.9 |
11.4 |
0.0 |
1.0 |
0.5 |
N |
20.0 |
NNE |
... |
7.0 |
70.0 |
59.0 |
1028.7 |
1025.7 |
NaN |
NaN |
8.4 |
11.3 |
No |
2 |
2010-02-08 |
Adelaide |
24.0 |
38.1 |
0.0 |
23.4 |
13.0 |
SE |
39.0 |
NNE |
... |
19.0 |
36.0 |
24.0 |
1018.0 |
1016.0 |
NaN |
NaN |
32.4 |
37.4 |
No |
3 |
2016-09-19 |
Adelaide |
6.7 |
16.4 |
0.4 |
NaN |
NaN |
N |
31.0 |
N |
... |
15.0 |
65.0 |
40.0 |
1014.4 |
1010.0 |
NaN |
NaN |
11.2 |
15.9 |
No |
4 |
2014-03-05 |
Adelaide |
16.7 |
24.8 |
0.0 |
6.6 |
11.7 |
S |
37.0 |
S |
... |
24.0 |
61.0 |
48.0 |
1019.3 |
1018.9 |
NaN |
NaN |
20.8 |
23.7 |
No |
5 rows × 22 columns
X = weather.iloc[:,:-1]
Y = weather.iloc[:,-1]
X.shape
(5000, 21)
X.info()
Int64Index: 5000 entries, 0 to 4999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 5000 non-null object
1 Location 5000 non-null object
2 MinTemp 4979 non-null float64
3 MaxTemp 4987 non-null float64
4 Rainfall 4950 non-null float64
5 Evaporation 2841 non-null float64
6 Sunshine 2571 non-null float64
7 WindGustDir 4669 non-null object
8 WindGustSpeed 4669 non-null float64
9 WindDir9am 4651 non-null object
10 WindDir3pm 4887 non-null object
11 WindSpeed9am 4949 non-null float64
12 WindSpeed3pm 4919 non-null float64
13 Humidity9am 4936 non-null float64
14 Humidity3pm 4880 non-null float64
15 Pressure9am 4506 non-null float64
16 Pressure3pm 4504 non-null float64
17 Cloud9am 3111 non-null float64
18 Cloud3pm 3012 non-null float64
19 Temp9am 4967 non-null float64
20 Temp3pm 4912 non-null float64
dtypes: float64(16), object(5)
memory usage: 859.4+ KB
X.isnull().mean()
Date 0.0000
Location 0.0000
MinTemp 0.0042
MaxTemp 0.0026
Rainfall 0.0100
Evaporation 0.4318
Sunshine 0.4858
WindGustDir 0.0662
WindGustSpeed 0.0662
WindDir9am 0.0698
WindDir3pm 0.0226
WindSpeed9am 0.0102
WindSpeed3pm 0.0162
Humidity9am 0.0128
Humidity3pm 0.0240
Pressure9am 0.0988
Pressure3pm 0.0992
Cloud9am 0.3778
Cloud3pm 0.3976
Temp9am 0.0066
Temp3pm 0.0176
dtype: float64
Y.shape
(5000,)
Y.isnull().sum()
0
np.unique(Y)
array(['No', 'Yes'], dtype=object)
2. 分训练集和测试集,优先探索标签
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y,test_size=0.3,random_state=420)
Xtrain.head()
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed9am |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
1809 |
2015-08-24 |
Katherine |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
17.0 |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
4176 |
2016-12-10 |
Tuggeranong |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
7.0 |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
110 |
2010-04-18 |
Albany |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
17.0 |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
3582 |
2009-11-26 |
Sale |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
11.0 |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
2162 |
2014-04-25 |
Mildura |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
5 rows × 21 columns
for i in [Xtrain, Xtest, Ytrain, Ytest]:
i.index = range(i.shape[0])
Xtrain.head()
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed9am |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
0 |
2015-08-24 |
Katherine |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
17.0 |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
1 |
2016-12-10 |
Tuggeranong |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
7.0 |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
2 |
2010-04-18 |
Albany |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
17.0 |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
3 |
2009-11-26 |
Sale |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
11.0 |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
4 |
2014-04-25 |
Mildura |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
5 rows × 21 columns
Ytrain.head()
0 No
1 No
2 No
3 Yes
4 No
Name: RainTomorrow, dtype: object
Ytrain.value_counts()
No 2704
Yes 796
Name: RainTomorrow, dtype: int64
Ytest.value_counts()
No 1157
Yes 343
Name: RainTomorrow, dtype: int64
Ytrain.value_counts()[0]/Ytrain.value_counts()[1]
3.3969849246231156
from sklearn.preprocessing import LabelEncoder
encorder = LabelEncoder().fit(Ytrain)
Ytrain = pd.DataFrame(encorder.transform(Ytrain))
Ytest = pd.DataFrame(encorder.transform(Ytest))
Ytrain
|
0 |
0 |
0 |
1 |
0 |
2 |
0 |
3 |
1 |
4 |
0 |
... |
... |
3495 |
0 |
3496 |
1 |
3497 |
0 |
3498 |
0 |
3499 |
0 |
3500 rows × 1 columns
Ytest.head()
3. 探索特征
3.1 描述性统计与异常值
Ytrain.to_csv("你想要保存这个文件的地址.文件名.csv")
Xtrain.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
|
count |
mean |
std |
min |
1% |
5% |
10% |
25% |
50% |
75% |
90% |
99% |
max |
MinTemp |
3486.0 |
12.225645 |
6.396243 |
-6.5 |
-1.715 |
1.800 |
4.1 |
7.7 |
12.0 |
16.7 |
20.9 |
25.900 |
29.0 |
MaxTemp |
3489.0 |
23.245543 |
7.201839 |
-3.7 |
8.888 |
12.840 |
14.5 |
18.0 |
22.5 |
28.4 |
33.0 |
40.400 |
46.4 |
Rainfall |
3467.0 |
2.487049 |
7.949686 |
0.0 |
0.000 |
0.000 |
0.0 |
0.0 |
0.0 |
0.8 |
6.6 |
41.272 |
115.8 |
Evaporation |
1983.0 |
5.619163 |
4.383098 |
0.0 |
0.400 |
0.800 |
1.4 |
2.6 |
4.8 |
7.4 |
10.2 |
20.600 |
56.0 |
Sunshine |
1790.0 |
7.508659 |
3.805841 |
0.0 |
0.000 |
0.345 |
1.4 |
4.6 |
8.3 |
10.6 |
12.0 |
13.300 |
13.9 |
WindGustSpeed |
3263.0 |
39.858413 |
13.219607 |
9.0 |
15.000 |
20.000 |
24.0 |
31.0 |
39.0 |
48.0 |
57.0 |
76.000 |
117.0 |
WindSpeed9am |
3466.0 |
14.046163 |
8.670472 |
0.0 |
0.000 |
0.000 |
4.0 |
7.0 |
13.0 |
19.0 |
26.0 |
37.000 |
65.0 |
WindSpeed3pm |
3437.0 |
18.553390 |
8.611818 |
0.0 |
2.000 |
6.000 |
7.0 |
13.0 |
19.0 |
24.0 |
30.0 |
43.000 |
65.0 |
Humidity9am |
3459.0 |
69.069095 |
18.787698 |
2.0 |
18.000 |
35.000 |
45.0 |
57.0 |
70.0 |
83.0 |
94.0 |
100.000 |
100.0 |
Humidity3pm |
3408.0 |
51.651995 |
20.697872 |
2.0 |
9.000 |
17.000 |
23.0 |
37.0 |
52.0 |
66.0 |
79.0 |
98.000 |
100.0 |
Pressure9am |
3154.0 |
1017.622067 |
7.065236 |
985.1 |
1000.506 |
1006.100 |
1008.9 |
1012.8 |
1017.6 |
1022.3 |
1027.0 |
1033.247 |
1038.1 |
Pressure3pm |
3154.0 |
1015.227077 |
7.032531 |
980.2 |
998.000 |
1004.000 |
1006.5 |
1010.3 |
1015.2 |
1020.0 |
1024.4 |
1030.800 |
1036.0 |
Cloud9am |
2171.0 |
4.491939 |
2.858781 |
0.0 |
0.000 |
0.000 |
1.0 |
1.0 |
5.0 |
7.0 |
8.0 |
8.000 |
8.0 |
Cloud3pm |
2095.0 |
4.603819 |
2.655765 |
0.0 |
0.000 |
0.000 |
1.0 |
2.0 |
5.0 |
7.0 |
8.0 |
8.000 |
8.0 |
Temp9am |
3481.0 |
16.989859 |
6.537552 |
-5.2 |
2.400 |
7.000 |
9.0 |
12.2 |
16.6 |
21.6 |
26.0 |
31.000 |
38.0 |
Temp3pm |
3431.0 |
21.719003 |
7.031199 |
-4.1 |
7.460 |
11.500 |
13.3 |
16.6 |
21.0 |
26.6 |
31.4 |
38.600 |
45.9 |
Xtest.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T
|
count |
mean |
std |
min |
1% |
5% |
10% |
25% |
50% |
75% |
90% |
99% |
max |
MinTemp |
1493.0 |
11.916812 |
6.375377 |
-8.5 |
-2.024 |
1.600 |
3.70 |
7.3 |
11.8 |
16.5 |
20.48 |
25.316 |
28.3 |
MaxTemp |
1498.0 |
22.906809 |
6.986043 |
-0.8 |
9.134 |
13.000 |
14.50 |
17.8 |
22.4 |
27.8 |
32.60 |
38.303 |
45.1 |
Rainfall |
1483.0 |
2.241807 |
7.988822 |
0.0 |
0.000 |
0.000 |
0.00 |
0.0 |
0.0 |
0.8 |
5.20 |
35.372 |
108.2 |
Evaporation |
858.0 |
5.657809 |
4.105762 |
0.0 |
0.400 |
1.000 |
1.60 |
2.8 |
4.8 |
7.6 |
10.40 |
19.458 |
38.8 |
Sunshine |
781.0 |
7.677465 |
3.862294 |
0.0 |
0.000 |
0.300 |
1.50 |
4.7 |
8.6 |
10.7 |
12.20 |
13.400 |
13.9 |
WindGustSpeed |
1406.0 |
40.044097 |
14.027052 |
9.0 |
15.000 |
20.000 |
24.00 |
30.0 |
39.0 |
48.0 |
57.00 |
78.000 |
122.0 |
WindSpeed9am |
1483.0 |
13.986514 |
9.124337 |
0.0 |
0.000 |
0.000 |
4.00 |
7.0 |
13.0 |
20.0 |
26.00 |
39.360 |
72.0 |
WindSpeed3pm |
1482.0 |
18.601215 |
8.850446 |
0.0 |
2.000 |
6.000 |
7.00 |
13.0 |
19.0 |
24.0 |
31.00 |
43.000 |
56.0 |
Humidity9am |
1477.0 |
68.688558 |
18.876448 |
4.0 |
20.000 |
36.000 |
44.00 |
57.0 |
69.0 |
82.0 |
95.00 |
100.000 |
100.0 |
Humidity3pm |
1472.0 |
51.431386 |
20.459957 |
2.0 |
8.710 |
18.000 |
23.00 |
37.0 |
52.0 |
66.0 |
78.00 |
96.290 |
100.0 |
Pressure9am |
1352.0 |
1017.763536 |
6.910275 |
988.5 |
1000.900 |
1006.255 |
1008.61 |
1013.2 |
1017.8 |
1022.3 |
1026.50 |
1033.449 |
1038.2 |
Pressure3pm |
1350.0 |
1015.397926 |
6.916976 |
986.2 |
999.198 |
1003.900 |
1006.49 |
1010.9 |
1015.4 |
1020.0 |
1024.20 |
1031.151 |
1036.9 |
Cloud9am |
940.0 |
4.494681 |
2.870468 |
0.0 |
0.000 |
0.000 |
1.00 |
1.0 |
5.0 |
7.0 |
8.00 |
8.000 |
8.0 |
Cloud3pm |
917.0 |
4.403490 |
2.731969 |
0.0 |
0.000 |
0.000 |
1.00 |
2.0 |
5.0 |
7.0 |
8.00 |
8.000 |
8.0 |
Temp9am |
1486.0 |
16.751817 |
6.339816 |
-5.3 |
2.370 |
6.725 |
9.00 |
12.1 |
16.5 |
21.3 |
25.45 |
30.200 |
35.1 |
Temp3pm |
1481.0 |
21.483660 |
6.770567 |
-1.2 |
8.540 |
11.800 |
13.30 |
16.5 |
20.9 |
26.2 |
30.90 |
37.400 |
42.9 |
Xtrain.head()
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed9am |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
0 |
2015-08-24 |
Katherine |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
17.0 |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
1 |
2016-12-10 |
Tuggeranong |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
7.0 |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
2 |
2010-04-18 |
Albany |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
17.0 |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
3 |
2009-11-26 |
Sale |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
11.0 |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
4 |
2014-04-25 |
Mildura |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
5 rows × 21 columns
type(Xtrain.iloc[0,0])
str
3.2 处理困难特征:日期
Xtrainc = Xtrain.copy()
Xtrainc.sort_values(by="Location")
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed9am |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
2796 |
2015-03-24 |
Adelaide |
12.3 |
19.3 |
0.0 |
5.0 |
NaN |
S |
39.0 |
S |
... |
13.0 |
19.0 |
59.0 |
47.0 |
1022.2 |
1021.4 |
NaN |
NaN |
15.1 |
17.7 |
2975 |
2012-08-17 |
Adelaide |
7.8 |
13.2 |
17.6 |
0.8 |
NaN |
SW |
61.0 |
SW |
... |
20.0 |
28.0 |
76.0 |
47.0 |
1012.5 |
1014.7 |
NaN |
NaN |
8.3 |
12.5 |
775 |
2013-03-16 |
Adelaide |
17.4 |
23.8 |
NaN |
NaN |
9.7 |
SSE |
46.0 |
S |
... |
9.0 |
19.0 |
63.0 |
57.0 |
1019.9 |
1020.5 |
NaN |
NaN |
19.1 |
20.7 |
861 |
2011-07-12 |
Adelaide |
7.9 |
11.4 |
0.0 |
1.0 |
0.5 |
N |
20.0 |
NNE |
... |
7.0 |
7.0 |
70.0 |
59.0 |
1028.7 |
1025.7 |
NaN |
NaN |
8.4 |
11.3 |
2906 |
2015-08-24 |
Adelaide |
9.2 |
14.3 |
0.0 |
NaN |
NaN |
SE |
48.0 |
SE |
... |
17.0 |
19.0 |
64.0 |
42.0 |
1024.7 |
1024.1 |
NaN |
NaN |
9.9 |
13.4 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
2223 |
2009-05-08 |
Woomera |
9.2 |
20.6 |
0.0 |
5.2 |
10.4 |
ESE |
37.0 |
SE |
... |
19.0 |
19.0 |
64.0 |
34.0 |
1030.5 |
1026.9 |
0.0 |
1.0 |
13.7 |
20.1 |
1984 |
2014-05-26 |
Woomera |
15.5 |
23.6 |
0.0 |
24.0 |
NaN |
NNW |
43.0 |
NNE |
... |
9.0 |
26.0 |
49.0 |
37.0 |
1014.2 |
1010.3 |
7.0 |
7.0 |
18.0 |
21.5 |
1592 |
2012-01-10 |
Woomera |
16.8 |
26.7 |
0.0 |
10.0 |
5.3 |
SW |
46.0 |
S |
... |
20.0 |
22.0 |
52.0 |
33.0 |
1019.1 |
1016.8 |
4.0 |
6.0 |
18.3 |
24.9 |
2824 |
2015-11-03 |
Woomera |
16.2 |
28.5 |
7.8 |
4.2 |
4.5 |
WSW |
80.0 |
NE |
... |
26.0 |
50.0 |
76.0 |
53.0 |
1009.6 |
1006.8 |
6.0 |
7.0 |
20.5 |
26.2 |
1005 |
2010-05-14 |
Woomera |
3.9 |
19.3 |
0.0 |
5.8 |
10.5 |
NE |
33.0 |
ENE |
... |
15.0 |
13.0 |
43.0 |
19.0 |
1020.2 |
1016.4 |
1.0 |
1.0 |
11.5 |
18.5 |
3500 rows × 21 columns
Xtrain.iloc[:,0].value_counts()
2015-10-12 6
2014-05-16 6
2015-07-03 6
2009-03-30 5
2016-09-07 5
..
2010-06-14 1
2013-12-01 1
2009-01-18 1
2014-11-24 1
2014-04-04 1
Name: Date, Length: 2141, dtype: int64
Xtrain.loc[Xtrain.iloc[:,0] == "2015-08-24",:]
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed9am |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
0 |
2015-08-24 |
Katherine |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
17.0 |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
2906 |
2015-08-24 |
Adelaide |
9.2 |
14.3 |
0.0 |
NaN |
NaN |
SE |
48.0 |
SE |
... |
17.0 |
19.0 |
64.0 |
42.0 |
1024.7 |
1024.1 |
NaN |
NaN |
9.9 |
13.4 |
2 rows × 21 columns
Xtrain.iloc[:,0].value_counts().count()
2141
Xtrain["Rainfall"].head(20)
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.2
8 0.0
9 0.2
10 1.0
11 0.0
12 0.2
13 0.0
14 0.0
15 3.0
16 0.2
17 0.0
18 35.2
19 0.0
Name: Rainfall, dtype: float64
Xtrain["Rainfall"].isnull().sum()
33
Xtrain.loc[Xtrain.loc[:,"Rainfall"] >= 1,"RainToday"] = "Yes"
Xtrain.loc[Xtrain.loc[:,"Rainfall"] < 1,"RainToday"] = "No"
Xtrain.loc[Xtrain.loc[:,"Rainfall"] == np.nan,"RainToday"] = np.nan
Xtrain.head()
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
2015-08-24 |
Katherine |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
No |
1 |
2016-12-10 |
Tuggeranong |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
No |
2 |
2010-04-18 |
Albany |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
No |
3 |
2009-11-26 |
Sale |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
No |
4 |
2014-04-25 |
Mildura |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
No |
5 rows × 22 columns
Xtrain.loc[:,"RainToday"].value_counts()
No 2642
Yes 825
Name: RainToday, dtype: int64
Xtest.loc[Xtest["Rainfall"] >= 1,"RainToday"] = "Yes"
Xtest.loc[Xtest["Rainfall"] < 1,"RainToday"] = "No"
Xtest.loc[Xtest["Rainfall"] == np.nan,"RainToday"] = np.nan
Xtrain.head()
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
2015-08-24 |
Katherine |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
No |
1 |
2016-12-10 |
Tuggeranong |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
No |
2 |
2010-04-18 |
Albany |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
No |
3 |
2009-11-26 |
Sale |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
No |
4 |
2014-04-25 |
Mildura |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
No |
5 rows × 22 columns
Xtest.head()
|
Date |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
2016-01-23 |
NorahHead |
22.0 |
27.8 |
25.2 |
NaN |
NaN |
SSW |
57.0 |
S |
... |
37.0 |
91.0 |
86.0 |
1006.6 |
1008.1 |
NaN |
NaN |
26.2 |
23.1 |
Yes |
1 |
2009-03-05 |
MountGambier |
12.0 |
18.6 |
2.2 |
3.0 |
7.8 |
SW |
52.0 |
SW |
... |
28.0 |
88.0 |
62.0 |
1020.2 |
1019.9 |
8.0 |
7.0 |
14.8 |
17.5 |
Yes |
2 |
2010-03-05 |
MountGinini |
9.1 |
13.3 |
NaN |
NaN |
NaN |
NE |
41.0 |
NaN |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
3 |
2013-10-26 |
Wollongong |
13.1 |
20.3 |
0.0 |
NaN |
NaN |
SW |
33.0 |
W |
... |
24.0 |
40.0 |
51.0 |
1021.3 |
1019.5 |
NaN |
NaN |
16.8 |
19.6 |
No |
4 |
2016-11-28 |
Sale |
12.2 |
20.0 |
0.4 |
NaN |
NaN |
E |
33.0 |
SW |
... |
19.0 |
92.0 |
69.0 |
1015.6 |
1013.2 |
8.0 |
4.0 |
13.6 |
19.0 |
No |
5 rows × 22 columns
Xtrain.loc[0,"Date"].split("-")
['2015', '08', '24']
int(Xtrain.loc[0,"Date"].split("-")[1])
8
Xtrain["Date"] = Xtrain.loc[:,"Date"].apply(lambda x:int(x.split("-")[1]))
Xtrain.loc[:,"Date"].value_counts()
3 334
5 324
7 316
6 302
9 302
1 300
11 299
10 282
4 265
2 264
12 259
8 253
Name: Date, dtype: int64
Xtrain = Xtrain.rename(columns={"Date":"Month"})
Xtrain.head()
|
Month |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
8 |
Katherine |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
No |
1 |
12 |
Tuggeranong |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
No |
2 |
4 |
Albany |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
No |
3 |
11 |
Sale |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
No |
4 |
4 |
Mildura |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
No |
5 rows × 22 columns
Xtest["Date"] = Xtest.loc[:,"Date"].apply(lambda x:int(x.split("-")[1]))
Xtest = Xtest.rename(columns={"Date":"Month"})
Xtest.head()
|
Month |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
1 |
NorahHead |
22.0 |
27.8 |
25.2 |
NaN |
NaN |
SSW |
57.0 |
S |
... |
37.0 |
91.0 |
86.0 |
1006.6 |
1008.1 |
NaN |
NaN |
26.2 |
23.1 |
Yes |
1 |
3 |
MountGambier |
12.0 |
18.6 |
2.2 |
3.0 |
7.8 |
SW |
52.0 |
SW |
... |
28.0 |
88.0 |
62.0 |
1020.2 |
1019.9 |
8.0 |
7.0 |
14.8 |
17.5 |
Yes |
2 |
3 |
MountGinini |
9.1 |
13.3 |
NaN |
NaN |
NaN |
NE |
41.0 |
NaN |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
3 |
10 |
Wollongong |
13.1 |
20.3 |
0.0 |
NaN |
NaN |
SW |
33.0 |
W |
... |
24.0 |
40.0 |
51.0 |
1021.3 |
1019.5 |
NaN |
NaN |
16.8 |
19.6 |
No |
4 |
11 |
Sale |
12.2 |
20.0 |
0.4 |
NaN |
NaN |
E |
33.0 |
SW |
... |
19.0 |
92.0 |
69.0 |
1015.6 |
1013.2 |
8.0 |
4.0 |
13.6 |
19.0 |
No |
5 rows × 22 columns
3.3 处理困难特征:地点
Xtrain.loc[:,"Location"].value_counts()
Bendigo 94
Sydney 92
SalmonGums 92
Canberra 87
Ballarat 87
Darwin 86
Cairns 84
Wollongong 82
Albury 80
Townsville 80
Newcastle 78
Adelaide 77
BadgerysCreek 77
Dartmoor 76
Moree 76
Launceston 76
CoffsHarbour 76
Witchcliffe 76
WaggaWagga 75
NorahHead 74
Mildura 72
MelbourneAirport 72
SydneyAirport 72
Cobar 71
Richmond 71
PerthAirport 71
Hobart 71
Perth 70
Walpole 69
PearceRAAF 69
NorfolkIsland 68
Nuriootpa 68
MountGambier 68
Woomera 67
Albany 67
GoldCoast 66
Watsonia 66
Penrith 65
MountGinini 64
Brisbane 63
AliceSprings 63
Williamtown 63
Tuggeranong 62
Sale 62
Portland 60
Katherine 53
Melbourne 52
Uluru 46
Nhil 44
Name: Location, dtype: int64
Xtrain.loc[:,"Location"].value_counts().count()
49
cityll = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_cityll.csv",index_col=0)
city_climate = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_Cityclimate.csv")
cityll.head()
|
City |
Latitude |
Longitude |
Latitudedir |
Longitudedir |
0 |
Adelaide |
34.9285° |
138.6007° |
S, |
E |
1 |
Albany |
35.0275° |
117.8840° |
S, |
E |
2 |
Albury |
36.0737° |
146.9135° |
S, |
E |
3 |
Wodonga |
36.1241° |
146.8818° |
S, |
E |
4 |
AliceSprings |
23.6980° |
133.8807° |
S, |
E |
float(cityll.loc[0,"Latitude"][:-1])
34.9285
cityll.loc[:,"Latitudedir"].value_counts()
S, 100
Name: Latitudedir, dtype: int64
city_climate.head()
|
City |
Climate |
0 |
Adelaide |
Warm temperate |
1 |
Albany |
Mild temperate |
2 |
Albury |
Hot dry summer, cool winter |
3 |
Wodonga |
Hot dry summer, cool winter |
4 |
AliceSprings |
Hot dry summer, warm winter |
cityll["Latitudenum"] = cityll["Latitude"].apply(lambda x:float(x[:-1]))
cityll["Longitudenum"] = cityll["Longitude"].apply(lambda x:float(x[:-1]))
citylld = cityll.iloc[:,[0,5,6]]
citylld
|
City |
Latitudenum |
Longitudenum |
0 |
Adelaide |
34.9285 |
138.6007 |
1 |
Albany |
35.0275 |
117.8840 |
2 |
Albury |
36.0737 |
146.9135 |
3 |
Wodonga |
36.1241 |
146.8818 |
4 |
AliceSprings |
23.6980 |
133.8807 |
... |
... |
... |
... |
95 |
Wollongong |
34.4278 |
150.8931 |
96 |
Wyndham |
15.4825 |
128.1228 |
97 |
Yalgoo |
28.3445 |
116.6851 |
98 |
Yulara |
25.2335 |
130.9849 |
99 |
Uluru |
25.3444 |
131.0369 |
100 rows × 3 columns
citylld["climate"] = city_climate.iloc[:,-1]
C:\Users\chen'bu'rong\AppData\Local\Temp\ipykernel_7292\702061772.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
citylld["climate"] = city_climate.iloc[:,-1]
citylld.head()
|
City |
Latitudenum |
Longitudenum |
climate |
0 |
Adelaide |
34.9285 |
138.6007 |
Warm temperate |
1 |
Albany |
35.0275 |
117.8840 |
Mild temperate |
2 |
Albury |
36.0737 |
146.9135 |
Hot dry summer, cool winter |
3 |
Wodonga |
36.1241 |
146.8818 |
Hot dry summer, cool winter |
4 |
AliceSprings |
23.6980 |
133.8807 |
Hot dry summer, warm winter |
citylld.loc[:,"climate"].value_counts()
Hot dry summer, cool winter 24
Warm temperate 18
Hot dry summer, warm winter 18
High humidity summer, warm winter 17
Mild temperate 9
Cool temperate 9
Warm humid summer, mild winter 5
Name: climate, dtype: int64
samplecity = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_samplecity.csv",index_col=0)
samplecity.head()
|
City |
Latitude |
Longitude |
Latitudedir |
Longitudedir |
0 |
Canberra |
35.2809° |
149.1300° |
S, |
E |
1 |
Sydney |
33.8688° |
151.2093° |
S, |
E |
2 |
Perth |
31.9505° |
115.8605° |
S, |
E |
3 |
Darwin |
12.4634° |
130.8456° |
S, |
E |
4 |
Hobart |
42.8821° |
147.3272° |
S, |
E |
samplecity["Latitudenum"] = samplecity["Latitude"].apply(lambda x:float(x[:-1]))
samplecity["Longitudenum"] = samplecity["Longitude"].apply(lambda x:float(x[:-1]))
samplecityd = samplecity.iloc[:,[0,5,6]]
samplecityd.head()
|
City |
Latitudenum |
Longitudenum |
0 |
Canberra |
35.2809 |
149.1300 |
1 |
Sydney |
33.8688 |
151.2093 |
2 |
Perth |
31.9505 |
115.8605 |
3 |
Darwin |
12.4634 |
130.8456 |
4 |
Hobart |
42.8821 |
147.3272 |
from math import radians, sin, cos, acos
citylld.loc[:,"slat"] = citylld.iloc[:,1].apply(lambda x : radians(x))
citylld.loc[:,"slon"] = citylld.iloc[:,2].apply(lambda x : radians(x))
samplecityd.loc[:,"elat"] = samplecityd.iloc[:,1].apply(lambda x : radians(x))
samplecityd.loc[:,"elon"] = samplecityd.iloc[:,2].apply(lambda x : radians(x))
import sys
for i in range(samplecityd.shape[0]):
slat = citylld.loc[:,"slat"]
slon = citylld.loc[:,"slon"]
elat = samplecityd.loc[i,"elat"]
elon = samplecityd.loc[i,"elon"]
dist = 6371.01 * np.arccos(np.sin(slat)*np.sin(elat) +
np.cos(slat)*np.cos(elat)*np.cos(slon.values - elon))
city_index = np.argsort(dist)[0]
samplecityd.loc[i,"closest_city"] = citylld.loc[city_index,"City"]
samplecityd.loc[i,"climate"] = citylld.loc[city_index,"climate"]
samplecityd.head(300)
|
City |
Latitudenum |
Longitudenum |
elat |
elon |
closest_city |
climate |
0 |
Canberra |
35.2809 |
149.1300 |
0.615768 |
2.602810 |
Canberra |
Cool temperate |
1 |
Sydney |
33.8688 |
151.2093 |
0.591122 |
2.639100 |
Sydney |
Warm temperate |
2 |
Perth |
31.9505 |
115.8605 |
0.557641 |
2.022147 |
Perth |
Warm temperate |
3 |
Darwin |
12.4634 |
130.8456 |
0.217527 |
2.283687 |
Darwin |
High humidity summer, warm winter |
4 |
Hobart |
42.8821 |
147.3272 |
0.748434 |
2.571345 |
Hobart |
Cool temperate |
5 |
Brisbane |
27.4698 |
153.0251 |
0.479438 |
2.670792 |
Brisbane |
Warm humid summer, mild winter |
6 |
Adelaide |
34.9285 |
138.6007 |
0.609617 |
2.419039 |
Adelaide |
Warm temperate |
7 |
Bendigo |
36.7570 |
144.2794 |
0.641531 |
2.518151 |
Ballarat |
Cool temperate |
8 |
Townsville |
19.2590 |
146.8169 |
0.336133 |
2.562438 |
Townsville |
High humidity summer, warm winter |
9 |
AliceSprings |
23.6980 |
133.8807 |
0.413608 |
2.336659 |
AliceSprings |
Hot dry summer, warm winter |
10 |
MountGambier |
37.8284 |
140.7804 |
0.660230 |
2.457082 |
KingstonSE |
Mild temperate |
11 |
Launceston |
41.4332 |
147.1441 |
0.723146 |
2.568149 |
Launceston |
Cool temperate |
12 |
Ballarat |
37.5622 |
143.8503 |
0.655584 |
2.510661 |
Ballarat |
Cool temperate |
13 |
Albany |
35.0275 |
117.8840 |
0.611345 |
2.057464 |
Albany |
Mild temperate |
14 |
Albury |
36.0737 |
146.9135 |
0.629605 |
2.564124 |
Albury |
Hot dry summer, cool winter |
15 |
PerthAirport |
31.9440 |
115.9680 |
0.557528 |
2.024023 |
PerthAirport |
Warm temperate |
16 |
MelbourneAirport |
37.6697 |
144.8488 |
0.657460 |
2.528088 |
MelbourneAirport |
Mild temperate |
17 |
Mildura |
34.2080 |
142.1246 |
0.597042 |
2.480542 |
Mildura |
Hot dry summer, cool winter |
18 |
SydneyAirport |
33.9399 |
151.1753 |
0.592363 |
2.638507 |
SydneyAirport |
Warm temperate |
19 |
Nuriootpa |
34.4666 |
138.9917 |
0.601556 |
2.425863 |
Adelaide |
Warm temperate |
20 |
Sale |
38.1026 |
147.0730 |
0.665016 |
2.566908 |
LakesEntrance |
Mild temperate |
21 |
Watsonia |
37.7080 |
145.0830 |
0.658129 |
2.532176 |
Ivanhoe |
Hot dry summer, cool winter |
22 |
Tuggeranong |
35.4244 |
149.0888 |
0.618272 |
2.602090 |
Canberra |
Cool temperate |
23 |
Portland |
38.3609 |
141.6041 |
0.669524 |
2.471458 |
MountGambier |
Mild temperate |
24 |
Woomera |
31.1656 |
136.8193 |
0.543942 |
2.387947 |
LeighCreek |
Warm temperate |
25 |
Cairns |
16.9186 |
145.7781 |
0.295285 |
2.544308 |
Cairns |
High humidity summer, warm winter |
26 |
Cobar |
31.4949 |
145.8402 |
0.549690 |
2.545392 |
Bourke |
Hot dry summer, cool winter |
27 |
Wollongong |
34.4278 |
150.8931 |
0.600878 |
2.633581 |
Wollongong |
Warm temperate |
28 |
GoldCoast |
28.0167 |
153.4000 |
0.488984 |
2.677335 |
Southport |
Cool temperate |
29 |
WaggaWagga |
35.1082 |
147.3598 |
0.612754 |
2.571914 |
Albury |
Hot dry summer, cool winter |
30 |
NorfolkIsland |
29.0408 |
167.9547 |
0.506858 |
2.931363 |
LordHoweIsland |
Warm temperate |
31 |
Penrith |
33.7500 |
150.7000 |
0.589049 |
2.630211 |
SydneyAirport |
Warm temperate |
32 |
SalmonGums |
32.9879 |
121.6422 |
0.575747 |
2.123057 |
Norseman |
Hot dry summer, cool winter |
33 |
Newcastle |
32.9283 |
151.7817 |
0.574707 |
2.649090 |
Newcastle |
Warm temperate |
34 |
CoffsHarbour |
30.2986 |
153.1094 |
0.528810 |
2.672263 |
CoffsHarbour |
Warm humid summer, mild winter |
35 |
Witchcliffe |
34.0082 |
115.1155 |
0.593555 |
2.009144 |
MargaretRiver |
Warm temperate |
36 |
Richmond |
37.8230 |
144.9980 |
0.660136 |
2.530693 |
Melbourne |
Mild temperate |
37 |
Dartmoor |
37.9144 |
141.2730 |
0.661731 |
2.465679 |
MountGambier |
Mild temperate |
38 |
NorahHead |
33.2833 |
151.5667 |
0.580903 |
2.645338 |
Swansea |
Cool temperate |
39 |
BadgerysCreek |
33.8829 |
150.7609 |
0.591368 |
2.631274 |
SydneyAirport |
Warm temperate |
40 |
MountGinini |
35.5294 |
148.7723 |
0.620105 |
2.596566 |
Canberra |
Cool temperate |
41 |
Moree |
29.4658 |
149.8339 |
0.514275 |
2.615095 |
Goondiwindi |
Hot dry summer, warm winter |
42 |
Walpole |
34.9551 |
116.7696 |
0.610082 |
2.038014 |
Albany |
Mild temperate |
43 |
PearceRAAF |
31.6666 |
116.0257 |
0.552686 |
2.025030 |
PerthAirport |
Warm temperate |
44 |
Williamtown |
32.8150 |
151.8428 |
0.572730 |
2.650157 |
Newcastle |
Warm temperate |
45 |
Melbourne |
37.8136 |
144.9631 |
0.659972 |
2.530083 |
Melbourne |
Mild temperate |
46 |
Nhil |
36.3328 |
141.6503 |
0.634127 |
2.472264 |
Horsham |
Mild temperate |
47 |
Katherine |
14.4521 |
132.2715 |
0.252237 |
2.308573 |
Katherine |
High humidity summer, warm winter |
48 |
Uluru |
25.3444 |
131.0369 |
0.442343 |
2.287025 |
Uluru |
Hot dry summer, warm winter |
samplecityd["climate"].value_counts()
Warm temperate 15
Mild temperate 10
Cool temperate 9
Hot dry summer, cool winter 6
High humidity summer, warm winter 4
Hot dry summer, warm winter 3
Warm humid summer, mild winter 2
Name: climate, dtype: int64
locafinal = samplecityd.iloc[:,[0,-1]]
locafinal.head()
|
Location |
Climate |
0 |
Canberra |
Cool temperate |
1 |
Sydney |
Warm temperate |
2 |
Perth |
Warm temperate |
3 |
Darwin |
High humidity summer, warm winter |
4 |
Hobart |
Cool temperate |
locafinal.columns = ["Location","Climate"]
locafinal = locafinal.set_index(keys="Location")
locafinal
|
Climate |
Location |
|
Canberra |
Cool temperate |
Sydney |
Warm temperate |
Perth |
Warm temperate |
Darwin |
High humidity summer, warm winter |
Hobart |
Cool temperate |
Brisbane |
Warm humid summer, mild winter |
Adelaide |
Warm temperate |
Bendigo |
Cool temperate |
Townsville |
High humidity summer, warm winter |
AliceSprings |
Hot dry summer, warm winter |
MountGambier |
Mild temperate |
Launceston |
Cool temperate |
Ballarat |
Cool temperate |
Albany |
Mild temperate |
Albury |
Hot dry summer, cool winter |
PerthAirport |
Warm temperate |
MelbourneAirport |
Mild temperate |
Mildura |
Hot dry summer, cool winter |
SydneyAirport |
Warm temperate |
Nuriootpa |
Warm temperate |
Sale |
Mild temperate |
Watsonia |
Hot dry summer, cool winter |
Tuggeranong |
Cool temperate |
Portland |
Mild temperate |
Woomera |
Warm temperate |
Cairns |
High humidity summer, warm winter |
Cobar |
Hot dry summer, cool winter |
Wollongong |
Warm temperate |
GoldCoast |
Cool temperate |
WaggaWagga |
Hot dry summer, cool winter |
NorfolkIsland |
Warm temperate |
Penrith |
Warm temperate |
SalmonGums |
Hot dry summer, cool winter |
Newcastle |
Warm temperate |
CoffsHarbour |
Warm humid summer, mild winter |
Witchcliffe |
Warm temperate |
Richmond |
Mild temperate |
Dartmoor |
Mild temperate |
NorahHead |
Cool temperate |
BadgerysCreek |
Warm temperate |
MountGinini |
Cool temperate |
Moree |
Hot dry summer, warm winter |
Walpole |
Mild temperate |
PearceRAAF |
Warm temperate |
Williamtown |
Warm temperate |
Melbourne |
Mild temperate |
Nhil |
Mild temperate |
Katherine |
High humidity summer, warm winter |
Uluru |
Hot dry summer, warm winter |
locafinal.to_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\samplelocation.csv")
Xtrain.head()
|
Month |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
8 |
Katherine |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
No |
1 |
12 |
Tuggeranong |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
No |
2 |
4 |
Albany |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
No |
3 |
11 |
Sale |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
No |
4 |
4 |
Mildura |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
No |
5 rows × 22 columns
import re
Xtrain["Location"] = Xtrain["Location"].map(locafinal.iloc[:,0])
Xtrain.head()
|
Month |
Location |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
8 |
High humidity summer, warm winter |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
No |
1 |
12 |
Cool temperate |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
No |
2 |
4 |
Mild temperate |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
No |
3 |
11 |
Mild temperate |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
No |
4 |
4 |
Hot dry summer, cool winter |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
No |
5 rows × 22 columns
Xtrain["Location"] = Xtrain["Location"].apply(lambda x:re.sub(",","",x.strip()))
Xtest["Location"] = Xtest["Location"].map(locafinal.iloc[:,0]).apply(lambda x:re.sub(",","",x.strip()))
Xtrain = Xtrain.rename(columns={"Location":"Climate"})
Xtest = Xtest.rename(columns={"Location":"Climate"})
Xtrain.head()
|
Month |
Climate |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
8 |
High humidity summer warm winter |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
NaN |
27.5 |
NaN |
No |
1 |
12 |
Cool temperate |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
NaN |
NaN |
14.6 |
23.6 |
No |
2 |
4 |
Mild temperate |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
NaN |
NaN |
NE |
... |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
No |
3 |
11 |
Mild temperate |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
No |
4 |
4 |
Hot dry summer cool winter |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
No |
5 rows × 22 columns
Xtest.head()
|
Month |
Climate |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
1 |
Cool temperate |
22.0 |
27.8 |
25.2 |
NaN |
NaN |
SSW |
57.0 |
S |
... |
37.0 |
91.0 |
86.0 |
1006.6 |
1008.1 |
NaN |
NaN |
26.2 |
23.1 |
Yes |
1 |
3 |
Mild temperate |
12.0 |
18.6 |
2.2 |
3.0 |
7.8 |
SW |
52.0 |
SW |
... |
28.0 |
88.0 |
62.0 |
1020.2 |
1019.9 |
8.0 |
7.0 |
14.8 |
17.5 |
Yes |
2 |
3 |
Cool temperate |
9.1 |
13.3 |
NaN |
NaN |
NaN |
NE |
41.0 |
NaN |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
3 |
10 |
Warm temperate |
13.1 |
20.3 |
0.0 |
NaN |
NaN |
SW |
33.0 |
W |
... |
24.0 |
40.0 |
51.0 |
1021.3 |
1019.5 |
NaN |
NaN |
16.8 |
19.6 |
No |
4 |
11 |
Mild temperate |
12.2 |
20.0 |
0.4 |
NaN |
NaN |
E |
33.0 |
SW |
... |
19.0 |
92.0 |
69.0 |
1015.6 |
1013.2 |
8.0 |
4.0 |
13.6 |
19.0 |
No |
5 rows × 22 columns
3.4 处理分类型变量:缺失值
Xtrain.isnull().mean()
Month 0.000000
Climate 0.000000
MinTemp 0.004000
MaxTemp 0.003143
Rainfall 0.009429
Evaporation 0.433429
Sunshine 0.488571
WindGustDir 0.067714
WindGustSpeed 0.067714
WindDir9am 0.067429
WindDir3pm 0.024286
WindSpeed9am 0.009714
WindSpeed3pm 0.018000
Humidity9am 0.011714
Humidity3pm 0.026286
Pressure9am 0.098857
Pressure3pm 0.098857
Cloud9am 0.379714
Cloud3pm 0.401429
Temp9am 0.005429
Temp3pm 0.019714
RainToday 0.009429
dtype: float64
Xtrain.info()
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Month 3500 non-null int64
1 Climate 3500 non-null object
2 MinTemp 3486 non-null float64
3 MaxTemp 3489 non-null float64
4 Rainfall 3467 non-null float64
5 Evaporation 1983 non-null float64
6 Sunshine 1790 non-null float64
7 WindGustDir 3263 non-null object
8 WindGustSpeed 3263 non-null float64
9 WindDir9am 3264 non-null object
10 WindDir3pm 3415 non-null object
11 WindSpeed9am 3466 non-null float64
12 WindSpeed3pm 3437 non-null float64
13 Humidity9am 3459 non-null float64
14 Humidity3pm 3408 non-null float64
15 Pressure9am 3154 non-null float64
16 Pressure3pm 3154 non-null float64
17 Cloud9am 2171 non-null float64
18 Cloud3pm 2095 non-null float64
19 Temp9am 3481 non-null float64
20 Temp3pm 3431 non-null float64
21 RainToday 3467 non-null object
dtypes: float64(16), int64(1), object(5)
memory usage: 601.7+ KB
Xtrain.dtypes == "object"
Month False
Climate True
MinTemp False
MaxTemp False
Rainfall False
Evaporation False
Sunshine False
WindGustDir True
WindGustSpeed False
WindDir9am True
WindDir3pm True
WindSpeed9am False
WindSpeed3pm False
Humidity9am False
Humidity3pm False
Pressure9am False
Pressure3pm False
Cloud9am False
Cloud3pm False
Temp9am False
Temp3pm False
RainToday True
dtype: bool
cate = Xtrain.columns[Xtrain.dtypes == "object"].tolist()
cate
['Climate', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']
cloud = ["Cloud9am","Cloud3pm"]
cate = cate + cloud
cate
['Climate',
'WindGustDir',
'WindDir9am',
'WindDir3pm',
'RainToday',
'Cloud9am',
'Cloud3pm']
from sklearn.impute import SimpleImputer
si = SimpleImputer(missing_values=np.nan,strategy="most_frequent")
si.fit(Xtrain.loc[:,cate])
SimpleImputer(strategy='most_frequent')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SimpleImputer
SimpleImputer(strategy='most_frequent')
Xtrain.loc[:,cate] = si.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = si.transform(Xtest.loc[:,cate])
Xtrain.head()
|
Month |
Climate |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
8 |
High humidity summer warm winter |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
ESE |
26.0 |
NNW |
... |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
7.0 |
27.5 |
NaN |
No |
1 |
12 |
Cool temperate |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
NNW |
33.0 |
NE |
... |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
7.0 |
7.0 |
14.6 |
23.6 |
No |
2 |
4 |
Mild temperate |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
W |
NaN |
NE |
... |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
No |
3 |
11 |
Mild temperate |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
S |
37.0 |
N |
... |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
No |
4 |
4 |
Hot dry summer cool winter |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
NNE |
24.0 |
E |
... |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
No |
5 rows × 22 columns
Xtest.head()
|
Month |
Climate |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
1 |
Cool temperate |
22.0 |
27.8 |
25.2 |
NaN |
NaN |
SSW |
57.0 |
S |
... |
37.0 |
91.0 |
86.0 |
1006.6 |
1008.1 |
7.0 |
7.0 |
26.2 |
23.1 |
Yes |
1 |
3 |
Mild temperate |
12.0 |
18.6 |
2.2 |
3.0 |
7.8 |
SW |
52.0 |
SW |
... |
28.0 |
88.0 |
62.0 |
1020.2 |
1019.9 |
8.0 |
7.0 |
14.8 |
17.5 |
Yes |
2 |
3 |
Cool temperate |
9.1 |
13.3 |
NaN |
NaN |
NaN |
NE |
41.0 |
N |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
7.0 |
7.0 |
NaN |
NaN |
No |
3 |
10 |
Warm temperate |
13.1 |
20.3 |
0.0 |
NaN |
NaN |
SW |
33.0 |
W |
... |
24.0 |
40.0 |
51.0 |
1021.3 |
1019.5 |
7.0 |
7.0 |
16.8 |
19.6 |
No |
4 |
11 |
Mild temperate |
12.2 |
20.0 |
0.4 |
NaN |
NaN |
E |
33.0 |
SW |
... |
19.0 |
92.0 |
69.0 |
1015.6 |
1013.2 |
8.0 |
4.0 |
13.6 |
19.0 |
No |
5 rows × 22 columns
Xtrain.loc[:,cate].isnull().mean()
Climate 0.0
WindGustDir 0.0
WindDir9am 0.0
WindDir3pm 0.0
RainToday 0.0
Cloud9am 0.0
Cloud3pm 0.0
dtype: float64
Xtest.loc[:,cate].isnull().mean()
Climate 0.0
WindGustDir 0.0
WindDir9am 0.0
WindDir3pm 0.0
RainToday 0.0
Cloud9am 0.0
Cloud3pm 0.0
dtype: float64
3.5 处理分类型变量:将分类型变量编码
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe = oe.fit(Xtrain.loc[:,cate])
Xtrain.loc[:,cate] = oe.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = oe.transform(Xtest.loc[:,cate])
cate
Xtrain.loc[:,cate].head()
|
Climate |
WindGustDir |
WindDir9am |
WindDir3pm |
RainToday |
Cloud9am |
Cloud3pm |
0 |
1.0 |
2.0 |
6.0 |
0.0 |
0.0 |
0.0 |
7.0 |
1 |
0.0 |
6.0 |
4.0 |
6.0 |
0.0 |
7.0 |
7.0 |
2 |
4.0 |
13.0 |
4.0 |
0.0 |
0.0 |
1.0 |
3.0 |
3 |
4.0 |
8.0 |
3.0 |
8.0 |
0.0 |
6.0 |
6.0 |
4 |
2.0 |
5.0 |
0.0 |
6.0 |
0.0 |
2.0 |
4.0 |
Xtest.loc[:,cate].head()
|
Climate |
WindGustDir |
WindDir9am |
WindDir3pm |
RainToday |
Cloud9am |
Cloud3pm |
0 |
0.0 |
11.0 |
8.0 |
11.0 |
1.0 |
7.0 |
7.0 |
1 |
4.0 |
12.0 |
12.0 |
8.0 |
1.0 |
8.0 |
7.0 |
2 |
0.0 |
4.0 |
3.0 |
9.0 |
0.0 |
7.0 |
7.0 |
3 |
6.0 |
12.0 |
13.0 |
9.0 |
0.0 |
7.0 |
7.0 |
4 |
4.0 |
0.0 |
12.0 |
0.0 |
0.0 |
8.0 |
4.0 |
Xtrain.head()
|
Month |
Climate |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
8 |
1.0 |
17.5 |
36.0 |
0.0 |
8.8 |
NaN |
2.0 |
26.0 |
6.0 |
... |
15.0 |
57.0 |
NaN |
1016.8 |
1012.2 |
0.0 |
7.0 |
27.5 |
NaN |
0.0 |
1 |
12 |
0.0 |
9.5 |
25.0 |
0.0 |
NaN |
NaN |
6.0 |
33.0 |
4.0 |
... |
17.0 |
59.0 |
31.0 |
1020.4 |
1017.5 |
7.0 |
7.0 |
14.6 |
23.6 |
0.0 |
2 |
4 |
4.0 |
13.0 |
22.6 |
0.0 |
3.8 |
10.4 |
13.0 |
NaN |
4.0 |
... |
31.0 |
79.0 |
68.0 |
1020.3 |
1015.7 |
1.0 |
3.0 |
17.5 |
20.8 |
0.0 |
3 |
11 |
4.0 |
13.9 |
29.8 |
0.0 |
5.8 |
5.1 |
8.0 |
37.0 |
3.0 |
... |
28.0 |
82.0 |
44.0 |
1012.5 |
1005.9 |
6.0 |
6.0 |
18.5 |
27.5 |
0.0 |
4 |
4 |
2.0 |
6.0 |
23.5 |
0.0 |
2.8 |
8.6 |
5.0 |
24.0 |
0.0 |
... |
15.0 |
58.0 |
35.0 |
1019.8 |
1014.1 |
2.0 |
4.0 |
12.4 |
22.4 |
0.0 |
5 rows × 22 columns
3.6 处理连续型变量:填补缺失值
col = Xtrain.columns.tolist()
col
['Month',
'Climate',
'MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustDir',
'WindGustSpeed',
'WindDir9am',
'WindDir3pm',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Cloud9am',
'Cloud3pm',
'Temp9am',
'Temp3pm',
'RainToday']
cate
['Climate',
'WindGustDir',
'WindDir9am',
'WindDir3pm',
'RainToday',
'Cloud9am',
'Cloud3pm']
for i in cate:
col.remove(i)
col
['Month',
'MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Temp9am',
'Temp3pm']
impmean = SimpleImputer(missing_values=np.nan,strategy = "mean")
impmean = impmean.fit(Xtrain.loc[:,col])
Xtrain.loc[:,col] = impmean.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = impmean.transform(Xtest.loc[:,col])
Xtrain.isnull().mean()
Month 0.0
Climate 0.0
MinTemp 0.0
MaxTemp 0.0
Rainfall 0.0
Evaporation 0.0
Sunshine 0.0
WindGustDir 0.0
WindGustSpeed 0.0
WindDir9am 0.0
WindDir3pm 0.0
WindSpeed9am 0.0
WindSpeed3pm 0.0
Humidity9am 0.0
Humidity3pm 0.0
Pressure9am 0.0
Pressure3pm 0.0
Cloud9am 0.0
Cloud3pm 0.0
Temp9am 0.0
Temp3pm 0.0
RainToday 0.0
dtype: float64
Xtest.isnull().mean()
Month 0.0
Climate 0.0
MinTemp 0.0
MaxTemp 0.0
Rainfall 0.0
Evaporation 0.0
Sunshine 0.0
WindGustDir 0.0
WindGustSpeed 0.0
WindDir9am 0.0
WindDir3pm 0.0
WindSpeed9am 0.0
WindSpeed3pm 0.0
Humidity9am 0.0
Humidity3pm 0.0
Pressure9am 0.0
Pressure3pm 0.0
Cloud9am 0.0
Cloud3pm 0.0
Temp9am 0.0
Temp3pm 0.0
RainToday 0.0
dtype: float64
3.7 处理连续型变量:无量纲化
col.remove("Month")
col
['MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Temp9am',
'Temp3pm']
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss = ss.fit(Xtrain.loc[:,col])
Xtrain.loc[:,col] = ss.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = ss.transform(Xtest.loc[:,col])
Xtrain.head()
|
Month |
Climate |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
8.0 |
1.0 |
0.826375 |
1.774044 |
-0.314379 |
0.964367 |
0.000000 |
2.0 |
-1.085893e+00 |
6.0 |
... |
-0.416443 |
-0.646283 |
0.000000 |
-0.122589 |
-0.453507 |
0.0 |
7.0 |
1.612270 |
0.000000 |
0.0 |
1 |
12.0 |
0.0 |
-0.427048 |
0.244031 |
-0.314379 |
0.000000 |
0.000000 |
6.0 |
-5.373993e-01 |
4.0 |
... |
-0.182051 |
-0.539186 |
-1.011310 |
0.414254 |
0.340522 |
7.0 |
7.0 |
-0.366608 |
0.270238 |
0.0 |
2 |
4.0 |
4.0 |
0.121324 |
-0.089790 |
-0.314379 |
-0.551534 |
1.062619 |
13.0 |
-1.113509e-15 |
4.0 |
... |
1.458692 |
0.531786 |
0.800547 |
0.399342 |
0.070852 |
1.0 |
3.0 |
0.078256 |
-0.132031 |
0.0 |
3 |
11.0 |
4.0 |
0.262334 |
0.911673 |
-0.314379 |
0.054826 |
-0.885225 |
8.0 |
-2.239744e-01 |
3.0 |
... |
1.107105 |
0.692432 |
-0.374711 |
-0.763819 |
-1.397352 |
6.0 |
6.0 |
0.231658 |
0.830540 |
0.0 |
4 |
4.0 |
2.0 |
-0.975421 |
0.035393 |
-0.314379 |
-0.854715 |
0.401087 |
5.0 |
-1.242605e+00 |
0.0 |
... |
-0.416443 |
-0.592734 |
-0.815433 |
0.324780 |
-0.168855 |
2.0 |
4.0 |
-0.704091 |
0.097837 |
0.0 |
5 rows × 22 columns
Xtest.head()
|
Month |
Climate |
MinTemp |
MaxTemp |
Rainfall |
Evaporation |
Sunshine |
WindGustDir |
WindGustSpeed |
WindDir9am |
... |
WindSpeed3pm |
Humidity9am |
Humidity3pm |
Pressure9am |
Pressure3pm |
Cloud9am |
Cloud3pm |
Temp9am |
Temp3pm |
RainToday |
0 |
1.0 |
0.0 |
1.531425 |
0.633489 |
2.871067 |
0.000000 |
0.000000 |
11.0 |
1.343150 |
8.0 |
... |
2.161868e+00 |
1.174369 |
1.681991 |
-1.643646 |
-1.067755 |
7.0 |
7.0 |
1.412848 |
0.198404 |
1.0 |
1 |
3.0 |
4.0 |
-0.035354 |
-0.646158 |
-0.036285 |
-0.794079 |
0.107073 |
12.0 |
0.951369 |
12.0 |
... |
1.107105e+00 |
1.013723 |
0.506733 |
0.384430 |
0.700082 |
8.0 |
7.0 |
-0.335927 |
-0.606132 |
1.0 |
2 |
3.0 |
0.0 |
-0.489720 |
-1.383346 |
0.000000 |
0.000000 |
0.000000 |
4.0 |
0.089450 |
3.0 |
... |
-4.163637e-16 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
7.0 |
7.0 |
0.000000 |
0.000000 |
0.0 |
3 |
10.0 |
6.0 |
0.136992 |
-0.409702 |
-0.314379 |
0.000000 |
0.000000 |
12.0 |
-0.537399 |
13.0 |
... |
6.383207e-01 |
-1.556609 |
-0.031928 |
0.548465 |
0.640155 |
7.0 |
7.0 |
-0.029125 |
-0.304431 |
0.0 |
4 |
11.0 |
4.0 |
-0.004018 |
-0.451429 |
-0.263817 |
0.000000 |
0.000000 |
0.0 |
-0.537399 |
12.0 |
... |
5.234093e-02 |
1.227917 |
0.849516 |
-0.301537 |
-0.303690 |
8.0 |
4.0 |
-0.520009 |
-0.390632 |
0.0 |
5 rows × 22 columns
Ytrain.head()
4. 建模与模型评估
from time import time
import datetime
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, recall_score
Ytrain = Ytrain.iloc[:,0].ravel()
Ytest = Ytest.iloc[:,0].ravel()
times = time()
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
linear 's testing accuracy 0.844000, recall is 0.469388', auc is 0.869029
00:03:751689
poly 's testing accuracy 0.840667, recall is 0.457726', auc is 0.868157
00:04:253937
rbf 's testing accuracy 0.813333, recall is 0.306122', auc is 0.814873
00:05:900434
sigmoid 's testing accuracy 0.655333, recall is 0.154519', auc is 0.437308
00:06:403792
5. 建模调参
5.1 追求最高召回率recall
times = time()
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
linear 's testing accuracy 0.796000, recall is 0.775510', auc is 0.870065
00:04:266080
poly 's testing accuracy 0.793333, recall is 0.763848', auc is 0.871448
00:04:972567
rbf 's testing accuracy 0.803333, recall is 0.600583', auc is 0.819713
00:06:837272
sigmoid 's testing accuracy 0.562000, recall is 0.282799', auc is 0.437119
00:08:094214
times = time()
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
,class_weight = {1:10}
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("testing accuracy %f, recall is %f', auc is %f" %(score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
testing accuracy 0.636667, recall is 0.912536', auc is 0.866360
00:07:806017
5.2 追求最高准确率
valuec = pd.Series(Ytest).value_counts()
valuec
0 1157
1 343
dtype: int64
valuec[0]/valuec.sum()
0.7713333333333333
from sklearn.metrics import confusion_matrix as CM
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
cm = CM(Ytest,result,labels=(1,0))
cm
array([[ 161, 182],
[ 52, 1105]], dtype=int64)
specificity = cm[1,1]/cm[1,:].sum()
specificity
0.9550561797752809
irange = np.linspace(0.01,0.05,10)
irange
array([0.01 , 0.01444444, 0.01888889, 0.02333333, 0.02777778,
0.03222222, 0.03666667, 0.04111111, 0.04555556, 0.05 ])
for i in irange:
times = time()
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
,class_weight = {1:1+i}
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" %(1+i,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
under ratio 1:1.010000 testing accuracy 0.844667, recall is 0.475219', auc is 0.869157
00:03:688484
under ratio 1:1.014444 testing accuracy 0.844667, recall is 0.478134', auc is 0.869185
00:03:856429
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869200
00:03:745157
under ratio 1:1.023333 testing accuracy 0.845333, recall is 0.481050', auc is 0.869175
00:03:598711
under ratio 1:1.027778 testing accuracy 0.844000, recall is 0.481050', auc is 0.869394
00:03:667787
under ratio 1:1.032222 testing accuracy 0.844000, recall is 0.481050', auc is 0.869528
00:03:641163
under ratio 1:1.036667 testing accuracy 0.844000, recall is 0.481050', auc is 0.869659
00:03:895604
under ratio 1:1.041111 testing accuracy 0.844667, recall is 0.483965', auc is 0.869629
00:03:787082
under ratio 1:1.045556 testing accuracy 0.844667, recall is 0.483965', auc is 0.869712
00:03:729805
under ratio 1:1.050000 testing accuracy 0.845333, recall is 0.486880', auc is 0.869863
00:03:800089
irange_ = np.linspace(0.018889,0.027778,10)
for i in irange_:
times = time()
clf = SVC(kernel = "linear"
,gamma="auto"
,cache_size = 5000
,class_weight = {1:1+i}
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" %(1+i,score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869213
00:03:654617
under ratio 1:1.019877 testing accuracy 0.844000, recall is 0.478134', auc is 0.869228
00:03:753644
under ratio 1:1.020864 testing accuracy 0.844000, recall is 0.478134', auc is 0.869218
00:03:743298
under ratio 1:1.021852 testing accuracy 0.844667, recall is 0.478134', auc is 0.869188
00:03:557083
under ratio 1:1.022840 testing accuracy 0.844667, recall is 0.478134', auc is 0.869220
00:03:805152
under ratio 1:1.023827 testing accuracy 0.844667, recall is 0.481050', auc is 0.869188
00:03:774551
under ratio 1:1.024815 testing accuracy 0.844667, recall is 0.481050', auc is 0.869231
00:03:644071
under ratio 1:1.025803 testing accuracy 0.844667, recall is 0.481050', auc is 0.869238
00:03:772898
under ratio 1:1.026790 testing accuracy 0.844000, recall is 0.481050', auc is 0.869314
00:03:660354
under ratio 1:1.027778 testing accuracy 0.844000, recall is 0.481050', auc is 0.869326
00:03:715478
from sklearn.linear_model import LogisticRegression as LR
logclf = LR(solver="liblinear").fit(Xtrain, Ytrain)
logclf.score(Xtest,Ytest)
0.8486666666666667
C_range = np.linspace(5,10,10)
for C in C_range:
logclf = LR(solver="liblinear",C=C).fit(Xtrain, Ytrain)
print(C,logclf.score(Xtest,Ytest))
5.0 0.8493333333333334
5.555555555555555 0.8493333333333334
6.111111111111111 0.8486666666666667
6.666666666666667 0.8493333333333334
7.222222222222222 0.8493333333333334
7.777777777777778 0.8493333333333334
8.333333333333334 0.8493333333333334
8.88888888888889 0.8493333333333334
9.444444444444445 0.8493333333333334
10.0 0.8493333333333334
5.3 追求模型精确度和召回率的平衡
times = time()
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("testing accuracy %f,recall is %f', auc is %f" % (score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
testing accuracy 0.794000,recall is 0.772595', auc is 0.870143
00:11:026793
from sklearn.metrics import roc_curve as ROC
import matplotlib.pyplot as plt
FPR, Recall, thresholds = ROC(Ytest,clf.decision_function(Xtest),pos_label=1)
area = roc_auc_score(Ytest,clf.decision_function(Xtest))
area
0.8701426983930995
plt.figure()
plt.plot(FPR, Recall, color='red',
label='ROC curve (area = %0.2f)' % area)
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('Recall')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

maxindex = (Recall - FPR).tolist().index(max(Recall - FPR))
thresholds[maxindex]
-0.09027758680662012
from sklearn.metrics import accuracy_score as AC
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
,class_weight = "balanced"
).fit(Xtrain, Ytrain)
prob = pd.DataFrame(clf.decision_function(Xtest))
prob.head()
|
0 |
0 |
2.186028 |
1 |
0.373602 |
2 |
-0.019583 |
3 |
-1.134845 |
4 |
-0.237963 |
prob.loc[prob.iloc[:,0] >= thresholds[maxindex],"y_pred"]=1
prob.loc[prob.iloc[:,0] < thresholds[maxindex],"y_pred"]=0
prob.loc[:,"y_pred"].isnull().sum()
0
times = time()
score = AC(Ytest,prob.loc[:,"y_pred"].values)
recall = recall_score(Ytest, prob.loc[:,"y_pred"])
print("testing accuracy %f,recall is %f" % (score,recall))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
testing accuracy 0.790000,recall is 0.804665
00:00:002001
说明调参已达该模型极限,想要提升模型效果,只能更换算法