chenburong2021

sklearn_SVM:SVC真实案例：天气预测_菜菜视频学习笔记

SVC真实案例：天气预测

- 1. 导库导数据，探索特征
- 2. 分训练集和测试集,优先探索标签
- 3. 探索特征
- - 3.1 描述性统计与异常值
  - 3.2 处理困难特征：日期
  - 3.3 处理困难特征：地点
  - 3.4 处理分类型变量：缺失值
  - 3.5 处理分类型变量：将分类型变量编码
  - 3.6 处理连续型变量：填补缺失值
  - 3.7 处理连续型变量：无量纲化
- 4. 建模与模型评估
- 5. 建模调参
- - 5.1 追求最高召回率recall
  - 5.2 追求最高准确率
  - 5.3 追求模型精确度和召回率的平衡

1. 导库导数据，探索特征

#5000行,21个特征的预测天气数据集
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

weather = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_weatherAUS5000.csv",index_col=0)

weather.head()

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainTomorrow
0	2015-03-24	Adelaide	12.3	19.3	0.0	5.0	NaN	S	39.0	S	...	19.0	59.0	47.0	1022.2	1021.4	NaN	NaN	15.1	17.7	No
1	2011-07-12	Adelaide	7.9	11.4	0.0	1.0	0.5	N	20.0	NNE	...	7.0	70.0	59.0	1028.7	1025.7	NaN	NaN	8.4	11.3	No
2	2010-02-08	Adelaide	24.0	38.1	0.0	23.4	13.0	SE	39.0	NNE	...	19.0	36.0	24.0	1018.0	1016.0	NaN	NaN	32.4	37.4	No
3	2016-09-19	Adelaide	6.7	16.4	0.4	NaN	NaN	N	31.0	N	...	15.0	65.0	40.0	1014.4	1010.0	NaN	NaN	11.2	15.9	No
4	2014-03-05	Adelaide	16.7	24.8	0.0	6.6	11.7	S	37.0	S	...	24.0	61.0	48.0	1019.3	1018.9	NaN	NaN	20.8	23.7	No

5 rows × 22 columns

#将特征矩阵和标签Y分开
X = weather.iloc[:,:-1]
Y = weather.iloc[:,-1]

#分裂的快捷键：ctrl shift -

#合并的快捷键：shift M

X.shape #5000行随机

(5000, 21)

#探索数据类型
X.info()


Int64Index: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           5000 non-null   object 
 1   Location       5000 non-null   object 
 2   MinTemp        4979 non-null   float64
 3   MaxTemp        4987 non-null   float64
 4   Rainfall       4950 non-null   float64
 5   Evaporation    2841 non-null   float64
 6   Sunshine       2571 non-null   float64
 7   WindGustDir    4669 non-null   object 
 8   WindGustSpeed  4669 non-null   float64
 9   WindDir9am     4651 non-null   object 
 10  WindDir3pm     4887 non-null   object 
 11  WindSpeed9am   4949 non-null   float64
 12  WindSpeed3pm   4919 non-null   float64
 13  Humidity9am    4936 non-null   float64
 14  Humidity3pm    4880 non-null   float64
 15  Pressure9am    4506 non-null   float64
 16  Pressure3pm    4504 non-null   float64
 17  Cloud9am       3111 non-null   float64
 18  Cloud3pm       3012 non-null   float64
 19  Temp9am        4967 non-null   float64
 20  Temp3pm        4912 non-null   float64
dtypes: float64(16), object(5)
memory usage: 859.4+ KB

#探索缺失值
X.isnull().mean() #缺失值所占总值的比例 isnull().sum(全部的True)/X.shape[0]
#我们要有不同的缺失值填补策略

Date             0.0000
Location         0.0000
MinTemp          0.0042
MaxTemp          0.0026
Rainfall         0.0100
Evaporation      0.4318
Sunshine         0.4858
WindGustDir      0.0662
WindGustSpeed    0.0662
WindDir9am       0.0698
WindDir3pm       0.0226
WindSpeed9am     0.0102
WindSpeed3pm     0.0162
Humidity9am      0.0128
Humidity3pm      0.0240
Pressure9am      0.0988
Pressure3pm      0.0992
Cloud9am         0.3778
Cloud3pm         0.3976
Temp9am          0.0066
Temp3pm          0.0176
dtype: float64

#在上方添加一个新的cell ESC a enter

#在下方添加一个新的cell ESC b enter

#删除一个cell ESC d d or ESC x

Y.shape

(5000,)

Y.isnull().sum() #加和的时候，True是1，False是0

#探索标签的分类,提取标签中不重复的值
np.unique(Y) #我们的标签是二分类

array(['No', 'Yes'], dtype=object)

2. 分训练集和测试集,优先探索标签

#分训练集和测试集
#防止训练模型受测试集影响
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y,test_size=0.3,random_state=420) #随机抽样

Xtrain.head()

	Date	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm
1809	2015-08-24	Katherine	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	17.0	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN
4176	2016-12-10	Tuggeranong	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	7.0	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6
110	2010-04-18	Albany	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	17.0	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8
3582	2009-11-26	Sale	13.9	29.8	5.8	5.1	S	37.0	N	...	11.0	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5
2162	2014-04-25	Mildura	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4

5 rows × 21 columns

#恢复索引
for i in [Xtrain, Xtest, Ytrain, Ytest]:
    i.index = range(i.shape[0])
#使索引值为行数

Xtrain.head()

	Date	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm
0	2015-08-24	Katherine	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	17.0	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN
1	2016-12-10	Tuggeranong	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	7.0	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6
2	2010-04-18	Albany	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	17.0	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8
3	2009-11-26	Sale	13.9	29.8	5.8	5.1	S	37.0	N	...	11.0	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5
4	2014-04-25	Mildura	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4

5 rows × 21 columns

Ytrain.head()

0     No
1     No
2     No
3    Yes
4     No
Name: RainTomorrow, dtype: object

#是否有样本不平衡问题？
Ytrain.value_counts()

No     2704
Yes     796
Name: RainTomorrow, dtype: int64

Ytest.value_counts()

No     1157
Yes     343
Name: RainTomorrow, dtype: int64

#有轻微的样本不均衡问题

Ytrain.value_counts()[0]/Ytrain.value_counts()[1]

3.3969849246231156

#将标签编码
from sklearn.preprocessing import LabelEncoder #标签专用，第三章讲过
encorder = LabelEncoder().fit(Ytrain) #允许一维数据的输入的；其他类大多是 不允许一维数据输入
#encorder建模，认得了：有两类，YES和NO，YES是1，NO是0

#使用训练集进行训练，然后在训练集和测试集上分别进行transform
Ytrain = pd.DataFrame(encorder.transform(Ytrain))
Ytest = pd.DataFrame(encorder.transform(Ytest))

#如果我们的测试集中，出现了训练集中没有出现过的标签类别，则需要重新建模
#比如说，测试集中有YES, NO, UNKNOWN
#而我们的训练集中只有YES和NO

Ytrain

	0
0	0
1	0
2	0
3	1
4	0
...	...
3495	0
3496	1
3497	0
3498	0
3499	0

3500 rows × 1 columns

Ytest.head()

	0
0	0
1	0
2	1
3	0
4	0

3. 探索特征

3.1 描述性统计与异常值

Ytrain.to_csv("你想要保存这个文件的地址.文件名.csv")
#如果确定上述处理无误，保存数据，避免操作失误重新执行代码

#描述性统计
Xtrain.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T

	count	mean	std	min	1%	5%	10%	25%	50%	75%	90%	99%	max
MinTemp	3486.0	12.225645	6.396243	-6.5	-1.715	1.800	4.1	7.7	12.0	16.7	20.9	25.900	29.0
MaxTemp	3489.0	23.245543	7.201839	-3.7	8.888	12.840	14.5	18.0	22.5	28.4	33.0	40.400	46.4
Rainfall	3467.0	2.487049	7.949686	0.0	0.000	0.000	0.0	0.0	0.0	0.8	6.6	41.272	115.8
Evaporation	1983.0	5.619163	4.383098	0.0	0.400	0.800	1.4	2.6	4.8	7.4	10.2	20.600	56.0
Sunshine	1790.0	7.508659	3.805841	0.0	0.000	0.345	1.4	4.6	8.3	10.6	12.0	13.300	13.9
WindGustSpeed	3263.0	39.858413	13.219607	9.0	15.000	20.000	24.0	31.0	39.0	48.0	57.0	76.000	117.0
WindSpeed9am	3466.0	14.046163	8.670472	0.0	0.000	0.000	4.0	7.0	13.0	19.0	26.0	37.000	65.0
WindSpeed3pm	3437.0	18.553390	8.611818	0.0	2.000	6.000	7.0	13.0	19.0	24.0	30.0	43.000	65.0
Humidity9am	3459.0	69.069095	18.787698	2.0	18.000	35.000	45.0	57.0	70.0	83.0	94.0	100.000	100.0
Humidity3pm	3408.0	51.651995	20.697872	2.0	9.000	17.000	23.0	37.0	52.0	66.0	79.0	98.000	100.0
Pressure9am	3154.0	1017.622067	7.065236	985.1	1000.506	1006.100	1008.9	1012.8	1017.6	1022.3	1027.0	1033.247	1038.1
Pressure3pm	3154.0	1015.227077	7.032531	980.2	998.000	1004.000	1006.5	1010.3	1015.2	1020.0	1024.4	1030.800	1036.0
Cloud9am	2171.0	4.491939	2.858781	0.0	0.000	0.000	1.0	1.0	5.0	7.0	8.0	8.000	8.0
Cloud3pm	2095.0	4.603819	2.655765	0.0	0.000	0.000	1.0	2.0	5.0	7.0	8.0	8.000	8.0
Temp9am	3481.0	16.989859	6.537552	-5.2	2.400	7.000	9.0	12.2	16.6	21.6	26.0	31.000	38.0
Temp3pm	3431.0	21.719003	7.031199	-4.1	7.460	11.500	13.3	16.6	21.0	26.6	31.4	38.600	45.9

Xtest.describe([0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.99]).T

	count	mean	std	min	1%	5%	10%	25%	50%	75%	90%	99%	max
MinTemp	1493.0	11.916812	6.375377	-8.5	-2.024	1.600	3.70	7.3	11.8	16.5	20.48	25.316	28.3
MaxTemp	1498.0	22.906809	6.986043	-0.8	9.134	13.000	14.50	17.8	22.4	27.8	32.60	38.303	45.1
Rainfall	1483.0	2.241807	7.988822	0.0	0.000	0.000	0.00	0.0	0.0	0.8	5.20	35.372	108.2
Evaporation	858.0	5.657809	4.105762	0.0	0.400	1.000	1.60	2.8	4.8	7.6	10.40	19.458	38.8
Sunshine	781.0	7.677465	3.862294	0.0	0.000	0.300	1.50	4.7	8.6	10.7	12.20	13.400	13.9
WindGustSpeed	1406.0	40.044097	14.027052	9.0	15.000	20.000	24.00	30.0	39.0	48.0	57.00	78.000	122.0
WindSpeed9am	1483.0	13.986514	9.124337	0.0	0.000	0.000	4.00	7.0	13.0	20.0	26.00	39.360	72.0
WindSpeed3pm	1482.0	18.601215	8.850446	0.0	2.000	6.000	7.00	13.0	19.0	24.0	31.00	43.000	56.0
Humidity9am	1477.0	68.688558	18.876448	4.0	20.000	36.000	44.00	57.0	69.0	82.0	95.00	100.000	100.0
Humidity3pm	1472.0	51.431386	20.459957	2.0	8.710	18.000	23.00	37.0	52.0	66.0	78.00	96.290	100.0
Pressure9am	1352.0	1017.763536	6.910275	988.5	1000.900	1006.255	1008.61	1013.2	1017.8	1022.3	1026.50	1033.449	1038.2
Pressure3pm	1350.0	1015.397926	6.916976	986.2	999.198	1003.900	1006.49	1010.9	1015.4	1020.0	1024.20	1031.151	1036.9
Cloud9am	940.0	4.494681	2.870468	0.0	0.000	0.000	1.00	1.0	5.0	7.0	8.00	8.000	8.0
Cloud3pm	917.0	4.403490	2.731969	0.0	0.000	0.000	1.00	2.0	5.0	7.0	8.00	8.000	8.0
Temp9am	1486.0	16.751817	6.339816	-5.3	2.370	6.725	9.00	12.1	16.5	21.3	25.45	30.200	35.1
Temp3pm	1481.0	21.483660	6.770567	-1.2	8.540	11.800	13.30	16.5	20.9	26.2	30.90	37.400	42.9

#如果异常值，首先你察这个异常值出现的频率
#如果异常值只出现了一次，多半是输入错误，直接把异常值删除
#如果异常值出现了多次，去跟业务人员沟通，人为造成的错误异常值留着是没有用的
#如果异常值占到你总数据量的10%左右了，这份数据可能就不能用了 - 
#把异常值替换成非异常但是非干扰的项，比如说用0来进行替换，或者把异常当缺失，用均值或缺失代替

Xtrain.head()

	Date	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm
0	2015-08-24	Katherine	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	17.0	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN
1	2016-12-10	Tuggeranong	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	7.0	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6
2	2010-04-18	Albany	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	17.0	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8
3	2009-11-26	Sale	13.9	29.8	5.8	5.1	S	37.0	N	...	11.0	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5
4	2014-04-25	Mildura	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4

5 rows × 21 columns

type(Xtrain.iloc[0,0]) #字符串

str

3.2 处理困难特征：日期

Xtrainc = Xtrain.copy()

Xtrainc.sort_values(by="Location")

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm
2796	2015-03-24	Adelaide	12.3	19.3	0.0	5.0	NaN	S	39.0	S	...	13.0	19.0	59.0	47.0	1022.2	1021.4	NaN	NaN	15.1	17.7
2975	2012-08-17	Adelaide	7.8	13.2	17.6	0.8	NaN	SW	61.0	SW	...	20.0	28.0	76.0	47.0	1012.5	1014.7	NaN	NaN	8.3	12.5
775	2013-03-16	Adelaide	17.4	23.8	NaN	NaN	9.7	SSE	46.0	S	...	9.0	19.0	63.0	57.0	1019.9	1020.5	NaN	NaN	19.1	20.7
861	2011-07-12	Adelaide	7.9	11.4	0.0	1.0	0.5	N	20.0	NNE	...	7.0	7.0	70.0	59.0	1028.7	1025.7	NaN	NaN	8.4	11.3
2906	2015-08-24	Adelaide	9.2	14.3	0.0	NaN	NaN	SE	48.0	SE	...	17.0	19.0	64.0	42.0	1024.7	1024.1	NaN	NaN	9.9	13.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2223	2009-05-08	Woomera	9.2	20.6	0.0	5.2	10.4	ESE	37.0	SE	...	19.0	19.0	64.0	34.0	1030.5	1026.9	0.0	1.0	13.7	20.1
1984	2014-05-26	Woomera	15.5	23.6	0.0	24.0	NaN	NNW	43.0	NNE	...	9.0	26.0	49.0	37.0	1014.2	1010.3	7.0	7.0	18.0	21.5
1592	2012-01-10	Woomera	16.8	26.7	0.0	10.0	5.3	SW	46.0	S	...	20.0	22.0	52.0	33.0	1019.1	1016.8	4.0	6.0	18.3	24.9
2824	2015-11-03	Woomera	16.2	28.5	7.8	4.2	4.5	WSW	80.0	NE	...	26.0	50.0	76.0	53.0	1009.6	1006.8	6.0	7.0	20.5	26.2
1005	2010-05-14	Woomera	3.9	19.3	0.0	5.8	10.5	NE	33.0	ENE	...	15.0	13.0	43.0	19.0	1020.2	1016.4	1.0	1.0	11.5	18.5

3500 rows × 21 columns

#判断日期数据的类型
#检查日期是否重复
#非重复-连续型
#重复-离散型数据，分类过多(beyond 5000)

Xtrain.iloc[:,0].value_counts()

2015-10-12    6
2014-05-16    6
2015-07-03    6
2009-03-30    5
2016-09-07    5
             ..
2010-06-14    1
2013-12-01    1
2009-01-18    1
2014-11-24    1
2014-04-04    1
Name: Date, Length: 2141, dtype: int64

#不同地点上一段相似的时间的数据

Xtrain.loc[Xtrain.iloc[:,0] == "2015-08-24",:]

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm
0	2015-08-24	Katherine	17.5	36.0	0.0	8.8	NaN	ESE	26.0	NNW	...	17.0	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN
2906	2015-08-24	Adelaide	9.2	14.3	0.0	NaN	NaN	SE	48.0	SE	...	17.0	19.0	64.0	42.0	1024.7	1024.1	NaN	NaN	9.9	13.4

2 rows × 21 columns

#首先，日期不是独一无二的，日期有重复
#其次，在我们分训练集和测试集之后，日期也不是连续的，而是分散的
#某一年的某一天倾向于会下雨？或者倾向于不会下雨吗？
#不是日期影响了下雨与否，反而更多的是这一天的日照时间，湿度，温度等等这些因素影响了是否会下雨
#光看日期，其实感觉它对我们的判断并无直接影响
#如果我们把它当作连续型变量处理，那算法会人为它是一系列1~3000左右的数字，不会意识到这是日期

Xtrain.iloc[:,0].value_counts().count()
#如果我们把它当作分类型变量处理，类别太多，有2141类，如果换成数值型，会被直接当成连续型变量，如果做成哑变量，我们特征的维度会爆炸

Xtrain["Rainfall"].head(20)

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      0.0
6      0.0
7      0.2
8      0.0
9      0.2
10     1.0
11     0.0
12     0.2
13     0.0
14     0.0
15     3.0
16     0.2
17     0.0
18    35.2
19     0.0
Name: Rainfall, dtype: float64

Xtrain["Rainfall"].isnull().sum()
#假设你没有下雨
#复制你的空值

Xtrain.loc[Xtrain.loc[:,"Rainfall"] >= 1,"RainToday"] = "Yes"#取出rainfall判断其值的大小，给RainToday这列赋值
Xtrain.loc[Xtrain.loc[:,"Rainfall"] < 1,"RainToday"] = "No"
Xtrain.loc[Xtrain.loc[:,"Rainfall"] == np.nan,"RainToday"] = np.nan

Xtrain.head()

	Date	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	2015-08-24	Katherine	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN	No
1	2016-12-10	Tuggeranong	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6	No
2	2010-04-18	Albany	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8	No
3	2009-11-26	Sale	13.9	29.8	5.8	5.1	S	37.0	N	...	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5	No
4	2014-04-25	Mildura	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4	No

5 rows × 22 columns

Xtrain.loc[:,"RainToday"].value_counts()

No     2642
Yes     825
Name: RainToday, dtype: int64

Xtest.loc[Xtest["Rainfall"] >= 1,"RainToday"] = "Yes"
Xtest.loc[Xtest["Rainfall"] < 1,"RainToday"] = "No"
Xtest.loc[Xtest["Rainfall"] == np.nan,"RainToday"] = np.nan

Xtrain.head()

	Date	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	2015-08-24	Katherine	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN	No
1	2016-12-10	Tuggeranong	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6	No
2	2010-04-18	Albany	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8	No
3	2009-11-26	Sale	13.9	29.8	5.8	5.1	S	37.0	N	...	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5	No
4	2014-04-25	Mildura	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4	No

5 rows × 22 columns

Xtest.head()

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	2016-01-23	NorahHead	22.0	27.8	25.2	NaN	NaN	SSW	57.0	S	...	37.0	91.0	86.0	1006.6	1008.1	NaN	NaN	26.2	23.1	Yes
1	2009-03-05	MountGambier	12.0	18.6	2.2	3.0	7.8	SW	52.0	SW	...	28.0	88.0	62.0	1020.2	1019.9	8.0	7.0	14.8	17.5	Yes
2	2010-03-05	MountGinini	9.1	13.3	NaN	NaN	NaN	NE	41.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2013-10-26	Wollongong	13.1	20.3	0.0	NaN	NaN	SW	33.0	W	...	24.0	40.0	51.0	1021.3	1019.5	NaN	NaN	16.8	19.6	No
4	2016-11-28	Sale	12.2	20.0	0.4	NaN	NaN	E	33.0	SW	...	19.0	92.0	69.0	1015.6	1013.2	8.0	4.0	13.6	19.0	No

5 rows × 22 columns

Xtrain.loc[0,"Date"].split("-") #,以"-"分割字符串

['2015', '08', '24']

int(Xtrain.loc[0,"Date"].split("-")[1]) #提取出月份

Xtrain["Date"] = Xtrain.loc[:,"Date"].apply(lambda x:int(x.split("-")[1]))
#apply是对dataframe上的某一列进行处理的一个函数
#lambda x匿名函数，请在dataframe上这一列中的每一行帮我执行冒号后的命令
#类循环，比循环快的多

Xtrain.loc[:,"Date"].value_counts()

3     334
5     324
7     316
6     302
9     302
1     300
11    299
10    282
4     265
2     264
12    259
8     253
Name: Date, dtype: int64

#替换完毕后，我们需要修改列的名称
#rename是比较少用的，可以用来修改单个列名的函数
#我们通常都直接使用 df.columns = 某个列表 这样的形式来一次修改所有的列名
#但rename允许我们只修改某个单独的列
Xtrain = Xtrain.rename(columns={"Date":"Month"})

Xtrain.head()

	Month	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	8	Katherine	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN	No
1	12	Tuggeranong	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6	No
2	4	Albany	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8	No
3	11	Sale	13.9	29.8	5.8	5.1	S	37.0	N	...	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5	No
4	4	Mildura	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4	No

5 rows × 22 columns

Xtest["Date"] = Xtest.loc[:,"Date"].apply(lambda x:int(x.split("-")[1]))
Xtest = Xtest.rename(columns={"Date":"Month"})

Xtest.head()

	Month	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	1	NorahHead	22.0	27.8	25.2	NaN	NaN	SSW	57.0	S	...	37.0	91.0	86.0	1006.6	1008.1	NaN	NaN	26.2	23.1	Yes
1	3	MountGambier	12.0	18.6	2.2	3.0	7.8	SW	52.0	SW	...	28.0	88.0	62.0	1020.2	1019.9	8.0	7.0	14.8	17.5	Yes
2	3	MountGinini	9.1	13.3	NaN	NaN	NaN	NE	41.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	10	Wollongong	13.1	20.3	0.0	NaN	NaN	SW	33.0	W	...	24.0	40.0	51.0	1021.3	1019.5	NaN	NaN	16.8	19.6	No
4	11	Sale	12.2	20.0	0.4	NaN	NaN	E	33.0	SW	...	19.0	92.0	69.0	1015.6	1013.2	8.0	4.0	13.6	19.0	No

5 rows × 22 columns

3.3 处理困难特征：地点

# 现在开始处理困难特征：地点
Xtrain.loc[:,"Location"].value_counts()

Bendigo             94
Sydney              92
SalmonGums          92
Canberra            87
Ballarat            87
Darwin              86
Cairns              84
Wollongong          82
Albury              80
Townsville          80
Newcastle           78
Adelaide            77
BadgerysCreek       77
Dartmoor            76
Moree               76
Launceston          76
CoffsHarbour        76
Witchcliffe         76
WaggaWagga          75
NorahHead           74
Mildura             72
MelbourneAirport    72
SydneyAirport       72
Cobar               71
Richmond            71
PerthAirport        71
Hobart              71
Perth               70
Walpole             69
PearceRAAF          69
NorfolkIsland       68
Nuriootpa           68
MountGambier        68
Woomera             67
Albany              67
GoldCoast           66
Watsonia            66
Penrith             65
MountGinini         64
Brisbane            63
AliceSprings        63
Williamtown         63
Tuggeranong         62
Sale                62
Portland            60
Katherine           53
Melbourne           52
Uluru               46
Nhil                44
Name: Location, dtype: int64

# 现在开始处理困难特征：地点
Xtrain.loc[:,"Location"].value_counts().count()
#超过25个类别的分类型变量，都会被算法当成是连续型变量，不会被当作类
#把城市转化为气候，生成分类型变量
#把四十九种城市转化为7种气候 
#以距离气象站最近城市的气候作为该气象站的气候，所以调用城市经纬度，以寻找距离气象站最近的城市

cityll = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_cityll.csv",index_col=0)
city_climate = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_Cityclimate.csv")

cityll.head() #每个城市对应的经纬度，这些城市是澳大利亚统计局做的那张地图上的城市

	City	Latitude	Longitude	Latitudedir	Longitudedir
0	Adelaide	34.9285°	138.6007°	S,	E
1	Albany	35.0275°	117.8840°	S,	E
2	Albury	36.0737°	146.9135°	S,	E
3	Wodonga	36.1241°	146.8818°	S,	E
4	AliceSprings	23.6980°	133.8807°	S,	E

  float(cityll.loc[0,"Latitude"][:-1])

34.9285

cityll.loc[:,"Latitudedir"].value_counts()

S,    100
Name: Latitudedir, dtype: int64

city_climate.head() #澳大利亚统计局做的每个城市对应的气候

	City	Climate
0	Adelaide	Warm temperate
1	Albany	Mild temperate
2	Albury	Hot dry summer, cool winter
3	Wodonga	Hot dry summer, cool winter
4	AliceSprings	Hot dry summer, warm winter

#去掉度数符号
cityll["Latitudenum"] = cityll["Latitude"].apply(lambda x:float(x[:-1]))
cityll["Longitudenum"] = cityll["Longitude"].apply(lambda x:float(x[:-1]))

#观察一下所有的经纬度方向都是一致的，全部是南纬，东经，因为澳大利亚在南半球，东半球
#所以经纬度的方向我们可以舍弃了
citylld = cityll.iloc[:,[0,5,6]]

citylld

	City	Latitudenum	Longitudenum
0	Adelaide	34.9285	138.6007
1	Albany	35.0275	117.8840
2	Albury	36.0737	146.9135
3	Wodonga	36.1241	146.8818
4	AliceSprings	23.6980	133.8807
...	...	...	...
95	Wollongong	34.4278	150.8931
96	Wyndham	15.4825	128.1228
97	Yalgoo	28.3445	116.6851
98	Yulara	25.2335	130.9849
99	Uluru	25.3444	131.0369

100 rows × 3 columns

#将city_climate中的气候添加到我们的citylld中
citylld["climate"] = city_climate.iloc[:,-1]

C:\Users\chen'bu'rong\AppData\Local\Temp\ipykernel_7292\702061772.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  citylld["climate"] = city_climate.iloc[:,-1]

citylld.head()

	City	Latitudenum	Longitudenum	climate
0	Adelaide	34.9285	138.6007	Warm temperate
1	Albany	35.0275	117.8840	Mild temperate
2	Albury	36.0737	146.9135	Hot dry summer, cool winter
3	Wodonga	36.1241	146.8818	Hot dry summer, cool winter
4	AliceSprings	23.6980	133.8807	Hot dry summer, warm winter

citylld.loc[:,"climate"].value_counts()

Hot dry summer, cool winter          24
Warm temperate                       18
Hot dry summer, warm winter          18
High humidity summer, warm winter    17
Mild temperate                        9
Cool temperate                        9
Warm humid summer, mild winter        5
Name: climate, dtype: int64

samplecity = pd.read_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\day08_samplecity.csv",index_col=0)

samplecity.head()

	City	Latitude	Longitude	Latitudedir	Longitudedir
0	Canberra	35.2809°	149.1300°	S,	E
1	Sydney	33.8688°	151.2093°	S,	E
2	Perth	31.9505°	115.8605°	S,	E
3	Darwin	12.4634°	130.8456°	S,	E
4	Hobart	42.8821°	147.3272°	S,	E

#我们对samplecity也执行同样的处理：去掉经纬度中度数的符号，并且舍弃我们的经纬度的方向
samplecity["Latitudenum"] = samplecity["Latitude"].apply(lambda x:float(x[:-1]))
samplecity["Longitudenum"] = samplecity["Longitude"].apply(lambda x:float(x[:-1]))
samplecityd = samplecity.iloc[:,[0,5,6]]

samplecityd.head()#这里的city是气象站的名称

	City	Latitudenum	Longitudenum
0	Canberra	35.2809	149.1300
1	Sydney	33.8688	151.2093
2	Perth	31.9505	115.8605
3	Darwin	12.4634	130.8456
4	Hobart	42.8821	147.3272

#首先使用radians将角度（经纬度）转换成弧度
from math import radians, sin, cos, acos
citylld.loc[:,"slat"] = citylld.iloc[:,1].apply(lambda x : radians(x))

citylld.loc[:,"slon"] = citylld.iloc[:,2].apply(lambda x : radians(x))
samplecityd.loc[:,"elat"] = samplecityd.iloc[:,1].apply(lambda x : radians(x))
samplecityd.loc[:,"elon"] = samplecityd.iloc[:,2].apply(lambda x : radians(x))

import sys
for i in range(samplecityd.shape[0]):
    slat = citylld.loc[:,"slat"]
    slon = citylld.loc[:,"slon"]
    elat = samplecityd.loc[i,"elat"]
    elon = samplecityd.loc[i,"elon"]
    dist = 6371.01 * np.arccos(np.sin(slat)*np.sin(elat) + 
                          np.cos(slat)*np.cos(elat)*np.cos(slon.values - elon))
    city_index = np.argsort(dist)[0]#对dist距离进行索引排序 
    #每次计算后，取距离最近的城市，然后将最近的城市和城市对应的气候都匹配到samplecityd中
    samplecityd.loc[i,"closest_city"] = citylld.loc[city_index,"City"]
    samplecityd.loc[i,"climate"] = citylld.loc[city_index,"climate"]

#查看最后的结果，需要检查城市匹配是否基本正确
samplecityd.head(300)

	City	Latitudenum	Longitudenum	elat	elon	closest_city	climate
0	Canberra	35.2809	149.1300	0.615768	2.602810	Canberra	Cool temperate
1	Sydney	33.8688	151.2093	0.591122	2.639100	Sydney	Warm temperate
2	Perth	31.9505	115.8605	0.557641	2.022147	Perth	Warm temperate
3	Darwin	12.4634	130.8456	0.217527	2.283687	Darwin	High humidity summer, warm winter
4	Hobart	42.8821	147.3272	0.748434	2.571345	Hobart	Cool temperate
5	Brisbane	27.4698	153.0251	0.479438	2.670792	Brisbane	Warm humid summer, mild winter
6	Adelaide	34.9285	138.6007	0.609617	2.419039	Adelaide	Warm temperate
7	Bendigo	36.7570	144.2794	0.641531	2.518151	Ballarat	Cool temperate
8	Townsville	19.2590	146.8169	0.336133	2.562438	Townsville	High humidity summer, warm winter
9	AliceSprings	23.6980	133.8807	0.413608	2.336659	AliceSprings	Hot dry summer, warm winter
10	MountGambier	37.8284	140.7804	0.660230	2.457082	KingstonSE	Mild temperate
11	Launceston	41.4332	147.1441	0.723146	2.568149	Launceston	Cool temperate
12	Ballarat	37.5622	143.8503	0.655584	2.510661	Ballarat	Cool temperate
13	Albany	35.0275	117.8840	0.611345	2.057464	Albany	Mild temperate
14	Albury	36.0737	146.9135	0.629605	2.564124	Albury	Hot dry summer, cool winter
15	PerthAirport	31.9440	115.9680	0.557528	2.024023	PerthAirport	Warm temperate
16	MelbourneAirport	37.6697	144.8488	0.657460	2.528088	MelbourneAirport	Mild temperate
17	Mildura	34.2080	142.1246	0.597042	2.480542	Mildura	Hot dry summer, cool winter
18	SydneyAirport	33.9399	151.1753	0.592363	2.638507	SydneyAirport	Warm temperate
19	Nuriootpa	34.4666	138.9917	0.601556	2.425863	Adelaide	Warm temperate
20	Sale	38.1026	147.0730	0.665016	2.566908	LakesEntrance	Mild temperate
21	Watsonia	37.7080	145.0830	0.658129	2.532176	Ivanhoe	Hot dry summer, cool winter
22	Tuggeranong	35.4244	149.0888	0.618272	2.602090	Canberra	Cool temperate
23	Portland	38.3609	141.6041	0.669524	2.471458	MountGambier	Mild temperate
24	Woomera	31.1656	136.8193	0.543942	2.387947	LeighCreek	Warm temperate
25	Cairns	16.9186	145.7781	0.295285	2.544308	Cairns	High humidity summer, warm winter
26	Cobar	31.4949	145.8402	0.549690	2.545392	Bourke	Hot dry summer, cool winter
27	Wollongong	34.4278	150.8931	0.600878	2.633581	Wollongong	Warm temperate
28	GoldCoast	28.0167	153.4000	0.488984	2.677335	Southport	Cool temperate
29	WaggaWagga	35.1082	147.3598	0.612754	2.571914	Albury	Hot dry summer, cool winter
30	NorfolkIsland	29.0408	167.9547	0.506858	2.931363	LordHoweIsland	Warm temperate
31	Penrith	33.7500	150.7000	0.589049	2.630211	SydneyAirport	Warm temperate
32	SalmonGums	32.9879	121.6422	0.575747	2.123057	Norseman	Hot dry summer, cool winter
33	Newcastle	32.9283	151.7817	0.574707	2.649090	Newcastle	Warm temperate
34	CoffsHarbour	30.2986	153.1094	0.528810	2.672263	CoffsHarbour	Warm humid summer, mild winter
35	Witchcliffe	34.0082	115.1155	0.593555	2.009144	MargaretRiver	Warm temperate
36	Richmond	37.8230	144.9980	0.660136	2.530693	Melbourne	Mild temperate
37	Dartmoor	37.9144	141.2730	0.661731	2.465679	MountGambier	Mild temperate
38	NorahHead	33.2833	151.5667	0.580903	2.645338	Swansea	Cool temperate
39	BadgerysCreek	33.8829	150.7609	0.591368	2.631274	SydneyAirport	Warm temperate
40	MountGinini	35.5294	148.7723	0.620105	2.596566	Canberra	Cool temperate
41	Moree	29.4658	149.8339	0.514275	2.615095	Goondiwindi	Hot dry summer, warm winter
42	Walpole	34.9551	116.7696	0.610082	2.038014	Albany	Mild temperate
43	PearceRAAF	31.6666	116.0257	0.552686	2.025030	PerthAirport	Warm temperate
44	Williamtown	32.8150	151.8428	0.572730	2.650157	Newcastle	Warm temperate
45	Melbourne	37.8136	144.9631	0.659972	2.530083	Melbourne	Mild temperate
46	Nhil	36.3328	141.6503	0.634127	2.472264	Horsham	Mild temperate
47	Katherine	14.4521	132.2715	0.252237	2.308573	Katherine	High humidity summer, warm winter
48	Uluru	25.3444	131.0369	0.442343	2.287025	Uluru	Hot dry summer, warm winter

#查看气候的分布
samplecityd["climate"].value_counts()

Warm temperate                       15
Mild temperate                       10
Cool temperate                        9
Hot dry summer, cool winter           6
High humidity summer, warm winter     4
Hot dry summer, warm winter           3
Warm humid summer, mild winter        2
Name: climate, dtype: int64

#确认无误后，取出样本气象站所对应的城市的气候，并保存
locafinal = samplecityd.iloc[:,[0,-1]]

locafinal.head()

	Location	Climate
0	Canberra	Cool temperate
1	Sydney	Warm temperate
2	Perth	Warm temperate
3	Darwin	High humidity summer, warm winter
4	Hobart	Cool temperate

locafinal.columns = ["Location","Climate"]

#在这里设定locafinal的索引为地点，是为了之后通过地点，将气候匹配到训练集的location上
locafinal = locafinal.set_index(keys="Location")

locafinal

	Climate
Location
Canberra	Cool temperate
Sydney	Warm temperate
Perth	Warm temperate
Darwin	High humidity summer, warm winter
Hobart	Cool temperate
Brisbane	Warm humid summer, mild winter
Adelaide	Warm temperate
Bendigo	Cool temperate
Townsville	High humidity summer, warm winter
AliceSprings	Hot dry summer, warm winter
MountGambier	Mild temperate
Launceston	Cool temperate
Ballarat	Cool temperate
Albany	Mild temperate
Albury	Hot dry summer, cool winter
PerthAirport	Warm temperate
MelbourneAirport	Mild temperate
Mildura	Hot dry summer, cool winter
SydneyAirport	Warm temperate
Nuriootpa	Warm temperate
Sale	Mild temperate
Watsonia	Hot dry summer, cool winter
Tuggeranong	Cool temperate
Portland	Mild temperate
Woomera	Warm temperate
Cairns	High humidity summer, warm winter
Cobar	Hot dry summer, cool winter
Wollongong	Warm temperate
GoldCoast	Cool temperate
WaggaWagga	Hot dry summer, cool winter
NorfolkIsland	Warm temperate
Penrith	Warm temperate
SalmonGums	Hot dry summer, cool winter
Newcastle	Warm temperate
CoffsHarbour	Warm humid summer, mild winter
Witchcliffe	Warm temperate
Richmond	Mild temperate
Dartmoor	Mild temperate
NorahHead	Cool temperate
BadgerysCreek	Warm temperate
MountGinini	Cool temperate
Moree	Hot dry summer, warm winter
Walpole	Mild temperate
PearceRAAF	Warm temperate
Williamtown	Warm temperate
Melbourne	Mild temperate
Nhil	Mild temperate
Katherine	High humidity summer, warm winter
Uluru	Hot dry summer, warm winter

locafinal.to_csv(r"C:\Users\chen'bu'rong\Desktop\class_file\day08_08SVM.case2\samplelocation.csv")
#之后使用气候来替换原本的气象站名称
#通过map功能，以字典形式将，通过索引将location映射为气候

#是否还记得训练集长什么样呢？
Xtrain.head()

	Month	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	8	Katherine	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN	No
1	12	Tuggeranong	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6	No
2	4	Albany	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8	No
3	11	Sale	13.9	29.8	5.8	5.1	S	37.0	N	...	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5	No
4	4	Mildura	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4	No

5 rows × 22 columns

#将location中的内容替换，并且确保匹配进入的气候字符串中不含有逗号，气候两边不含有空格
#我们使用re这个模块来消除逗号
#re.sub(希望替换的值，希望被替换成的值，要操作的字符串) #去掉逗号
#x.strip()是去掉空格的函数
#把location替换成气候的是我们的map的映射
import re

#气象站的名字替换成了对应的城市对应的气候
Xtrain["Location"] = Xtrain["Location"].map(locafinal.iloc[:,0])

Xtrain.head()

	Month	Location	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	8	High humidity summer, warm winter	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN	No
1	12	Cool temperate	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6	No
2	4	Mild temperate	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8	No
3	11	Mild temperate	13.9	29.8	5.8	5.1	S	37.0	N	...	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5	No
4	4	Hot dry summer, cool winter	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4	No

5 rows × 22 columns

#城市的气候中所含的逗号和空格都去掉
Xtrain["Location"] = Xtrain["Location"].apply(lambda x:re.sub(",","",x.strip()))

Xtest["Location"] = Xtest["Location"].map(locafinal.iloc[:,0]).apply(lambda x:re.sub(",","",x.strip()))
#re.sub(要被替换字符，要替换成的字符，要被替换的整个对象)
#x.strip() 去空格

#修改特征内容之后，我们使用新列名“Climate”来替换之前的列名“Location”
#注意这个命令一旦执行之后，就再没有列"Location"了，使用索引时要特别注意
Xtrain = Xtrain.rename(columns={"Location":"Climate"})
Xtest = Xtest.rename(columns={"Location":"Climate"})

Xtrain.head()

	Month	Climate	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	8	High humidity summer warm winter	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	15.0	57.0	NaN	1016.8	1012.2	0.0	NaN	27.5	NaN	No
1	12	Cool temperate	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	17.0	59.0	31.0	1020.4	1017.5	NaN	NaN	14.6	23.6	No
2	4	Mild temperate	13.0	22.6	3.8	10.4	NaN	NaN	NE	...	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8	No
3	11	Mild temperate	13.9	29.8	5.8	5.1	S	37.0	N	...	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5	No
4	4	Hot dry summer cool winter	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4	No

5 rows × 22 columns

Xtest.head()

	Month	Climate	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	1	Cool temperate	22.0	27.8	25.2	NaN	NaN	SSW	57.0	S	...	37.0	91.0	86.0	1006.6	1008.1	NaN	NaN	26.2	23.1	Yes
1	3	Mild temperate	12.0	18.6	2.2	3.0	7.8	SW	52.0	SW	...	28.0	88.0	62.0	1020.2	1019.9	8.0	7.0	14.8	17.5	Yes
2	3	Cool temperate	9.1	13.3	NaN	NaN	NaN	NE	41.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	10	Warm temperate	13.1	20.3	0.0	NaN	NaN	SW	33.0	W	...	24.0	40.0	51.0	1021.3	1019.5	NaN	NaN	16.8	19.6	No
4	11	Mild temperate	12.2	20.0	0.4	NaN	NaN	E	33.0	SW	...	19.0	92.0	69.0	1015.6	1013.2	8.0	4.0	13.6	19.0	No

5 rows × 22 columns

3.4 处理分类型变量：缺失值

#查看缺失值的缺失情况
#现实中多使用均值，众数来填补缺失值
#不使用算法的原因有：运算时间长，可解释性差
Xtrain.isnull().mean()

Month            0.000000
Climate          0.000000
MinTemp          0.004000
MaxTemp          0.003143
Rainfall         0.009429
Evaporation      0.433429
Sunshine         0.488571
WindGustDir      0.067714
WindGustSpeed    0.067714
WindDir9am       0.067429
WindDir3pm       0.024286
WindSpeed9am     0.009714
WindSpeed3pm     0.018000
Humidity9am      0.011714
Humidity3pm      0.026286
Pressure9am      0.098857
Pressure3pm      0.098857
Cloud9am         0.379714
Cloud3pm         0.401429
Temp9am          0.005429
Temp3pm          0.019714
RainToday        0.009429
dtype: float64

Xtrain.info()


RangeIndex: 3500 entries, 0 to 3499
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Month          3500 non-null   int64  
 1   Climate        3500 non-null   object 
 2   MinTemp        3486 non-null   float64
 3   MaxTemp        3489 non-null   float64
 4   Rainfall       3467 non-null   float64
 5   Evaporation    1983 non-null   float64
 6   Sunshine       1790 non-null   float64
 7   WindGustDir    3263 non-null   object 
 8   WindGustSpeed  3263 non-null   float64
 9   WindDir9am     3264 non-null   object 
 10  WindDir3pm     3415 non-null   object 
 11  WindSpeed9am   3466 non-null   float64
 12  WindSpeed3pm   3437 non-null   float64
 13  Humidity9am    3459 non-null   float64
 14  Humidity3pm    3408 non-null   float64
 15  Pressure9am    3154 non-null   float64
 16  Pressure3pm    3154 non-null   float64
 17  Cloud9am       2171 non-null   float64
 18  Cloud3pm       2095 non-null   float64
 19  Temp9am        3481 non-null   float64
 20  Temp3pm        3431 non-null   float64
 21  RainToday      3467 non-null   object 
dtypes: float64(16), int64(1), object(5)
memory usage: 601.7+ KB

Xtrain.dtypes == "object"

Month            False
Climate           True
MinTemp          False
MaxTemp          False
Rainfall         False
Evaporation      False
Sunshine         False
WindGustDir       True
WindGustSpeed    False
WindDir9am        True
WindDir3pm        True
WindSpeed9am     False
WindSpeed3pm     False
Humidity9am      False
Humidity3pm      False
Pressure9am      False
Pressure3pm      False
Cloud9am         False
Cloud3pm         False
Temp9am          False
Temp3pm          False
RainToday         True
dtype: bool

#首先找出，分类型特征都有哪些
cate = Xtrain.columns[Xtrain.dtypes == "object"].tolist()

cate

['Climate', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

#除了特征类型为"object"的特征们，还有虽然用数字表示，但是本质为分类型特征的云层遮蔽程度
cloud = ["Cloud9am","Cloud3pm"]
cate = cate + cloud

cate

['Climate',
 'WindGustDir',
 'WindDir9am',
 'WindDir3pm',
 'RainToday',
 'Cloud9am',
 'Cloud3pm']

#对于分类型特征，我们使用众数来进行填补
from sklearn.impute import SimpleImputer #0.20, conda, pip

si = SimpleImputer(missing_values=np.nan,strategy="most_frequent")#当值是空的时候用众数填补的函数
#注意，我们使用训练集数据来训练我们的填补器，本质是在生成训练集中的众数
si.fit(Xtrain.loc[:,cate])

SimpleImputer(strategy='most_frequent')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

SimpleImputer

SimpleImputer(strategy='most_frequent')

#然后我们用训练集中的众数来同时填补训练集和测试集
Xtrain.loc[:,cate] = si.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = si.transform(Xtest.loc[:,cate])

Xtrain.head()

	Month	Climate	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	8	High humidity summer warm winter	17.5	36.0	8.8	NaN	ESE	26.0	NNW	...	15.0	57.0	NaN	1016.8	1012.2	0.0	7.0	27.5	NaN	No
1	12	Cool temperate	9.5	25.0	NaN	NaN	NNW	33.0	NE	...	17.0	59.0	31.0	1020.4	1017.5	7.0	7.0	14.6	23.6	No
2	4	Mild temperate	13.0	22.6	3.8	10.4	W	NaN	NE	...	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8	No
3	11	Mild temperate	13.9	29.8	5.8	5.1	S	37.0	N	...	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5	No
4	4	Hot dry summer cool winter	6.0	23.5	2.8	8.6	NNE	24.0	E	...	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4	No

5 rows × 22 columns

Xtest.head()

	Month	Climate	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	1	Cool temperate	22.0	27.8	25.2	NaN	NaN	SSW	57.0	S	...	37.0	91.0	86.0	1006.6	1008.1	7.0	7.0	26.2	23.1	Yes
1	3	Mild temperate	12.0	18.6	2.2	3.0	7.8	SW	52.0	SW	...	28.0	88.0	62.0	1020.2	1019.9	8.0	7.0	14.8	17.5	Yes
2	3	Cool temperate	9.1	13.3	NaN	NaN	NaN	NE	41.0	N	...	NaN	NaN	NaN	NaN	NaN	7.0	7.0	NaN	NaN	No
3	10	Warm temperate	13.1	20.3	0.0	NaN	NaN	SW	33.0	W	...	24.0	40.0	51.0	1021.3	1019.5	7.0	7.0	16.8	19.6	No
4	11	Mild temperate	12.2	20.0	0.4	NaN	NaN	E	33.0	SW	...	19.0	92.0	69.0	1015.6	1013.2	8.0	4.0	13.6	19.0	No

5 rows × 22 columns

#查看分类型特征是否依然存在缺失值
Xtrain.loc[:,cate].isnull().mean()

Climate        0.0
WindGustDir    0.0
WindDir9am     0.0
WindDir3pm     0.0
RainToday      0.0
Cloud9am       0.0
Cloud3pm       0.0
dtype: float64

Xtest.loc[:,cate].isnull().mean()

Climate        0.0
WindGustDir    0.0
WindDir9am     0.0
WindDir3pm     0.0
RainToday      0.0
Cloud9am       0.0
Cloud3pm       0.0
dtype: float64

3.5 处理分类型变量：将分类型变量编码

# 处理分类型变量：将分类型变量编码
#先填缺失值后编码
#将所有的分类型变量编码为数字，一个类别是一个数字

from sklearn.preprocessing import OrdinalEncoder #只允许二维以上的数据进行输入
oe = OrdinalEncoder()

#利用训练集进行fit
oe = oe.fit(Xtrain.loc[:,cate])

#用训练集的编码结果来编码训练和测试特征矩阵
#在这里如果测试特征矩阵报错，那么测试集可能有异常值，错误值，或说明测试集中出现了训练集中从未见过的类别
#那么就需要重新调整你的模型
Xtrain.loc[:,cate] = oe.transform(Xtrain.loc[:,cate])
Xtest.loc[:,cate] = oe.transform(Xtest.loc[:,cate])

cate

Xtrain.loc[:,cate].head()

	Climate	WindGustDir	WindDir9am	WindDir3pm	Cloud9am	Cloud3pm
0	1.0	2.0	6.0	0.0	0.0	7.0
1	0.0	6.0	4.0	6.0	7.0	7.0
2	4.0	13.0	4.0	0.0	1.0	3.0
3	4.0	8.0	3.0	8.0	6.0	6.0
4	2.0	5.0	0.0	6.0	2.0	4.0

Xtest.loc[:,cate].head()

	Climate	WindGustDir	WindDir9am	WindDir3pm	RainToday	Cloud9am	Cloud3pm
0	0.0	11.0	8.0	11.0	1.0	7.0	7.0
1	4.0	12.0	12.0	8.0	1.0	8.0	7.0
2	0.0	4.0	3.0	9.0	0.0	7.0	7.0
3	6.0	12.0	13.0	9.0	0.0	7.0	7.0
4	4.0	0.0	12.0	0.0	0.0	8.0	4.0

Xtrain.head()

	Month	Climate	MinTemp	MaxTemp	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm
0	8	1.0	17.5	36.0	8.8	NaN	2.0	26.0	6.0	...	15.0	57.0	NaN	1016.8	1012.2	0.0	7.0	27.5	NaN
1	12	0.0	9.5	25.0	NaN	NaN	6.0	33.0	4.0	...	17.0	59.0	31.0	1020.4	1017.5	7.0	7.0	14.6	23.6
2	4	4.0	13.0	22.6	3.8	10.4	13.0	NaN	4.0	...	31.0	79.0	68.0	1020.3	1015.7	1.0	3.0	17.5	20.8
3	11	4.0	13.9	29.8	5.8	5.1	8.0	37.0	3.0	...	28.0	82.0	44.0	1012.5	1005.9	6.0	6.0	18.5	27.5
4	4	2.0	6.0	23.5	2.8	8.6	5.0	24.0	0.0	...	15.0	58.0	35.0	1019.8	1014.1	2.0	4.0	12.4	22.4

5 rows × 22 columns

3.6 处理连续型变量：填补缺失值

col = Xtrain.columns.tolist()

col

['Month',
 'Climate',
 'MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustDir',
 'WindGustSpeed',
 'WindDir9am',
 'WindDir3pm',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Cloud9am',
 'Cloud3pm',
 'Temp9am',
 'Temp3pm',
 'RainToday']

cate

['Climate',
 'WindGustDir',
 'WindDir9am',
 'WindDir3pm',
 'RainToday',
 'Cloud9am',
 'Cloud3pm']

#赋值一个列表剔除分类型的列
for i in cate:
    col.remove(i)

col

['Month',
 'MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Temp9am',
 'Temp3pm']

#实例化模型，填补策略为"mean"表示均值
impmean = SimpleImputer(missing_values=np.nan,strategy = "mean")
#用训练集来fit模型
impmean = impmean.fit(Xtrain.loc[:,col])
#分别在训练集和测试集上进行均值填补
Xtrain.loc[:,col] = impmean.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = impmean.transform(Xtest.loc[:,col])

Xtrain.isnull().mean()

Month            0.0
Climate          0.0
MinTemp          0.0
MaxTemp          0.0
Rainfall         0.0
Evaporation      0.0
Sunshine         0.0
WindGustDir      0.0
WindGustSpeed    0.0
WindDir9am       0.0
WindDir3pm       0.0
WindSpeed9am     0.0
WindSpeed3pm     0.0
Humidity9am      0.0
Humidity3pm      0.0
Pressure9am      0.0
Pressure3pm      0.0
Cloud9am         0.0
Cloud3pm         0.0
Temp9am          0.0
Temp3pm          0.0
RainToday        0.0
dtype: float64

Xtest.isnull().mean()

Month            0.0
Climate          0.0
MinTemp          0.0
MaxTemp          0.0
Rainfall         0.0
Evaporation      0.0
Sunshine         0.0
WindGustDir      0.0
WindGustSpeed    0.0
WindDir9am       0.0
WindDir3pm       0.0
WindSpeed9am     0.0
WindSpeed3pm     0.0
Humidity9am      0.0
Humidity3pm      0.0
Pressure9am      0.0
Pressure3pm      0.0
Cloud9am         0.0
Cloud3pm         0.0
Temp9am          0.0
Temp3pm          0.0
RainToday        0.0
dtype: float64

3.7 处理连续型变量：无量纲化

# 无量纲化处理连续型数据

col.remove("Month")
#月份没有缺失值，所以分类型填补缺失值时未导入，导入连续型数据填补缺失值时也无影响
#但是我们不想把他无量纲化

col

['MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Pressure3pm',
 'Temp9am',
 'Temp3pm']

from sklearn.preprocessing import StandardScaler #数据转换为均值为0，方差为1的数据
#标准化不改变数据的分布，不会把数据变成正态分布的

ss = StandardScaler()
ss = ss.fit(Xtrain.loc[:,col])
Xtrain.loc[:,col] = ss.transform(Xtrain.loc[:,col])
Xtest.loc[:,col] = ss.transform(Xtest.loc[:,col])

Xtrain.head()

	Month	Climate	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm
0	8.0	1.0	0.826375	1.774044	-0.314379	0.964367	0.000000	2.0	-1.085893e+00	6.0	...	-0.416443	-0.646283	0.000000	-0.122589	-0.453507	0.0	7.0	1.612270	0.000000
1	12.0	0.0	-0.427048	0.244031	-0.314379	0.000000	0.000000	6.0	-5.373993e-01	4.0	...	-0.182051	-0.539186	-1.011310	0.414254	0.340522	7.0	7.0	-0.366608	0.270238
2	4.0	4.0	0.121324	-0.089790	-0.314379	-0.551534	1.062619	13.0	-1.113509e-15	4.0	...	1.458692	0.531786	0.800547	0.399342	0.070852	1.0	3.0	0.078256	-0.132031
3	11.0	4.0	0.262334	0.911673	-0.314379	0.054826	-0.885225	8.0	-2.239744e-01	3.0	...	1.107105	0.692432	-0.374711	-0.763819	-1.397352	6.0	6.0	0.231658	0.830540
4	4.0	2.0	-0.975421	0.035393	-0.314379	-0.854715	0.401087	5.0	-1.242605e+00	0.0	...	-0.416443	-0.592734	-0.815433	0.324780	-0.168855	2.0	4.0	-0.704091	0.097837

5 rows × 22 columns

Xtest.head()

	Month	Climate	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	WindSpeed3pm	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday
0	1.0	0.0	1.531425	0.633489	2.871067	0.000000	0.000000	11.0	1.343150	8.0	...	2.161868e+00	1.174369	1.681991	-1.643646	-1.067755	7.0	7.0	1.412848	0.198404	1.0
1	3.0	4.0	-0.035354	-0.646158	-0.036285	-0.794079	0.107073	12.0	0.951369	12.0	...	1.107105e+00	1.013723	0.506733	0.384430	0.700082	8.0	7.0	-0.335927	-0.606132	1.0
2	3.0	0.0	-0.489720	-1.383346	0.000000	0.000000	0.000000	4.0	0.089450	3.0	...	-4.163637e-16	0.000000	0.000000	0.000000	0.000000	7.0	7.0	0.000000	0.000000	0.0
3	10.0	6.0	0.136992	-0.409702	-0.314379	0.000000	0.000000	12.0	-0.537399	13.0	...	6.383207e-01	-1.556609	-0.031928	0.548465	0.640155	7.0	7.0	-0.029125	-0.304431	0.0
4	11.0	4.0	-0.004018	-0.451429	-0.263817	0.000000	0.000000	0.0	-0.537399	12.0	...	5.234093e-02	1.227917	0.849516	-0.301537	-0.303690	8.0	4.0	-0.520009	-0.390632	0.0

5 rows × 22 columns

#完成特征工程，开始建模与模型评估

Ytrain.head()

	0
0	0
1	0
2	0
3	1
4	0

4. 建模与模型评估

from time import time #随时监控我们的模型的运行时间
import datetime
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, recall_score

Ytrain = Ytrain.iloc[:,0].ravel()
Ytest = Ytest.iloc[:,0].ravel()

#建模选择自然是我们的支持向量机SVC，首先用核函数的学习曲线来选择核函数
#我们希望同时观察，精确性，recall以及AUC分数
times = time() #因为SVM是计算量很大的模型，所以我们需要时刻监控我们的模型运行时间

for kernel in ["linear","poly","rbf","sigmoid"]:
    clf = SVC(kernel = kernel
              ,gamma="auto"
              ,degree = 1
              ,cache_size = 5000  #cache_size 设定越大，算法使用的内存越多，速度越快
             ).fit(Xtrain, Ytrain)
    result = clf.predict(Xtest)  #获取模型预测结果
    score = clf.score(Xtest,Ytest)  #接口score返回准确度accuracy
    recall = recall_score(Ytest, result)
    auc = roc_auc_score(Ytest,clf.decision_function(Xtest))#auc第二个参数是置信度
    print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

linear 's testing accuracy 0.844000, recall is 0.469388', auc is 0.869029
00:03:751689
poly 's testing accuracy 0.840667, recall is 0.457726', auc is 0.868157
00:04:253937
rbf 's testing accuracy 0.813333, recall is 0.306122', auc is 0.814873
00:05:900434
sigmoid 's testing accuracy 0.655333, recall is 0.154519', auc is 0.437308
00:06:403792

5. 建模调参

5.1 追求最高召回率recall

# 追求最高召回率recall
times = time()
for kernel in ["linear","poly","rbf","sigmoid"]:
    clf = SVC(kernel = kernel
              ,gamma="auto"
              ,degree = 1
              ,cache_size = 5000
              ,class_weight = "balanced"
             ).fit(Xtrain, Ytrain)
    result = clf.predict(Xtest)
    score = clf.score(Xtest,Ytest)
    recall = recall_score(Ytest, result)
    auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
    print("%s 's testing accuracy %f, recall is %f', auc is %f" % (kernel,score,recall,auc))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

linear 's testing accuracy 0.796000, recall is 0.775510', auc is 0.870065
00:04:266080
poly 's testing accuracy 0.793333, recall is 0.763848', auc is 0.871448
00:04:972567
rbf 's testing accuracy 0.803333, recall is 0.600583', auc is 0.819713
00:06:837272
sigmoid 's testing accuracy 0.562000, recall is 0.282799', auc is 0.437119
00:08:094214

times = time()
clf = SVC(kernel = "linear"
          ,gamma="auto"
          ,cache_size = 5000
          ,class_weight = {1:10}
#注意，这里写的其实是，类别1：10，(类别1权重为10)隐藏了类别0：1(类别0权重为1)这个比例
         ).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("testing accuracy %f, recall is %f', auc is %f" %(score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

#高精度预测少数类发生的情况，但是也把大量多数类预测失误为少数类

testing accuracy 0.636667, recall is 0.912536', auc is 0.866360
00:07:806017

5.2 追求最高准确率

# 最求最高准确度

valuec = pd.Series(Ytest).value_counts()

valuec# 少数类标签为1，多数类标签为0

0    1157
1     343
dtype: int64

valuec[0]/valuec.sum()

0.7713333333333333

#查看模型的特异度
from sklearn.metrics import confusion_matrix as CM

clf = SVC(kernel = "linear"
          ,gamma="auto"
          ,cache_size = 5000
         ).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
# 获取预测结果

# 构建混淆矩阵
cm = CM(Ytest,result,labels=(1,0))

cm

array([[ 161,  182],
       [  52, 1105]], dtype=int64)

specificity = cm[1,1]/cm[1,:].sum()

specificity #几乎所有的多类样本0都被判断正确了，还有不少少数类1也被判断正确了

0.9550561797752809

# 多数类准确率已经挺高了，试一试微微调节使得少数类准确率上升来使得，总体准确率上升

irange = np.linspace(0.01,0.05,10)

irange

array([0.01      , 0.01444444, 0.01888889, 0.02333333, 0.02777778,
       0.03222222, 0.03666667, 0.04111111, 0.04555556, 0.05      ])

for i in irange:
    times = time()
    clf = SVC(kernel = "linear"
              ,gamma="auto"
              ,cache_size = 5000
              ,class_weight = {1:1+i}
             ).fit(Xtrain, Ytrain)
    result = clf.predict(Xtest)
    score = clf.score(Xtest,Ytest)
    recall = recall_score(Ytest, result)
    auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
    print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" %(1+i,score,recall,auc))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

under ratio 1:1.010000 testing accuracy 0.844667, recall is 0.475219', auc is 0.869157
00:03:688484
under ratio 1:1.014444 testing accuracy 0.844667, recall is 0.478134', auc is 0.869185
00:03:856429
under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869200
00:03:745157
under ratio 1:1.023333 testing accuracy 0.845333, recall is 0.481050', auc is 0.869175
00:03:598711
under ratio 1:1.027778 testing accuracy 0.844000, recall is 0.481050', auc is 0.869394
00:03:667787
under ratio 1:1.032222 testing accuracy 0.844000, recall is 0.481050', auc is 0.869528
00:03:641163
under ratio 1:1.036667 testing accuracy 0.844000, recall is 0.481050', auc is 0.869659
00:03:895604
under ratio 1:1.041111 testing accuracy 0.844667, recall is 0.483965', auc is 0.869629
00:03:787082
under ratio 1:1.045556 testing accuracy 0.844667, recall is 0.483965', auc is 0.869712
00:03:729805
under ratio 1:1.050000 testing accuracy 0.845333, recall is 0.486880', auc is 0.869863
00:03:800089

irange_ = np.linspace(0.018889,0.027778,10)

for i in irange_:
    times = time()
    clf = SVC(kernel = "linear"
              ,gamma="auto"
              ,cache_size = 5000
              ,class_weight = {1:1+i}
             ).fit(Xtrain, Ytrain)
    result = clf.predict(Xtest)
    score = clf.score(Xtest,Ytest)
    recall = recall_score(Ytest, result)
    auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
    print("under ratio 1:%f testing accuracy %f, recall is %f', auc is %f" %(1+i,score,recall,auc))
    print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

under ratio 1:1.018889 testing accuracy 0.844667, recall is 0.478134', auc is 0.869213
00:03:654617
under ratio 1:1.019877 testing accuracy 0.844000, recall is 0.478134', auc is 0.869228
00:03:753644
under ratio 1:1.020864 testing accuracy 0.844000, recall is 0.478134', auc is 0.869218
00:03:743298
under ratio 1:1.021852 testing accuracy 0.844667, recall is 0.478134', auc is 0.869188
00:03:557083
under ratio 1:1.022840 testing accuracy 0.844667, recall is 0.478134', auc is 0.869220
00:03:805152
under ratio 1:1.023827 testing accuracy 0.844667, recall is 0.481050', auc is 0.869188
00:03:774551
under ratio 1:1.024815 testing accuracy 0.844667, recall is 0.481050', auc is 0.869231
00:03:644071
under ratio 1:1.025803 testing accuracy 0.844667, recall is 0.481050', auc is 0.869238
00:03:772898
under ratio 1:1.026790 testing accuracy 0.844000, recall is 0.481050', auc is 0.869314
00:03:660354
under ratio 1:1.027778 testing accuracy 0.844000, recall is 0.481050', auc is 0.869326
00:03:715478

# 没有出现比84.53更高的精度，已经无法通过调节权重来使得模型的准确度达到质变
# 接下来只能通过更换模型来提高模型精度了

from sklearn.linear_model import LogisticRegression as LR

logclf = LR(solver="liblinear").fit(Xtrain, Ytrain)
logclf.score(Xtest,Ytest)

0.8486666666666667

C_range = np.linspace(5,10,10)

for C in C_range:
    logclf = LR(solver="liblinear",C=C).fit(Xtrain, Ytrain)
    print(C,logclf.score(Xtest,Ytest))

5.0 0.8493333333333334
5.555555555555555 0.8493333333333334
6.111111111111111 0.8486666666666667
6.666666666666667 0.8493333333333334
7.222222222222222 0.8493333333333334
7.777777777777778 0.8493333333333334
8.333333333333334 0.8493333333333334
8.88888888888889 0.8493333333333334
9.444444444444445 0.8493333333333334
10.0 0.8493333333333334

5.3 追求模型精确度和召回率的平衡

# 模型的精确度还是没有产生质变，可能通过尝试集成算法达成目标

# 追求模型精确度和召回率的平衡
times = time()
clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
          ,class_weight = "balanced"
         ).fit(Xtrain, Ytrain)
result = clf.predict(Xtest)
score = clf.score(Xtest,Ytest)
recall = recall_score(Ytest, result)
auc = roc_auc_score(Ytest,clf.decision_function(Xtest))
print("testing accuracy %f,recall is %f', auc is %f" % (score,recall,auc))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))

testing accuracy 0.794000,recall is 0.772595', auc is 0.870143
00:11:026793

from sklearn.metrics import roc_curve as ROC
import matplotlib.pyplot as plt

FPR, Recall, thresholds = ROC(Ytest,clf.decision_function(Xtest),pos_label=1)#正样本标签为1(少数类标签为1)
# roc曲线输入的是概率预测值，或者是置信度

area = roc_auc_score(Ytest,clf.decision_function(Xtest))

area

0.8701426983930995

plt.figure()
plt.plot(FPR, Recall, color='red',
         label='ROC curve (area = %0.2f)' % area)
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('Recall')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

# 以此为基础求解最佳阈值
maxindex = (Recall - FPR).tolist().index(max(Recall - FPR))#提取列表中最大值的标签

thresholds[maxindex]

-0.09027758680662012

from sklearn.metrics import accuracy_score as AC

clf = SVC(kernel = "linear",C=3.1663157894736838,cache_size = 5000
          ,class_weight = "balanced"
         ).fit(Xtrain, Ytrain)

prob = pd.DataFrame(clf.decision_function(Xtest))# 置信度

prob.head()

	0
0	2.186028
1	0.373602
2	-0.019583
3	-1.134845
4	-0.237963

prob.loc[prob.iloc[:,0] >= thresholds[maxindex],"y_pred"]=1
prob.loc[prob.iloc[:,0] < thresholds[maxindex],"y_pred"]=0

prob.loc[:,"y_pred"].isnull().sum()

times = time()
score = AC(Ytest,prob.loc[:,"y_pred"].values)
recall = recall_score(Ytest, prob.loc[:,"y_pred"])
print("testing accuracy %f,recall is %f" % (score,recall))
print(datetime.datetime.fromtimestamp(time()-times).strftime("%M:%S:%f"))
# 最佳阈值下模型的精度与召回率也并未产生质变

testing accuracy 0.790000,recall is 0.804665
00:00:002001

说明调参已达该模型极限，想要提升模型效果，只能更换算法

你可能感兴趣的:(sklearn,python,机器学习,svm)

基于Python拉取tiktok直播视频流，并将视频流切割成一定时长的视频片段 sh_moranliunian 蜘蛛侠网络爬虫后端 python 爬虫
通过访问tiktok的直播间网页，从网页的script标签内部提取出关于该直播间的相关信息的JSON串，最终从JSON里提取出直播视频流的hls地址和直播间的其他信息。importsysimportrequestsimportjsonimporttimeimportsubprocessfromurllib.parseimporturlunparsefrombs4importBeautifulSou
python中datetime模块 a1111111111ss python python
参考大佬cmzsteven双手奉上大佬的网址https://blog.csdn.net/cmzsteven/article/details/64906245datetime模块中包含如下类：2、通过year,month,day三个数据描述符可以进行访问：date对象由year年份、month月份及day日期三部分构成：date（year，month，day)>>>a=datetime.date.t
如何合法抓取TikTok视频信息和评论：完整Python爬虫教程 Python爬虫项目 2025年爬虫实战项目音视频 python 爬虫开发语言
一、引言TikTok是全球最受欢迎的短视频平台之一，每天吸引着数百万的用户上传和分享视频内容。作为内容创作者和数据分析师，抓取TikTok上的视频和评论可以帮助你分析社交趋势、受欢迎的内容类型和用户互动。然而，TikTok明确表示其平台的数据抓取行为受到限制，这也意味着我们不能直接通过常规的网络爬虫技术去抓取其数据。本文将介绍如何在合法的前提下进行TikTok数据抓取。我们将探索TikTok的AP
谈高考真题的使用（数学） weixin_34116110 python 测试
2019独角兽企业重金招聘Python工程师标准>>>在高三数学复习中，大家常说“以本为本，以纲为纲，高考真题当主粮”，就是以教材内容为根本，以“考试大纲”为准绳，以高考真题的训练为主线；抓住了本，把握了纲，训练有的放矢，我们的复习就会事半功倍。高考数学试题难度相对稳定，考查形式的变化却是异彩纷呈，而变化中又有着一定的规律：全国试题与各省市试题的考试要求基本一致；题型除上海和江苏外，全国和其他各省
Python之pip的安装和使用详细教程叫我技术帝 Python python
我们都知道python有海量的第三方库或者说模块，这些库针对不同的应用，发挥不同的作用。我们在实际的项目中，或多或少的都要使用到第三方库，那么如何将他人的库加入到自己的项目中内呢？打个电话？大哥你好，想用下你那个库，麻烦给邮箱发个源码呗！显然这是个笑话。Python官方的PyPi仓库为我们提供了一个统一的代码托管仓库，所有的第三方库，甚至你自己写的开源模块，都可以发布到这里，让全世界的人分享下载。
python使用pip安装本地包-Python之pip使用详解|附第三方库安装总结 weixin_37988176
首先简单介绍下pip是什么？pip是python的第三方库管理器，可以根据所开发项目的需要，使用pip相关命令安装不同库。Pyhon3.4以后，pip都默认跟Python一块安装，pip在python安装目录中的位置如下：执行方法：运行【win+R】+cmd，执行pip，查看是否安装成功。（找不到命令，则需要手动添加到环境变量）python官方提供了一个pypi库（https://pypi.org
2024年09月中国电子学会青少年软件编程（Python）等级考试试卷（二级）答案 + 解析伶俐角少儿编程 python 少儿编程青少年编程等级考试中国电子学会青少年编程
青少年软件编程（Python）等级考试试卷（二级）分数：100题数：37点击前往在线模拟练习一、单选题(共25题，共50分)1.a=['甲','乙','丙','丁','子','丑']print(a[4])以上代码的输出是ÿ
Python pip download下载安装包到指定路径飘～～～～ python
一、Python第三方安装包下载pipdownload-dsave_pathpackages-d:后面接下载包路径(save_path)packages:安装包名称二、Python第三方安装包安装2.1whl包python-mpipinstallxxx.whl2.2tar.gz包tar-zxvfxxx.tar.gzcdxxxpythonsetup.pybuildpythonsetup.pyinst
【免费】中国电子学会2024年03月份青少年软件编程Python等级考试试卷二级真题(含答案) Lemon Liu 电子学会Python真题前端 javascript microsoft python 青少年编程
2024-03Python二级真题分数：100题数：37测试时长：60min一、单选题(共25题，共50分)1.期末考试结束了，全班的语文成绩都储存在列表score中，班主任老师请小明找到全班最高分，小明准备用Python来完成，以下哪个选项，可以获取最高分呢？（B）（2分）A.min(score)B.max(score)C.score.max()D.score.min()答案解析：max()函数
中国电子学会202309青少年软件编程（Python）等级考试试卷（二级）真题晴朗向上 python 考级编程开发语言 microsoft
青少年软件编程（Python）等级考试试卷（二级）分数：100题数：37一、单选题（共25题，每题2分，共50分）1、yyh = [2023, '杭州亚运会', ['拱宸桥', '玉琮''莲叶']]jxw = yyh[2][0]print(jxw[1] * 2)以上代码运行结果是？（）A.宸宸B.杭杭C.玉玉D.州州2、阿宝在学习Python语言编程，他写了一个程序可以实现输入月份数字就可以输出2
2024年9月中国电子学会青少年软件编程（Python）等级考试试卷（三级）答案 + 解析 Sinsa_SI python windows 开发语言电子学会等级考试
更多真题在线练习系统：历年真题在线练习系统一、单选题1、以下表达式的值为True的是？（）A.all('','1','2','3')B.any([])C.bool('abc')D.divmod(6,0)正确答案：C答案解析：A和B选项，False；D选项，报错；C选项，True。2、下列代码的运行结果是？（）l=list(map(float,(1,2,3,4)))print(l)A.[1,2,3,
2024年9月电子学会青少年软件编程Python等级考试（三级）真题试卷 No0d1es 青少年软件编程（Python）等级考试试卷 python 开发语言青少年编程电子学会三级
2024年9月青少年软件编程Python等级考试（三级）真题试卷选择题第1题单选题以下python表达式的值为True的是？（）A.all('','1','2','3')B.any([])C.bool('abc')D.divmod(6,0)第2题单选题下列python代码的运行结果是？（）l=list(map(float,(1,2,3,4)))print(l)A.[1,2,3,4]B.['1','
【mysql】mysql之主从部署以及介绍向往风的男子 DBA mysql 数据库
本站以分享各种运维经验和运维所需要的技能为主《python零基础入门》：python零基础入门学习《python运维脚本》：python运维脚本实践《shell》：shell学习《terraform》持续更新中：terraform_Aws学习零基础入门到最佳实战《k8》从问题中去学习k8s《docker学习》暂未更新《ceph学习》ceph日常问题解决分享《日志收集》ELK+各种中间件《运维日常》
MySQL5.6主从复制最佳实践 weixin_34252090 数据库操作系统 python
2019独角兽企业重金招聘Python工程师标准>>>MySQL5.6主从复制最佳实践MySQL5.6主从复制的配置环境操作系统：CentOS-6.6-x86_64MySQL版本：mysql-5.6.26.tar.gz主节点IP：192.168.31.57主机名：edu-mysql-01从节点IP：192.168.31.59主机名：edu-mysql-02MySQL主从复制官方文档http://d
决策树算法全解析：从零基础到Titanic实战，一文搞定机器学习经典模型吴师兄大模型 0基础实现机器学习入门到精通算法机器学习决策树人工智能深度学习编程开发语言
Langchain系列文章目录01-玩转LangChain：从模型调用到Prompt模板与输出解析的完整指南02-玩转LangChainMemory模块：四种记忆类型详解及应用场景全覆盖03-全面掌握LangChain：从核心链条构建到动态任务分配的实战指南04-玩转LangChain：从文档加载到高效问答系统构建的全程实战05-玩转LangChain：深度评估问答系统的三种高效方法（示例生成、手
vs code配置python_如何在vscode里的python配置好matplotlib？,vscode配置python环境教程 weixin_39564151 vs code配置python
如何在vscode里的python配置好matplotlib？,vscode配置python环境教程vscode配置python环境教程2020-09-2015:14:33人已围观VScode配置Python环境“配置任务运行程序”遇到问题我建议尝试再把bug写出来，不能因为不一样就不继续首先需要VScodePython插件。打开Python任意脚可以直接拖入。点击左下角的扩展按钮，在弹出界面选择
TikTokenizer 开源项目教程邱纳巧Gillian
TikTokenizer开源项目教程tiktokenizerOnlineplaygroundforOpenAPItokenizers项目地址:https://gitcode.com/gh_mirrors/ti/tiktokenizer项目介绍TikTokenizer是一个基于Python的开源项目，旨在提供一个高效、灵活的文本分词工具。该项目利用先进的算法和数据结构，能够快速准确地对文本进行分词处
python类方法和类的实例化 Cachel wood 程序设计杂事 python 开发语言 mysql hive sql 机器学习数据库
文章目录类方法实例方法类方法静态方法特殊方法私有方法Python类的实例化1.调用`__new__`方法2.调用`__init__`方法3.返回实例对象总结类方法在Python里，类的自定义方法是类中用户自行定义的函数，这些方法能够实现特定的功能，并且可以访问和操作类的属性。下面详细介绍Python类中常见的自定义方法。实例方法定义：实例方法是类中最常见的方法，它的第一个参数通常是self，代表类
python 输入一行字符串删除其中所有大写字母后输出_Python练习题3.17删除字符 weixin_39624873 python 输入一行字符串删除其中所有大写字母后输出
输入一个字符串str，再输入要删除字符c，大小写不区分，将字符串str中出现的所有字符c删除。输入格式:在第一行中输入一行字符在第二行输入待删除的字符输出格式:在一行中输出删除后的字符串输入样例:在这里给出一组输入。例如：beee输出样例:在这里给出相应的输出。例如：result:b代码如下：#!/usr/bin/python#-*-coding:utf-8-*-s=input().strip()
Telegram bot教程：通过BotFather设置Telegram bot的命令菜单鲲志说 Web3相关业界资讯 telegram bot 经验分享笔记 twitter Telegram Bot
最近在研究Telegrambot嘛，总有些小细节可以记录了，今天就记录一个通过BotFather设置Telegrambot的命令菜单功能➡️【好看的灵魂千篇一律，有趣的鲲志一百六七！】-欢迎认识我～～作者：鲲志说（公众号、B站同名，视频号：鲲志说996）科技博主：极星会星辉大使后端研发：java、go、python、TS，前电商、现web3主理人：COC杭州开发者社区主理人、周周黑客松杭州主理人、
Python,C++开发餐饮后厨环境远程管理APP Geeker-2025 python c++
开发一款用于**餐饮后厨环境远程管理**的App，结合Python和C++的优势，可以实现高效的后端数据处理、实时的环境监控以及用户友好的前端界面。以下是一个详细的开发方案，涵盖技术选型、功能模块、开发步骤等内容。##技术选型###后端（Python）-**编程语言**：Python-**Web框架**：Django或Flask-**数据库**：PostgreSQL或MySQL-**实时通信**：
图像处理篇---图像预处理 Ronin-Lotus 图像处理篇深度学习篇程序代码篇图像处理人工智能 opencv python 深度学习计算机视觉
文章目录前言一、通用目的1.1数据标准化目的实现1.2噪声抑制目的实现高斯滤波中值滤波双边滤波1.3尺寸统一化目的实现1.4数据增强目的实现1.5特征增强目的实现：边缘检测直方图均衡化锐化二、分领域预处理2.1传统机器学习（如SVM、随机森林）2.1.1特点2.1.2预处理重点灰度化二值化形态学操作特征工程2.2深度学习（如CNN、Transformer）2.2.1特点2.2.2预处理重点通道顺序
清晰易懂的Python安装与配置教程 Tee xm python 开发语言
初学者也能看懂的Python安装与配置教程本教程将手把手教你安装Python，并配置国内镜像源和自定义依赖包缓存位置，即使你是零基础小白，也能轻松完成！一、准备工作操作系统：Windows10/11、macOS或Linux。下载工具：浏览器（推荐Chrome或Edge）。存储空间：至少预留500MB可用空间。二、安装Python1.下载Python访问Python官网下载页面：https://ww
双均线量化策略实战指南：基于 iTick 外汇API、股票API报价源的 Python 实现算法pythonai开发
在量化交易领域，iTick报价API凭借其强大的多市场覆盖能力，已成为专业交易员的首选数据解决方案。其外汇API支持全球主要货币对（如EURUSD、GBPUSD）的毫秒级行情推送，包含Bid/Ask深度报价和实时波动率数据；股票API则覆盖A股、港股及美股市场，提供Level-2逐笔成交和十档盘口信息。通过统一的RESTful接口，开发者可轻松获取标准化的OHLCV数据，实现外汇、股票等多资产策略
1.1PaddleTS_环境配置：一个易用的深度时序建模的Python库 pythonQA python paddlepaddle
PaddleTS是一个易用的深度时序建模的Python库，它基于飞桨深度学习框架PaddlePaddle，专注业界领先的深度模型，旨在为领域专家和行业用户提供可扩展的时序建模能力和便捷易用的用户体验。PaddleTS的主要特性包括：设计统一数据结构，实现对多样化时序数据的表达，支持单目标与多目标变量，支持多类型协变量封装基础模型功能，如数据加载、回调设置、损失函数、训练过程控制等公共方法，帮助开发
【大模型科普】AIGC技术发展与应用实践（一文读懂AIGC）人工智能
【专栏介绍】⌈⌈⌈人工智能与大模型应用⌋⌋⌋人工智能（AI）通过算法模拟人类智能，利用机器学习、深度学习等技术驱动医疗、金融等领域的智能化。大模型是千亿参数的深度神经网络（如ChatGPT），经海量数据训练后能完成文本生成、图像创作等复杂任务，显著提升效率，但面临算力消耗、数据偏见等挑战。当前正加速与教育、科研融合，未来需平衡技术创新与伦理风险，推动可持续发展。文章目录一、AIGC概述（一）什么是
蓝桥杯pythonB组备赛暴力执码蓝桥杯职场和发展
P1003[NOIP2011提高组]铺地毯题目描述为了准备一个独特的颁奖典礼，组织者在会场的一片矩形区域（可看做是平面直角坐标系的第一象限）铺上一些矩形地毯。一共有n张地毯，编号从1到n。现在将这些地毯按照编号从小到大的顺序平行于坐标轴先后铺设，后铺的地毯覆盖在前面已经铺好的地毯之上。地毯铺设完成后，组织者想知道覆盖地面某个点的最上面的那张地毯的编号。注意：在矩形地毯边界和四个顶点上的点也算被地毯
解决 Python 中 `cv2` 模块部分初始化导致的 `AttributeError` Leuanghing python 开发语言
解决Python中cv2模块部分初始化导致的AttributeError在Python开发中，尤其是使用OpenCV库进行图像处理时，可能会遇到一些令人困惑的错误。今天，我们就来探讨一个常见的错误：AttributeError:partiallyinitializedmodule'cv2'hasnoattribute'gapi_wip_gst_GStreamerPipeline'，并提供一个有效的
python 正则表达式的语法及使用主打Python 正则表达式 python 基础语法正则表达式 python
python正则表达式的语法及使用概念：按照程序员的指示，字符串里提取你要的数据。应用：爬虫清洗数据，匹配电话，匹配邮箱，匹配账号……最重要的就是（.*?）正则语法（元字符）1、？：前面的内容出现0-1次2、+：前面的内容出现1-多次3、*：前面的内容出现0-多次‘’’正则(Regular)：记住的点：1、(.？)2、re.findall()结果是一个列表3、用(.?)的是后，一定要复制，而不是手
python pandas 读取excel单元门公式值_Python pandas对excel的操作实现示例 weixin_39585761 python pandas 读取excel单元门公式值
最近经常看到各平台里都有Python的广告，都是对excel的操作，这里明哥收集整理了一下pandas对excel的操作方法和使用过程。本篇介绍pandas的DataFrame对列(Column)的处理方法。示例数据请通过明哥的gitee进行下载。增加计算列pandas的DataFrame，每一行或每一列都是一个序列(Series)。比如：importpandasaspddf1=pd.read_e
PHP，安卓，UI，java，linux视频教程合集 cocos2d-x小菜 java UI linux PHP android
╔-----------------------------------╗┆
zookeeper admin 笔记 braveCS zookeeper
Required Software 1) JDK>=1.6 2)推荐使用ensemble的ZooKeeper(至少3台)，并run on separate machines 3)在Yahoo!，zk配置在特定的RHEL boxes里，2个cpu，2G内存，80G硬盘数据和日志目录 1)数据目录里的文件是zk节点的持久化备份，包括快照和事务日
Spring配置多个连接池 easterfly spring
项目中需要同时连接多个数据库的时候，如何才能在需要用到哪个数据库就连接哪个数据库呢？ Spring中有关于dataSource的配置： <bean id="dataSource" class="com.mchange.v2.c3p0.ComboPooledDataSource" &nb
Mysql 171815164 mysql
例如，你想myuser使用mypassword从任何主机连接到mysql服务器的话。 GRANT ALL PRIVILEGES ON *.* TO 'myuser'@'%'IDENTIFIED BY 'mypassword' WI TH GRANT OPTION; 如果你想允许用户myuser从ip为192.168.1.6的主机连接到mysql服务器，并使用mypassword作
CommonDAO（公共/基础DAO） g21121 DAO
好久没有更新博客了，最近一段时间工作比较忙，所以请见谅，无论你是爱看呢还是爱看呢还是爱看呢，总之或许对你有些帮助。 DAO(Data Access Object)是一个数据访问（顾名思义就是与数据库打交道）接口，DAO一般在业
直言有讳永夜-极光感悟随笔
1.转载地址:http://blog.csdn.net/jasonblog/article/details/10813313 精华: “直言有讳”是阿里巴巴提倡的一种观念，而我在此之前并没有很深刻的认识。为什么呢？就好比是读书时候做阅读理解，我喜欢我自己的解读，并不喜欢老师给的意思。在这里也是。我自己坚持的原则是互相尊重，我觉得阿里巴巴很多价值观其实是基本的做人
安装CentOS 7 和Win 7后，Win7 引导丢失随便小屋 centos
一般安装双系统的顺序是先装Win7，然后在安装CentOS，这样CentOS可以引导WIN 7启动。但安装CentOS7后，却找不到Win7 的引导，稍微修改一点东西即可。一、首先具有root 的权限。即进入Terminal后输入命令su，然后输入密码即可二、利用vim编辑器打开/boot/grub2/grub.cfg文件进行修改 v
Oracle备份与恢复案例 aijuans oracle
Oracle备份与恢复案例一. 理解什么是数据库恢复当我们使用一个数据库时，总希望数据库的内容是可靠的、正确的，但由于计算机系统的故障（硬件故障、软件故障、网络故障、进程故障和系统故障）影响数据库系统的操作，影响数据库中数据的正确性，甚至破坏数据库，使数据库中全部或部分数据丢失。因此当发生上述故障后，希望能重构这个完整的数据库，该处理称为数据库恢复。恢复过程大致可以分为复原(Restore)与
JavaEE开源快速开发平台G4Studio v5.0发布無為子
我非常高兴地宣布,今天我们最新的JavaEE开源快速开发平台G4Studio_V5.0版本已经正式发布。访问G4Studio网站 http://www.g4it.org 2013-04-06 发布G4Studio_V5.0版本功能新增 (1). 新增了调用Oracle存储过程返回游标，并将游标映射为Java List集合对象的标
Oracle显示根据高考分数模拟录取百合不是茶 PL/SQL编程 oracle例子模拟高考录取学习交流
题目要求: 1,创建student表和result表 2,pl/sql对学生的成绩数据进行处理 3,处理的逻辑是根据每门专业课的最低分线和总分的最低分数线自动的将录取和落选 1,创建student表,和result表学生信息表; create table student( student_id number primary key,--学生id
优秀的领导与差劲的领导 bijian1013 领导管理团队
责任优秀的领导：优秀的领导总是对他所负责的项目担负起责任。如果项目不幸失败了，那么他知道该受责备的人是他自己，并且敢于承认错误。差劲的领导：差劲的领导觉得这不是他的问题，因此他会想方设法证明是他的团队不行，或是将责任归咎于团队中他不喜欢的那几个成员身上。努力工作优秀的领导：团队领导应该是团队成员的榜样。至少，他应该与团队中的其他成员一样努力工作。这仅仅因为他
js函数在浏览器下的兼容 Bill_chen jquery 浏览器 IE DWR ext
做前端开发的工程师，少不了要用FF进行测试，纯js函数在不同浏览器下，名称也可能不同。对于IE6和FF，取得下一结点的函数就不尽相同： IE6：node.nextSibling,对于FF是不能识别的； FF：node.nextElementSibling,对于IE是不能识别的；兼容解决方式：var Div = node.nextSibl
【JVM四】老年代垃圾回收：吞吐量垃圾收集器(Throughput GC) bit1129 垃圾回收
吞吐量与用户线程暂停时间衡量垃圾回收算法优劣的指标有两个：吞吐量越高，则算法越好暂停时间越短，则算法越好首先说明吞吐量和暂停时间的含义。垃圾回收时，JVM会启动几个特定的GC线程来完成垃圾回收的任务，这些GC线程与应用的用户线程产生竞争关系，共同竞争处理器资源以及CPU的执行时间。GC线程不会对用户带来的任何价值，因此，好的GC应该占
J2EE监听器和过滤器基础白糖_ J2EE
Servlet程序由Servlet，Filter和Listener组成，其中监听器用来监听Servlet容器上下文。监听器通常分三类：基于Servlet上下文的ServletContex监听，基于会话的HttpSession监听和基于请求的ServletRequest监听。 ServletContex监听器 ServletContex又叫application
博弈AngularJS讲义(16) - 提供者 boyitech js AngularJS api Angular Provider
Angular框架提供了强大的依赖注入机制，这一切都是有注入器(injector)完成. 注入器会自动实例化服务组件和符合Angular API规则的特殊对象，例如控制器，指令，过滤器动画等。那注入器怎么知道如何去创建这些特殊的对象呢？ Angular提供了5种方式让注入器创建对象，其中最基础的方式就是提供者(provider), 其余四种方式(Value, Fac
java-写一函数f(a,b)，它带有两个字符串参数并返回一串字符，该字符串只包含在两个串中都有的并按照在a中的顺序。 bylijinnan java
public class CommonSubSequence { /** * 题目：写一函数f(a,b)，它带有两个字符串参数并返回一串字符，该字符串只包含在两个串中都有的并按照在a中的顺序。 * 写一个版本算法复杂度O(N^2)和一个O(N) 。 * * O(N^2)：对于a中的每个字符，遍历b中的每个字符，如果相同，则拷贝到新字符串中。 * O(
sqlserver 2000 无法验证产品密钥 Chen.H sql windows SQL Server Microsoft
在 Service Pack 4 (SP 4), 是运行 Microsoft Windows Server 2003、 Microsoft Windows Storage Server 2003 或 Microsoft Windows 2000 服务器上您尝试安装 Microsoft SQL Server 2000 通过卷许可协议 (VLA) 媒体。这样做, 收到以下错误信息CD KEY的 SQ
[新概念武器]气象战争 comsci
气象战争的发动者必须是拥有发射深空航天器能力的国家或者组织.... 原因如下: 地球上的气候变化和大气层中的云层涡旋场有密切的关系,而维持一个在大气层某个层次
oracle 中 rollup、cube、grouping 使用详解 daizj oracle grouping rollup cube
oracle 中 rollup、cube、grouping 使用详解 -- 使用oracle 样例表演示转自namesliu -- 使用oracle 的样列库，演示 rollup, cube, grouping 的用法与使用场景 --- ROLLUP ，为了理解分组的成员数量，我增加了分组的计数 COUNT(SAL)
技术资料汇总分享 Dead_knight 技术资料汇总分享
本人汇总的技术资料，分享出来，希望对大家有用。 http://pan.baidu.com/s/1jGr56uE 资料主要包含： Workflow->工作流相关理论、框架(OSWorkflow、JBPM、Activiti、fireflow...) Security->java安全相关资料(SSL、SSO、SpringSecurity、Shiro、JAAS...) Ser
初一下学期难记忆单词背诵第一课 dcj3sjt126com english word
could 能够 minute 分钟 Tuesday 星期二 February 二月 eighteenth 第十八 listen 听 careful 小心的，仔细的 short 短的 heavy 重的 empty 空的 certainly 当然 carry 携带；搬运 tape 磁带 basket 蓝子 bottle 瓶 juice 汁，果汁 head 头；头部
截取视图的图片, 然后分享出去 dcj3sjt126com OS Objective-C
OS 7 has a new method that allows you to draw a view hierarchy into the current graphics context. This can be used to get an UIImage very fast. I implemented a category method on UIView to get the vi
MySql重置密码 fanxiaolong MySql重置密码
方法一: 在my.ini的[mysqld]字段加入： skip-grant-tables 重启mysql服务，这时的mysql不需要密码即可登录数据库然后进入mysql mysql>use mysql; mysql>更新 user set password=password('新密码') WHERE User='root'; mysq
Ehcache（03）——Ehcache中储存缓存的方式 234390216 ehcache MemoryStore DiskStore 存储驱除策略
Ehcache中储存缓存的方式目录 1 堆内存（MemoryStore） 1.1 指定可用内存 1.2 驱除策略 1.3 元素过期 2 &nbs
spring mvc中的@propertysource jackyrong spring mvc
在spring mvc中，在配置文件中的东西，可以在java代码中通过注解进行读取了： @PropertySource 在spring 3.1中开始引入比如有配置文件 config.properties mongodb.url=1.2.3.4 mongodb.db=hello 则代码中 @PropertySource(&
重学单例模式 lanqiu17 单例 Singleton 模式
最近在重新学习设计模式，感觉对模式理解更加深刻。觉得有必要记下来。第一个学的就是单例模式，单例模式估计是最好理解的模式了。它的作用就是防止外部创建实例，保证只有一个实例。单例模式的常用实现方式有两种，就人们熟知的饱汉式与饥汉式，具体就不多说了。这里说下其他的实现方式静态内部类方式: package test.pattern.singleton.statics; publ
.NET开源核心运行时，且行且珍惜 netcome java .net 开源
背景 2014年11月12日，ASP.NET之父、微软云计算与企业级产品工程部执行副总裁Scott Guthrie，在Connect全球开发者在线会议上宣布，微软将开源全部.NET核心运行时，并将.NET 扩展为可在 Linux 和 Mac OS 平台上运行。.NET核心运行时将基于MIT开源许可协议发布，其中将包括执行.NET代码所需的一切项目——CLR、JIT编译器、垃圾收集器（GC）和核心
使用oscahe缓存技术减少与数据库的频繁交互 Everyday都不同 Web 高并发 oscahe缓存
此前一直不知道缓存的具体实现，只知道是把数据存储在内存中，以便下次直接从内存中读取。对于缓存的使用也没有概念，觉得缓存技术是一个比较”神秘陌生“的领域。但最近要用到缓存技术，发现还是很有必要一探究竟的。缓存技术使用背景：一般来说，对于web项目，如果我们要什么数据直接jdbc查库好了，但是在遇到高并发的情形下，不可能每一次都是去查数据库，因为这样在高并发的情形下显得不太合理——
Spring+Mybatis 手动控制事务 toknowme mybatis
@Override public boolean testDelete(String jobCode) throws Exception { boolean flag = false; &nbs
菜鸟级的android程序员面试时候需要掌握的知识点 xp9802 android
熟悉Android开发架构和API调用掌握APP适应不同型号手机屏幕开发技巧熟悉Android下的数据存储熟练Android Debug Bridge Tool 熟练Eclipse/ADT及相关工具熟悉Android框架原理及Activity生命周期熟练进行Android UI布局熟练使用SQLite数据库；熟悉Android下网络通信机制，S