(1)KNN概念:k个最近的邻居,即每个样本都可以用它最接近的k个邻居来代表。(K Near Neighbor)
(2)算法思想:一个样本与数据集中的k个样本最相似, 如果这k个样本中的大多数属于某一个类别, 则该样本也属于这个类别。
(3)距离度量:一般是用欧式距离,L2范数即可。
(4)K值的选择:如果选择较小的K值,相当于在较小的邻域中进行预测,学习的近似误差会减小;缺点是学习的估计误差会
增大。如果邻近点恰巧是噪声,预测就会出错。K值减小就意味着整体模型变复杂,容易发生过拟合。
如果选择较大K值,就相当于用较大邻域中进行预测;优点是可以减少学习的估计误差,但近似误差会增大,K值得增大就意味着整
体模型变的简单。
一般算法实例流程:
1、数据集的处理
2、分割数据集
3、对数据集进行标准化
4、estimator流程进行分类预测
import pandas as pd
# 读取数据
data = pd.read_csv("./KNN_al/train.csv")
data.head(10)
row_id | x | y | accuracy | time | place_id | |
---|---|---|---|---|---|---|
0 | 0 | 0.7941 | 9.0809 | 54 | 470702 | 8523065625 |
1 | 1 | 5.9567 | 4.7968 | 13 | 186555 | 1757726713 |
2 | 2 | 8.3078 | 7.0407 | 74 | 322648 | 1137537235 |
3 | 3 | 7.3665 | 2.5165 | 65 | 704587 | 6567393236 |
4 | 4 | 4.0961 | 1.1307 | 31 | 472130 | 7440663949 |
5 | 5 | 3.8099 | 1.9586 | 75 | 178065 | 6289802927 |
6 | 6 | 6.3336 | 4.3720 | 13 | 666829 | 9931249544 |
7 | 7 | 5.7409 | 6.7697 | 85 | 369002 | 5662813655 |
8 | 8 | 4.3114 | 6.9410 | 3 | 166384 | 8471780938 |
9 | 9 | 6.3414 | 0.0758 | 65 | 400060 | 1253803156 |
data.info()
RangeIndex: 29118021 entries, 0 to 29118020
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 row_id int64
1 x float64
2 y float64
3 accuracy int64
4 time int64
5 place_id int64
dtypes: float64(2), int64(4)
memory usage: 1.3 GB
这个数据太大了,接近三千万条,我们需要对数据进行筛选。
data = data.query("x > 1.0 & x < 1.25 & y > 2.5 & y < 2.75")
data.head(10)
row_id | x | y | accuracy | time | place_id | |
---|---|---|---|---|---|---|
600 | 600 | 1.2214 | 2.7023 | 17 | 65380 | 6683426742 |
957 | 957 | 1.1832 | 2.6891 | 58 | 785470 | 6683426742 |
4345 | 4345 | 1.1935 | 2.6550 | 11 | 400082 | 6889790653 |
4735 | 4735 | 1.1452 | 2.6074 | 49 | 514983 | 6822359752 |
5580 | 5580 | 1.0089 | 2.7287 | 19 | 732410 | 1527921905 |
6090 | 6090 | 1.1140 | 2.6262 | 11 | 145507 | 4000153867 |
6234 | 6234 | 1.1449 | 2.5003 | 34 | 316377 | 3741484405 |
6350 | 6350 | 1.0844 | 2.7436 | 65 | 36816 | 5963693798 |
7468 | 7468 | 1.0058 | 2.5096 | 66 | 746766 | 9076695703 |
8478 | 8478 | 1.2015 | 2.5187 | 72 | 690722 | 3992589015 |
time_value = pd.to_datetime(data['time'], unit='s')
time_value.head()
600 1970-01-01 18:09:40
957 1970-01-10 02:11:10
4345 1970-01-05 15:08:02
4735 1970-01-06 23:03:03
5580 1970-01-09 11:26:50
Name: time, dtype: datetime64[ns]
# 把日期格式转换成 字典格式
time_value = pd.DatetimeIndex(time_value)
time_value
DatetimeIndex(['1970-01-01 18:09:40', '1970-01-10 02:11:10',
'1970-01-05 15:08:02', '1970-01-06 23:03:03',
'1970-01-09 11:26:50', '1970-01-02 16:25:07',
'1970-01-04 15:52:57', '1970-01-01 10:13:36',
'1970-01-09 15:26:06', '1970-01-08 23:52:02',
...
'1970-01-07 10:03:36', '1970-01-09 11:44:34',
'1970-01-04 08:07:44', '1970-01-04 15:47:47',
'1970-01-08 01:24:11', '1970-01-01 10:33:56',
'1970-01-07 23:22:04', '1970-01-08 15:03:14',
'1970-01-04 00:53:41', '1970-01-08 23:01:07'],
dtype='datetime64[ns]', name='time', length=17710, freq=None)
# 构造一些特征
data['day'] = time_value.day
data['hour'] = time_value.hour
data['weekday'] = time_value.weekday
data.head()
row_id | x | y | accuracy | time | place_id | day | hour | weekday | |
---|---|---|---|---|---|---|---|---|---|
600 | 600 | 1.2214 | 2.7023 | 17 | 65380 | 6683426742 | 1 | 18 | 3 |
957 | 957 | 1.1832 | 2.6891 | 58 | 785470 | 6683426742 | 10 | 2 | 5 |
4345 | 4345 | 1.1935 | 2.6550 | 11 | 400082 | 6889790653 | 5 | 15 | 0 |
4735 | 4735 | 1.1452 | 2.6074 | 49 | 514983 | 6822359752 | 6 | 23 | 1 |
5580 | 5580 | 1.0089 | 2.7287 | 19 | 732410 | 1527921905 | 9 | 11 | 4 |
# 把时间戳特征删除
data = data.drop(['time'], axis=1)
data.head()
row_id | x | y | accuracy | place_id | day | hour | weekday | |
---|---|---|---|---|---|---|---|---|
600 | 600 | 1.2214 | 2.7023 | 17 | 6683426742 | 1 | 18 | 3 |
957 | 957 | 1.1832 | 2.6891 | 58 | 6683426742 | 10 | 2 | 5 |
4345 | 4345 | 1.1935 | 2.6550 | 11 | 6889790653 | 5 | 15 | 0 |
4735 | 4735 | 1.1452 | 2.6074 | 49 | 6822359752 | 6 | 23 | 1 |
5580 | 5580 | 1.0089 | 2.7287 | 19 | 1527921905 | 9 | 11 | 4 |
# 把签到数量少于n个目标位置删除
place_count = data.groupby('place_id').count()
place_count
# 以某个特征进行分组,该特征就成了索引index
row_id | x | y | accuracy | day | hour | weekday | |
---|---|---|---|---|---|---|---|
place_id | |||||||
1012023972 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1057182134 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1059958036 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
1085266789 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
1097200869 | 1044 | 1044 | 1044 | 1044 | 1044 | 1044 | 1044 |
... | ... | ... | ... | ... | ... | ... | ... |
9904182060 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
9915093501 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
9946198589 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
9950190890 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
9980711012 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
805 rows × 7 columns
# tf里面保留了row_id>3的数据
tf = place_count[place_count.row_id > 3]
tf
row_id | x | y | accuracy | day | hour | weekday | |
---|---|---|---|---|---|---|---|
place_id | |||||||
1097200869 | 1044 | 1044 | 1044 | 1044 | 1044 | 1044 | 1044 |
1228935308 | 120 | 120 | 120 | 120 | 120 | 120 | 120 |
1267801529 | 58 | 58 | 58 | 58 | 58 | 58 | 58 |
1278040507 | 15 | 15 | 15 | 15 | 15 | 15 | 15 |
1285051622 | 21 | 21 | 21 | 21 | 21 | 21 | 21 |
... | ... | ... | ... | ... | ... | ... | ... |
9741307878 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
9753855529 | 21 | 21 | 21 | 21 | 21 | 21 | 21 |
9806043737 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
9809476069 | 23 | 23 | 23 | 23 | 23 | 23 | 23 |
9980711012 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
239 rows × 7 columns
# 然后把索引重新设置一下,让place_id回到数据特征里面
tf = tf.reset_index()
tf
place_id | row_id | x | y | accuracy | day | hour | weekday | |
---|---|---|---|---|---|---|---|---|
0 | 1097200869 | 1044 | 1044 | 1044 | 1044 | 1044 | 1044 | 1044 |
1 | 1228935308 | 120 | 120 | 120 | 120 | 120 | 120 | 120 |
2 | 1267801529 | 58 | 58 | 58 | 58 | 58 | 58 | 58 |
3 | 1278040507 | 15 | 15 | 15 | 15 | 15 | 15 | 15 |
4 | 1285051622 | 21 | 21 | 21 | 21 | 21 | 21 | 21 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
234 | 9741307878 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
235 | 9753855529 | 21 | 21 | 21 | 21 | 21 | 21 | 21 |
236 | 9806043737 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
237 | 9809476069 | 23 | 23 | 23 | 23 | 23 | 23 | 23 |
238 | 9980711012 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
239 rows × 8 columns
# 把data里面的id是不是在tf.place_id里面,有就保存下来。
data = data[data['place_id'].isin(tf.place_id)]
data
row_id | x | y | accuracy | place_id | day | hour | weekday | |
---|---|---|---|---|---|---|---|---|
600 | 600 | 1.2214 | 2.7023 | 17 | 6683426742 | 1 | 18 | 3 |
957 | 957 | 1.1832 | 2.6891 | 58 | 6683426742 | 10 | 2 | 5 |
4345 | 4345 | 1.1935 | 2.6550 | 11 | 6889790653 | 5 | 15 | 0 |
4735 | 4735 | 1.1452 | 2.6074 | 49 | 6822359752 | 6 | 23 | 1 |
5580 | 5580 | 1.0089 | 2.7287 | 19 | 1527921905 | 9 | 11 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
29100203 | 29100203 | 1.0129 | 2.6775 | 12 | 3312463746 | 1 | 10 | 3 |
29108443 | 29108443 | 1.1474 | 2.6840 | 36 | 3533177779 | 7 | 23 | 2 |
29109993 | 29109993 | 1.0240 | 2.7238 | 62 | 6424972551 | 8 | 15 | 3 |
29111539 | 29111539 | 1.2032 | 2.6796 | 87 | 3533177779 | 4 | 0 | 6 |
29112154 | 29112154 | 1.1070 | 2.5419 | 178 | 4932578245 | 8 | 23 | 3 |
16918 rows × 8 columns
y = data["place_id"]
x = data.drop(["place_id"],axis = 1) # 沿着列的方向删除目标值即可
from sklearn.datasets import load_iris, fetch_20newsgroups, load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25)
data
row_id | x | y | accuracy | place_id | day | hour | weekday | |
---|---|---|---|---|---|---|---|---|
600 | 600 | 1.2214 | 2.7023 | 17 | 6683426742 | 1 | 18 | 3 |
957 | 957 | 1.1832 | 2.6891 | 58 | 6683426742 | 10 | 2 | 5 |
4345 | 4345 | 1.1935 | 2.6550 | 11 | 6889790653 | 5 | 15 | 0 |
4735 | 4735 | 1.1452 | 2.6074 | 49 | 6822359752 | 6 | 23 | 1 |
5580 | 5580 | 1.0089 | 2.7287 | 19 | 1527921905 | 9 | 11 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
29100203 | 29100203 | 1.0129 | 2.6775 | 12 | 3312463746 | 1 | 10 | 3 |
29108443 | 29108443 | 1.1474 | 2.6840 | 36 | 3533177779 | 7 | 23 | 2 |
29109993 | 29109993 | 1.0240 | 2.7238 | 62 | 6424972551 | 8 | 15 | 3 |
29111539 | 29111539 | 1.2032 | 2.6796 | 87 | 3533177779 | 4 | 0 | 6 |
29112154 | 29112154 | 1.1070 | 2.5419 | 178 | 4932578245 | 8 | 23 | 3 |
16918 rows × 8 columns
# 这个时候我们先不做数据的标准化处理,直接调用KNN算法来试一试预测效果如何。
def knn_al():
knn = KNeighborsClassifier(n_neighbors = 5)
# fit,predict ,score
knn.fit(x_train,y_train)
# 得出预测结果
y_predict = knn.predict(x_test)
print("预测目标签到位置为:",y_predict)
# 得出准确率
print("预测的准确率:",knn.score(x_test,y_test))
if __name__ == "__main__":
knn_al()
预测目标签到位置为: [1479000473 2584530303 2946102544 ... 5606572086 1602053545 1097200869]
预测的准确率: 0.029787234042553193
# 我们尝试着提高下算法的准确率试试,先删除data中的row_id的特征。
data_del_row_id = data.drop(['row_id'],axis =1)
data_del_row_id
x | y | accuracy | place_id | day | hour | weekday | |
---|---|---|---|---|---|---|---|
600 | 1.2214 | 2.7023 | 17 | 6683426742 | 1 | 18 | 3 |
957 | 1.1832 | 2.6891 | 58 | 6683426742 | 10 | 2 | 5 |
4345 | 1.1935 | 2.6550 | 11 | 6889790653 | 5 | 15 | 0 |
4735 | 1.1452 | 2.6074 | 49 | 6822359752 | 6 | 23 | 1 |
5580 | 1.0089 | 2.7287 | 19 | 1527921905 | 9 | 11 | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
29100203 | 1.0129 | 2.6775 | 12 | 3312463746 | 1 | 10 | 3 |
29108443 | 1.1474 | 2.6840 | 36 | 3533177779 | 7 | 23 | 2 |
29109993 | 1.0240 | 2.7238 | 62 | 6424972551 | 8 | 15 | 3 |
29111539 | 1.2032 | 2.6796 | 87 | 3533177779 | 4 | 0 | 6 |
29112154 | 1.1070 | 2.5419 | 178 | 4932578245 | 8 | 23 | 3 |
16918 rows × 7 columns
y = data_del_row_id["place_id"]
x = data_del_row_id.drop(["place_id"],axis = 1) # 沿着列的方向删除目标值即可
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25)
if __name__ == "__main__":
knn_al()
预测目标签到位置为: [1097200869 3312463746 9632980559 ... 3533177779 4932578245 1913341282]
预测的准确率: 0.0806146572104019
我们删除了row_id之后,发现预测的准确率从0.0319提高到了0.0806
# 接下来删除day试试
y = data_del_row_id["day"]
x = data_del_row_id.drop(["place_id"],axis = 1) # 沿着列的方向删除目标值即可
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25)
if __name__ == "__main__":
knn_al()
预测目标签到位置为: [2 9 4 ... 6 5 9]
预测的准确率: 0.810401891252955
我们删除了day特征之后,发现预测的准确率从0.0763提高到了0.8104
我们先回到处理好的数据,即data,然后对特征值进行标准化操作。
# 取出数据当中的特征值和目标值
y = data['place_id']
x = data.drop(['place_id'], axis=1)
# 进行数据的分割训练集合测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
# 特征工程(标准化)
std = StandardScaler()
# 对测试集和训练集的特征值进行标准化
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
if __name__ == "__main__":
knn_al()
预测目标签到位置为: [6683426742 1435128522 2327054745 ... 2460093296 1435128522 1097200869]
预测的准确率: 0.41631205673758864
我们标准化之后,发现预测的准确率从0.0763提高到了0.41631205673758864。
接着我们drop一下"row_id"的特征,再试试。
# 取出数据当中的特征值和目标值
x = data.drop("place_id",axis = 1)
x
row_id | x | y | accuracy | day | hour | weekday | |
---|---|---|---|---|---|---|---|
600 | 600 | 1.2214 | 2.7023 | 17 | 1 | 18 | 3 |
957 | 957 | 1.1832 | 2.6891 | 58 | 10 | 2 | 5 |
4345 | 4345 | 1.1935 | 2.6550 | 11 | 5 | 15 | 0 |
4735 | 4735 | 1.1452 | 2.6074 | 49 | 6 | 23 | 1 |
5580 | 5580 | 1.0089 | 2.7287 | 19 | 9 | 11 | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
29100203 | 29100203 | 1.0129 | 2.6775 | 12 | 1 | 10 | 3 |
29108443 | 29108443 | 1.1474 | 2.6840 | 36 | 7 | 23 | 2 |
29109993 | 29109993 | 1.0240 | 2.7238 | 62 | 8 | 15 | 3 |
29111539 | 29111539 | 1.2032 | 2.6796 | 87 | 4 | 0 | 6 |
29112154 | 29112154 | 1.1070 | 2.5419 | 178 | 8 | 23 | 3 |
16918 rows × 7 columns
# 进行数据的分割训练集合测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
# 特征工程(标准化)
std = StandardScaler()
# 对测试集和训练集的特征值进行标准化
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
if __name__ == "__main__":
knn_al()
预测目标签到位置为: [5270522918 1097200869 3312463746 ... 1097200869 5606572086 1097200869]
预测的准确率: 0.40803782505910163
# 我们再drop特征:“day”
#
x_no_row_id = x.drop(["row_id"],axis =1)
x_no_row_id_and_no_day = x_no_row_id.drop(["day"],axis =1)
x_no_row_id_and_no_day
x | y | accuracy | hour | weekday | |
---|---|---|---|---|---|
600 | 1.2214 | 2.7023 | 17 | 18 | 3 |
957 | 1.1832 | 2.6891 | 58 | 2 | 5 |
4345 | 1.1935 | 2.6550 | 11 | 15 | 0 |
4735 | 1.1452 | 2.6074 | 49 | 23 | 1 |
5580 | 1.0089 | 2.7287 | 19 | 11 | 4 |
... | ... | ... | ... | ... | ... |
29100203 | 1.0129 | 2.6775 | 12 | 10 | 3 |
29108443 | 1.1474 | 2.6840 | 36 | 23 | 2 |
29109993 | 1.0240 | 2.7238 | 62 | 15 | 3 |
29111539 | 1.2032 | 2.6796 | 87 | 0 | 6 |
29112154 | 1.1070 | 2.5419 | 178 | 23 | 3 |
16918 rows × 5 columns
y
600 6683426742
957 6683426742
4345 6889790653
4735 6822359752
5580 1527921905
...
29100203 3312463746
29108443 3533177779
29109993 6424972551
29111539 3533177779
29112154 4932578245
Name: place_id, Length: 16918, dtype: int64
## 3.5
# 进行数据的分割训练集合测试集
x_train, x_test, y_train, y_test = train_test_split(x_no_row_id_and_no_day, y, test_size=0.25)
# 特征工程(标准化)
std = StandardScaler()
# 对测试集和训练集的特征值进行标准化
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
knn = KNeighborsClassifier(n_neighbors = 5)
# fit,predict ,score
knn.fit(x_train,y_train)
# 得出预测结果
y_predict = knn.predict(x_test)
print("预测目标签到位置为:",y_predict)
# 得出准确率
print("预测的准确率:",knn.score(x_test,y_test))
预测目标签到位置为: [6399991653 3533177779 1097200869 ... 2327054745 3992589015 6683426742]
预测的准确率: 0.48699763593380613
k值取很小,容易受异常点影响。
k值取很大,容易受k值数量(类别)的影响。
estimator.score()
一般最常见使用的是准确率,即预测结果正确的百分比:
A c c u r a c y = T P + T N T P + F P + F N + T N Accuracy = \frac{TP+TN}{TP+FP+FN+TN} Accuracy=TP+FP+FN+TNTP+TN
混淆矩阵:分类任务中,预测结果(Predicted Condition)与正确标记(True Condition)之间存在四种不同的组合,构成混淆矩阵(适用于多分类)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kZbQHN9e-1656744456611)(attachment:image-2.png)]
精确率(Precision)与召回率(Recall)
精确率:预测结果为正例样本中真实为正例的比例(查得准)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mL6R4695-1656744456612)(attachment:image.png)]
P r e c i s i o n = T P T P + F P Precision = \frac{TP}{TP+FP} Precision=TP+FPTP
召回率:真实为正例的样本中预测结果为正例的比例(查的全,对正样本的区分能力)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1ST5eaks-1656744456613)(attachment:image-3.png)]
R e c a l l = T P T P + F N Recall = \frac{TP}{TP+FN} Recall=TP+FNTP
其他分类标准,F1-score,反映了模型的稳健型
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zfkzyqJe-1656744456613)(attachment:image-3.png)]
分类模型评估API
sklearn.metrics.classification_report
sklearn.metrics.classification_report(y_true, y_pred, target_names=None)
y_true:真实目标值
y_pred:估计器预测目标值
target_names:目标类别名称
return:每个类别精确率与召回率
在上面,我们将数据分为训练集和测试集。现在我们抛开测试集不看,将训练集进行划分。
将训练集分为训练集和验证集。
通常情况下,有很多参数是需要手动指定的(如k-近邻算法中的K值),这种叫超参数。但是手动过程繁杂,所以需要对模型预设几种超参数组合。每组超参数都采用交叉验证来进行评估。最后选出最优参数组合建立模型。
超参数搜索-网格搜索API: sklearn.model_selection.GridSearchCV
sklearn.model_selection.GridSearchCV(estimator, param_grid=None,cv=None)
estimator:估计器对象
param_grid:估计器参数(dict){“n_neighbors”:[1,3,5]}
cv:指定几折交叉验证
fit:输入训练数据
score:准确率
结果分析:
best_score_:在交叉验证中测试的最好结果
best_estimator_:最好的参数模型
cv_results_:每次交叉验证后的测试集准确率结果和训练集准确率结果
from sklearn.model_selection import train_test_split, GridSearchCV
# 构造一些参数的值进行搜索
param = {"n_neighbors": [1,3,5,7,10]}
# 进行网格搜索
gc = GridSearchCV(knn, param_grid=param, cv=2)
gc.fit(x_train, y_train)
# 预测准确率
print("在测试集上准确率:", gc.score(x_test, y_test))
print("在交叉验证当中最好的结果:", gc.best_score_)
print("选择最好的模型是:", gc.best_estimator_)
print("*"*100)
print("每个超参数每次交叉验证的结果:", gc.cv_results_)
在测试集上准确率: 0.4955082742316785
在交叉验证当中最好的结果: 0.45917402269861285
选择最好的模型是: KNeighborsClassifier(n_neighbors=10)
****************************************************************************************************
每个超参数每次交叉验证的结果: {'mean_fit_time': array([0.00385594, 0.00366092, 0.00310779, 0.00316703, 0.003443 ]), 'std_fit_time': array([4.26769257e-04, 5.06877899e-04, 7.70092010e-05, 4.99486923e-05,
2.91109085e-04]), 'mean_score_time': array([0.19389665, 0.20236516, 0.21587265, 0.22173393, 0.23718596]), 'std_score_time': array([0.00897849, 0.00262308, 0.00137246, 0.00043309, 0.00201011]), 'param_n_neighbors': masked_array(data=[1, 3, 5, 7, 10],
mask=[False, False, False, False, False],
fill_value='?',
dtype=object), 'params': [{'n_neighbors': 1}, {'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 7}, {'n_neighbors': 10}], 'split0_test_score': array([0.41456494, 0.42307692, 0.44435687, 0.44656368, 0.45176545]), 'split1_test_score': array([0.4186633 , 0.43332282, 0.45412989, 0.4612232 , 0.4665826 ]), 'mean_test_score': array([0.41661412, 0.42819987, 0.44924338, 0.45389344, 0.45917402]), 'std_test_score': array([0.00204918, 0.00512295, 0.00488651, 0.00732976, 0.00740858]), 'rank_test_score': array([5, 4, 3, 2, 1], dtype=int32)}
P ( C ∣ W ) = P ( W ∣ C ) P ( C ) P ( W ) P(C|W)=\frac{P(W|C)P(C)}{P(W)} P(C∣W)=P(W)P(W∣C)P(C)
注:w为给定文档的特征值(频数统计,预测文档提供),c为文档类别
():每个文档类别的概率(某文档类别词数/总文档词数)
(│):给定类别下特征(被预测文档中出现的词)的概率
计算方法:(1│)=/ (训练文档中去计算)
为该1词在C类别所有文档中出现的次数
为所属类别C下的文档所有词出现的次数和
为指定的系数一般为1,m为训练文档中统计出的特征词个数
P ( F 1 ∣ C ) = N i + α N + α m P(F1|C)=\frac{N_i+\alpha}{N+\alpha m} P(F1∣C)=N+αmNi+α
sklearn.naive_bayes.MultinomialNB
sklearn.naive_bayes.MultinomialNB(alpha = 1.0)
朴素贝叶斯分类
α \alpha α:拉普拉斯平滑系数
问题描述:
(1)sklearn20类新闻分类;
(2)20个新闻组数据集包含20个主题的18000个新闻组帖子
朴素贝叶斯案例流程
1、加载20类新闻数据,并进行分割
2、生成文章特征词
3、朴素贝叶斯estimator流程进行预估
def naviebayes():
"""
朴素贝叶斯进行文本分类
:return: None
"""
news = fetch_20newsgroups(subset='all')
# 进行数据分割
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25)
# 对数据集进行特征抽取
tf = TfidfVectorizer()
# 以训练集当中的词的列表进行每篇文章重要性统计['a','b','c','d']
x_train = tf.fit_transform(x_train)
print(tf.get_feature_names_out())
print("*"*50)
x_test = tf.transform(x_test)
# 进行朴素贝叶斯算法的预测
mlt = MultinomialNB(alpha=1.0)
print(x_train.toarray())
print("*"*50)
mlt.fit(x_train, y_train)
y_predict = mlt.predict(x_test)
print("预测的文章类别为:", y_predict)
print("*"*50)
# 得出准确率
print("准确率为:", mlt.score(x_test, y_test))
print("*"*50)
print("每个类别的精确率和召回率:", classification_report(y_test, y_predict, target_names=news.target_names))
print("*"*50)
return None
if __name__ =="__main__":
naviebayes()
['00' '000' '0000' ... 'óáíïìåô' 'ýé' 'ÿhooked']
**************************************************
[[0. 0.02654538 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
...
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]
[0. 0. 0. ... 0. 0. 0. ]]
**************************************************
预测的文章类别为: [ 5 2 17 ... 1 13 7]
**************************************************
准确率为: 0.8612054329371817
**************************************************
每个类别的精确率和召回率: precision recall f1-score support
alt.atheism 0.88 0.80 0.84 200
comp.graphics 0.88 0.79 0.83 241
comp.os.ms-windows.misc 0.89 0.78 0.83 254
comp.sys.ibm.pc.hardware 0.76 0.87 0.81 245
comp.sys.mac.hardware 0.84 0.90 0.86 229
comp.windows.x 0.90 0.85 0.88 245
misc.forsale 0.93 0.67 0.78 241
rec.autos 0.91 0.92 0.92 263
rec.motorcycles 0.94 0.95 0.94 265
rec.sport.baseball 0.94 0.95 0.95 237
rec.sport.hockey 0.91 0.98 0.94 238
sci.crypt 0.79 0.98 0.88 259
sci.electronics 0.91 0.82 0.86 238
sci.med 0.98 0.90 0.94 239
sci.space 0.87 0.97 0.92 249
soc.religion.christian 0.62 0.98 0.76 260
talk.politics.guns 0.80 0.95 0.87 230
talk.politics.mideast 0.92 0.98 0.95 230
talk.politics.misc 1.00 0.65 0.79 196
talk.religion.misc 0.97 0.23 0.37 153
accuracy 0.86 4712
macro avg 0.88 0.85 0.85 4712
weighted avg 0.88 0.86 0.86 4712
**************************************************