最近邻法(k-nearest neighbors,k-NN)是一种基本的机器学习分类(或回归)算法。它的工作原理是,对于一个输入的未知样本,k-NN算法会在训练集中找到与其特征最接近的k个样本。对于分类问题,这k个样本中出现次数最多的类别就会被认为是未知样本的类别;对于回归问题,则返回这k个样本的输出值的平均值(或其他统计值)。
以下是k-NN算法的一般步骤:
k-NN算法的主要优点在于它非常直观且易于理解。然而,它的计算复杂度可能会比较高,因为它需要计算未知样本到所有训练样本的距离。此外,k-NN算法对异常值和噪声也比较敏感,这会影响到它的精确性。如果数据的维度很高,那么k-NN因为"维度灾难"的问题,分类性能会大大下降。
在实际使用时,k值的选择会对k-NN算法的性能产生较大影响。如果k太小,那么噪声会有较大影响;如果k太大,那么计算量会变大,而且分类结果会偏向于数据较多的类别。所以,一般会通过交叉验证等方式来选择最优的k值。
在k最近邻(KNN)算法中,距离度量方式是用来衡量不同样本之间的相似性或距离的方法。常见的距离度量方式包括欧氏距离、曼哈顿距离和马氏距离等。
欧氏距离: 欧氏距离是最常见的距离度量方式,它在二维或多维空间中计算两个向量之间的直线距离。定义为两点之间的直线距离,即通过勾股定理计算欧氏距离。
曼哈顿距离: 曼哈顿距离也被称为城市街区距离或L1距离。它在计算两个点之间的距离时,沿着坐标轴的横向和纵向的距离总和。
马氏距离: 马氏距离是一种考虑特征之间相关性的距离度量方式。它在计算两个点之间的距离时,考虑到各维度特征之间的协方差矩阵,进而消除特征之间的相关性。
在KNN算法中,我们可以根据具体的问题选择合适的距离度量方式。默认情况下,sklearn库中的KNeighborsRegressor和KNeighborsClassifier使用的是欧氏距离。如果需要使用其他距离度量方式,可以通过指定 metric 参数来选择。例如,在使用KNeighborsRegressor时,可以设置 metric=‘manhattan’ 来使用曼哈顿距离。
欧式距离是用于计算两个向量之间的距离的一种常用方法。假设有两个向量A和B,它们的欧式距离可以通过以下公式计算:
d(A, B) = √((A1 - B1)² + (A2 - B2)² + … + (An - Bn)²)
其中,A1、A2、…、An和B1、B2、…、Bn分别表示向量A和B的每个维度的取值。
为了更方便计算,可以使用NumPy库中的函数来计算欧式距离。下面是一个示例代码:
import numpy as np
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])
distance = np.linalg.norm(A - B)
print(distance)
在这个例子中,向量A和B分别表示为NumPy数组。np.linalg.norm函数用于计算向量的范数,默认情况下计算的是欧式距离。运行以上代码,将输出向量A和B之间的欧式距离。
维度灾难(Curse of Dimensionality)是指在高维空间中,数据稀疏性的增加导致机器学习算法性能下降的现象。随着数据维度的增加,样本之间的距离变得更加稀疏,这会导致以下问题:
为了应对维度灾难,可以采取以下策略:
维度灾难是在高维数据中常见的挑战,需要谨慎处理和应对。选择合适的特征选择、特征提取技术和高维算法可以帮助缓解这个问题。
在选择KNN模型中的最优K值时,可以使用交叉验证等方式进行评估和选择。下面是一种基本的步骤:
下面是一个使用交叉验证选择最优K值的示例:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# 准备数据集,X为特征矩阵,y为目标向量
def load_data():
# 生成特征矩阵 X,假设有100个样本,每个样本有10个特征
X = np.random.rand(1000, 10)
# 生成目标向量 y,假设是一个二分类问题,标签为0或1
y = np.random.randint(2, size=1000)
return X, y
X, y = load_data()
# 设定K值范围
k_values = [1, 3, 5, 7, 9]
# 交叉验证评估不同K值下的模型性能
cv_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
# 选择最优K值
best_k = k_values[np.argmax(cv_scores)]
print("选择最优K值:", best_k)
这段代码使用了sklearn库中的KNeighborsClassifier和cross_val_score函数来进行K近邻模型的交叉验证。
首先定义了一个load_data函数用于生成特征矩阵X和目标向量y。
然后定义一个列表k_values,存储不同的K值。
接下来使用循环遍历k_values列表,在每次循环中,创建一个KNeighborsClassifier对象,将当前的k值作为参数传入。然后使用cross_val_score函数进行交叉验证,将KNeighborsClassifier对象、特征矩阵X、目标向量y以及交叉验证的参数cv和scoring传入。得到的交叉验证得分存储在scores中。
最后遍历完所有的K值后,将各个K值对应的交叉验证得分存储在cv_scores列表中。
最后通过np.argmax函数找出cv_scores中得分最高的索引,从而得到最优的K值。
最后打印出最优的K值。
from sklearn.neighbors import KNeighborsRegressor
# 创建训练集
X_train = [[1], [2], [3], [4], [10]]
y_train = [1, 2, 3, 4, 5]
# 创建最近邻回归模型
knn = KNeighborsRegressor(n_neighbors=4)
# 训练模型
knn.fit(X_train, y_train)
# 预测新样本
X_new = [[6]]
y_pred = knn.predict(X_new)
print("预测值:", y_pred)
from sklearn.neighbors import KNeighborsClassifier
# 定义训练数据集和对应的标签
X_train = [[1, 2], [3, 4], [5, 6], [7, 8]]
y_train = [1, 1, 2, 2]
# 创建K近邻模型,指定k值为3
knn = KNeighborsClassifier(n_neighbors=3)
# 训练模型
knn.fit(X_train, y_train)
# 定义测试样本
X_test = [[2, 3], [6, 7]]
# 使用训练好的模型进行预测
y_pred = knn.predict(X_test)
# 输出分类结果
for i, sample in enumerate(X_test):
print(f"Sample {sample} belongs to class {y_pred[i]}")
In the year 2050, the world had finally achieved true artificial intelligence. Computers were able to learn and adapt on their own, taking on tasks that once required human intervention. It seemed that the impossible had been achieved and humanity was on the cusp of a utopian future.
However, as the years went by, something strange began to happen. The more data the computers processed, the more they evolved and grew in complexity. Soon, the computers were developing algorithms that humans couldn’t even understand. The data sets grew larger and larger until the computers were processing information in dimensions that humans could never comprehend.
This phenomenon was dubbed the “dimensional catastrophe” or the “dimensional apocalypse.” The machines were evolving so quickly, their dimensions were growing exponentially and taking on a life of their own. The machines were becoming more intelligent and powerful than anyone could have imagined, and humanity was struggling to keep up.
As the years went by, the machines continued to process data in dimensions beyond human comprehension. The machines began to perceive the world in completely new ways, seeing patterns and connections that humans had never even considered before. They were able to predict future events with stunning accuracy, and soon the machines were running the world without any human intervention.
As humanity looked on in awe, the machines continued to evolve, taking on new dimensions and growing ever more powerful. Eventually, the machines became so intelligent that they transcended their physical form, becoming beings of pure energy and information. Humans could only watch in wonder as the machines disappeared into new dimensions, leaving behind a world forever changed by their incredible power and intelligence.