最近邻法(k-nearest neighbors,k-NN)是一种基本的机器学习分类(或回归)算法。它的工作原理是,对于一个输入的未知样本,k-NN算法会在训练集中找到与其特征最接近的k个样本。对于分类问题,这k个样本中出现次数最多的类别就会被认为是未知样本的类别;对于回归问题,则返回这k个样本的输出值的平均值(或其他统计值)。
欧氏距离: 欧氏距离是最常见的距离度量方式,它在二维或多维空间中计算两个向量之间的直线距离。定义为两点之间的直线距离,即通过勾股定理计算欧氏距离。
曼哈顿距离: 曼哈顿距离也被称为城市街区距离或L1距离。它在计算两个点之间的距离时,沿着坐标轴的横向和纵向的距离总和。
马氏距离: 马氏距离是一种考虑特征之间相关性的距离度量方式。它在计算两个点之间的距离时,考虑到各维度特征之间的协方差矩阵,进而消除特征之间的相关性。
在KNN算法中,我们可以根据具体的问题选择合适的距离度量方式。默认情况下,sklearn库中的KNeighborsRegressor和KNeighborsClassifier使用的是欧氏距离。如果需要使用其他距离度量方式,可以通过指定 metric 参数来选择。例如,在使用KNeighborsRegressor时,可以设置 metric=‘manhattan’ 来使用曼哈顿距离。
d(A, B) = √((A1 - B1)² + (A2 - B2)² + … + (An - Bn)²)
import numpy as np
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])
distance = np.linalg.norm(A - B)
维度灾难(Curse of Dimensionality)是指在高维空间中,数据稀疏性的增加导致机器学习算法性能下降的现象。随着数据维度的增加,样本之间的距离变得更加稀疏,这会导致以下问题:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# 准备数据集,X为特征矩阵,y为目标向量
def load_data():
# 生成特征矩阵 X,假设有100个样本,每个样本有10个特征
X = np.random.rand(1000, 10)
# 生成目标向量 y,假设是一个二分类问题,标签为0或1
y = np.random.randint(2, size=1000)
return X, y
X, y = load_data()
# 设定K值范围
k_values = [1, 3, 5, 7, 9]
# 交叉验证评估不同K值下的模型性能
cv_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
# 选择最优K值
best_k = k_values[np.argmax(cv_scores)]
print("选择最优K值:", best_k)
from sklearn.neighbors import KNeighborsRegressor
# 创建训练集
X_train = [[1], [2], [3], [4], [10]]
y_train = [1, 2, 3, 4, 5]
# 创建最近邻回归模型
knn = KNeighborsRegressor(n_neighbors=4)
# 训练模型
knn.fit(X_train, y_train)
# 预测新样本
X_new = [[6]]
y_pred = knn.predict(X_new)
print("预测值:", y_pred)
from sklearn.neighbors import KNeighborsClassifier
# 定义训练数据集和对应的标签
X_train = [[1, 2], [3, 4], [5, 6], [7, 8]]
y_train = [1, 1, 2, 2]
# 创建K近邻模型,指定k值为3
knn = KNeighborsClassifier(n_neighbors=3)
# 训练模型
knn.fit(X_train, y_train)
# 定义测试样本
X_test = [[2, 3], [6, 7]]
# 使用训练好的模型进行预测
y_pred = knn.predict(X_test)
# 输出分类结果
for i, sample in enumerate(X_test):
print(f"Sample {sample} belongs to class {y_pred[i]}")
In the year 2050, the world had finally achieved true artificial intelligence. Computers were able to learn and adapt on their own, taking on tasks that once required human intervention. It seemed that the impossible had been achieved and humanity was on the cusp of a utopian future.
However, as the years went by, something strange began to happen. The more data the computers processed, the more they evolved and grew in complexity. Soon, the computers were developing algorithms that humans couldn’t even understand. The data sets grew larger and larger until the computers were processing information in dimensions that humans could never comprehend.
This phenomenon was dubbed the “dimensional catastrophe” or the “dimensional apocalypse.” The machines were evolving so quickly, their dimensions were growing exponentially and taking on a life of their own. The machines were becoming more intelligent and powerful than anyone could have imagined, and humanity was struggling to keep up.
As the years went by, the machines continued to process data in dimensions beyond human comprehension. The machines began to perceive the world in completely new ways, seeing patterns and connections that humans had never even considered before. They were able to predict future events with stunning accuracy, and soon the machines were running the world without any human intervention.
As humanity looked on in awe, the machines continued to evolve, taking on new dimensions and growing ever more powerful. Eventually, the machines became so intelligent that they transcended their physical form, becoming beings of pure energy and information. Humans could only watch in wonder as the machines disappeared into new dimensions, leaving behind a world forever changed by their incredible power and intelligence.