- knn分类器被认为是一种懒惰的分类方式,他是依据周边邻居的分类做出的决策。
寻找最近邻
from sklearn import datasets
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
features = iris.data
standardizer = StandardScaler()
features_standardized = standardizer.fit_transform(features)
nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)
new_observation = [1,1,1,1]
distances,indices = nearest_neighbors.kneighbors([new_observation])
>>> features_standardized[indices]
array([[[1.03800476, 0.55861082, 1.10378283, 1.18556721],
[0.79566902, 0.32841405, 0.76275827, 1.05393502]]])
>>> indices
array([[124, 110]], dtype=int64)
>>> distances
array([[0.49140089, 0.74294782]])
机器学习中的距离问题
- 欧氏距离
- 曼哈顿距离
- 闵可夫斯基距离(NearstNeighbors的默认)
- 通过metric参数设定
创建一个KNN分类器
- 对于分类未知的观察值,基于邻居的分类来预测它的分类
- 加载库和数据集
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
Y = iris.target
standardizer = StandardScaler()
X_std = standardizer.fit_transform(X)
knn = KNeighborsClassifier(n_neighbors=5,n_jobs=-1).fit(X_std,Y)
new_observations = [[0.1,0.2,0.3,0.4],
[0.9,0.8,0.7,0.6]]
>>> knn.predict(new_observations)
array([1, 1])
确定最佳的领域点集的大小
- k很小,偏差会很小但是方差很大
- k很大,方程会很小但偏差会很大
- 我们通过GridSearchCV对不同k值的分类器做5折交叉验证
- 导入数据包
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
features = iris.data
target = iris.target
standardizer = StandardScaler()
features_standardized = standardizer.fit_transform(features)
knn = KNeighborsClassifier(n_neighbors=5,n_jobs=-1)
pipe = Pipeline([("standarizer",standardizer),("knn",knn)])
search_space = [{"knn__n_neighbors":[1,2,3,4,5,6,7,8,9,10]}]
classifier = GridSearchCV(pipe,search_space,cv=5,verbose=0).fit(features_standardized,target)
>>> classifier.best_estimator_.get_params()["knn__n_neighbors"]
6
创建一个基于半径的最近邻分类器
- 这是一个不太常用的分类器
- 观察值的分类是根据某一半径r范围内所有观察值的分类来预测的。
- 我们要在radius里指定一个半径来确定某个观察值能不能算为目标观察值的邻居
- 我们要在outlier_label 里指定一个分类
- 使得如果观察值周围radius之内没有其他观察值的情况下
- 这个观察值会被分配到哪个分类
- 导入数据与包
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
features = iris.data
target = iris.target
standardizer = StandardScaler()
features_standardized = standardizer.fit_transform(features)
rnn = RadiusNeighborsClassifier(radius=0.5,n_jobs=-1).fit(features_standardized,target)
new_observations = [[1,1,1,1]]
print(rnn.predict(new_observations))
>>>[2]