ML: KNN笔记

使用Jupyter notebook

%matplotlib qt
import numpy as np
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
  1. 读取txt数据,最后一列为标签
data = []
labels = []
with open('data\\datingTestSet.txt') as f:
    for line in f:
        tokens = line.strip().split('\t')
        data.append([float(tk) for tk in tokens[:-1]])
        labels.append(tokens[-1])

data[1:10]
np.unique(labels)
array(['didntLike', 'largeDoses', 'smallDoses'],
dtype='|S10')

  1. 处理字符标签为数字标签
x = np.array(data)
labels = np.array(labels)
y = np.zeros(labels.shape)
y[labels=='didntLike'] = 1
y[labels=='smallDoses'] = 2
y[labels=='largeDoses'] = 3
  1. 数据未归一化前
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x,y)
print(model)
expected = y
predicted = model.predict(x)
print metrics.classification_report(expected,predicted,target_names=['didntLike','smallDoses','largeDoses'])
print metrics.confusion_matrix(expected,predicted)

结果:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')
precision recall f1-score support

didntLike 0.89 0.85 0.87 342
smallDoses 0.93 0.98 0.96 331
largeDoses 0.82 0.83 0.82 327

avg / total 0.88 0.88 0.88 1000

[[289 0 53]
[ 1 325 5]
[ 33 24 270]]

  1. 数据归一化到[0-1范围]
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(x)
X_train_minmax
array([[ 0.44832535,  0.39805139,  0.56233353],
       [ 0.15873259,  0.34195467,  0.98724416],
       [ 0.28542943,  0.06892523,  0.47449629],
       ..., 
       [ 0.29115949,  0.50910294,  0.51079493],
       [ 0.52711097,  0.43665451,  0.4290048 ],
       [ 0.47940793,  0.3768091 ,  0.78571804]])
  1. 拆分训练数据与测试数据
from sklearn.cross_validation import train_test_split  
''''' 拆分训练数据与测试数据 '''  
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)  
  1. 归一化后结果
    n_neighbors = 3 K近邻的K取值为3
x_train, x_test, y_train, y_test = train_test_split(X_train_minmax, y, test_size = 0.2)  
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x_train,y_train)
print(model)
expected = y_test
predicted = model.predict(x_test)
print metrics.classification_report(expected,predicted,target_names=['didntLike','smallDoses','largeDoses'])
print metrics.confusion_matrix(expected,predicted)

结果:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')
precision recall f1-score support

didntLike 0.97 1.00 0.99 68
smallDoses 0.93 1.00 0.96 51
largeDoses 1.00 0.93 0.96 81

avg / total 0.97 0.97 0.97 200

[[68 0 0]
[ 0 51 0]
[ 2 4 75]]

小结:
归一化后的结果,与归一化前相差很大

你可能感兴趣的:(ML: KNN笔记)