闲着无聊,这次自己动手实现一下简单的KNN分类算法,来实现对图片的分类,夯实一下自己的基础。
首先,KNN算法流程:
1)计算测试数据与各个训练数据之间的距离;
2)按照距离的递增关系进行排序;
3)选取距离最小的点;
4)确定最小点所在的位置;
5)返回最小点位置所在的类别作为测试数据的预测分类
数据集:数据集采用Sort_1000pics数据集。数据集包含1000张图片,总共分为10类。分别是人,沙滩,建筑,大卡车,恐龙,大象,花朵,马,山峰,食品十类,每类100张,(数据集可以到网上下载)。参考
将所得到的图片至“./photo目录下”,(这里采用的是Anaconda3作为开发环境)。
首先采用自己的代码试试:
import datetime
starttime = datetime.datetime.now()
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import os
import cv2
X = []
Y = []
for i in range(0, 10):
#遍历文件夹,读取图片
for f in os.listdir("./photo/%s" % i):
#打开一张图片并灰度化
Images = cv2.imread("./photo/%s/%s" % (i, f))
image=cv2.resize(Images,(256,256),interpolation=cv2.INTER_CUBIC)
hist = cv2.calcHist([image], [0,1], None, [256,256], [0.0,255.0,0.0,255.0])
X.append(((hist/255).flatten()))
Y.append(i)
X = np.array(X)
Y = np.array(Y)
#切分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
#随机率为100%(保证唯一性)选取其中的30%作为测试集
class KNN:
def __init__(self,train_data,train_label,test_data):
self.train_data = train_data
self.train_label = train_label
self.test_data = test_data
def knnclassify(self):
num_train = (self.train_data).shape[0]
num_test = (self.test_data).shape[0]
labels = []
for i in range(num_test):
y = []
for j in range(num_train):
dis = np.sum(np.square((self.train_data)[j]-(self.test_data)[i]))
y.append(dis)
labels.append(self.train_label[y.index(min(y))])
labels = np.array(labels)
return labels
knn = KNN(X_train,y_train,X_test)
predictions_labels = knn.knnclassify()
print(confusion_matrix(y_test, predictions_labels))
print (classification_report(y_test, predictions_labels))
endtime = datetime.datetime.now()
print (endtime - starttime)
输出结果为:
[[28 0 0 1 0 2 0 0 0 0]
[ 3 11 0 0 0 9 0 2 4 2]
[ 5 2 9 1 0 5 0 0 2 2]
[ 0 0 0 14 0 1 1 0 1 12]
[ 0 1 0 0 31 0 0 0 0 0]
[ 3 0 0 1 0 27 0 2 1 0]
[ 5 0 0 0 0 0 22 0 0 3]
[ 0 0 0 0 0 1 0 25 0 0]
[ 7 3 0 1 0 7 0 2 4 7]
[ 3 0 0 2 0 3 1 1 0 20]]
precision recall f1-score support
0 0.52 0.90 0.66 31
1 0.65 0.35 0.46 31
2 1.00 0.35 0.51 26
3 0.70 0.48 0.57 29
4 1.00 0.97 0.98 32
5 0.49 0.79 0.61 34
6 0.92 0.73 0.81 30
7 0.78 0.96 0.86 26
8 0.33 0.13 0.19 31
9 0.43 0.67 0.53 30
avg / total 0.67 0.64 0.62 300
0:00:33.881616
从中可以看出,混淆矩阵,精度,召回率,以及f1分数,以及所用的时间。机器学习中采用的特征提取方法为颜色直方图(提取RGB中三颜色中的BG色值)。更多图像特征提取方法请参考。至于图像的读取,处理等操作都是大同小异,再看看集成库的KNN分类效果。
import datetime
starttime = datetime.datetime.now()
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import os
import cv2
X = []
Y = []
for i in range(0, 10):
#遍历文件夹,读取图片
for f in os.listdir("./photo/%s" % i):
#打开一张图片并灰度化
Images = cv2.imread("./photo/%s/%s" % (i, f))
image=cv2.resize(Images,(256,256),interpolation=cv2.INTER_CUBIC)
hist = cv2.calcHist([image], [0,1], None, [256,256], [0.0,255.0,0.0,255.0])
X.append(((hist/255).flatten()))
Y.append(i)
X = np.array(X)
Y = np.array(Y)
#切分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
#随机率为100%选取其中的30%作为测试集
clf0 = KNeighborsClassifier(n_neighbors=10).fit(X_train, y_train)
predictions0 = clf0.predict(X_test)
print(confusion_matrix(y_test, predictions0))
print (classification_report(y_test, predictions0))
endtime = datetime.datetime.now()
print (endtime - starttime)
输出结果为
[[28 0 0 0 0 3 0 0 0 0]
[ 7 1 0 0 0 18 0 0 2 3]
[10 1 3 1 0 6 1 1 1 2]
[ 2 0 0 18 0 2 1 0 2 4]
[ 0 0 0 1 30 0 0 0 1 0]
[ 5 0 0 0 0 24 0 4 1 0]
[ 5 0 0 0 0 0 21 3 0 1]
[ 0 0 0 0 0 1 0 25 0 0]
[10 1 0 3 0 8 0 1 1 7]
[ 5 0 0 2 0 1 1 2 0 19]]
precision recall f1-score support
0 0.39 0.90 0.54 31
1 0.33 0.03 0.06 31
2 1.00 0.12 0.21 26
3 0.72 0.62 0.67 29
4 1.00 0.94 0.97 32
5 0.38 0.71 0.49 34
6 0.88 0.70 0.78 30
7 0.69 0.96 0.81 26
8 0.12 0.03 0.05 31
9 0.53 0.63 0.58 30
avg / total 0.59 0.57 0.51 300
0:00:36.135252
可以看出,精度有所下降,时间略微上升。再看看,第三个版本的KNN。
‘’‘前半部分代码相同’’
......
......
from numpy import *
# KNN分类算法函数定义
def kNNClassify(newInput, dataSet, labels, k):
numSamples = dataSet.shape[0] # shape[0]表示行数
# tile(A, reps): 构造一个矩阵,通过A重复reps次得到
# the following copy numSamples rows for dataSet
diff = np.tile(newInput, (numSamples, 1)) - dataSet # 按元素求差值
squaredDiff = diff ** 2 # 将差值平方
squaredDist = sum(squaredDiff, axis = 1) # 按行累加
distance = squaredDist ** 0.5 # 将差值平方和求开方,即得距离
# # step 2: 对距离排序
# argsort() 返回排序后的索引值
sortedDistIndices = argsort(distance)
classCount = {} # define a dictionary (can be append element)
for i in range(k):
# # step 3: 选择k个最近邻
voteLabel = labels[sortedDistIndices[i]]
# # step 4: 计算k个最近邻中各类别出现的次数
# when the key voteLabel is not in dictionary classCount, get()
# will return 0
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
# # step 5: 返回出现次数最多的类别标签
maxCount = 0
for key, value in classCount.items():
if value > maxCount:
maxCount = value
maxIndex = key
return maxIndex
predictions=[]
for i in range(X_test.shape[0]):
predictions_labes = kNNClassify(X_test[i], X_train, y_train, 10)
predictions.append(predictions_labes)
print (confusion_matrix(y_test, predictions))
#打印预测结果混淆矩阵
print (classification_report(y_test, predictions))
#打印精度、召回率、FI结果
endtime = datetime.datetime.now()
print (endtime - starttime)
输出结果为
[[28 0 0 0 0 2 0 0 1 0]
[ 6 1 0 0 0 18 0 0 3 3]
[10 1 3 1 0 5 1 1 2 2]
[ 2 0 0 16 0 2 1 0 2 6]
[ 0 0 0 1 30 0 0 0 1 0]
[ 5 0 0 0 0 24 0 4 1 0]
[ 5 0 0 0 0 0 21 2 0 2]
[ 0 0 0 0 0 1 0 25 0 0]
[10 0 0 2 0 7 0 1 2 9]
[ 5 0 0 1 0 1 1 1 0 21]]
precision recall f1-score support
0 0.39 0.90 0.55 31
1 0.50 0.03 0.06 31
2 1.00 0.12 0.21 26
3 0.76 0.55 0.64 29
4 1.00 0.94 0.97 32
5 0.40 0.71 0.51 34
6 0.88 0.70 0.78 30
7 0.74 0.96 0.83 26
8 0.17 0.06 0.09 31
9 0.49 0.70 0.58 30
avg / total 0.62 0.57 0.52 300
0:01:01.444997