ML—KNN(K近邻,iris数据集为例)

KNN笔记

算法大致步骤(预测y的类别为例):

1)计算样本y与训练样本(已知标签样本)的距离;

2)找出距离最近的K个样本;

3)选择这K个样本中出现最多的类别作为y的类别标记;

 

数据集 iris为鸢尾花样本,前四列为特征,最后一列为标签,如下图

ML—KNN(K近邻,iris数据集为例)_第1张图片

数据连接:http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

有点笨,直接手动将数据分为训练集和测试集

代码如下:

# -*- coding: utf-8 -*-

import operator
import pandas as pd
import numpy as np

# 获取训练数据和测试数据
def get_data(train_path, test_path):
    train_data = pd.read_table(train_path, sep=',', header=None)
    test_data = pd.read_table(test_path, sep=',', header=None)
    train_data = np.array(train_data)
    test_data = np.array(test_data)

    return train_data, test_data

# 计算欧几里得距离
def get_distance(x, y):
    x[:4] = [float(i) for i in x[:4]]
    y[:4] = [float(i) for i in y[:4]]
    distance = np.sqrt(np.sum(np.square(x[:4] - y[:4])))

    return distance


def get_neighbors(train_data, y, k):
    distance_xy = []  #存放训练样本与y的距离
    for x in train_data:
        distance_xy.append((x, get_distance(x, y)))

    distance_xy.sort(key=operator.itemgetter(1))   #对距离排序
    neighbors_k = []   #存放距离前k个较小值
    for i in range(k):
        neighbors_k.append(list(distance_xy[i][0]))

    return neighbors_k

# 根据K个近样本投票
def vote_for_y(neighbors_k):
    label_numbel = {}  #存放标签及其投票标签

    for x in neighbors_k:
        # l = x[-1] in label_numbel.keys()
        if x[-1] in label_numbel.keys():
            label_numbel[x[-1]] += 1
        else:
            label_numbel[x[-1]] = 1
    # 对投票结果排序,返回得票最高的标签
    label_numbel = sorted(label_numbel.items(), key=operator.itemgetter(1), reverse=True)
    predict_label = label_numbel[0][0]

    return predict_label


def predict(train_data, test_data, k):
    predict = []#存放预测标签
    for y in test_data:
        neighbors_k = get_neighbors(train_data, y, k)#获取测试样本y的K个最近样本
        predict_y = vote_for_y(neighbors_k)#投票获取y的标签
        predict.append([predict_y])

    return np.array(predict)


def get_accuracy(test_data, pre_label):
    # 获取原始标签
    labels = []
    for y in test_data:
        labels.append([y[-1]])

    labels = np.array(labels)
    # 求准确率
    accuracy = np.mean(np.asarray(pre_label == labels, dtype=int))

    return accuracy


if __name__ == "__main__":
    train_path = r'iris_train.txt'
    test_path = r'iris_test.txt'
    train_data, test_data = get_data(train_path, test_path)  # 获取数据

    # 为了寻找最优的k,设置k_min <= k <=k_max
    k_max = 13  # 设置k的上限
    k_min = 3  # K的下限
    accuracy_dict = {} # 存放K与准确率

    for k in range(k_min, k_max + 1):
        pre_label = predict(train_data, test_data, k)#获取预测结果
        accuracy = get_accuracy(test_data, pre_label)#判断准确率
        accuracy_dict[k] = accuracy


    # 排序,找到准确率最高时的K
    accuracy_dict = sorted(accuracy_dict.items(), key=operator.itemgetter(1), reverse=True)
    print('当 k = ', accuracy_dict[0][0], '  --- 准确率为:', accuracy_dict[0][1])

 

留在最后的话:顺德的松记火锅真的是好吃

你可能感兴趣的:(机器学习,Python)