西瓜书《机器学习》课后答案——chapter11_11.1 Relief特征选择算法

试编程实现Relief算法,并考察其在西瓜数据集3.0上的运行结果。

# -*- coding: gbk -*-
""" Author: Victoria Date: 201.11.14 16:00 """
import numpy as np
import operator
import xlrd

def diff(x, y, discrete=True):
    """ Input: x: value y: value """
    if discrete:
        if x==y:
            return 0
        else:
            return 1
    else:
        return abs(x-y)

def nearest(X, y, x, x_label):
    """ Compute near-hit and near-miss in binary classification problem. Input: X: list, instances y: list, labels x: instance x_label: label for instance x Return: near_hit: list, instance near_miss: list, instance """
    near_hit_dist = float("inf")
    near_miss_dist = float("inf")
    for i in range(len(X)):
        dist = np.sum((np.array(x) - np.array(X[i]))**2)
        if y[i]==x_label:          
            if dist < near_hit_dist and dist!=0:
                near_hit = X[i]
                near_hit_dist = dist
        else:
            if dist < near_miss_dist:
                near_miss = X[i]
                near_miss_dist = dist
    return near_hit, near_miss

def relief(X, y, k, discrete):
    """ Input: X: list y: list k: selecting features number discrete: list, whether the corresponding feature is discrete or not """
    N = len(X)
    d = len(X[0])
    delta = []
    for i in range(d):
        delta_i = 0
        for j in range(N):
            near_hit, near_miss = nearest(X, y, X[j], y[j])            
            delta_i += -diff(X[j][i], near_hit[i], discrete[i])**2 + diff(X[j][i], near_miss[i], discrete[i])**2
        delta.append(delta_i)
    print delta
    features_and_delta = zip(range(d), delta)
    features_and_delta = sorted(features_and_delta, key=operator.itemgetter(1), reverse=True)
    print features_and_delta
    select_features_index = [features_and_delta[i][0] for i in range(k)]
    return select_features_index

if __name__=="__main__":
    workbook = xlrd.open_workbook("../../数据/3.0.xlsx")
    sheet = workbook.sheet_by_name("Sheet1")
    X = []
    y = []
    for i in range(17):
        X.append(sheet.col_values(i)[:-1])
    y = sheet.row_values(8)

    relief(X, y, 1, [1, 1, 1, 1, 1, 1, 0, 0])

特征[色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖量]对应的统计量为:

[-4, 6, -1, 8, 2, -3, -0.40418499999999996, 0.26350200000000007]

可以看出Relief算法认为最好的特征是纹理,其次是根蒂。

你可能感兴趣的:(机器学习)