《机器学习(周志华)》习题11.1 参考答案

试编程实现Relief算法,并考察其在西瓜3.0上的结果。

# coding: utf-8
import numpy as np 

input_path = "西瓜数据集3.csv"
file = open(input_path.decode('utf-8'))
filedata = [line.strip('\n').split(',') for line in file]
filedata = [[float(i) if '.' in i.decode('utf-8') else i for i in row ] for row in filedata] # change decimal from string to float 
filedata = filedata[1:]
X = [row[1:-1] for row in filedata] # attributes
Y = [row[-1] for row in filedata] # class label 
weight = np.zeros(len(X[0]))

# Normalise
for row in X:
	row[-2] = (row[-2]-0.243) / (0.774-0.243)
	row[-1] = (row[-1]-0.042) / (0.46-0.042)

def cal_dis(a, b):
	ret = 0
	for i in range(len(a)):
		ai = a[i]
		bi = b[i]
		if type(a[i]) == float:
			ret += np.abs(ai-bi)
		else:
			ret += 0 if ai==bi else 1
	return ret 


def find_near(sample_id):
	global X, Y
	near_hit_id = -1
	near_miss_id = -1
	near_hit_dis = np.inf
	near_miss_dis = np.inf  
	sample_feat = X[sample_id]
	sample_label = Y[sample_id]
	for i in range(len(Y)):
		if i == sample_id:
			continue
		pair_feat = X[i]
		pair_label = Y[i]
		dis = cal_dis(sample_feat, pair_feat)
		if pair_label == sample_label and dis < near_hit_dis:	
			near_hit_dis = dis
			near_hit_id = i 
		if pair_label != sample_label and dis < near_miss_dis:
			near_miss_dis = dis
			near_miss_id = i 
	return near_hit_id, near_miss_id


for i in range(len(Y)):
	near_hit_id, near_miss_id = find_near(i)
	print i, near_hit_id, near_miss_id
	for j in range(len(X[0])):
		weight[j] += cal_dis([X[i][j]], [X[near_miss_id][j]]) - cal_dis([X[i][j]], [X[near_hit_id][j]])

print weight



输出结果:

色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率
[-7.          4.         -1.          9.          7.         -4.         -0.3747646		3.57416268]


relief算法得到的特征重要性排序可得,[纹理,脐部,根蒂,含糖率...]

还是比较make sense 的。

不过注意,该算法实际蕴含假设,同类数据属性值差别小,不同类数据属性值差别大,但这个假设经常并不对所有属性成立!

而且它还依赖于计算最近邻,首先,为什么要用最近邻,有道理么,个人感觉还不如用类重心。

其次,这个算法本身是用来降维的,也就是说是用来解决维度灾难的,而维度灾难下算距离是有问题的,那这个算法不是本末倒置?

anyway,做个编程练习加深对算法理解还是可以的,但实际中我是不会倾向使用它的。

你可能感兴趣的:(数据挖掘)