基于ML-DecisionTree的多标签分类算法

     之前有一篇文章介绍了ML-KNN多标签分类算法,这里再介绍另一个算法ML-DT,算法思想比较简单,借鉴了决策树根据信息增益筛选特征生成分类器的思想,多标签场景下,信息增益表示的是该特征对所有标签的鉴别能力。

     算法大致思想如下:首先计算每个特征的信息增益IG,挑选IG最大的特征来划分样本为左右子集,递归下去,直到满足停止条件(例如叶子节点中子集样本数量为100)结束,对未知样本,沿根节点遍历一条路径到叶子节点,计算叶子节点样本子集中每个标签为0和1的概率,概率超过0.5的标签定为未知样本标签。

      先贴下伪代码,然后结合伪代码介绍算法细节:

      基于ML-DecisionTree的多标签分类算法_第1张图片

1、第5行计算特征l在每个划分点的信息增益,并挑选最大的作为分支节点划分样本集



其中,MLEnt代表信息熵,表示根据特征l的划分点划分后的左右子集

2、第6、7行继续对划分后的左右子树挑选信息增益最大的特征,递归下去,直到满足第2行的停止条件。

3、对未知样本从根节点开始沿着一条路径遍历直到叶子节点,计算叶子节点样本子集中每个标签为0和1的概率,概率超过0.5的标签定为未知样本标签。



Python代码实现如下:

import numpy as np
import pandas as pd

def mldt(train, id, label_columns, path='', paths=[], min_records=30):
	#满足停止条件,递归结束
	if train.shape[0]split_val]
			changed_labels = 0
			for label in label_columns:
				if set(left[label].values)!=set(right[label].values):
					changed_labels += 1
			if changed_labels>0:
				entropy_split_val = left.shape[0]*1.0/train.shape[0]*calc_entropy(left, label_columns)
				entropy_split_val += right.shape[0]*1.0/train.shape[0]*calc_entropy(right, label_columns)
				IG[feature+'_'+str(split_val)] = entropy_all-entropy_split_val
				
	#满足停止条件,递归结束
	if len(IG)==0:
		paths.append(path)
		return
	
	#挑选信息增益最大特征
	max_feature_val = sorted(IG.items(), key=lambda item:item[1], reverse=True)[0][0]
	feature = max_feature_val.split('_')[0]
	split_val = float(max_feature_val.split('_')[1])
	#print feature,split_val
	
	#根据特征划分左右子集并递归左右子集
	left = train[train[feature]<=split_val]
	right = train[train[feature]>split_val]
	mldt(left, id, label_columns, path+'/'+max_feature_val+'_left', paths)
	mldt(right, id, label_columns, path+'/'+max_feature_val+'_right', paths)

#计算样本集的信息熵
def calc_entropy(df, label_columns):
	entropy = 0
	for label in label_columns:
		proba = df[df[label]==1].shape[0]*1.0/df.shape[0]
		if proba==0:
			entropy += -(1-proba)*np.log2(1-proba)
		elif proba==1:
			entropy += -proba*np.log2(proba)
		else:
			entropy += -(proba*np.log2(proba)+(1-proba)*np.log2(1-proba))
	return entropy

#预测测试样本集
def predict(train, test, id, label_columns):
	df_path = pd.read_table('paths.txt', header=None)
	paths = df_path[0].values
	new_paths = []
	leaf_samples = {}
	for path in paths:
		path = path[1:]
		branches = path.split('/')
		new_paths.append(branches)
		
		tmp_train = train
		for str in branches:
			conditions = str.split('_')
			feature = conditions[0]
			split_val = float(conditions[1])
			direction = conditions[2]
			if direction=='left':
				tmp_train = tmp_train[tmp_train[feature]<=split_val]
			else:
				tmp_train = tmp_train[tmp_train[feature]>split_val]
		leaf_samples[path] = tmp_train
	
	test_ids = test[id].values
	test_labels = test[label_columns]
	tmp_test = test.drop(label_columns+[id], axis=1)
	
	correct_rec = 0
	for i in range(tmp_test.shape[0]):
		record = tmp_test.iloc[i]
		#对训练集保存的每条路径遍历,找到匹配路径
		match_path = ''
		for path in new_paths:
			iter = 0
			for str in path:
				conditions = str.split('_')
				feature = conditions[0]
				split_val = float(conditions[1])
				direction = conditions[2]
				
				if direction=='left':
					if record[feature]>split_val:
						break
				if direction=='right':
					if record[feature]<=split_val:
						break
				iter += 1
				if iter==len(path):
					match_path = path
		
		#计算匹配路径叶子节点样本子集中各标签为0和1的概率,并对比生成最终标签
		leaves = leaf_samples['/'.join(match_path)]
		correct_col = 0
		for label in label_columns:
			if leaves.shape[0]>0:
				proba_zero = leaves[leaves[label]==0].shape[0]*1.0/leaves.shape[0]
				proba_one = leaves[leaves[label]==1].shape[0]*1.0/leaves.shape[0]
				if proba_zero>proba_one:
					pred_label = 0
				else:
					pred_label = 1
			else:
				pred_label = -1
			if pred_label==test_labels[label].values[i]:
				correct_col += 1
		if correct_col==len(label_columns):
			correct_rec += 1
			
	print '测试集标签识别准确率', correct_rec*1.0/test.shape[0]

你可能感兴趣的:(基于ML-DecisionTree的多标签分类算法)