一、前言
这段时间疫情不那么严重,回公司上班了。平时工作比较忙,而且重点在学习数学。很久没有更新,最近实现《西瓜书》决策树,贴出来给大家共享。西瓜数据集2.0如下:在这里插入代码片
[‘青绿’, ‘蜷缩’, ‘浊响’, ‘清晰’, ‘凹陷’, ‘硬滑’, ‘好瓜’],
[‘乌黑’, ‘蜷缩’, ‘沉闷’, ‘清晰’, ‘凹陷’, ‘硬滑’, ‘好瓜’],
[‘乌黑’, ‘蜷缩’, ‘浊响’, ‘清晰’, ‘凹陷’, ‘硬滑’, ‘好瓜’],
[‘青绿’, ‘蜷缩’, ‘沉闷’, ‘清晰’, ‘凹陷’, ‘硬滑’, ‘好瓜’],
[‘浅白’, ‘蜷缩’, ‘浊响’, ‘清晰’, ‘凹陷’, ‘硬滑’, ‘好瓜’],
[‘青绿’, ‘稍蜷’, ‘浊响’, ‘清晰’, ‘稍凹’, ‘软粘’, ‘好瓜’],
[‘乌黑’, ‘稍蜷’, ‘浊响’, ‘稍糊’, ‘稍凹’, ‘软粘’, ‘好瓜’],
[‘乌黑’, ‘稍蜷’, ‘浊响’, ‘清晰’, ‘稍凹’, ‘硬滑’, ‘好瓜’],
[‘乌黑’, ‘稍蜷’, ‘沉闷’, ‘稍糊’, ‘稍凹’, ‘硬滑’, ‘坏瓜’],
[‘青绿’, ‘硬挺’, ‘清脆’, ‘清晰’, ‘平坦’, ‘软粘’, ‘坏瓜’],
[‘浅白’, ‘硬挺’, ‘清脆’, ‘模糊’, ‘平坦’, ‘硬滑’, ‘坏瓜’],
[‘浅白’, ‘蜷缩’, ‘浊响’, ‘模糊’, ‘平坦’, ‘软粘’, ‘坏瓜’],
[‘青绿’, ‘稍蜷’, ‘浊响’, ‘稍糊’, ‘凹陷’, ‘硬滑’, ‘坏瓜’],
[‘浅白’, ‘稍蜷’, ‘沉闷’, ‘稍糊’, ‘凹陷’, ‘硬滑’, ‘坏瓜’],
[‘乌黑’, ‘稍蜷’, ‘浊响’, ‘清晰’, ‘稍凹’, ‘软粘’, ‘坏瓜’],
[‘浅白’, ‘蜷缩’, ‘浊响’, ‘模糊’, ‘平坦’, ‘硬滑’, ‘坏瓜’],
[‘青绿’, ‘蜷缩’, ‘沉闷’, ‘稍糊’, ‘稍凹’, ‘硬滑’, ‘坏瓜’]
二、样本数据读取及存储
为了便于数据操作,每个数据样本存储为字典,字典key为样本各个特征,比如纹理,敲声等,字典value对应特征标签值,比如清晰、沉闷。
def read_data(filename):
"""
Function : 读取西瓜数据集
Input: filename: 数据集文件名
Output: data:西瓜数据集列表,列表元素为字典,每个字典保存西瓜属性
"""
text_list = []
with open(filename,"r") as f:
#当读到最后一行的下一行时,line 为空集,停止读取
while True:
line = f.readline()
if not line:
break
#删除每行尾换行符
line = line.strip("\n")
#s删除每行头尾空格
line = line.strip(" ")
#删除每行头尾的[ ,]
line = line.strip("[")
line = line.strip(",")
line = line.strip("]")
if line != "":
text_list.append(line)
#创建数据列表,每个西瓜数据为一个字典,字典形成列表
dataset = []
for i,text_line in enumerate(text_list):
#把每行字符串分割为列表
split_data_text = text_line.split( ",")
#每个西瓜数据初始化一个字典对象并保存该西瓜的数据
dic_example = {
}
dic_example["编号"] = i + 1
#删除每个特征标签的引号和空格
dic_example["色泽"] = split_data_text[0].replace("'","").strip()
dic_example["根蒂"] = split_data_text[1].replace("'","").strip()
dic_example["敲声"] = split_data_text[2].replace("'","").strip()
dic_example["纹理"] = split_data_text[3].replace("'","").strip()
dic_example["脐眼"] = split_data_text[4].replace("'","").strip()
dic_example["触感"] = split_data_text[5].replace("'","").strip()
dic_example["标签"] = split_data_text[6].replace("'","").strip()
#将西瓜数据字典加入列表
dataset.append(dic_example)
return dataset
#建立数据集
数据读取结果如下:
filename = "西瓜数据集2.0.txt"
dataset = read_data(filename)
[{
'编号': 1,
'色泽': '青绿',
'根蒂': '蜷缩',
'敲声': '浊响',
'纹理': '清晰',
'脐眼': '凹陷',
'触感': '硬滑',
'标签': '好瓜'},
{
'编号': 2,
'色泽': '乌黑',
'根蒂': '蜷缩',
'敲声': '沉闷',
'纹理