周志华《机器学习》一书中大量例题习题用到了“西瓜数据集3.0”和“西瓜数据集3.0a”,两个数据集的区别是“西瓜数据集3.0”有离散属性而“西瓜数据集3.0a”都是连续属性。生成这两个数据集的代码如下,运行代码即可生成python数据文件watermelon_3.0.npz和watermelon_3.0a.npz:
write_dataset_watermelon3.py
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 27 21:24:11 2018
Write 'Machine Learning, Zhihua Zhou' P84 watermelon_3.0 dataset to
'watermelon_3.0.npy'
@author: weiyx15
"""
'''
[x]
色泽:乌黑-0, 青绿-1, 浅白-2
根蒂:蜷缩-0, 稍蜷-1, 硬挺-2
敲声:浊响-0, 沉闷-1, 清脆-2
纹理:清晰-0, 稍糊-1, 模糊-2
脐部:凹陷-0, 稍凹-1, 平坦-2
触感:硬滑-0, 软粘-1
密度:<数值>
含糖率:<数值>
[y]
好瓜:是-0, 否-1
'''
import numpy as np
xn_discrete = 6
xn_continuous = 2
yn = 2
x_discrete = [3, 3, 3, 3, 3, 2]
x = np.array([[1, 0, 0, 0, 0, 0, .697, .46],
[0, 0, 1, 0, 0, 0, .774, .376],
[0, 0, 0, 0, 0, 0, .634, .264],
[1, 0, 1, 0, 0, 0, .608, .318],
[2, 0, 0, 0, 0, 0, .556, .215],
[1, 1, 0, 0, 1, 1, .403, .237],
[0, 1, 0, 1, 1, 1, .481, .149],
[0, 1, 0, 0, 1, 0, .437, .211],
[0, 1, 1, 1, 1, 0, .666, .091],
[1, 2, 2, 0, 2, 1, .243, .267],
[2, 2, 2, 2, 2, 0, .245, .057],
[2, 0, 0, 2, 2, 1, .343, .099],
[1, 1, 0, 1, 0, 0, .639, .161],
[2, 1, 1, 1, 0, 0, .657, .198],
[0, 1, 0, 0, 1, 1, .36, .37],
[2, 0, 0, 2, 2, 0, .593, .042],
[1, 0, 1, 1, 1, 0, .719, .103]])
y = np.array([1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0])
np.savez('watermelon_3.0.npz', xn_discrete, xn_continuous, yn, x_discrete, x, y)
write_dataset_watermelon3a.py
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 20 20:19:18 2018
Write 'Machine Learning, Zhihua Zhou' P89 watermelon_3.0a dataset to
'watermelon_3.0a.npy'
@author: weiyx15
"""
import numpy as np
x = np.array([[.697, .46], [.774, .376], [.634, .264], [.608, .318],
[.556, .215], [.403, .237], [.481, .149], [.437, .211],
[.666, .091], [.243, .267], [.245, .057], [.343, .099],
[.639, .161], [.657, .198], [.36, .37], [.593, .042],
[.719, .103]])
y = np.array([1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0])
np.savez('watermelon_3.0a.npz', x, y)