产生非线性数据集,比如用以测试核机制的性能;
核方法最终的使命是:unfold the half-moons
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=200, shuffle=True, random_state=123)
plt.scatter(X[y==0, 0], X[y==0, 1], color='r', marker='^', alpha=.4)
plt.scatter(X[y==1, 0], X[y==1, 1], color='r', marker='o', alpha=.4)
plt.show()
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1000, noise=.1, factor=.2, random_state=123)
plt.scatter(X[y==0, 0], X[y==0, 1], color='r', marker='^', alpha=.4)
plt.scatter(X[y==1, 0], X[y==1, 1], color='b', marker='o', alpha=.4)
plt.show()
from sklearn import datasets
>>> iris = datasets.load_iris()
>>> dir(iris)
>>> iris.data.shape
(150, 4) # 训练样本
>>> iris.target.shape
(150,) # 一维的训练样本
>>> print(iris.DESCR)
# DESCR成员:数据集的介绍
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics:
which contains 569 samples of malignant(恶性的) and benign(良性的) tumor cells.
The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnoisi (M=malignant, B=benign), respectively.
The columns 3-32 contains 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/'
'breast-cancer-wisconsin/wdbc.data', header=None)
X, y = df.values[:, 2:], df.values[:, 1]