Iris数据集是常用的分类实验数据集,由Fisher, 1936收集整理,Iris也称鸢尾花卉数据集,是一类多重变量分析的数据集
鸢尾花数据集包含了
该虹膜数据集包含150行数据,包括来自每个的三个相关鸢尾种类50个样品:又称为山鸢尾,虹膜锦葵和变色鸢尾
从左到右,Iris setosa (由 Radomil, CC BY-SA3.0),Iris versicolor (由Dlanglois, CC BY-SA 3.0)和lris virginica(由Frank Mayfield, CC BY-SA 2.0) )
scikit-learn数据集API介绍
代码如下
from sklearn.datasets import load_iris, fetch_20newsgroups
# 数据集获取
iris = load_iris() # 小数据集获取
# news = fetch_20newsgroups() # 大数据集获取
# print(news)
# print("鸢尾花数据集的返回值:\n", iris)
# 返回值是一个继承自字典的Bench
print("鸢尾花的特征值:\n", iris["data"])
print("鸢尾花的目标值:\n", iris.target)
print("鸢尾花特征的名字:\n", iris.feature_names)
print("鸢尾花目标值的名字:\n", iris.target_names)
print("鸢尾花数据集的描述:\n", iris.DESCR)
------------------------------------------------------------
输出:
鸢尾花的特征值:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
…… # 省略,共150行
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
鸢尾花的目标值:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
鸢尾花特征的名字:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
鸢尾花目标值的名字:
['setosa' 'versicolor' 'virginica']
鸢尾花的描述:
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
Process finished with exit code 0
通过图像,以查看不同类别是如何通过特征来区分的。 在理想情况下,标签类将由一个或多个特征对完美分隔。 在现实世界中,这种理想情况很少会发生
seaborn.lmplot() 是一个非常有用的方法,它会在绘制二维散点图时,自动完成回归拟合
使用代码如下
from sklearn.datasets import load_iris, fetch_20newsgroups
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from pylab import mpl
mpl.rcParams["font.sans-serif"] = ["SimHei"] # 设置显示中文字体
mpl.rcParams["axes.unicode_minus"] = False # 设置正常显示符号
# 数据集获取
iris = load_iris() # 小数据集获取
# 数据可视化,将数据转换成dataframe的格式存储
iris_data = pd.DataFrame(data=iris.data, columns=['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'])
iris_data['target'] = iris.target # 新增target目标值一列
print(iris_data)
def plot_iris(iris, col1, col2):
sns.lmplot(x=col1, y=col2, data=iris, hue="target", fit_reg=False) # fit_reg为是否进行线性拟合
plt.xlabel(col1)
plt.ylabel(col2)
plt.title('鸢尾花种类分布图')
plt.show()
plot_iris(iris_data, 'Sepal_Width', 'Petal_Length')
------------------------------------------------------------------------------
输出:
Sepal_Length Sepal_Width Petal_Length Petal_Width target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2
[150 rows x 5 columns]
生成图像如下
机器学习一般的数据集会划分为两个部分
划分比例
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris() # 获取鸢尾花数据集
# 对鸢尾花数据集进行分割
# x_train:训练集的特征值
# x_test:测试集的特征值
# y_train:训练集的目标值
# y_test:测试集的目标值
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=22)
print('训练集特征值x_train为:\n', x_train)
print('测试集特征值x_test为:\n', x_test)
print('训练集目标值y_train为:\n', y_train)
print('测试集目标值y_test为:\n', y_test)
print("训练集特征值x_train的形状为:", x_train.shape)
print("测试集特征值x_test的形状为:", x_test.shape)
print("训练集目标值y_train的形状为:", y_train.shape)
print("测试集目标值y_test的形状为:", y_test.shape)
# 随机数种子
print('-------------------验证random_state的不同----------------------')
x_train1, x_test1, y_train1, y_test1 = train_test_split(iris.data, iris.target, test_size=0.2, random_state=6)
x_train2, x_test2, y_train2, y_test2 = train_test_split(iris.data, iris.target, test_size=0.2, random_state=6)
print("训练集特征值x_train1的形状为:", x_train1.shape)
print("测试集特征值x_test1的形状为:", x_test1.shape)
print('-----------------------------------------')
print('测试集目标值y_test为:\n', y_test)
print('测试集目标值y_test1为:\n', y_test1)
print('测试集目标值y_test2为:\n', y_test2)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
输出:
训练集特征值x_train为:
[[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
…… # 省略,共120条
[6.4 2.8 5.6 2.2]
[7.7 3.8 6.7 2.2]]
测试集特征值x_test为:
[[5.4 3.7 1.5 0.2]
…… # 省略,共30条
[6.2 2.8 4.8 1.8]]
训练集目标值y_train为:
[0 0 1 1 1 0 0 0 2 2 1 1 0 0 1 1 2 2 0 1 1 2 0 0 0 0 0 0 2 1 1 2 0 0 0 0 1
0 1 1 1 1 1 0 1 2 0 2 1 2 1 1 1 0 0 2 1 0 1 1 2 2 0 2 0 2 0 1 0 2 1 2 1 2
0 1 1 1 1 2 0 0 2 1 1 0 1 0 2 2 2 2 0 2 2 0 1 1 0 2 0 1 0 2 0 2 2 0 2 0 1
0 0 2 1 2 2 0 2 2]
测试集目标值y_test为:
[0 2 1 2 1 1 1 2 1 0 2 1 2 2 0 2 1 1 2 1 0 2 0 1 2 0 2 2 2 2]
训练集特征值x_train的形状为: (120, 4)
测试集特征值x_test的形状为: (30, 4)
训练集目标值y_train的形状为: (120,)
测试集目标值y_test的形状为: (30,)
-------------------验证random_state的不同----------------------
训练集特征值x_train1的形状为: (120, 4)
测试集特征值x_test1的形状为: (30, 4)
-----------------------------------------
测试集目标值y_test为:
[0 2 1 2 1 1 1 2 1 0 2 1 2 2 0 2 1 1 2 1 0 2 0 1 2 0 2 2 2 2]
测试集目标值y_test1为:
[0 2 0 0 2 1 2 0 2 1 2 1 2 2 1 2 2 1 1 0 0 2 0 0 1 1 1 2 0 1]
测试集目标值y_test2为:
[0 2 0 0 2 1 2 0 2 1 2 1 2 2 1 2 2 1 1 0 0 2 0 0 1 1 1 2 0 1]
学习导航:http://xqnav.top/