Scikit-learn(sklearn)
是机器学习中常用的第三方模块,对常用的机器学习方法进行了封装,包括回归(Regression)
、降维(Dimensionality Reduction)
、分类(Classfication)
、聚类(Clustering)
等方法。作为机器学习库,sklearn
内置了非常丰富的数据集,本文介绍sklearn
数据集及导入过程
pip install scikit-learn
sklearn
、查看所有数据集import sklearn.datasets as sds
print(dir(sds))
'''输出
['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_base', '_california_housing', '_covtype', '_kddcup99', '_lfw', '_olivetti_faces', '_openml', '_rcv1', '_samples_generator', '_species_distributions', '_svmlight_format_fast', '_svmlight_format_io', '_twenty_newsgroups',
'clear_data_home', 'dump_svmlight_file',
'fetch_20newsgroups', 'fetch_20newsgroups_vectorized', 'fetch_california_housing', 'fetch_covtype', 'fetch_kddcup99', 'fetch_lfw_pairs', 'fetch_lfw_people', 'fetch_olivetti_faces', 'fetch_openml', 'fetch_rcv1', 'fetch_species_distributions',
'get_data_home',
'load_boston', 'load_breast_cancer', 'load_diabetes', 'load_digits', 'load_files', 'load_iris', 'load_linnerud', 'load_sample_image', 'load_sample_images', 'load_svmlight_file', 'load_svmlight_files', 'load_wine',
'make_biclusters', 'make_blobs', 'make_checkerboard', 'make_circles', 'make_classification', 'make_friedman1', 'make_friedman2', 'make_friedman3', 'make_gaussian_quantiles', 'make_hastie_10_2', 'make_low_rank_matrix', 'make_moons', 'make_multilabel_classification', 'make_regression', 'make_s_curve', 'make_sparse_coded_signal', 'make_sparse_spd_matrix', 'make_sparse_uncorrelated', 'make_spd_matrix', 'make_swiss_roll']
'''
sklearn
中的数据集分为几类:
自带的小数据集(packaged dataset):sklearn.datasets.load_<name>
可在线下载的数据集(Downloaded Dataset):sklearn.datasets.fetch_<name>
计算机生成的数据集(Generated Dataset):sklearn.datasets.make_<name>
svmlight/libsvm格式的数据集:sklearn.datasets.load_svmlight_file(...)
从买了data.org在线下载获取的数据集:sklearn.datasets.fetch_mldata(...)
# 可能这里有些也不是指的数据集,但是不影响,当选择想用的数据集后,
# 在sklearn中打印查看是否有即可,对一些常用是数据集做了标注
['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_base', '_california_housing', '_covtype', '_kddcup99', '_lfw', '_olivetti_faces', '_openml', '_rcv1', '_samples_generator', '_species_distributions', '_svmlight_format_fast', '_svmlight_format_io', '_twenty_newsgroups',
'clear_data_home',
'dump_svmlight_file',
'fetch_20newsgroups',
'fetch_20newsgroups_vectorized',
'fetch_california_housing',
'fetch_covtype',
'fetch_kddcup99',
'fetch_lfw_pairs',
'fetch_lfw_people',
'fetch_olivetti_faces',
'fetch_openml',
'fetch_rcv1',
'fetch_species_distributions',
'get_data_home',
'load_boston',
'load_breast_cancer',乳腺癌数据集load-barest-cancer():简单经典的用于二分类任务的数据集
'load_diabetes', 糖尿病数据集:load-diabetes():经典的用于回归认为的数据集,值得注意的是,这10个特征中的每个特征都已经被处理成0均值,方差归一化的特征值。
'load_digits', 手写数字数据集:load_digits():用于分类任务或者降维任务的数据集
'load_files',
'load_iris', 鸢尾花数据集:load_iris():用于分类任务的数据集
'load_linnerud',
'load_sample_image',
'load_sample_images',
'load_svmlight_file',
'load_svmlight_files',
'load_wine',
'make_biclusters',
'make_blobs',
'make_checkerboard',
'make_circles',
'make_classification',
'make_friedman1',
'make_friedman2',
'make_friedman3',
'make_gaussian_quantiles',
'make_hastie_10_2',
'make_low_rank_matrix',
'make_moons',
'make_multilabel_classification',
'make_regression',
'make_s_curve',
'make_sparse_coded_signal',
'make_sparse_spd_matrix',
'make_sparse_uncorrelated',
'make_spd_matrix',
'make_swiss_roll']
以
iris
鸢尾花数据集为例
print(help(sds.load_iris))
'''
Help on function load_iris in module sklearn.datasets._base:
load_iris(return_X_y=False)
Load and return the iris dataset (classification).
The iris dataset is a classic and very easy multi-class classification
dataset.
================= ==============
Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features real, positive
================= ==============
Read more in the :ref:`User Guide `.
Parameters
----------
return_X_y : boolean, default=False.
If True, returns ``(data, target)`` instead of a Bunch object. See
below for more information about the `data` and `target` object.
.. versionadded:: 0.18
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are:
'data', the data to learn, 'target', the classification labels,
'target_names', the meaning of the labels, 'feature_names', the
meaning of the features, 'DESCR', the full description of
the dataset, 'filename', the physical location of
iris csv dataset (added in version `0.20`).
(data, target) : tuple if ``return_X_y`` is True
.. versionadded:: 0.18
Notes
-----
.. versionchanged:: 0.20
Fixed two wrong data points according to Fisher's paper.
The new version is the same as in R, but not as in the UCI
Machine Learning Repository.
Examples
--------
Let's say you are interested in the samples 10, 25, and 50, and want to
know their class name.
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> data.target[[10, 25, 50]]
array([0, 0, 1])
>>> list(data.target_names)
['setosa', 'versicolor', 'virginica']
None
'''
数据集的分类数目,数据集大小,数据集的导入方式都已经说明
导入:
# 官方
from sklearn.datasets import load_iris
data = load_iris()
# 也可另外方式
import sklearn.datasets as sds
data = sds.load_iris()
数据预览,help
文档说明了data
有两个属性,data
和target
,因此分别查看二者的数据大小
print(data.data.shape)
print(data.target.shape)
'''
(150, 4)
(150,)
'''
接下来就可以对数据进行预处理,建模分析了…
参考:
Sklearn提供的常用数据集
Python之Sklearn使用教程
sklearn提供的自带的数据集