sklearn——数据集

sklearn——数据集

    • 1.安装
    • 2.导入`sklearn`、查看所有数据集
    • 3.导入数据集

Scikit-learn(sklearn)是机器学习中常用的第三方模块,对常用的机器学习方法进行了封装,包括回归(Regression)、降维(Dimensionality Reduction)、分类(Classfication)、聚类(Clustering)等方法。作为机器学习库,sklearn内置了非常丰富的数据集,本文介绍sklearn数据集及导入过程

1.安装

pip install scikit-learn

2.导入sklearn、查看所有数据集

import sklearn.datasets as sds
print(dir(sds))
'''输出
['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_base', '_california_housing', '_covtype', '_kddcup99', '_lfw', '_olivetti_faces', '_openml', '_rcv1', '_samples_generator', '_species_distributions', '_svmlight_format_fast', '_svmlight_format_io', '_twenty_newsgroups', 
'clear_data_home', 'dump_svmlight_file', 
'fetch_20newsgroups', 'fetch_20newsgroups_vectorized', 'fetch_california_housing', 'fetch_covtype', 'fetch_kddcup99', 'fetch_lfw_pairs', 'fetch_lfw_people', 'fetch_olivetti_faces', 'fetch_openml', 'fetch_rcv1', 'fetch_species_distributions', 
'get_data_home', 
'load_boston', 'load_breast_cancer', 'load_diabetes', 'load_digits', 'load_files', 'load_iris', 'load_linnerud', 'load_sample_image', 'load_sample_images', 'load_svmlight_file', 'load_svmlight_files', 'load_wine', 
'make_biclusters', 'make_blobs', 'make_checkerboard', 'make_circles', 'make_classification', 'make_friedman1', 'make_friedman2', 'make_friedman3', 'make_gaussian_quantiles', 'make_hastie_10_2', 'make_low_rank_matrix', 'make_moons', 'make_multilabel_classification', 'make_regression', 'make_s_curve', 'make_sparse_coded_signal', 'make_sparse_spd_matrix', 'make_sparse_uncorrelated', 'make_spd_matrix', 'make_swiss_roll']
'''

sklearn中的数据集分为几类:

自带的小数据集(packaged dataset):sklearn.datasets.load_<name>
可在线下载的数据集(Downloaded Dataset):sklearn.datasets.fetch_<name>
计算机生成的数据集(Generated Dataset):sklearn.datasets.make_<name>
svmlight/libsvm格式的数据集:sklearn.datasets.load_svmlight_file(...)
从买了data.org在线下载获取的数据集:sklearn.datasets.fetch_mldata(...)
# 可能这里有些也不是指的数据集,但是不影响,当选择想用的数据集后,
# 在sklearn中打印查看是否有即可,对一些常用是数据集做了标注
['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_base', '_california_housing', '_covtype', '_kddcup99', '_lfw', '_olivetti_faces', '_openml', '_rcv1', '_samples_generator', '_species_distributions', '_svmlight_format_fast', '_svmlight_format_io', '_twenty_newsgroups',
 'clear_data_home', 
 'dump_svmlight_file', 
 'fetch_20newsgroups', 
 'fetch_20newsgroups_vectorized', 
 'fetch_california_housing', 
 'fetch_covtype', 
 'fetch_kddcup99', 
 'fetch_lfw_pairs', 
 'fetch_lfw_people', 
 'fetch_olivetti_faces', 
 'fetch_openml', 
 'fetch_rcv1', 
 'fetch_species_distributions', 
 'get_data_home', 
 'load_boston', 
 'load_breast_cancer',乳腺癌数据集load-barest-cancer():简单经典的用于二分类任务的数据集 
 'load_diabetes',  糖尿病数据集:load-diabetes():经典的用于回归认为的数据集,值得注意的是,这10个特征中的每个特征都已经被处理成0均值,方差归一化的特征值。
 'load_digits', 手写数字数据集:load_digits():用于分类任务或者降维任务的数据集
 'load_files', 
 'load_iris', 鸢尾花数据集:load_iris():用于分类任务的数据集
 'load_linnerud', 
 'load_sample_image', 
 'load_sample_images', 
 'load_svmlight_file', 
 'load_svmlight_files', 
 'load_wine', 
 'make_biclusters', 
 'make_blobs', 
 'make_checkerboard', 
 'make_circles', 
 'make_classification', 
 'make_friedman1', 
 'make_friedman2', 
 'make_friedman3', 
 'make_gaussian_quantiles', 
 'make_hastie_10_2', 
 'make_low_rank_matrix', 
 'make_moons', 
 'make_multilabel_classification', 
 'make_regression', 
 'make_s_curve', 
 'make_sparse_coded_signal', 
 'make_sparse_spd_matrix', 
 'make_sparse_uncorrelated', 
 'make_spd_matrix', 
 'make_swiss_roll']

3.导入数据集

iris鸢尾花数据集为例

print(help(sds.load_iris))
'''
Help on function load_iris in module sklearn.datasets._base:

load_iris(return_X_y=False)
    Load and return the iris dataset (classification).
    
    The iris dataset is a classic and very easy multi-class classification
    dataset.
    
    =================   ==============
    Classes                          3
    Samples per class               50
    Samples total                  150
    Dimensionality                   4
    Features            real, positive
    =================   ==============
    
    Read more in the :ref:`User Guide `.
    
    Parameters
    ----------
    return_X_y : boolean, default=False.
        If True, returns ``(data, target)`` instead of a Bunch object. See
        below for more information about the `data` and `target` object.
    
        .. versionadded:: 0.18
    
    Returns
    -------
    data : Bunch
        Dictionary-like object, the interesting attributes are:
        'data', the data to learn, 'target', the classification labels,
        'target_names', the meaning of the labels, 'feature_names', the
        meaning of the features, 'DESCR', the full description of
        the dataset, 'filename', the physical location of
        iris csv dataset (added in version `0.20`).
    
    (data, target) : tuple if ``return_X_y`` is True
    
        .. versionadded:: 0.18
    
    Notes
    -----
        .. versionchanged:: 0.20
            Fixed two wrong data points according to Fisher's paper.
            The new version is the same as in R, but not as in the UCI
            Machine Learning Repository.
    
    Examples
    --------
    Let's say you are interested in the samples 10, 25, and 50, and want to
    know their class name.
    
    >>> from sklearn.datasets import load_iris
    >>> data = load_iris()
    >>> data.target[[10, 25, 50]]
    array([0, 0, 1])
    >>> list(data.target_names)
    ['setosa', 'versicolor', 'virginica']

None
'''

数据集的分类数目,数据集大小,数据集的导入方式都已经说明
导入:

# 官方
from sklearn.datasets import load_iris
data = load_iris()
# 也可另外方式
import sklearn.datasets as sds
data = sds.load_iris()

数据预览,help文档说明了data有两个属性,datatarget,因此分别查看二者的数据大小

print(data.data.shape)
print(data.target.shape)
'''
(150, 4)
(150,)
'''

接下来就可以对数据进行预处理,建模分析了…

参考:
Sklearn提供的常用数据集
Python之Sklearn使用教程
sklearn提供的自带的数据集

你可能感兴趣的:(Python数据科学,python)