机器学习常见数据集下载(免费)

机器学习常见数据集下载(免费)

  • 数据集下载
  • 从sklearn库中获取数据集
    • 示例(加州住房数据集)

机器学习用到数据集都在UCI上面,做个笔记方便找。

UCI官网(老版本):https://archive.ics.uci.edu/ml/index.php

UCI官网(新版本):https://archive-beta.ics.uci.edu/


UCI找不到的也有别的地方
Kaggle比赛:https://www.kaggle.com/datasets (这个登录有点麻烦)

天池大数据众智平台-阿里云天池:https://tianchi.aliyun.com (英文不好的可以试试这个)

飞浆数据集:https://aistudio.baidu.com/aistudio/datasetoverview(百度的AI开放平台,选择【开发平台】,第二列的【数据集】就是)

数据集下载

下面这些数据的下载地址都是老官网。

鸢尾花数据集:https://archive.ics.uci.edu/ml/datasets/Iris

红酒数据集:https://archive.ics.uci.edu/ml/datasets/Wine

波士顿房价数据集:https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

隐形眼镜数据集:https://archive.ics.uci.edu/ml/datasets/lenses

患疝气病马的数据集:http://archive.ics.uci.edu/ml/datasets/Horse+Colic

葡萄牙银行机构营销案例数据集:http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

1984年美国国会投票的数据集:http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

发现毒蘑菇相似特征的数据集:https://archive.ics.uci.edu/ml/datasets/mushroom



另外几个是kaggle上的数据集(如果不登录还没法下,而且登录还麻烦):
旧金山犯罪案例:https://www.kaggle.com/c/sf-crime
泰坦尼克幸存者预测:https://www.kaggle.com/c/titanic/data
手写数字识别:https://www.kaggle.com/c/digit-recognizer/data

从sklearn库中获取数据集

学到后期发现的,原来有些数据在sklearn中有,调函数就能获取,省事多了。但好像个数不多。获取到的数据是JSON形式的,代码演示的是红酒数据集。

  • wine:一个JSON形式的数据
  • wine.data:数据
  • wine.feature_names:每一列特征的名称
  • wine.target:所属类型
  • wine.target_names:类型的名称

如果将wine.data与wine.target拼接成DataFrame,
那么它会是 [178 rows x 14 columns] 0~13都是特征 14列是标签 wine.feature_names+‘种类’ 可以做它的列名

from sklearn.datasets import load_boston,load_wine,load_iris,load_breast_cancer
import pprint

boston = load_boston()
wine = load_wine()
iris = load_iris()
BreastCancer = load_breast_cancer()

pprint.pprint(wine)

'''
打印结果;
"D:\Programming Software\Python3.9.1\python.exe" "D:/Program Space/Python/sklearn_machinelearning/src/Test/main.py"
{'DESCR': '.. _wine_dataset:\n'
          '\n'
          'Wine recognition dataset\n'
          '------------------------\n'
          '\n'
          '**Data Set Characteristics:**\n'
          '\n'
          '    :Number of Instances: 178 (50 in each of three classes)\n'
          '    :Number of Attributes: 13 numeric, predictive attributes and '
          'the class\n'
          '    :Attribute Information:\n'
          ' \t\t- Alcohol\n'
          ' \t\t- Malic acid\n'
          ' \t\t- Ash\n'
          '\t\t- Alcalinity of ash  \n'
          ' \t\t- Magnesium\n'
          '\t\t- Total phenols\n'
          ' \t\t- Flavanoids\n'
          ' \t\t- Nonflavanoid phenols\n'
          ' \t\t- Proanthocyanins\n'
          '\t\t- Color intensity\n'
          ' \t\t- Hue\n'
          ' \t\t- OD280/OD315 of diluted wines\n'
          ' \t\t- Proline\n'
          '\n'
          '    - class:\n'
          '            - class_0\n'
          '            - class_1\n'
          '            - class_2\n'
          '\t\t\n'
          '    :Summary Statistics:\n'
          '    \n'
          '    ============================= ==== ===== ======= =====\n'
          '                                   Min   Max   Mean     SD\n'
          '    ============================= ==== ===== ======= =====\n'
          '    Alcohol:                      11.0  14.8    13.0   0.8\n'
          '    Malic Acid:                   0.74  5.80    2.34  1.12\n'
          '    Ash:                          1.36  3.23    2.36  0.27\n'
          '    Alcalinity of Ash:            10.6  30.0    19.5   3.3\n'
          '    Magnesium:                    70.0 162.0    99.7  14.3\n'
          '    Total Phenols:                0.98  3.88    2.29  0.63\n'
          '    Flavanoids:                   0.34  5.08    2.03  1.00\n'
          '    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12\n'
          '    Proanthocyanins:              0.41  3.58    1.59  0.57\n'
          '    Colour Intensity:              1.3  13.0     5.1   2.3\n'
          '    Hue:                          0.48  1.71    0.96  0.23\n'
          '    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71\n'
          '    Proline:                       278  1680     746   315\n'
          '    ============================= ==== ===== ======= =====\n'
          '\n'
          '    :Missing Attribute Values: None\n'
          '    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n'
          '    :Creator: R.A. Fisher\n'
          '    :Donor: Michael Marshall (MARSHALL%[email protected])\n'
          '    :Date: July, 1988\n'
          '\n'
          'This is a copy of UCI ML Wine recognition datasets.\n'
          'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n'
          '\n'
          'The data is the results of a chemical analysis of wines grown in '
          'the same\n'
          'region in Italy by three different cultivators. There are thirteen '
          'different\n'
          'measurements taken for different constituents found in the three '
          'types of\n'
          'wine.\n'
          '\n'
          'Original Owners: \n'
          '\n'
          'Forina, M. et al, PARVUS - \n'
          'An Extendible Package for Data Exploration, Classification and '
          'Correlation. \n'
          'Institute of Pharmaceutical and Food Analysis and Technologies,\n'
          'Via Brigata Salerno, 16147 Genoa, Italy.\n'
          '\n'
          'Citation:\n'
          '\n'
          'Lichman, M. (2013). UCI Machine Learning Repository\n'
          '[https://archive.ics.uci.edu/ml]. Irvine, CA: University of '
          'California,\n'
          'School of Information and Computer Science. \n'
          '\n'
          '.. topic:: References\n'
          '\n'
          '  (1) S. Aeberhard, D. Coomans and O. de Vel, \n'
          '  Comparison of Classifiers in High Dimensional Settings, \n'
          '  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. '
          'of  \n'
          '  Mathematics and Statistics, James Cook University of North '
          'Queensland. \n'
          '  (Also submitted to Technometrics). \n'
          '\n'
          '  The data was used with many others for comparing various \n'
          '  classifiers. The classes are separable, though only RDA \n'
          '  has achieved 100% correct classification. \n'
          '  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed '
          'data)) \n'
          '  (All results using the leave-one-out technique) \n'
          '\n'
          '  (2) S. Aeberhard, D. Coomans and O. de Vel, \n'
          '  "THE CLASSIFICATION PERFORMANCE OF RDA" \n'
          '  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. '
          'of \n'
          '  Mathematics and Statistics, James Cook University of North '
          'Queensland. \n'
          '  (Also submitted to Journal of Chemometrics).\n',
 'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]]),
 'feature_names': ['alcohol',
                   'malic_acid',
                   'ash',
                   'alcalinity_of_ash',
                   'magnesium',
                   'total_phenols',
                   'flavanoids',
                   'nonflavanoid_phenols',
                   'proanthocyanins',
                   'color_intensity',
                   'hue',
                   'od280/od315_of_diluted_wines',
                   'proline'],
 'frame': None,
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2]),
 'target_names': array(['class_0', 'class_1', 'class_2'], dtype='

示例(加州住房数据集)

from sklearn.datasets import fetch_california_housing as fch # 加利福尼亚房价数据集
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


housevalue = fch()
# 放入DataFrame中便于查看
X = pd.DataFrame(housevalue.data,columns=housevalue.feature_names)
y = pd.DataFrame(housevalue.target,columns=housevalue.target_names)
df = pd.concat([X,y],axis=1)

print(df)

你可能感兴趣的:(#,机器学习,机器学习)