机器学习用到数据集都在UCI上面,做个笔记方便找。
UCI官网(老版本):https://archive.ics.uci.edu/ml/index.php
UCI官网(新版本):https://archive-beta.ics.uci.edu/
UCI找不到的也有别的地方
Kaggle比赛:https://www.kaggle.com/datasets (这个登录有点麻烦)
天池大数据众智平台-阿里云天池:https://tianchi.aliyun.com (英文不好的可以试试这个)
飞浆数据集:https://aistudio.baidu.com/aistudio/datasetoverview(百度的AI开放平台,选择【开发平台】,第二列的【数据集】就是)
下面这些数据的下载地址都是老官网。
鸢尾花数据集:https://archive.ics.uci.edu/ml/datasets/Iris
红酒数据集:https://archive.ics.uci.edu/ml/datasets/Wine
波士顿房价数据集:https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
隐形眼镜数据集:https://archive.ics.uci.edu/ml/datasets/lenses
患疝气病马的数据集:http://archive.ics.uci.edu/ml/datasets/Horse+Colic
葡萄牙银行机构营销案例数据集:http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
1984年美国国会投票的数据集:http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records
发现毒蘑菇相似特征的数据集:https://archive.ics.uci.edu/ml/datasets/mushroom
另外几个是kaggle上的数据集(如果不登录还没法下,而且登录还麻烦):
旧金山犯罪案例:https://www.kaggle.com/c/sf-crime
泰坦尼克幸存者预测:https://www.kaggle.com/c/titanic/data
手写数字识别:https://www.kaggle.com/c/digit-recognizer/data
学到后期发现的,原来有些数据在sklearn中有,调函数就能获取,省事多了。但好像个数不多。获取到的数据是JSON形式的,代码演示的是红酒数据集。
如果将wine.data与wine.target拼接成DataFrame,
那么它会是 [178 rows x 14 columns] 0~13都是特征 14列是标签 wine.feature_names+‘种类’ 可以做它的列名
from sklearn.datasets import load_boston,load_wine,load_iris,load_breast_cancer
import pprint
boston = load_boston()
wine = load_wine()
iris = load_iris()
BreastCancer = load_breast_cancer()
pprint.pprint(wine)
'''
打印结果;
"D:\Programming Software\Python3.9.1\python.exe" "D:/Program Space/Python/sklearn_machinelearning/src/Test/main.py"
{'DESCR': '.. _wine_dataset:\n'
'\n'
'Wine recognition dataset\n'
'------------------------\n'
'\n'
'**Data Set Characteristics:**\n'
'\n'
' :Number of Instances: 178 (50 in each of three classes)\n'
' :Number of Attributes: 13 numeric, predictive attributes and '
'the class\n'
' :Attribute Information:\n'
' \t\t- Alcohol\n'
' \t\t- Malic acid\n'
' \t\t- Ash\n'
'\t\t- Alcalinity of ash \n'
' \t\t- Magnesium\n'
'\t\t- Total phenols\n'
' \t\t- Flavanoids\n'
' \t\t- Nonflavanoid phenols\n'
' \t\t- Proanthocyanins\n'
'\t\t- Color intensity\n'
' \t\t- Hue\n'
' \t\t- OD280/OD315 of diluted wines\n'
' \t\t- Proline\n'
'\n'
' - class:\n'
' - class_0\n'
' - class_1\n'
' - class_2\n'
'\t\t\n'
' :Summary Statistics:\n'
' \n'
' ============================= ==== ===== ======= =====\n'
' Min Max Mean SD\n'
' ============================= ==== ===== ======= =====\n'
' Alcohol: 11.0 14.8 13.0 0.8\n'
' Malic Acid: 0.74 5.80 2.34 1.12\n'
' Ash: 1.36 3.23 2.36 0.27\n'
' Alcalinity of Ash: 10.6 30.0 19.5 3.3\n'
' Magnesium: 70.0 162.0 99.7 14.3\n'
' Total Phenols: 0.98 3.88 2.29 0.63\n'
' Flavanoids: 0.34 5.08 2.03 1.00\n'
' Nonflavanoid Phenols: 0.13 0.66 0.36 0.12\n'
' Proanthocyanins: 0.41 3.58 1.59 0.57\n'
' Colour Intensity: 1.3 13.0 5.1 2.3\n'
' Hue: 0.48 1.71 0.96 0.23\n'
' OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71\n'
' Proline: 278 1680 746 315\n'
' ============================= ==== ===== ======= =====\n'
'\n'
' :Missing Attribute Values: None\n'
' :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n'
' :Creator: R.A. Fisher\n'
' :Donor: Michael Marshall (MARSHALL%[email protected])\n'
' :Date: July, 1988\n'
'\n'
'This is a copy of UCI ML Wine recognition datasets.\n'
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n'
'\n'
'The data is the results of a chemical analysis of wines grown in '
'the same\n'
'region in Italy by three different cultivators. There are thirteen '
'different\n'
'measurements taken for different constituents found in the three '
'types of\n'
'wine.\n'
'\n'
'Original Owners: \n'
'\n'
'Forina, M. et al, PARVUS - \n'
'An Extendible Package for Data Exploration, Classification and '
'Correlation. \n'
'Institute of Pharmaceutical and Food Analysis and Technologies,\n'
'Via Brigata Salerno, 16147 Genoa, Italy.\n'
'\n'
'Citation:\n'
'\n'
'Lichman, M. (2013). UCI Machine Learning Repository\n'
'[https://archive.ics.uci.edu/ml]. Irvine, CA: University of '
'California,\n'
'School of Information and Computer Science. \n'
'\n'
'.. topic:: References\n'
'\n'
' (1) S. Aeberhard, D. Coomans and O. de Vel, \n'
' Comparison of Classifiers in High Dimensional Settings, \n'
' Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. '
'of \n'
' Mathematics and Statistics, James Cook University of North '
'Queensland. \n'
' (Also submitted to Technometrics). \n'
'\n'
' The data was used with many others for comparing various \n'
' classifiers. The classes are separable, though only RDA \n'
' has achieved 100% correct classification. \n'
' (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed '
'data)) \n'
' (All results using the leave-one-out technique) \n'
'\n'
' (2) S. Aeberhard, D. Coomans and O. de Vel, \n'
' "THE CLASSIFICATION PERFORMANCE OF RDA" \n'
' Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. '
'of \n'
' Mathematics and Statistics, James Cook University of North '
'Queensland. \n'
' (Also submitted to Journal of Chemometrics).\n',
'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
1.065e+03],
[1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
1.050e+03],
[1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
1.185e+03],
...,
[1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
8.350e+02],
[1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
8.400e+02],
[1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
5.600e+02]]),
'feature_names': ['alcohol',
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline'],
'frame': None,
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2]),
'target_names': array(['class_0', 'class_1', 'class_2'], dtype='
from sklearn.datasets import fetch_california_housing as fch # 加利福尼亚房价数据集
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
housevalue = fch()
# 放入DataFrame中便于查看
X = pd.DataFrame(housevalue.data,columns=housevalue.feature_names)
y = pd.DataFrame(housevalue.target,columns=housevalue.target_names)
df = pd.concat([X,y],axis=1)
print(df)