机器学习经典开源数据集

0x00 前言

数据为王，使用相同机器学习算法，不同质量的数据能训练出不同效果的模型。本文将分享数据科学领域中经典的几个开源数据集。

正文分三部分：

详细介绍最常用的几个经典数据集
介绍如何使用 Python 优雅地观察数据集
其它开源数据集的获取方式

0x01 经典数据集

一、概述

下面表格中是居士整理的一些最常用的数据集，基本上能用于整个机器学习的过程中，这些数据集也频繁地出现在sklearn、spark ml、tenserfolw的官方示例中。

数据集名	数据描述	数据记录数	数据用途	下载地址
Iris	鸢尾花卉数据集	150	分类和聚类	http://archive.ics.uci.edu/ml/datasets/Iris
Adult	美国人口普查数据	48842	分类和聚类	http://archive.ics.uci.edu/ml/datasets/Adult
Wine	葡萄酒数据	178	分类和聚类	http://archive.ics.uci.edu/ml/datasets/Wine
20 Newsgroups	新闻数据集	19997	文本分类和聚类	http://qwone.com/~jason/20Newsgroups/
MovieLens	电影评分的数据集	26000000	推荐系统	https://grouplens.org/datasets/movielens/
MNIST	手写字识别数据集	70000	手写字识别	http://yann.lecun.com/exdb/mnist/

二、Iris

This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Iris也称鸢尾花卉数据集，是一类多重变量分析的数据集。是由杰出的统计学家R.A.Fisher在20世纪30年代中期创建的，它被公认为用于数据挖掘的最著名的数据集。它包含3种植物种类（Iris setosa、Iris versicolor和Iris virginica），每种各有50个样本。它由4个属性组成：sepal length（花萼长度）、sepal width（花萼宽度）、petal length（花瓣长度）和petal width（花瓣宽度）（单位是cm）。

三、Adult

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

该数据从美国1994年人口普查数据库抽取而来，可以用来预测居民收入是否超过50K$/year。该数据集类变量为年收入是否超过50k$，属性变量包含年龄，工种，学历，职业，人种等重要信息，值得一提的是，14个属性变量中有7个类别型变量。

四、Wine

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set.

这份数据集包含来自3种不同起源的葡萄酒的共178条记录。13个属性是葡萄酒的13种化学成分。通过化学分析可以来推断葡萄酒的起源。值得一提的是所有属性变量都是连续变量。

五、20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

该数据集包含大约20000个新闻组文档，在20个不同的新闻组中平均分配，是一个文本分类的经典数据集，它是机器学习技术的文本应用中的实验的流行数据集，如文本分类和文本聚类。

六、MovieLens

MovieLens 数据集是一个关于电影评分的数据集，里面包含了从IMDB, The Movie DataBase上面得到的用户对电影的评分信息。该数据集可以用于推荐系统。

七、MNIST

MNIST数据集机器学习领域内用于手写字识别的数据集，数据集中包含6个万训练集、10000个示例测试集。，每个样本图像的宽高为28*28。这些数据集的大小已经归一化，并且形成固定大小，因此预处理工作基本已经完成。在机器学习中，主流的机器学习工具（包括sklearn）很多都使用该数据集作为入门级别的介绍和应用。

0x02 数据探索

关于数据的详细信息，对它最好的理解方式不是看文档，而是自己去看数据的分布和特性。

理解数据

在这里我们以鸢尾花数据集为例，使用Python的pandas来描述，关于鸢尾花数据集的获取，我们直接使用sklearn提供的api，不在自己下载。

1.数据获取和描述

import pandas as pd
from sklearn.datasets import load_iris

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.info()

# info描述结果


RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB

2.数据示例

df.head()

num	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

3.数据描述

数据描述，使用describe可以看数据集的各个维度的描述，比如维度的总量，平均值等。

df.describe()

type	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

这里只是简单地做个示例，想深入看的话，可以在官网查看详细的api。

0x03 其它

一、UCI数据集

UCI数据集中包括了众多用于监督式和非监督式学习的数据集，数量大概400多个，其中很多数据集在其他众多数据工具中被反复引用，例如Iris、Wine、Adult、Car
Evaluation、Forest Fires等。

地址：http://archive.ics.uci.edu/ml/

二、sklearn的datasets

sklearn中已经自带了很多的数据集，比如前面用到datasets.load_iris()就是sklearn自带数据集，感兴趣的可以直接在官网中查看相应的api，包含了大部分常用的数据集。

地址：http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

参考

https://www.zhihu.com/question/63383992
https://zhuanlan.zhihu.com/p/25138563
http://blog.csdn.net/gzhermit/article/details/74231557

作者：木东居士 | | CSDN | GITHUB

个人主页：http://www.mdjs.info

文章可以转载, 但必须以超链接形式标明文章原始出处和作者信息

num	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

num	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2