机器学习实战的数据集在哪找

Good machine learning research starts with an exceptional dataset. There is no need to spend your evening crafting your own set of data in MySQL or, god forbid, Excel. Basically, anything from COVID-19 stats to Harry Potter spells (made it myself!) exists in a form of a database. You just need to find it.

优秀的机器学习研究始于出色的数据集。无需花费时间在MySQL或Excel上手工制作自己的数据集。基本上，从COVID-19的统计数据到哈利波特的咒语(自己动手做！)之类的任何东西都以数据库的形式存在。您只需要找到它。

Let me help you — in this post, you will learn where to find datasets for machine learning research.

让我来帮助您-在这篇文章中，您将学习在哪里找到用于机器学习研究的数据集。

顶级通用ML数据集聚合器 (Top general ML dataset aggregators)

Dataset aggregators collect thousands of databases for various purposes.

数据集聚合器出于各种目的收集了数千个数据库。

1. Kaggle (1. Kaggle)

Kaggle, being updated by enthusiasts every day, has one of the largest dataset libraries online.

Kaggle每天都会被发烧友更新，它拥有最大的在线数据集库之一。

Kaggle is a community-driven machine learning platform. It contains plenty of tutorials that cover hundreds of different real-life ML problems. It is true that quality may vary. However, all the data is completely free. You can also upload your own dataset there.

Kaggle是一个社区驱动的机器学习平台。它包含许多教程，涵盖了数百种现实生活中的ML问题。的确，质量可能会有所不同。但是，所有数据都是完全免费的。您也可以在那里上传自己的数据集。

2. Google数据集搜索 (2. Google Dataset Search)

Dataset Search is a reliable source of information for your research. It is convenient to sort datasets by:

数据集搜索是您研究的可靠信息来源。通过以下方式对数据集进行排序很方便：

relevance,
关联，
file format,
文件格式，
license type,
许可证类型
theme,
主题，
time of last update.
最后更新时间。

The datasets here are uploaded by international organizations such as the World Health Organization, Statista, and Harvard.

这里的数据集是由国际组织(例如世界卫生组织，Statista和哈佛大学)上传的。

3.在AWS上注册开放数据 (3. Registry of Open Data on AWS)

In the Registry of Open Data on AWS, anyone can share a dataset or find the one they need. You can do research based on the data you find with the help of Amazon data analytics tools. Among database creators, you will find Facebook Data for Good, NASA Space Act Agreement, and Space Telescope Science Institute.

在AWS上的开放数据注册表中，任何人都可以共享一个数据集或找到他们需要的数据集。您可以借助Amazon数据分析工具，根据找到的数据进行研究。在数据库创建者中，您会找到Facebook善待数据，NASA太空法案协议和太空望远镜科学研究所。

4. Microsoft Azure公共数据集 (4. Microsoft Azure Public Datasets)

Azure Public Datasets have regularly updated databases for app developers and researchers. They contain U.S. Government data, other statistical and scientific data, and online service information that Microsoft collects about its users.

Azure公共数据集定期为应用程序开发人员和研究人员更新数据库。它们包含Microsoft收集的有关其用户的美国政府数据，其他统计和科学数据以及在线服务信息。

Moreover, Azure offers a collection of tools that help you create cloud databases of your own, migrate your SQL workloads to Azure while maintaining complete SQL Server compatibility, and build data-driven mobile and web applications.

此外，Azure提供了一系列工具，可帮助您创建自己的云数据库，将SQL工作负载迁移到Azure，同时保持完整SQL Server兼容性以及构建数据驱动的移动和Web应用程序。

5. r /数据集 (5. r/datasets)

In the datasets subreddit, anyone can publish their open-source databases. You can go there, find a cool dataset, and try to do something nice with it.

在数据集 subreddit中，任何人都可以发布其开源数据库。您可以去那里，找到一个很酷的数据集，然后尝试做一些不错的事情。

6. UCI机器学习存储库 (6. UCI Machine Learning Repository)

UCI offers 507 datasets that cover bank marketing, car evaluation, lung cancer diagnosis, and many other different subjects. You can sort the databases by:

UCI提供507个数据集，涵盖银行营销，汽车评估，肺癌诊断以及许多其他不同主题。您可以通过以下方式对数据库进行排序：

default task,
默认任务，
data type,
数据类型，
area of application,
应用领域
subject.
学科。

7. CMU库 (7. CMU Libraries)

Carnegie Mellon University has its own collection of public datasets that you can use for your own research. There you will find insightful databases about American culture, music, and history that other aggregators don’t provide.

卡内基梅隆大学拥有自己的公共数据集，您可以将其用于自己的研究。在这里，您将找到其他聚合服务商未提供的有关美国文化，音乐和历史的有见地的数据库。

8. Github上很棒的公共数据集 (8. Awesome Public Datasets on Github)

This is a great open-source collection of the best datasets available online divided by industry. Some of the libraries that you can find there I am going to mention later in this post.

这是按行业划分的，在线可用的最佳数据集的大型开源集合。您将在此处找到的一些库在本文后面将提到。

机器学习和数据科学的最佳公共数据集 (Best public datasets for machine learning and data science)

Domain-specific databases for real machine learning enthusiasts.

面向特定机器的爱好者的领域特定数据库。

探索性分析 (Exploratory analysis)

Before you change the world with your ML research, it can be fun just to practice. Here are some datasets that you can use for exploratory analysis. This is the practice of studying the data by trying to find patterns and anomalies and using this information to build ML models.

在您通过机器学习研究改变世界之前，只需实践就可以很有趣。以下是一些可用于探索性分析的数据集。这是通过尝试查找模式和异常并使用此信息构建ML模型来研究数据的实践。

Million Song Dataset can be used for exploratory analysis and building recommender systems. The database is 280 GB, but for test research, you can also download a smaller version of just 10, 000 songs, which is around 2GB.
Million Song数据集可用于探索性分析和构建推荐系统。该数据库的容量为280 GB，但是为了进行测试研究，您还可以下载仅10,000首歌曲的较小版本，大约2GB。
Game of Thrones dataset by Myles O’Neil on Kaggle will interest you if you’re a fan of George R.R. Martin’s A Song of Fire and Ice book series. It explores the deaths and battles of this fantasy world.
如果您是乔治·RR·马丁(George RR Martin)的《火与冰之歌》系列小说的粉丝，那么Myles O'Neil在Kaggle上的《权力游戏》数据集将引起您的兴趣。它探索了这个幻想世界的死亡和战斗。
LEGO Database by Rachael Tatman describes all the official LEGO parts/sets, their colors, and inventories.
Rachael Tatman制作的LEGO数据库描述了所有乐高官方零件/装置，它们的颜色和库存。
UFO Sightings by National UFO Reporting Center contains reports over all the unidentified flying objects sightings over the last century.
国家不明飞行物报告中心提供的不明飞行物目击报告包含了上个世纪所有不明飞行物目击事件的报告。
World University Rankings by Myles O’Neil covers the world’s top universities and provides information about their rank for quality of education, alumni employment, influence, and other factors.
迈尔斯·奥尼尔(Myles O'Neil)的世界大学排名涵盖了世界顶级大学，并提供了有关其教育质量，校友就业，影响力和其他因素的排名信息。

深度学习 (Deep learning)

Deep learning is based on using artificial neural networks to solve tasks. Rather than writing an algorithm for the task, the programmer uses representation learning and allows the machine to make predictions by itself.

深度学习基于使用人工神经网络来解决任务的基础。程序员不用编写用于任务的算法，而是使用表示学习，并允许机器自己进行预测。

用于计算机视觉的图像处理和对象识别 (Image processing and object recognition for computer vision)

Google’s Open Images Dataset is very diverse and contains complex samples with several objects per image. It contains object bounding boxes, object segmentation, and labels to help you orient in more than 9 million pictures.
Google的开放图像数据集非常多样化，并且包含复杂的示例，每个图像包含多个对象。它包含对象边界框，对象分割和标签，可帮助您定向超过900万张图片。
VisualData is an aggregator of computer vision datasets where you can find medical datasets for machine learning, image datasets, and other cool machine learning data samples for business, educational, and other types of ML research.
VisualData是计算机视觉数据集的集合体，您可以在其中找到用于机器学习的医学数据集，图像数据集，以及用于商业，教育和其他类型的ML研究的其他很酷的机器学习数据样本。
xView is one of the largest publicly available storages of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.
xView是开销图像最大的公共可用存储之一。它包含来自世界各地复杂场景的图像，并使用边框进行了注释。
If you are looking for a quality large-scale deep learning dataset, pay attention to Kinetics-700. It has video clips of different human-object and human-human interactions divided into classes.
如果您正在寻找高质量的大规模深度学习数据集，请注意Kinetics-700 。它具有不同的人对对象和人与人之间的交互作用的视频片段，这些视频片段分为几类。
ImageNet is a set of images for deep computer vision with more than 1000 different classes built according to the WordNet hierarchy.
ImageNet是一组用于深度计算机视觉的图像，根据WordNet层次结构构建了1000多种不同的类。
Visual QA contains open-ended questions about more than 265,016 images. It can be used for a better understanding of computer vision modeling and language processing.
视觉质量检查包含有关265,016张图像的开放式问题。它可以用于更好地理解计算机视觉建模和语言处理。
The MNIST database is a collection of samples for handwritten digit recognition. It contains a training set of more than 60,000 examples and a test set of 10,000. On the website, you will also find a table that compares the effectiveness of different types of classifiers applied to this dataset. Even a beginner can use MNIST to train their deep learning model.
MNIST数据库是用于手写数字识别的样本的集合。它包含超过60,000个示例的训练集和10,000个测试集。在网站上，您还将找到一个表，该表比较了应用于此数据集的不同类型分类器的有效性。即使是初学者也可以使用MNIST来训练他们的深度学习模型。
CIFAR-10 is a collection of images for training deep learning computer vision algorithms. The data bank consists of 60000 32x32 color images in 10 classes, 6000 images in each class. If this is not enough, try the CIFAR-100 dataset.
CIFAR-10是用于训练深度学习计算机视觉算法的图像集合。该数据库包含10个类别的60000个32x32彩色图像，每个类别6000个图像。如果这还不够，请尝试CIFAR-100数据集。
COCO is a regularly updated DB for object segmentation and recognition in context, sponsored by Microsoft, Facebook, and Mighty AI.
COCO是由Microsoft，Facebook和Mighty AI赞助的，定期更新的数据库，用于上下文中的对象分割和识别。
Labeled Faces in the Wild is a dataset for training and testing face recognition models.
带标签的野外面Kong是用于训练和测试人脸识别模型的数据集。

自然语言处理，文本到语音和语音生成 (Natural language processing, text-to-speech, and speech generation)

Making robots and voice interfaces is impossible without speech corpora. Use these datasets to build your solutions.

没有语音语料库，就不可能制造机器人和语音接口。使用这些数据集构建解决方案。

音讯 (Audio)

VoxCeleb is an audio collection that you can use for deep learning tasks such as real-time natural language processing, voice recognition, and speech generation.
VoxCeleb是一个音频集合，可用于深度学习任务，例如实时自然语言处理，语音识别和语音生成。
On LibriSpeech, you will find about 1000 hours of 16kHz oral English speech derived from audiobooks.
在LibriSpeech上，您将发现大约1000个小时的有声读物来自16kHz的口语英语语音。
Free Spoken Digit Dataset can be used for. It consists of spoken digit recordings at 8kHz that are precisely trimmed. They have near minimal silence at the beginnings and ends. The dataset is open-source.
可以使用免费语音数字数据集。它由经过精确修整的8kHz语音数字录音组成。他们在开始和结束时几乎没有什么沉默。数据集是开源的。
Common Voice is an initiative by Mozilla that contains hundreds of thousands of records of human voice. Every visitor of the Common Voice website can contribute to their open human speech database recording their own voice.
Common Voice是Mozilla的一项举措，其中包含数十万条人类语音记录。 Common Voice网站的每个访问者都可以为其开放的人类语音数据库做出贡献，以记录自己的语音。

Check out the post by Christopher Dossman on Medium for more audio datasets of different kinds (it even has an Arabic corpus!).

请查看Christopher Dossman在Medium上的帖子，以获取更多不同类型的音频数据集(甚至具有阿拉伯语语料！)。

文本 (Text)

WordNet is a lexical database that contains all parts of speech grouped into sets of synonyms. Such a structure makes it a fantastic tool for natural language processing and linguistic research.
WordNet是一个词汇数据库，其中包含语音的所有部分，这些语言被分组为同义词集。这种结构使其成为自然语言处理和语言研究的绝佳工具。
20 Newsgroups is a dataset that consists of 18,000+ text documents from 20 different newsgroups including sports, technology, art, entertainment, etc.
20个新闻组是一个数据集，包含来自20个不同新闻组的18,000多个文本文档，包括体育，技术，艺术，娱乐等。
Sentiment140 is a dataset of tweets that can be used for sentiment analysis or TTS.
Sentiment140是可用于情绪分析或TTS的tweet数据集。
On IMDB Reviews, you will find 50,000+ raw and preprocessed movie reviews for sentiment analysis with deep learning.
在IMDB评论上，您会发现50,000多个原始和经过预处理的电影评论，用于通过深度学习进行情感分析。
Yelp Reviews contains user reviews, business information, and images that you can use for personal and academic purposes.
Yelp评论包含用户评论，业务信息以及可用于个人和学术目的的图像。
The Wikipedia Corpus is a huge set of data with examples of written English texts — more than 4,5 million articles.
维基百科语料库是一个庞大的数据集，其中包含英语书面文字的示例-超过450万篇文章。
If you are looking for a segmented text corpus where samples are grouped by the age of the writers, use The Blog Authorship Corpus. It contains posts of around 20,000 bloggers collected from blogger.com in 2004.
如果要查找按作者年龄分组样本的分段文本语料库，请使用The Blog Authorship Corpus 。它包含2004年从blogger.com收集的大约20,000个博客作者的帖子。

其他用于深度学习的视频和音频数据库 (Other video and audio databases for deep learning)

YouTube 8M has more than 6 million videos, human-proved labels, and about 2,6 billion audio and visual features.
YouTube 8M拥有超过600万个视频，经过人工验证的标签以及大约26亿个音频和视频功能。
There are millions of labeled 10-second sound clips selected from YouTube videos on AudioSet by Google.
从Google在AudioSet上的YouTube视频中选择了数百万个带有标签的10秒声音剪辑。
On FSB, you will find a multitude of sound samples ranging from human and animal sounds to music and mechanical noise.
在FSB上，您可以找到大量的声音样本，从人类和动物的声音到音乐和机械噪音。
Free Music Archive is a dataset for music analysis.
Free Music Archive是用于音乐分析的数据集。

行业特定的数据集 (Industry-specific datasets)

It’s impossible to cover every area where ML can be successfully applied. But I’ve collected some examples below to give you some ideas.

不可能覆盖可以成功应用ML的每个领域。但是，我在下面收集了一些示例，以给您一些想法。

MIMIC-III is an open-source anonymous dataset of health data of more than 40,000 critical care patients. Among the covered parameters are demographics, vital signs, laboratory tests, and medication intake.
MIMIC-III是超过40,000名重症监护患者的健康数据的开源匿名数据集。涵盖的参数包括人口统计，生命体征，实验室检查和药物摄入。
Google-Landmarks can be applied to landmark recognition and retrieval.
Google地标可以应用于地标识别和检索。
To understand the stock market, it can be very useful to build AI software. EOD Stock Prices stores historical data about day stock prices, dividends, and splits for US stocks.
要了解股票市场，构建AI软件可能非常有用。 EOD股票价格存储有关当日股票价格，股利和美国股票拆分的历史数据。
Boston Housing Dataset where you will find data that concerns housing in the area of Boston Mass.
波士顿住房数据集，您将在其中找到有关波士顿马萨诸塞州住房的数据。
Restaurants Health Score in San Francisco developed by the local Health Department provides interesting material for researchers interested in public health and restaurant business.
由当地卫生部门制定的“旧金山餐厅健康评分”为对公共卫生和餐厅业务感兴趣的研究人员提供了有趣的材料。
For information about home prices and rents by size, type, and tier in the USA, visit Zillow Real Estate Research website.
有关美国房屋价格和租金的大小，类型和等级的信息，请访问Zillow Real Estate Research网站。
The World Bank Global Education Statistics Dataset contains data about 4,000+ internationally comparable indicators for education access and progress.
世界银行全球教育统计数据集包含有关教育获取和进步的4,000多个国际可比较指标的数据。
Quandl is a resource to go if you are looking for financial and economical datasets for investment professionals.
如果您正在寻找投资专业人士的财务和经济数据集，那么Quandl是一种资源。

There are so many datasets that the opportunities for ML research are truly endless. Explore Kaggle, Google Dataset Search, and other resources from the list to find what intrigues you.

数据集如此之多，以至于机器学习研究的机会是无限的。从列表中探索Kaggle，Google数据集搜索和其他资源，以发现哪些让您着迷。

翻译自: https://medium.com/swlh/where-to-find-awesome-machine-learning-datasets-6bb909a3f350