Dataset 列表:机器学习研究

Face recognition

In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other projects that use images of faces.

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Face Recognition Technology (FERET) 11338 images of 1199 individuals in different positions and at different times. None. 11,338 Images Classification, face recognition 2003 [6][7] United States Department of Defense
CMU Pose, Illumination, and Expression (PIE) 41,368 color images of 68 people in 13 different poses. Images labeled with expressions. 41,368 Images, text Classification, face recognition 2000 [8][9] R. Gross et al.
SCFace Color images of faces at various angles. Location of facial features extracted. Coordinates of features given. 4,160 Images, text Classification, face recognition 2011 [10][11] M. Grgic et al.
YouTube Faces DB Videos of 1,595 different people gathered from YouTube. Each clip is between 48 and 6,070 frames. Identity of those appearing in videos and descriptors. 3,425 videos Video, text Video classification, face recognition 2011 [12][13] L. Wolf et al.
300 videos in-the-Wild 114 videos annotated for facial landmark tracking. The 68 landmark mark-up is applied to every frame. None 114 videos, 218,000 frames. Video, annotation file. Facial landmark tracking. 2015 [14] Shen, Jie et al.
Grammatical Facial Expressions Dataset Grammatical Facial Expressions from Brazilian Sign Language. Microsoft Kinect features extracted. 27,965 Text Facial gesture recognition 2014 [15] F. Freitas et al.
CMU Face Images Dataset Images of faces. Each person is photographed multiple times to capture different expressions. Labels and features. 640 Images, Text Face recognition 1999 [16][17] T. Mitchell
Yale Face Database Faces of 15 individuals in 11 different expressions. Labels of expressions. 165 Images Face recognition 1997 [18][19] J. Yang et al.
Cohn-Kanade AU-Coded Expression Database Large database of images with labels for expressions. Tracking of certain facial features. 500+ sequences Images, text Facial expression analysis 2000 [20][21] T. Kanade et al.
FaceScrub Images of public figures scrubbed from image searching. Name and m/f annotation. 107,818 Images, text Face recognition 2014 [22][23] H. Ng et al.
BioID Face Database Images of faces with eye positions marked. Manually set eye positions. 1521 Images, text Face recognition 2001 [24][25] BioID
Skin Segmentation Dataset Randomly sampled color values from face images. B, G, R, values extracted. 245,057 Text Segmentation, classification 2012 [26][27] R. Bhatt.
Bosphorus 3D Face image database. 34 action units and 6 expressions labeled; 24 facial landmarks labeled. 4652

Images, text

Face recognition, classification 2008 [28][29] A Savran et al.
UOY 3D-Face neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised. labeling. 5250

Images, text

Face recognition, classification 2004 [30][31] University of York
CASIA Expressions: Anger, smile, laugh, surprise, closed eyes. None. 4624

Images, text

Face recognition, classification 2007 [32][33] Institute of Automation, Chinese Academy of Sciences
CASIA Expressions: Anger Disgust Fear Happiness Sadness Surprise None. 480 Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second Face recognition, classification 2011 [34] Zhao, G. et al.
BU-3DFE neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted. None. 2500 Images, text Facial expression recognition, classification 2006 [35] Binghamton University
Face Recognition Grand Challenge Dataset Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data. None. 4007 Images, text Face recognition, classification 2004 [36][37] National Institute of Standards and Technology
Gavabdb Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images. None. 549 Images, text Face recognition, classification 2008 [38][39] King Juan Carlos University
3D-RMA Up to 100 subjects, expressions mostly neutral. Several poses as well. None. 9971 Images, text Face recognition, classification 2004 [40][41] Royal Military Academy (Belgium)

Action recognition

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Human Motion DataBase (HMDB51) 51 action categories, each containing at least 101 clips, extracted from a range of sources. None. 6,766 video clips video clips Action classification 2011 [42] H. Kuehne et al.
TV Human Interaction Dataset Videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none. None. 6,766 video clips video clips Action prediction 2013 [43] Patron-Perez, A. et al.
UT Interaction People acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip. None. 120 video clips video clips Action prediction 2009 [44] Ryoo, M. S. et al.
UT Kinect 10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting. None. 200 video clips with depth information at 15 frames per second video clips with depth information Action classification 2012 [45] Xia, L. et al.
SBU Interact Seven participants performing one of 8 actions together (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands) in an office setting. None. Around 300 interactions video clips with depth information Action classification 2012 [46] Yun, K. et al.
Berkeley Multimodal Human Action Database (MHAD) Recordings of a single person performing 12 actions MoCap pre-processing 660 action samples 8 PhaseSpace Motion Cpature, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphones Action classification 2013 [47] Ofli, F. et al.
UCF 101 Dataset Self described as “a dataset of 101 human actions classes from videos in the wild.” Dataset is large with over 27 hours of video. Actions classified and labeled. 13,000 Video, images, text Classification, action detection 2012 [48][49] K. Soomro et al.
THUMOS Dataset Large video dataset for action classification. Actions classified and labeled. 45M frames of video Video, images, text Classification, action detection 2013 [50][51] Y. Jiang et al.
Activitynet Large video dataset for activity recognition and detection. Actions classified and labeled. 10,024 Video, images, text Classification, action detection 2015 [52] Heilbron et al.
MSP-AVATAR Improvised scenarios annotated for discourse functions: contrast, confirmation/negation, question, uncertainty, suggest, giving orders, warn, inform, size description, using pronouns. Actions classified and labeled. 74 sessions Motion-captured video, audio Classification, action detection 2015 [53] Sadoughi, N. et al.
LILiR Twotalk Corpus Video datasets for non-verbal communication activity recognition: agreement, thinking, asking and understanding. Actions classified and labeled. 527 Video Action detection 2011 [54] Sheerman-Chase et al.

Object detection & recognition

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
DAVIS: Densely Annotated VIdeo Segmentation 150 video sequences containing 10459 frames with a total of 376 objects annotated. Dataset released for the 2017 DAVIS Challenge with a dedicated workshop co-located with CVPR 2017. The videos contain several types of objects and humans with a high quality segmentation annotation. 10,459 Frames annotated Video object segmentation 2017 [55] Pont-Tuset, J. et al.
T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects 30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object. 6D poses for all modeled objects in all images. Per-pixel labelling can be obtained by rendering of the object models at the ground truth poses. 49,000 RGB-D images, 3D object models 6D object pose estimation, object detection 2017 [56] T. Hodan et al.
Berkeley 3-D Object Dataset 849 images taken in 75 different scenes. About 50 different object classes are labeled. Object bounding boxes and labeling. 849 labeled images, text Object recognition 2014 [57][58] A. Janoch et al.
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500) 500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300. Each image segmented by five different subjects on average. 500 Segmented images Contour detection and hierarchical image segmentation 2011 [59] University of California, Berkeley
Microsoft Common Objects in Context (COCO) complex everyday scenes of common objects in their natural context. Object highlighting, labeling, and classification into 91 object types. 2,500,000 Labeled images, text Object recognition 2015 [60][61] T. Lin et al.
SUN Database Very large scene and object recognition database. Places and objects are labeled. Objects are segmented. 131,067 Images, text Object recognition, scene recognition 2014 [62][63] J. Xiao et al.
ImageNet Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge Labeled objects, bounding boxes, descriptive words, SIFT features 14,197,122 Images, text Object recognition, scene recognition 2014 [64][65] J. Deng et al.
TV News Channel Commercial Detection Dataset TV commercials and news broadcasts. Audio and video features extracted from still images. 129,685 Text Clustering, classification 2015 [66][67] P. Guha et al.
Statlog (Image Segmentation) Dataset The instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel. Many features calculated. 2310 Text Classification 1990 [68] University of Massachusetts
Caltech 101 Pictures of objects. Detailed object outlines marked. 9146 Images Classification, object recognition. 2003 [69][70] F. Li et al.
Caltech-256 Large dataset of images for object classification. Images categorized and hand-sorted. 30,607 Images, Text Classification, object detection 2007 [71][72] G. Griffin et al.
SIFT10M Dataset SIFT features of Caltech-256 dataset. Extensive SIFT feature extraction. 11,164,866 Text Classification, object detection 2016 [73] X. Fu et al.
LabelMe Annotated pictures of scenes. Objects outlined. 187,240 Images, text Classification, object detection 2005 [74] MIT Computer Science and Artificial Intelligence Laboratory
Cityscapes Dataset Stereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included. Pixel-level segmentation and labeling 25,000 Images, text Classification, object detection 2016 [75] Daimler AG et al.
PASCAL VOC Dataset Large number of images for classification tasks. Labeling, bounding box included 500,000 Images, text Classification, object detection 2010 [76][77] M. Everingham et al.
CIFAR-10 Dataset Many small, low-resolution, images of 10 classes of objects. Classes labelled, training set splits created. 60,000 Images Classification 2009 [65][78] A. Krizhevsky et al.
CIFAR-100 Dataset Like CIFAR-10, above, but 100 classes of objects are given. Classes labelled, training set splits created. 60,000 Images Classification 2009 [65][78] A. Krizhevsky et al.
German Traffic Sign Detection Benchmark Dataset Images from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries. Signs manually labeled 900 Images Classification 2013 [79][80] S Houben et al.
KITTI Vision Benchmark Dataset Autonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners. Many benchmarks extracted from data. >100 GB of data Images, text Classification, object detection 2012 [81][82] A Geiger et al.

Handwriting and character recognition

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Artificial Characters Dataset Artificially generated data describing the structure of 10 capital English letters. Coordinates of lines drawn given as integers. Various other features. 6000 Text Handwriting recognition, classification 1992 [83] H. Guvenir et al.
Letter Dataset Upper case printed letters. 17 features are extracted from all images. 20,000 Text OCR, classification 1991 [84][85] D. Slate et al.
Character Trajectories Dataset Labeled samples of pen tip trajectories for people writing simple characters. 3-dimensional pen tip velocity trajectory matrix for each sample 2858 Text Handwriting recognition, classification 2008 [86][87] B. Williams
Chars74K Dataset Character recognition in natural images of symbols used in both English and Kannada 74,107 Character recognition, handwriting recognition, OCR, classification 2009 [88] T. de Campos
UJI Pen Characters Dataset Isolated handwritten characters Coordinates of pen position as characters were written given. 11,640 Text Handwriting recognition, classification 2009 [89][90] F. Prat et al.
Gisette Dataset Handwriting samples from the often-confused 4 and 9 characters. Features extracted from images, split into train/test, handwriting images size-normalized. 13,500 Images, text Handwriting recognition, classification 2003 [91] Yann LeCun et al.
MNIST Database Database of handwritten digits. Hand-labeled. 60,000 Images, text Classification 1998 [92][93] National Institute of Standards and Technology
Optical Recognition of Handwritten Digits Dataset Normalized bitmaps of handwritten data. Size normalized and mapped to bitmaps. 5620 Images, text Handwriting recognition, classification 1998 [94] E. Alpaydin et al.
Pen-Based Recognition of Handwritten Digits Dataset Handwritten digits on electronic pen-tablet. Feature vectors extracted to be uniformly spaced. 10,992 Images, text Handwriting recognition, classification 1998 [95][96] E. Alpaydin et al.
Semeion Handwritten Digit Dataset Handwritten digits from 80 people. All handwritten digits have been normalized for size and mapped to the same grid. 1593 Images, text Handwriting recognition, classification 2008 [97] T. Srl
HASYv2 Handwritten mathematical symbols All symbols are centered and of size 32px x 32px. 168233 Images, text Classification 2017 [98] Martin Thoma

Aerial images

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Aerial Image Segmentation Dataset 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0. Images manually segmented. 80 Images Aerial Classification, object detection 2013 [99][100] J. Yuan et al.
KIT AIS Data Set Multiple labeled training and evaluation datasets of aerial images of crowds. Images manually labeled to show paths of individuals through crowds. ~ 150 Images with paths People tracking, aerial tracking 2012 [101][102] M. Butenuth et al.
Wilt Dataset Remote sensing data of diseased trees and other land cover. Various features extracted. 4899 Images Classification, aerial object detection 2014 [103][104] B. Johnson
Forest Type Mapping Dataset Satellite imagery of forests in Japan. Image wavelength bands extracted. 326 Text Classification 2015 [105][106] B. Johnson
Overhead Imagery Research Data Set Annotated overhead imagery. Images with multiple objects. Over 30 annotations and over 60 statistics that describe the target within the context of the image. 1000 Images, text Classification 2009 [107][108] F. Tanner et al.

Other images[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
MPII Cooking Activities Dataset Videos and images of various cooking activities. Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling. 881,755 frames Labeled video, images, text Classification 2012 [109][110] M. Rohrbach et al.
Stanford Dogs Dataset Images of 120 breeds of dogs from around the world. Train/test splits and ImageNet annotations provided. 20,580 Images, text Fine-grain classification 2011 [111][112] A. Khosla et al.
The Oxford-IIIT Pet Dataset 37 categories of pets with roughly 200 images of each. Breed labeled, tight bounding box, foreground-background segmentation. ~ 7,400 Images, text Classification, object detection 2012 [112][113] O. Parkhi et al.
Corel Image Features Data Set Database of images with features extracted. Many features including color histogram, co-occurrence texture, and colormoments, 68,040 Text Classification, object detection 1999 [114][115] M. Ortega-Bindenberger et al.
Online Video Characteristics and Transcoding Time Dataset. Transcoding times for various different videos and video properties. Video features given. 168,286 Text Regression 2015 [116] T. Deneke et al.
Microsoft Sequential Image Narrative Dataset (SIND) Dataset for sequential vision-to-language Descriptive caption and storytelling given for each photo, and photos are arranged in sequences 81,743 Images, text Visual storytelling 2016 [117] Microsoft Research
Caltech-UCSD Birds-200-2011 Dataset Large dataset of images of birds. Part locations for birds, bounding boxes, 312 binary attributes given 11,788 Images, text Classification 2011 [118][119] C. Wah et al.
YouTube-8M Large and diverse labeled video dataset YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities 8 million Video, text Video classification 2016 [120][121] S. Abu-El-Haija et al.
YFCC100M Large and diverse labeled image and video dataset Flickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags) 100 million Video, Image, Text Video and Image classification 2016 [122][123] B. Thomee et al.
Discrete LIRIS-ACCEDE Short videos annotated for valence and arousal. Valence and arousal labels. 9800 Video Video emotion elicitation detection 2015 [124] Y. Baveye et al.
Continuous LIRIS-ACCEDE Long videos annotated for valence and arousal while also collecting Galvanic Skin Response. Valence and arousal labels. 30 Video Video emotion elicitation detection 2015 [125] Y. Baveye et al.
MediaEval LIRIS-ACCEDE Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films. Vioence, valence and arousal labels. 10900 Video Video emotion elicitation detection 2015 [126] Y. Baveye et al.

Text data[edit]

Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Amazon reviews US product reviews from Amazon.com. None. ~ 82M Text Classification, sentiment analysis 2015 [127] McAuley et al.
OpinRank Review Dataset Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively. None. 42,230 / ~259,000 respectively Text Sentiment analysis, clustering 2011 [128][129] K. Ganesan et al.
MovieLens 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users. None. ~ 22M Text Regression, clustering, classification 2016 [130] GroupLens Research
Yahoo! Music User Ratings of Musical Artists Over 10M ratings of artists by Yahoo users. None described. ~ 10M Text Clustering, regression 2004 [131][132] Yahoo!
Car Evaluation Data Set Car properties and their overall acceptability. Six categorical features given. 1728 Text Classification 1997 [133][134] M. Bohanec
YouTube Comedy Slam Preference Dataset User vote data for pairs of videos shown on YouTube. Users voted on funnier videos. Video metadata given. 1,138,562 Text Classification 2012 [135][136] Google
Skytrax User Reviews Dataset User reviews of airlines, airports, seats, and lounges from Skytrax. Ratings are fine-grain and include many aspects of airport experience. 41396 Text Classification, regression 2015 [137] Q. Nguyen
Teaching Assistant Evaluation Dataset Teaching assistant reviews. Features of each instance such as class, class size, and instructor are given. 151 Text Classification 1997 [138][139] W. Loh et al.

News articles[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
NYSK Dataset English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn. Filtered and presented in XML format. 10,421 XML, text Sentiment analysis, topic extraction 2013 [140] Dermouche, M. et al.
The Reuters Corpus Volume 1 Large corpus of Reuters news stories in English. Fine-grain categorization and topic codes. 810,000 Text Classification, clustering, summarization 2002 [141] Reuters
The Reuters Corpus Volume 2 Large corpus of Reuters news stories in multiple languages. Fine-grain categorization and topic codes. 487,000 Text Classification, clustering, summarization 2005 [142] Reuters
Thomson Reuters Text Research Collection Large corpus of news stories. Details not described. 1,800,370 Text Classification, clustering, summarization 2009 [143] T. Rose et al.
Saudi Newspapers Corpus 31,030 Arabic newspaper articles. Metadata extracted. 31,030 JSON Summarization, clustering 2015 [144] M. Alhagri

Messages[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Enron Email Dataset Emails from employees at Enron organized into folders. Attachments removed, invalid email addresses converted to [email protected] or [email protected]. ~ 500,000 Text Network analysis, sentiment analysis 2004 (2015) [145][146] Klimt, B. and Y. Yang
Ling-Spam Dataset Corpus containing both legitimate and spam emails. Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled. Text Classification 2000 [147][148] Androutsopoulos, J. et al.
SMS Spam Collection Dataset Collected SMS spam messages. None. 5574 Text Classification 2011 [149][150] T. Almeida et al.
Twenty Newsgroups Dataset Messages from 20 different newsgroups. None. 20,000 Text Natural language processing 1999 [151] T. Mitchell et al.
Spambase Dataset Spam emails. Many text features extracted. 4601 Text Spam detection, classification 1999 [152] M. Hopkins et al.

Twitter and tweets[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Sentiment140 Tweet data from 2009 including original text, time stamp, user and sentiment. Classified using distant supervision from presence of emoticon in tweet. 1,578,627 Tweets, comma, separated values Sentiment analysis 2009 [153][154] A. Go et al.
ASU Twitter Dataset Twitter network data, not actual tweets. Shows connections between a large number of users. None. 11,316,811 users, 85,331,846 connections Text Clustering, graph analysis 2009 [155][156] R. Zafarani et al.
SNAP Social Circles: Twitter Database Large twitter network data. Node features, circles, and ego networks. 1,768,149 Text Clustering, graph analysis 2012 [157][158] J. McAuley et al.
Twitter Dataset for Arabic Sentiment Analysis Arabic tweets. Samples hand-labeled as positive or negative. 2000 Text Classification 2014 [159][160] N. Abdulla
Buzz in Social Media Dataset Data from Twitter and Tom’s Hardware. This dataset focuses on specific buzz topics being discussed on those sites. Data is windowed so that the user can attempt to predict the events leading up to social media buzz. 140,000 Text Regression, Classification 2013 [161][162] F. Kawala et al.
Paraphrase and Semantic Similarity in Twitter (PIT) This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled. tokenization, part-of-speech and named entity tagging 18,762 Text Regression, Classification 2015 [163][164] Xu et al.

Other text[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Legal Case Reports Federal Court of Australia cases from 2006–2009. None. 4,000 Text Summarization,

citation analysis

2012 [165][166] F. Galgani et al.
Blogger Authorship Corpus Blog entries of 19,320 people from blogger.com. Blogger self-provided gender, age, industry, and astrological sign. 681,288 Text Sentiment analysis, summarization, classification 2006 [167][168] J. Schler et al.
Social Structure of Facebook Networks Large dataset of the social structure of Facebook. None. 100 colleges covered Text Network analysis, clustering 2012 [169][170] A. Traud et al.
Dataset for the Machine Comprehension of Text Stories and associated questions for testing comprehension of text. None. 660 Text Natural language processing, machine comprehension 2013 [171][172] M. Richardson et al.
The Penn Treebank Project Naturally occurring text annotated for linguistic structure. Text is parsed into semantic trees. ~ 1M words Text Natural language processing, summarization 1995 [173][174] M. Marcus et al.
DEXTER Dataset Task given is to determine, from features given, which articles are about corporate acquisitions. Features extracted include word stems. Distractor features included. 2600 Text Classification 2008 [175] Reuters
Google Books N-grams N-grams from a very large corpus of books None. 2.2 TB of text Text Classification, clustering, regression 2011 [176][177] Google
Personae Corpus Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays. In addition to normal texts, syntactically annotated texts are given. 145 Text Classification, regression 2008 [178][179] K. Luyckx et al.
CNAE-9 Dataset Categorization task for free text descriptions of Brazilian companies. Word frequency has been extracted. 1080 Text Classification 2012 [180][181] P. Ciarelli et al.
Sentiment Labeled Sentences Dataset 3000 sentiment labeled sentences. Sentiment of each sentence has been hand labeled as positive or negative. 3000 Text Classification, sentiment analysis 2015 [182][183] D. Kotzias
BlogFeedback Dataset Dataset to predict the number of comments a post will receive based on features of that post. Many features of each post extracted. 60,021 Text Regression 2014 [184][185] K. Buza
Stanford Natural Language Inference (SNLI) Corpus Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs. Entailment class labels, syntactic parsing by the Stanford PCFG parser 570,000 Text Natural language inference/recognizing textual entailment 2015 [186] S. Bowman et al.

Sound data[edit]

Datasets of sounds and sound features.

Speech[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Zero Resource Speech Challenge 2015 Spontaneous speech (English), Read speech (Xitsonga). raw wav English: 5h, 12 speakers; Xitsonga: 2h30; 24 speakers sound Unsupervised discovery of speech features/subword units/word units 2015 [187][188]www.zerospeech.com/2015 Versteegh et al.
Parkinson Speech Dataset Multiple recordings of people with and without Parkinson’s Disease. Voice features extracted, disease scored by physician using unified Parkinson’s disease rating scale 1,040 Text Classification, regression 2013 [189][190] B. E. Sakar et al.
Spoken Arabic Digits Spoken Arabic digits from 44 male and 44 female. Time-series of mel-frequency cepstrum coefficients. 8,800 Text Classification 2010 [191][192] M. Bedda et al.
ISOLET Dataset Spoken letter names. Features extracted from sounds. 7797 Text Classification 1994 [193][194] R. Cole et al.
Japanese Vowels Dataset Nine male speakers uttered two Japanese vowels successively. Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients. 640 Text Classification 1999 [195][196] M. Kudo et al.
Parkinson’s Telemonitoring Dataset Multiple recordings of people with and without Parkinson’s Disease. Sound features extracted. 5875 Text Classification 2009 [197][198] A. Tsanas et al.
TIMIT Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. Speech is lexically and phonemically transcribed. 6300 Text Speech recognition, classification. 1986 [199][200] J. Garofolo et al.
Arabic Speech Corpus A single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level Speech is orthographically and phonetically transcribed with stress marks. ~1900 Text, WAV Speech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education. 2016 [201] N. Halabi

Music[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Geographical Original of Music Data Set Audio features of music samples from different locations. Audio features extracted using MARSYAS software. 1,059 Text Geographical classification, clustering 2014 [202][203] F. Zhou et al.
Million Song Dataset Audio features from one million different songs. Audio features extracted. 1M Text Classification, clustering 2011 [204][205] T. Bertin-Mahieux et al.
Free Music Archive Audio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text. Raw audio and audio features. 106,574 Text, MP3 Classification, recommendation 2017 [206] M. Defferrard et al.
Bach Choral Harmony Dataset Bach chorale chords. Audio features extracted. 5665 Text Classification 2014 [207][208] D. Radicioni et al.

Other sounds[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
UrbanSound Labeled sound recordings of sounds like air conditioners, car horns and children playing. Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file. 1,059 Sound

(WAV)

Classification 2014 [209][210] J. Salamon et al.

Signal data[edit]

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.

Electrical[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Witty Worm Dataset Dataset detailing the spread of the Witty worm and the infected computers. Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers. 55,909 IP addresses Text Classification 2004 [211][212] Center for Applied Internet Data Analysis
Cuff-Less Blood Pressure Estimation Dataset Cleaned vital signals from human patients which can be used to estimate blood pressure. 125 Hz vital signs have been cleaned. 12,000 Text Classification, regression 2015 [213][214] M. Kachuee et al.
Gas Sensor Array Drift Dataset Measurements from 16 chemical sensors utilized in simulations for drift compensation. Extensive number of features given. 13,910 Text Classification 2012 [215][216] A. Vergara
Servo Dataset Data covering the nonlinear relationships observed in a servo-amplifier circuit. Levels of various components as a function of other components are given. 167 Text Regression 1993 [217][218] K. Ullrich
UJIIndoorLoc-Mag Dataset Indoor localization database to test indoor positioning systems. Data is magnetic field based. Train and test splits given. 40,000 Text Classification, regression, clustering 2015 [219][220] D. Rambla et al.
Sensorless Drive Diagnosis Dataset Electrical signals from motors with defective components. Statistical features extracted. 58,508 Text Classification 2015 [221][222] M. Bator

Motion-tracking[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio) People performing five standard actions while wearing motion tackers. None. 165,632 Text Classification 2013 [223][224] Pontifical Catholic University of Rio de Janeiro
Gesture Phase Segmentation Dataset Features extracted from video of people doing various gestures. Features extracted aim at studying gesture phase segmentation. 9900 Text Classification, clustering 2014 [225][226] R. Madeo et a
Vicon Physical Action Data Set Dataset 10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker. Many parameters recorded by 3D tracker. 3000 Text Classification 2011 [227][228] T. Theodoridis
Daily and Sports Activities Dataset Motor sensor data for 19 daily and sports activities. Many sensors given, no preprocessing done on signals. 9120 Text Classification 2013 [229][230] B. Barshan et al.
Human Activity Recognition Using Smartphones Dataset Gyroscope and accelerometer data from people wearing smartphones and performing normal actions. Actions performed are labeled, all signals preprocessed for noise. 10,299 Text Classification 2012 [231][232] J. Reyes-Ortiz et al.
Australian Sign Language Signs Australian sign language signs captured by motion-tracking gloves. None. 2565 Text Classification 2002 [233][234] M. Kadous
Weight Lifting Exercises monitored with Inertial Measurement Units Five variations of the biceps curl exercise monitored with IMUs. Some statistics calculated from raw data. 39,242 Text Classification 2013 [235][236] W. Ugulino et al.
sEMG for Basic Hand movements Dataset Two databases of surface electromyographic signals of 6 hand movements. None. 3000 Text Classification 2014 [237][238] C. Sapsanis et al.
REALDISP Activity Recognition Dataset Evaluate techniques dealing with the effects of sensor displacement in wearable activity recognition. None. 1419 Text Classification 2014 [238][239] O. Banos et al.
Heterogeneity Activity Recognition Dataset Data from multiple different smart devices for humans performing various activities. None. 43,930,257 Text Classification, clustering 2015 [240][241] A. Stisen et al.
Indoor User Movement Prediction from RSS Data Temporal wireless network data that can be used to track the movement of people in an office. None. 13,197 Text Classification 2016 [242][243] D. Bacciu
PAMAP2 Physical Activity Monitoring Dataset 18 different types of physical activities performed by 9 subjects wearing 3 IMUs. None. 3,850,505 Text Classification 2012 [244] A. Reiss
OPPORTUNITY Activity Recognition Dataset Human Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms. None. 2551 Text Classification 2012 [245][246] D. Roggen et al.
Real World Activity Recognition Dataset Human Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors. None. 3,150,000 (per sensor) Text Classification 2016 [247] T. Sztyler et al.

Other signals[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Wine Dataset Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. 13 properties of each wine are given 178 Text Classification, regression 1991 [248][249] M. Forina et al.
Combined Cycle Power Plant Data Set Data from various sensors within a power plant running for 6 years. None 9568 Text Regression 2014 [250][251] P. Tufekci et al.

Physical data[edit]

Datasets from physical systems

High-energy physics[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
HIGGS Dataset Monte Carlo simulations of particle accelerator collisions. 28 features of each collision are given. 11M Text Classification 2014 [252][253][254] D. Whiteson
HEPMASS Dataset Monte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise. 28 features of each collision are given. 10,500,000 Text Classification 2016 [253][254][255] D. Whiteson

Systems[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Yacht Hydrodynamics Dataset Yacht performance based on dimensions. Six features are given for each yacht. 308 Text Regression 2013 [256][257] R. Lopez
Robot Execution Failures Dataset 5 data sets that center around robotic failure to execute common tasks. Integer valued features such as torque and other sensor measurements. 463 Text Classification 1999 [258] L. Seabra et al.
Pittsburgh Bridges Dataset Design description is given in terms of several properties of various bridges. Various bridge features are given. 108 Text Classification 1990 [259][260] Y. Reich et al.
Automobile Dataset Data about automobiles, their insurance risk, and their normalized losses. Car features extracted. 205 Text Regression 1987 [261][262] J. Schimmer et al.
Auto MPG Dataset MPG data for cars. Eight features of each car given. 398 Text Regression 1993 [263] Carnegie Mellon University
Energy Efficiency Dataset Heating and cooling requirements given as a function of building parameters. Building parameters given. 768 Text Classification, regression 2012 [264][265] A. Xifara et al.
Airfoil Self-Noise Dataset A series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections. Data about frequency, angle of attack, etc., are given. 1503 Text Regression 2014 [266] R. Lopez
Challenger USA Space Shuttle O-Ring Dataset Attempt to predict O-ring problems given past Challenger data. Several features of each flight, such as launch temperature, are given. 23 Text Regression 1993 [267][268] D. Draper et al.
Statlog (Shuttle) Dataset NASA space shuttle datasets. Nine features given. 58,000 Text Classification 2002 [269] NASA

Astronomy[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Volcanoes on Venus – JARtool experiment Dataset Venus images returned by the Magellan spacecraft. Images are labeled by humans. not given Images Classification 1991 [270][271] M. Burl
MAGIC Gamma Telescope Dataset Monte Carlo generated high-energy gamma particle events. Numerous features extracted from the simulations. 19,020 Text Classification 2007 [271][272] R. Bock
Solar Flare Dataset Measurements of the number of certain types of solar flare events occurring in a 24-hour period. Many solar flare-specific features are given. 1389 Text Regression, classification 1989 [273] G. Bradshaw

Earth science[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Volcanoes of the World Volcanic eruption data for all known volcanic events on earth. Details such as region, subregion, tectonic setting, dominant rock type are given. 1535 Text Regression, classification 2013 [274] E. Venzke et al.
Seismic-bumps Dataset Seismic activities from a coal mine. Seismic activity was classified as hazardous or not. 2584 Text Classification 2013 [275][276] M. Sikora et al.

Other physical[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Concrete Compressive Strength Dataset Dataset of concrete properties and compressive strength. Nine features are given for each sample. 1030 Text Regression 2007 [277][278] I. Yeh
Concrete Slump Test Dataset Concrete slump flow given in terms of properties. Features of concrete given such as fly ash, water, etc. 103 Text Regression 2009 [279][280] I. Yeh
Musk Dataset Predict if a molecule, given the features, will be a musk or a non-musk. 168 features given for each molecule. 6598 Text Classification 1994 [281] Arris Pharmaceutical Corp.
Steel Plates Faults Dataset Steel plates of 7 different types. 27 features given for each sample. 1941 Text Classification 2010 [282] Semeion Research Center

Biological data[edit]

Datasets from biological systems.

Human[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
EEG Database Study to examine EEG correlates of genetic predisposition to alcoholism. Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second. 122 Text Classification 1999 [283][284] H. Begleiter
P300 Interface Dataset Data from nine subjects collected using P300-based brain-computer interface for disabled subjects. Split into four sessions for each subject. MATLAB code given. 1,224 Text Classification 2008 [285][286] U. Hoffman et al.
Heart Disease Data Set Attributed of patients with and without heart disease. 75 attributes given for each patient with some missing values. 303 Text Classification 1988 [287][288] A. Janosi et al.
Breast Cancer Wisconsin (Diagnostic) Dataset Dataset of features of breast masses. Diagnoses by physician is given. 10 features for each sample are given. 569 Text Classification 1995 [289][290] W. Wolberg et al.
National Survey on Drug Use and Health Large scale survey on health and drug use in the United States. None. 55,268 Text Classification, regression 2012 [291] United States Department of Health and Human Services
Lung Cancer Dataset Lung cancer dataset without attribute definitions 56 features are given for each case 32 Text Classification 1992 [292][293] Z. Hong et al.
Arrhythmia Dataset Data for a group of patients, of which some have cardiac arrhythmia. 276 features for each instance. 452 Text Classification 1998 [294][295] H. Altay et al.
Diabetes 130-US hospitals for years 1999–2008 Dataset 9 years of readmission data across 130 US hospitals for patients with diabetes. Many features of each readmission are given. 100,000 Text Classification, clustering 2014 [296][297] J. Clore et al.
Diabetic Retinopathy Debrecen Dataset Features extracted from images of eyes with and without diabetic retinopathy. Features extracted and conditions diagnosed. 1151 Text Classification 2014 [298][299] B. Antal et al.
Liver Disorders Dataset Data for people with liver disorders. Seven biological features given for each patient. 345 Text Classification 1990 [300][301] Bupa Medical Research Ltd.
Thyroid Disease Dataset 10 databases of thyroid disease patient data. None. 7200 Text Classification 1987 [302][303] R. Quinlan
Mesothelioma Dataset Mesothelioma patient data. Large number of features, including asbestos exposure, are given. 324 Text Classification 2016 [304][305] A. Tanrikulu et al.
KEGG Metabolic Reaction Network (Undirected) Dataset Network of metabolic pathways. A reaction network and a relation network are given. Detailed features for each network node and pathway are given. 65,554 Text Classification, clustering, regression 2011 [306] M. Naeem et al.

Animal[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Abalone Dataset Physical measurements of Abalone. Weather patterns and location are also given. None. 4177 Text Regression 1995 [307] Marine Research Laboratories – Taroona
Zoo Dataset Artificial dataset covering 7 classes of animals. Animals are classed into 7 categories and features are given for each. 101 Text Classification 1990 [308] R. Forsyth
Demospongiae Dataset Data about marine sponges. 503 sponges in the Demosponge class are described by various features. 503 Text Classification 2010 [309] E. Armengol et al.
Splice-junction Gene Sequences Dataset Primate splice-junction gene sequences (DNA) with associated imperfect domain theory. None. 3190 Text Classification 1992 [293] G. Towell et al.
Mice Protein Expression Dataset Expression levels of 77 proteins measured in the cerebral cortex of mice. None. 1080 Text Classification, Clustering 2015 [310][311] C. Higuera et al.

Plant[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Forest Fires Dataset Forest fires and their properties. 13 features of each fire are extracted. 517 Text Regression 2008 [312][313] P. Cortez et al.
Iris Dataset Three types of iris plants are described by 4 different attributes. None. 150 Text Classification 1936 [314][315] R. Fisher
Plant Species Leaves Dataset Sixteen samples of leaf each of one-hundred plant species. Shape descriptor, fine-scale margin, and texture histograms are given. 1600 Text Classification 2012 [316][317] J. Cope et al.
Mushroom Dataset Mushroom attributes and classification. Many properties of each mushroom are given. 8124 Text Classification 1987 [318] J. Schlimmer
Soybean Dataset Database of diseased soybean plants. 35 features for each plant are given. Plants are classified into 19 categories. 307 Text Classification 1988 [319] R. Michalshi et al.
Seeds Dataset Measurements of geometrical properties of kernels belonging to three different varieties of wheat. None. 210 Text Classification, clustering 2012 [320][321] Charytanowicz et al.
Covertype Dataset Data for predicting forest cover type strictly from cartographic variables. Many geographical features given. 581,012 Text Classification 1998 [322][323] J. Blackard et al.
Abscisic Acid Signaling Network Dataset Data for a plant signaling network. Goal is to determine set of rules that governs the network. None. 300 Text Causal-discovery 2008 [324] J. Jenkens et al.
Folio Dataset 20 photos of leaves for each of 32 species. None. 637 Images, text Classification, clustering 2015 [325][326] T. Munisami et al.
Oxford Flower Dataset 17 category dataset of flowers. Train/test splits, labeled images, 1360 Images, text Classification 2006 [113][327] M-E Nilsback et al.

Microbe[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Ecoli Dataset Protein localization sites. Various features of the protein localizations sites are given. 336 Text Classification 1996 [328][329] K. Nakai et al.
MicroMass Dataset Identification of microorganisms from mass-spectrometry data. Various mass spectrometer features. 931 Text Classification 2013 [330][331] P. Mahe et al.
Yeast Dataset Predictions of Cellular localization sites of proteins. Eight features given per instance. 1484 Text Classification 1996 [332][333] K. Nakai et al.

Drug Discovery[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Tox21 Dataset Prediction of outcome of biological assays. Chemical descriptors of molecules are given. 12707 Text Classification 2016 [334] A. Mayr et al.

Anomaly data[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Numenta Anomaly Benchmark (NAB) Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted. ? 50+ files Comma separated values Anomaly detection 2016 (continually updated) [335] Numenta

Multivariate data[edit]

Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.

Financial[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Dow Jones Index Weekly data of stocks from the first and second quarters of 2011. Calculated values included such as percentage change and a lags. 750 Comma separated values Classification, regression, time Series 2014 [336][337] M. Brown et al.
Statlog (Australian Credit Approval) Credit card applications either accepted or rejected and attributes about the application. Attribute names are removed as well as identifying information. Factors have been relabeled. 690 Comma separated values Classification 1987 [338][339] R. Quinlan
eBay auction data Auction data from various eBay.com objects over various length auctions Contains all bids, bidderID, bid times, and opening prices. ~ 550 Text Regression, classification 2012 [340][341] G. Shmueli et al.
Statlog (German Credit Data) Binary credit classification into “good” or “bad” with many features Various financial features of each person are given. 690 Text Classification 1994 [342] H. Hofmann
Bank Marketing Dataset Data from a large marketing campaign carried out by a large bank . Many attributes of the clients contacted are given. If the client subscribed to the bank is also given. 45,211 Text Classification 2012 [343][344] S. Moro et al.
Istanbul Stock Exchange Dataset Several stock indexes tracked for almost two years. None. 536 Text Classification, regression 2013 [345][346] O. Akbilgic
Default of Credit Card Clients Credit default data for Taiwanese creditors. Various features about each account are given. 30,000 Text Classification 2016 [347][348] I. Yeh

Weather[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Cloud DataSet Data about 1024 different clouds. Image features extracted. 1024 Text Classification, clustering 1989 [349] P. Collard
El Nino Dataset Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. 12 weather attributes are measured at each buoy. 178080 Text Regression 1999 [350] Pacific Marine Environmental Laboratory
Greenhouse Gas Observing Network Dataset Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather. None. 2921 Text Regression 2015 [351] D. Lucas
Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory Continuous air samples in Hawaii, USA. 44 years of records. None. 44 years Text Regression 2001 [352] Mauna Loa Observatory
Ionosphere Dataset Radar data from the ionosphere. Task is to classify into good and bad radar returns. Many radar features given. 351 Text Classification 1989 [303][353] Johns Hopkins University
Ozone Level Detection Dataset Two ground ozone level datasets. Many features given, including weather conditions at time of measurement. 2536 Text Classification 2008 [354][355] K. Zhang et al.

Census[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Adult Dataset Census data from 1994 containing demographic features of adults and their income. Cleaned and anonymized. 48,842 Comma separated values Classification 1996 [356] United States Census Bureau
Census-Income (KDD) Weighted census data from the 1994 and 1995 Current Population Surveys. Split into training and test sets. 299,285 Comma separated values Classification 2000 [357][358] United States Census Bureau
IPUMS Census Database Census data from the Los Angeles and Long Beach areas. None 256,932 Text Classification, regression 1999 [359] IPUMS
US Census Data 1990 Partial data from 1990 US census. Results randomized and useful attributes selected. 2,458,285 Text Classification, regression 1990 [360] United States Census Bureau

Transit[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Bike Sharing Dataset Hourly and daily count of rental bikes in a large city. Many features, including weather, length of trip, etc., are given. 17,389 Text Regression 2013 [361][362] H. Fanaee-T
New York City Taxi Trip Data Trip data for yellow and green taxis in New York City. Gives pick up and drop off locations, fares, and other details of trips. 6 years Text Classification, clustering 2015 [363] New York City Taxi and Limousine Commission
Taxi Service Trajectory ECML PKDD Trajectories of all taxis in a large city. Many features given, including start and stop points. 1,710,671 Text Clustering, causal-discovery 2015 [364][365] M. Ferreira et al.

Internet[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Webpages from Common Crawl 2012 Large collection of webpages and how they are connected via hyperlinks None. 3.5B Text clustering, classification 2013 [366] V. Granville
Internet Advertisements Dataset Dataset for predicting if a given image is an advertisement or not. Features encode geometry of ads and phrases occurring in the URL. 3279 Text Classification 1998 [367][368] N. Kushmerick
Internet Usage Dataset General demographics of internet users. None. 10,104 Text Classification, clustering 1999 [369] D. Cook
URL Dataset 120 days of URL data from a large conference. Many features of each URL are given. 2,396,130 Text Classification 2009 [370][371] J. Ma
Phishing Websites Dataset Dataset of phishing websites. Many features of each site are given. 2456 Text Classification 2015 [372] R. Mustafa et al.
Online Retail Dataset Online transactions for a UK online retailer. Details of each transaction given. 541,909 Text Classification, clustering 2015 [373] D. Chen
Freebase Simple Topic Dump Freebase is an online effort to structure all human knowledge. Topics from Freebase have been extracted. large Text Classification, clustering 2011 [374][375] Freebase
Farm Ads Dataset The text of farm ads from websites. Binary approval or disapproval by content owners is given. SVMlight sparse vectors of text words in ads calculated. 4143 Text Classification 2011 [376][377] C. Masterharm et al.

Games[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Poker Hand Dataset 5 card hands from a standard 52 card deck. Attributes of each hand are given, including the Poker hands formed by the cards it contains. 1,025,010 Text Regression, classification 2007 [378] R. Cattral
Connect-4 Dataset Contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. None. 67,557 Text Classification 1995 [379] J. Tromp
Chess (King-Rook vs. King) Dataset Endgame Database for White King and Rook against Black King. None. 28,056 Text Classification 1994 [380][381] M. Bain et al.
Chess (King-Rook vs. King-Pawn) Dataset King+Rook versus King+Pawn on a7. None. 3196 Text Classification 1989 [382] R. Holte
Tic-Tac-Toe Endgame Dataset Binary classification for win conditions in tic-tac-toe. None. 958 Text Classification 1991 [383] D. Aha

Other multivariate[edit]

Dataset Name Brief description Preprocessing Instances Format Default Task Created (updated) Reference Creator
Housing Data Set Median home values of Boston with associated home and neighborhood attributes. None. 506 Text Regression 1993 [384] D. Harrison et al.
The Getty Vocabularies structured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials. None. large Text Classification 2015 [385] Getty Center
Yahoo! Front Page Today Module User Click Log User click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page. Conjoint analysis with a bilinear model. 45,811,883 user visits Text Regression, clustering 2009 [386][387] Chu et al.
British Oceanographic Data Centre Biological, chemical, physical and geophysical data for oceans. 22K variables tracked. Various. 22K variables, many instances Text Regression, clustering 2015 [388] British Oceanographic Data Centre
Congressional Voting Records Dataset Voting data for all USA representatives on 16 issues. Beyond the raw voting data, various other features are provided. 435 Text Classification 1987 [389] J. Schlimmer
Entree Chicago Recommendation Dataset Record of user interactions with Entree Chicago recommendation system. Details of each users usage of the app are recorded in detail. 50,672 Text Regression, recommendation 2000 [390] R. Burke
Insurance Company Benchmark (COIL 2000) Information on customers of an insurance company. Many features of each customer and the services they use. 9,000 Text Regression, classification 2000 [391][392] P. van der Putten
Nursery Dataset Data from applicants to nursery schools. Data about applicant’s family and various other factors included. 12,960 Text Classification 1997 [393][394] V. Rajkovic et al.
University Dataset Data describing attributed of a large number of universities. None. 285 Text Clustering, classification 1988 [395] S. Sounders et al.
Blood Transfusion Service Center Dataset Data from blood transfusion service center. Gives data on donors return rate, frequency, etc. None. 748 Text Classification 2008 [396][397] I. Yeh
Record Linkage Comparison Patterns Dataset Large dataset of records. Task is to link relevant records together. Blocking procedure applied to select only certain record pairs. 5,749,132 Text Classification 2011 [398][399] University of Mainz
Nomao Dataset Nomao collects data about places from many different sources. Task is to detect items that describe the same place. Duplicates labeled. 34,465 Text Classification 2012 [400][401] Nomao Labs
Movie Dataset Data for 10,000 movies. Several features for each movie are given. 10,000 Text Clustering, classification 1999 [402] G. Wiederhold
Open University Learning Analytics Dataset Information about students and their interactions with a virtual learning environment. None. ~ 30,000 Text Classification, clustering, regression 2015 [403][404] J. Kuzilek et al.


Ref:

  • List of datasets for machine learning research - Wikipedia


你可能感兴趣的:(深度学习)