JNingWei

Dataset 列表：机器学习研究

Face recognition

In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other projects that use images of faces.

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Face Recognition Technology (FERET)	11338 images of 1199 individuals in different positions and at different times.	None.	11,338	Images	Classification, face recognition	2003	^[6]^[7]	United States Department of Defense
CMU Pose, Illumination, and Expression (PIE)	41,368 color images of 68 people in 13 different poses.	Images labeled with expressions.	41,368	Images, text	Classification, face recognition	2000	^[8]^[9]	R. Gross et al.
SCFace	Color images of faces at various angles.	Location of facial features extracted. Coordinates of features given.	4,160	Images, text	Classification, face recognition	2011	^[10]^[11]	M. Grgic et al.
YouTube Faces DB	Videos of 1,595 different people gathered from YouTube. Each clip is between 48 and 6,070 frames.	Identity of those appearing in videos and descriptors.	3,425 videos	Video, text	Video classification, face recognition	2011	^[12]^[13]	L. Wolf et al.
300 videos in-the-Wild	114 videos annotated for facial landmark tracking. The 68 landmark mark-up is applied to every frame.	None	114 videos, 218,000 frames.	Video, annotation file.	Facial landmark tracking.	2015	^[14]	Shen, Jie et al.
Grammatical Facial Expressions Dataset	Grammatical Facial Expressions from Brazilian Sign Language.	Microsoft Kinect features extracted.	27,965	Text	Facial gesture recognition	2014	^[15]	F. Freitas et al.
CMU Face Images Dataset	Images of faces. Each person is photographed multiple times to capture different expressions.	Labels and features.	640	Images, Text	Face recognition	1999	^[16]^[17]	T. Mitchell
Yale Face Database	Faces of 15 individuals in 11 different expressions.	Labels of expressions.	165	Images	Face recognition	1997	^[18]^[19]	J. Yang et al.
Cohn-Kanade AU-Coded Expression Database	Large database of images with labels for expressions.	Tracking of certain facial features.	500+ sequences	Images, text	Facial expression analysis	2000	^[20]^[21]	T. Kanade et al.
FaceScrub	Images of public figures scrubbed from image searching.	Name and m/f annotation.	107,818	Images, text	Face recognition	2014	^[22]^[23]	H. Ng et al.
BioID Face Database	Images of faces with eye positions marked.	Manually set eye positions.	1521	Images, text	Face recognition	2001	^[24]^[25]	BioID
Skin Segmentation Dataset	Randomly sampled color values from face images.	B, G, R, values extracted.	245,057	Text	Segmentation, classification	2012	^[26]^[27]	R. Bhatt.
Bosphorus	3D Face image database.	34 action units and 6 expressions labeled; 24 facial landmarks labeled.	4652	Images, text	Face recognition, classification	2008	^[28]^[29]	A Savran et al.
UOY 3D-Face	neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.	labeling.	5250	Images, text	Face recognition, classification	2004	^[30]^[31]	University of York
CASIA	Expressions: Anger, smile, laugh, surprise, closed eyes.	None.	4624	Images, text	Face recognition, classification	2007	^[32]^[33]	Institute of Automation, Chinese Academy of Sciences
CASIA	Expressions: Anger Disgust Fear Happiness Sadness Surprise	None.	480	Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second	Face recognition, classification	2011	^[34]	Zhao, G. et al.
BU-3DFE	neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted.	None.	2500	Images, text	Facial expression recognition, classification	2006	^[35]	Binghamton University
Face Recognition Grand Challenge Dataset	Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.	None.	4007	Images, text	Face recognition, classification	2004	^[36]^[37]	National Institute of Standards and Technology
Gavabdb	Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images.	None.	549	Images, text	Face recognition, classification	2008	^[38]^[39]	King Juan Carlos University
3D-RMA	Up to 100 subjects, expressions mostly neutral. Several poses as well.	None.	9971	Images, text	Face recognition, classification	2004	^[40]^[41]	Royal Military Academy (Belgium)

Action recognition

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Human Motion DataBase (HMDB51)	51 action categories, each containing at least 101 clips, extracted from a range of sources.	None.	6,766 video clips	video clips	Action classification	2011	^[42]	H. Kuehne et al.
TV Human Interaction Dataset	Videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none.	None.	6,766 video clips	video clips	Action prediction	2013	^[43]	Patron-Perez, A. et al.
UT Interaction	People acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip.	None.	120 video clips	video clips	Action prediction	2009	^[44]	Ryoo, M. S. et al.
UT Kinect	10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting.	None.	200 video clips with depth information at 15 frames per second	video clips with depth information	Action classification	2012	^[45]	Xia, L. et al.
SBU Interact	Seven participants performing one of 8 actions together (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands) in an office setting.	None.	Around 300 interactions	video clips with depth information	Action classification	2012	^[46]	Yun, K. et al.
Berkeley Multimodal Human Action Database (MHAD)	Recordings of a single person performing 12 actions	MoCap pre-processing	660 action samples	8 PhaseSpace Motion Cpature, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphones	Action classification	2013	^[47]	Ofli, F. et al.
UCF 101 Dataset	Self described as “a dataset of 101 human actions classes from videos in the wild.” Dataset is large with over 27 hours of video.	Actions classified and labeled.	13,000	Video, images, text	Classification, action detection	2012	^[48]^[49]	K. Soomro et al.
THUMOS Dataset	Large video dataset for action classification.	Actions classified and labeled.	45M frames of video	Video, images, text	Classification, action detection	2013	^[50]^[51]	Y. Jiang et al.
Activitynet	Large video dataset for activity recognition and detection.	Actions classified and labeled.	10,024	Video, images, text	Classification, action detection	2015	^[52]	Heilbron et al.
MSP-AVATAR	Improvised scenarios annotated for discourse functions: contrast, confirmation/negation, question, uncertainty, suggest, giving orders, warn, inform, size description, using pronouns.	Actions classified and labeled.	74 sessions	Motion-captured video, audio	Classification, action detection	2015	^[53]	Sadoughi, N. et al.
LILiR Twotalk Corpus	Video datasets for non-verbal communication activity recognition: agreement, thinking, asking and understanding.	Actions classified and labeled.	527	Video	Action detection	2011	^[54]	Sheerman-Chase et al.

Object detection & recognition

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
DAVIS: Densely Annotated VIdeo Segmentation	150 video sequences containing 10459 frames with a total of 376 objects annotated.	Dataset released for the 2017 DAVIS Challenge with a dedicated workshop co-located with CVPR 2017. The videos contain several types of objects and humans with a high quality segmentation annotation.	10,459	Frames annotated	Video object segmentation	2017	^[55]	Pont-Tuset, J. et al.
T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects	30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object.	6D poses for all modeled objects in all images. Per-pixel labelling can be obtained by rendering of the object models at the ground truth poses.	49,000	RGB-D images, 3D object models	6D object pose estimation, object detection	2017	^[56]	T. Hodan et al.
Berkeley 3-D Object Dataset	849 images taken in 75 different scenes. About 50 different object classes are labeled.	Object bounding boxes and labeling.	849	labeled images, text	Object recognition	2014	^[57]^[58]	A. Janoch et al.
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500)	500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300.	Each image segmented by five different subjects on average.	500	Segmented images	Contour detection and hierarchical image segmentation	2011	^[59]	University of California, Berkeley
Microsoft Common Objects in Context (COCO)	complex everyday scenes of common objects in their natural context.	Object highlighting, labeling, and classification into 91 object types.	2,500,000	Labeled images, text	Object recognition	2015	^[60]^[61]	T. Lin et al.
SUN Database	Very large scene and object recognition database.	Places and objects are labeled. Objects are segmented.	131,067	Images, text	Object recognition, scene recognition	2014	^[62]^[63]	J. Xiao et al.
ImageNet	Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge	Labeled objects, bounding boxes, descriptive words, SIFT features	14,197,122	Images, text	Object recognition, scene recognition	2014	^[64]^[65]	J. Deng et al.
TV News Channel Commercial Detection Dataset	TV commercials and news broadcasts.	Audio and video features extracted from still images.	129,685	Text	Clustering, classification	2015	^[66]^[67]	P. Guha et al.
Statlog (Image Segmentation) Dataset	The instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel.	Many features calculated.	2310	Text	Classification	1990	^[68]	University of Massachusetts
Caltech 101	Pictures of objects.	Detailed object outlines marked.	9146	Images	Classification, object recognition.	2003	^[69]^[70]	F. Li et al.
Caltech-256	Large dataset of images for object classification.	Images categorized and hand-sorted.	30,607	Images, Text	Classification, object detection	2007	^[71]^[72]	G. Griffin et al.
SIFT10M Dataset	SIFT features of Caltech-256 dataset.	Extensive SIFT feature extraction.	11,164,866	Text	Classification, object detection	2016	^[73]	X. Fu et al.
LabelMe	Annotated pictures of scenes.	Objects outlined.	187,240	Images, text	Classification, object detection	2005	^[74]	MIT Computer Science and Artificial Intelligence Laboratory
Cityscapes Dataset	Stereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included.	Pixel-level segmentation and labeling	25,000	Images, text	Classification, object detection	2016	^[75]	Daimler AG et al.
PASCAL VOC Dataset	Large number of images for classification tasks.	Labeling, bounding box included	500,000	Images, text	Classification, object detection	2010	^[76]^[77]	M. Everingham et al.
CIFAR-10 Dataset	Many small, low-resolution, images of 10 classes of objects.	Classes labelled, training set splits created.	60,000	Images	Classification	2009	^[65]^[78]	A. Krizhevsky et al.
CIFAR-100 Dataset	Like CIFAR-10, above, but 100 classes of objects are given.	Classes labelled, training set splits created.	60,000	Images	Classification	2009	^[65]^[78]	A. Krizhevsky et al.
German Traffic Sign Detection Benchmark Dataset	Images from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries.	Signs manually labeled	900	Images	Classification	2013	^[79]^[80]	S Houben et al.
KITTI Vision Benchmark Dataset	Autonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners.	Many benchmarks extracted from data.	>100 GB of data	Images, text	Classification, object detection	2012	^[81]^[82]	A Geiger et al.

Handwriting and character recognition

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Artificial Characters Dataset	Artificially generated data describing the structure of 10 capital English letters.	Coordinates of lines drawn given as integers. Various other features.	6000	Text	Handwriting recognition, classification	1992	^[83]	H. Guvenir et al.
Letter Dataset	Upper case printed letters.	17 features are extracted from all images.	20,000	Text	OCR, classification	1991	^[84]^[85]	D. Slate et al.
Character Trajectories Dataset	Labeled samples of pen tip trajectories for people writing simple characters.	3-dimensional pen tip velocity trajectory matrix for each sample	2858	Text	Handwriting recognition, classification	2008	^[86]^[87]	B. Williams
Chars74K Dataset	Character recognition in natural images of symbols used in both English and Kannada		74,107		Character recognition, handwriting recognition, OCR, classification	2009	^[88]	T. de Campos
UJI Pen Characters Dataset	Isolated handwritten characters	Coordinates of pen position as characters were written given.	11,640	Text	Handwriting recognition, classification	2009	^[89]^[90]	F. Prat et al.
Gisette Dataset	Handwriting samples from the often-confused 4 and 9 characters.	Features extracted from images, split into train/test, handwriting images size-normalized.	13,500	Images, text	Handwriting recognition, classification	2003	^[91]	Yann LeCun et al.
MNIST Database	Database of handwritten digits.	Hand-labeled.	60,000	Images, text	Classification	1998	^[92]^[93]	National Institute of Standards and Technology
Optical Recognition of Handwritten Digits Dataset	Normalized bitmaps of handwritten data.	Size normalized and mapped to bitmaps.	5620	Images, text	Handwriting recognition, classification	1998	^[94]	E. Alpaydin et al.
Pen-Based Recognition of Handwritten Digits Dataset	Handwritten digits on electronic pen-tablet.	Feature vectors extracted to be uniformly spaced.	10,992	Images, text	Handwriting recognition, classification	1998	^[95]^[96]	E. Alpaydin et al.
Semeion Handwritten Digit Dataset	Handwritten digits from 80 people.	All handwritten digits have been normalized for size and mapped to the same grid.	1593	Images, text	Handwriting recognition, classification	2008	^[97]	T. Srl
HASYv2	Handwritten mathematical symbols	All symbols are centered and of size 32px x 32px.	168233	Images, text	Classification	2017	^[98]	Martin Thoma

Aerial images

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Aerial Image Segmentation Dataset	80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.	Images manually segmented.	80	Images	Aerial Classification, object detection	2013	^[99]^[100]	J. Yuan et al.
KIT AIS Data Set	Multiple labeled training and evaluation datasets of aerial images of crowds.	Images manually labeled to show paths of individuals through crowds.	~ 150	Images with paths	People tracking, aerial tracking	2012	^[101]^[102]	M. Butenuth et al.
Wilt Dataset	Remote sensing data of diseased trees and other land cover.	Various features extracted.	4899	Images	Classification, aerial object detection	2014	^[103]^[104]	B. Johnson
Forest Type Mapping Dataset	Satellite imagery of forests in Japan.	Image wavelength bands extracted.	326	Text	Classification	2015	^[105]^[106]	B. Johnson
Overhead Imagery Research Data Set	Annotated overhead imagery. Images with multiple objects.	Over 30 annotations and over 60 statistics that describe the target within the context of the image.	1000	Images, text	Classification	2009	^[107]^[108]	F. Tanner et al.

Other images[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
MPII Cooking Activities Dataset	Videos and images of various cooking activities.	Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling.	881,755 frames	Labeled video, images, text	Classification	2012	^[109]^[110]	M. Rohrbach et al.
Stanford Dogs Dataset	Images of 120 breeds of dogs from around the world.	Train/test splits and ImageNet annotations provided.	20,580	Images, text	Fine-grain classification	2011	^[111]^[112]	A. Khosla et al.
The Oxford-IIIT Pet Dataset	37 categories of pets with roughly 200 images of each.	Breed labeled, tight bounding box, foreground-background segmentation.	~ 7,400	Images, text	Classification, object detection	2012	^[112]^[113]	O. Parkhi et al.
Corel Image Features Data Set	Database of images with features extracted.	Many features including color histogram, co-occurrence texture, and colormoments,	68,040	Text	Classification, object detection	1999	^[114]^[115]	M. Ortega-Bindenberger et al.
Online Video Characteristics and Transcoding Time Dataset.	Transcoding times for various different videos and video properties.	Video features given.	168,286	Text	Regression	2015	^[116]	T. Deneke et al.
Microsoft Sequential Image Narrative Dataset (SIND)	Dataset for sequential vision-to-language	Descriptive caption and storytelling given for each photo, and photos are arranged in sequences	81,743	Images, text	Visual storytelling	2016	^[117]	Microsoft Research
Caltech-UCSD Birds-200-2011 Dataset	Large dataset of images of birds.	Part locations for birds, bounding boxes, 312 binary attributes given	11,788	Images, text	Classification	2011	^[118]^[119]	C. Wah et al.
YouTube-8M	Large and diverse labeled video dataset	YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities	8 million	Video, text	Video classification	2016	^[120]^[121]	S. Abu-El-Haija et al.
YFCC100M	Large and diverse labeled image and video dataset	Flickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags)	100 million	Video, Image, Text	Video and Image classification	2016	^[122]^[123]	B. Thomee et al.
Discrete LIRIS-ACCEDE	Short videos annotated for valence and arousal.	Valence and arousal labels.	9800	Video	Video emotion elicitation detection	2015	^[124]	Y. Baveye et al.
Continuous LIRIS-ACCEDE	Long videos annotated for valence and arousal while also collecting Galvanic Skin Response.	Valence and arousal labels.	30	Video	Video emotion elicitation detection	2015	^[125]	Y. Baveye et al.
MediaEval LIRIS-ACCEDE	Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films.	Vioence, valence and arousal labels.	10900	Video	Video emotion elicitation detection	2015	^[126]	Y. Baveye et al.

Text data[edit]

Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.

Reviews[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Amazon reviews	US product reviews from Amazon.com.	None.	~ 82M	Text	Classification, sentiment analysis	2015	^[127]	McAuley et al.
OpinRank Review Dataset	Reviews of cars and hotels from Edmunds.com and TripAdvisor respectively.	None.	42,230 / ~259,000 respectively	Text	Sentiment analysis, clustering	2011	^[128]^[129]	K. Ganesan et al.
MovieLens	22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.	None.	~ 22M	Text	Regression, clustering, classification	2016	^[130]	GroupLens Research
Yahoo! Music User Ratings of Musical Artists	Over 10M ratings of artists by Yahoo users.	None described.	~ 10M	Text	Clustering, regression	2004	^[131]^[132]	Yahoo!
Car Evaluation Data Set	Car properties and their overall acceptability.	Six categorical features given.	1728	Text	Classification	1997	^[133]^[134]	M. Bohanec
YouTube Comedy Slam Preference Dataset	User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.	Video metadata given.	1,138,562	Text	Classification	2012	^[135]^[136]	Google
Skytrax User Reviews Dataset	User reviews of airlines, airports, seats, and lounges from Skytrax.	Ratings are fine-grain and include many aspects of airport experience.	41396	Text	Classification, regression	2015	^[137]	Q. Nguyen
Teaching Assistant Evaluation Dataset	Teaching assistant reviews.	Features of each instance such as class, class size, and instructor are given.	151	Text	Classification	1997	^[138]^[139]	W. Loh et al.

News articles[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
NYSK Dataset	English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.	Filtered and presented in XML format.	10,421	XML, text	Sentiment analysis, topic extraction	2013	^[140]	Dermouche, M. et al.
The Reuters Corpus Volume 1	Large corpus of Reuters news stories in English.	Fine-grain categorization and topic codes.	810,000	Text	Classification, clustering, summarization	2002	^[141]	Reuters
The Reuters Corpus Volume 2	Large corpus of Reuters news stories in multiple languages.	Fine-grain categorization and topic codes.	487,000	Text	Classification, clustering, summarization	2005	^[142]	Reuters
Thomson Reuters Text Research Collection	Large corpus of news stories.	Details not described.	1,800,370	Text	Classification, clustering, summarization	2009	^[143]	T. Rose et al.
Saudi Newspapers Corpus	31,030 Arabic newspaper articles.	Metadata extracted.	31,030	JSON	Summarization, clustering	2015	^[144]	M. Alhagri

Messages[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Enron Email Dataset	Emails from employees at Enron organized into folders.	Attachments removed, invalid email addresses converted to [email protected] or [email protected].	~ 500,000	Text	Network analysis, sentiment analysis	2004 (2015)	^[145]^[146]	Klimt, B. and Y. Yang
Ling-Spam Dataset	Corpus containing both legitimate and spam emails.	Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.		Text	Classification	2000	^[147]^[148]	Androutsopoulos, J. et al.
SMS Spam Collection Dataset	Collected SMS spam messages.	None.	5574	Text	Classification	2011	^[149]^[150]	T. Almeida et al.
Twenty Newsgroups Dataset	Messages from 20 different newsgroups.	None.	20,000	Text	Natural language processing	1999	^[151]	T. Mitchell et al.
Spambase Dataset	Spam emails.	Many text features extracted.	4601	Text	Spam detection, classification	1999	^[152]	M. Hopkins et al.

Twitter and tweets[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Sentiment140	Tweet data from 2009 including original text, time stamp, user and sentiment.	Classified using distant supervision from presence of emoticon in tweet.	1,578,627	Tweets, comma, separated values	Sentiment analysis	2009	^[153]^[154]	A. Go et al.
ASU Twitter Dataset	Twitter network data, not actual tweets. Shows connections between a large number of users.	None.	11,316,811 users, 85,331,846 connections	Text	Clustering, graph analysis	2009	^[155]^[156]	R. Zafarani et al.
SNAP Social Circles: Twitter Database	Large twitter network data.	Node features, circles, and ego networks.	1,768,149	Text	Clustering, graph analysis	2012	^[157]^[158]	J. McAuley et al.
Twitter Dataset for Arabic Sentiment Analysis	Arabic tweets.	Samples hand-labeled as positive or negative.	2000	Text	Classification	2014	^[159]^[160]	N. Abdulla
Buzz in Social Media Dataset	Data from Twitter and Tom’s Hardware. This dataset focuses on specific buzz topics being discussed on those sites.	Data is windowed so that the user can attempt to predict the events leading up to social media buzz.	140,000	Text	Regression, Classification	2013	^[161]^[162]	F. Kawala et al.
Paraphrase and Semantic Similarity in Twitter (PIT)	This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.	tokenization, part-of-speech and named entity tagging	18,762	Text	Regression, Classification	2015	^[163]^[164]	Xu et al.

Other text[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Legal Case Reports	Federal Court of Australia cases from 2006–2009.	None.	4,000	Text	Summarization, citation analysis	2012	^[165]^[166]	F. Galgani et al.
Blogger Authorship Corpus	Blog entries of 19,320 people from blogger.com.	Blogger self-provided gender, age, industry, and astrological sign.	681,288	Text	Sentiment analysis, summarization, classification	2006	^[167]^[168]	J. Schler et al.
Social Structure of Facebook Networks	Large dataset of the social structure of Facebook.	None.	100 colleges covered	Text	Network analysis, clustering	2012	^[169]^[170]	A. Traud et al.
Dataset for the Machine Comprehension of Text	Stories and associated questions for testing comprehension of text.	None.	660	Text	Natural language processing, machine comprehension	2013	^[171]^[172]	M. Richardson et al.
The Penn Treebank Project	Naturally occurring text annotated for linguistic structure.	Text is parsed into semantic trees.	~ 1M words	Text	Natural language processing, summarization	1995	^[173]^[174]	M. Marcus et al.
DEXTER Dataset	Task given is to determine, from features given, which articles are about corporate acquisitions.	Features extracted include word stems. Distractor features included.	2600	Text	Classification	2008	^[175]	Reuters
Google Books N-grams	N-grams from a very large corpus of books	None.	2.2 TB of text	Text	Classification, clustering, regression	2011	^[176]^[177]	Google
Personae Corpus	Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.	In addition to normal texts, syntactically annotated texts are given.	145	Text	Classification, regression	2008	^[178]^[179]	K. Luyckx et al.
CNAE-9 Dataset	Categorization task for free text descriptions of Brazilian companies.	Word frequency has been extracted.	1080	Text	Classification	2012	^[180]^[181]	P. Ciarelli et al.
Sentiment Labeled Sentences Dataset	3000 sentiment labeled sentences.	Sentiment of each sentence has been hand labeled as positive or negative.	3000	Text	Classification, sentiment analysis	2015	^[182]^[183]	D. Kotzias
BlogFeedback Dataset	Dataset to predict the number of comments a post will receive based on features of that post.	Many features of each post extracted.	60,021	Text	Regression	2014	^[184]^[185]	K. Buza
Stanford Natural Language Inference (SNLI) Corpus	Image captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.	Entailment class labels, syntactic parsing by the Stanford PCFG parser	570,000	Text	Natural language inference/recognizing textual entailment	2015	^[186]	S. Bowman et al.

Sound data[edit]

Datasets of sounds and sound features.

Speech[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Zero Resource Speech Challenge 2015	Spontaneous speech (English), Read speech (Xitsonga).	raw wav	English: 5h, 12 speakers; Xitsonga: 2h30; 24 speakers	sound	Unsupervised discovery of speech features/subword units/word units	2015	^[187]^[188]www.zerospeech.com/2015	Versteegh et al.
Parkinson Speech Dataset	Multiple recordings of people with and without Parkinson’s Disease.	Voice features extracted, disease scored by physician using unified Parkinson’s disease rating scale	1,040	Text	Classification, regression	2013	^[189]^[190]	B. E. Sakar et al.
Spoken Arabic Digits	Spoken Arabic digits from 44 male and 44 female.	Time-series of mel-frequency cepstrum coefficients.	8,800	Text	Classification	2010	^[191]^[192]	M. Bedda et al.
ISOLET Dataset	Spoken letter names.	Features extracted from sounds.	7797	Text	Classification	1994	^[193]^[194]	R. Cole et al.
Japanese Vowels Dataset	Nine male speakers uttered two Japanese vowels successively.	Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.	640	Text	Classification	1999	^[195]^[196]	M. Kudo et al.
Parkinson’s Telemonitoring Dataset	Multiple recordings of people with and without Parkinson’s Disease.	Sound features extracted.	5875	Text	Classification	2009	^[197]^[198]	A. Tsanas et al.
TIMIT	Recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.	Speech is lexically and phonemically transcribed.	6300	Text	Speech recognition, classification.	1986	^[199]^[200]	J. Garofolo et al.
Arabic Speech Corpus	A single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme level	Speech is orthographically and phonetically transcribed with stress marks.	~1900	Text, WAV	Speech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.	2016	^[201]	N. Halabi

Music[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Geographical Original of Music Data Set	Audio features of music samples from different locations.	Audio features extracted using MARSYAS software.	1,059	Text	Geographical classification, clustering	2014	^[202]^[203]	F. Zhou et al.
Million Song Dataset	Audio features from one million different songs.	Audio features extracted.	1M	Text	Classification, clustering	2011	^[204]^[205]	T. Bertin-Mahieux et al.
Free Music Archive	Audio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.	Raw audio and audio features.	106,574	Text, MP3	Classification, recommendation	2017	^[206]	M. Defferrard et al.
Bach Choral Harmony Dataset	Bach chorale chords.	Audio features extracted.	5665	Text	Classification	2014	^[207]^[208]	D. Radicioni et al.

Other sounds[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
UrbanSound	Labeled sound recordings of sounds like air conditioners, car horns and children playing.	Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.	1,059	Sound (WAV)	Classification	2014	^[209]^[210]	J. Salamon et al.

Signal data[edit]

Datasets containing electric signal information requiring some sort of Signal processing for further analysis.

Electrical[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Witty Worm Dataset	Dataset detailing the spread of the Witty worm and the infected computers.	Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.	55,909 IP addresses	Text	Classification	2004	^[211]^[212]	Center for Applied Internet Data Analysis
Cuff-Less Blood Pressure Estimation Dataset	Cleaned vital signals from human patients which can be used to estimate blood pressure.	125 Hz vital signs have been cleaned.	12,000	Text	Classification, regression	2015	^[213]^[214]	M. Kachuee et al.
Gas Sensor Array Drift Dataset	Measurements from 16 chemical sensors utilized in simulations for drift compensation.	Extensive number of features given.	13,910	Text	Classification	2012	^[215]^[216]	A. Vergara
Servo Dataset	Data covering the nonlinear relationships observed in a servo-amplifier circuit.	Levels of various components as a function of other components are given.	167	Text	Regression	1993	^[217]^[218]	K. Ullrich
UJIIndoorLoc-Mag Dataset	Indoor localization database to test indoor positioning systems. Data is magnetic field based.	Train and test splits given.	40,000	Text	Classification, regression, clustering	2015	^[219]^[220]	D. Rambla et al.
Sensorless Drive Diagnosis Dataset	Electrical signals from motors with defective components.	Statistical features extracted.	58,508	Text	Classification	2015	^[221]^[222]	M. Bator

Motion-tracking[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)	People performing five standard actions while wearing motion tackers.	None.	165,632	Text	Classification	2013	^[223]^[224]	Pontifical Catholic University of Rio de Janeiro
Gesture Phase Segmentation Dataset	Features extracted from video of people doing various gestures.	Features extracted aim at studying gesture phase segmentation.	9900	Text	Classification, clustering	2014	^[225]^[226]	R. Madeo et a
Vicon Physical Action Data Set Dataset	10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.	Many parameters recorded by 3D tracker.	3000	Text	Classification	2011	^[227]^[228]	T. Theodoridis
Daily and Sports Activities Dataset	Motor sensor data for 19 daily and sports activities.	Many sensors given, no preprocessing done on signals.	9120	Text	Classification	2013	^[229]^[230]	B. Barshan et al.
Human Activity Recognition Using Smartphones Dataset	Gyroscope and accelerometer data from people wearing smartphones and performing normal actions.	Actions performed are labeled, all signals preprocessed for noise.	10,299	Text	Classification	2012	^[231]^[232]	J. Reyes-Ortiz et al.
Australian Sign Language Signs	Australian sign language signs captured by motion-tracking gloves.	None.	2565	Text	Classification	2002	^[233]^[234]	M. Kadous
Weight Lifting Exercises monitored with Inertial Measurement Units	Five variations of the biceps curl exercise monitored with IMUs.	Some statistics calculated from raw data.	39,242	Text	Classification	2013	^[235]^[236]	W. Ugulino et al.
sEMG for Basic Hand movements Dataset	Two databases of surface electromyographic signals of 6 hand movements.	None.	3000	Text	Classification	2014	^[237]^[238]	C. Sapsanis et al.
REALDISP Activity Recognition Dataset	Evaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.	None.	1419	Text	Classification	2014	^[238]^[239]	O. Banos et al.
Heterogeneity Activity Recognition Dataset	Data from multiple different smart devices for humans performing various activities.	None.	43,930,257	Text	Classification, clustering	2015	^[240]^[241]	A. Stisen et al.
Indoor User Movement Prediction from RSS Data	Temporal wireless network data that can be used to track the movement of people in an office.	None.	13,197	Text	Classification	2016	^[242]^[243]	D. Bacciu
PAMAP2 Physical Activity Monitoring Dataset	18 different types of physical activities performed by 9 subjects wearing 3 IMUs.	None.	3,850,505	Text	Classification	2012	^[244]	A. Reiss
OPPORTUNITY Activity Recognition Dataset	Human Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.	None.	2551	Text	Classification	2012	^[245]^[246]	D. Roggen et al.
Real World Activity Recognition Dataset	Human Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.	None.	3,150,000 (per sensor)	Text	Classification	2016	^[247]	T. Sztyler et al.

Other signals[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Wine Dataset	Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.	13 properties of each wine are given	178	Text	Classification, regression	1991	^[248]^[249]	M. Forina et al.
Combined Cycle Power Plant Data Set	Data from various sensors within a power plant running for 6 years.	None	9568	Text	Regression	2014	^[250]^[251]	P. Tufekci et al.

Physical data[edit]

Datasets from physical systems

High-energy physics[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
HIGGS Dataset	Monte Carlo simulations of particle accelerator collisions.	28 features of each collision are given.	11M	Text	Classification	2014	^[252]^[253]^[254]	D. Whiteson
HEPMASS Dataset	Monte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.	28 features of each collision are given.	10,500,000	Text	Classification	2016	^[253]^[254]^[255]	D. Whiteson

Systems[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Yacht Hydrodynamics Dataset	Yacht performance based on dimensions.	Six features are given for each yacht.	308	Text	Regression	2013	^[256]^[257]	R. Lopez
Robot Execution Failures Dataset	5 data sets that center around robotic failure to execute common tasks.	Integer valued features such as torque and other sensor measurements.	463	Text	Classification	1999	^[258]	L. Seabra et al.
Pittsburgh Bridges Dataset	Design description is given in terms of several properties of various bridges.	Various bridge features are given.	108	Text	Classification	1990	^[259]^[260]	Y. Reich et al.
Automobile Dataset	Data about automobiles, their insurance risk, and their normalized losses.	Car features extracted.	205	Text	Regression	1987	^[261]^[262]	J. Schimmer et al.
Auto MPG Dataset	MPG data for cars.	Eight features of each car given.	398	Text	Regression	1993	^[263]	Carnegie Mellon University
Energy Efficiency Dataset	Heating and cooling requirements given as a function of building parameters.	Building parameters given.	768	Text	Classification, regression	2012	^[264]^[265]	A. Xifara et al.
Airfoil Self-Noise Dataset	A series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.	Data about frequency, angle of attack, etc., are given.	1503	Text	Regression	2014	^[266]	R. Lopez
Challenger USA Space Shuttle O-Ring Dataset	Attempt to predict O-ring problems given past Challenger data.	Several features of each flight, such as launch temperature, are given.	23	Text	Regression	1993	^[267]^[268]	D. Draper et al.
Statlog (Shuttle) Dataset	NASA space shuttle datasets.	Nine features given.	58,000	Text	Classification	2002	^[269]	NASA

Astronomy[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Volcanoes on Venus – JARtool experiment Dataset	Venus images returned by the Magellan spacecraft.	Images are labeled by humans.	not given	Images	Classification	1991	^[270]^[271]	M. Burl
MAGIC Gamma Telescope Dataset	Monte Carlo generated high-energy gamma particle events.	Numerous features extracted from the simulations.	19,020	Text	Classification	2007	^[271]^[272]	R. Bock
Solar Flare Dataset	Measurements of the number of certain types of solar flare events occurring in a 24-hour period.	Many solar flare-specific features are given.	1389	Text	Regression, classification	1989	^[273]	G. Bradshaw

Earth science[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Volcanoes of the World	Volcanic eruption data for all known volcanic events on earth.	Details such as region, subregion, tectonic setting, dominant rock type are given.	1535	Text	Regression, classification	2013	^[274]	E. Venzke et al.
Seismic-bumps Dataset	Seismic activities from a coal mine.	Seismic activity was classified as hazardous or not.	2584	Text	Classification	2013	^[275]^[276]	M. Sikora et al.

Other physical[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Concrete Compressive Strength Dataset	Dataset of concrete properties and compressive strength.	Nine features are given for each sample.	1030	Text	Regression	2007	^[277]^[278]	I. Yeh
Concrete Slump Test Dataset	Concrete slump flow given in terms of properties.	Features of concrete given such as fly ash, water, etc.	103	Text	Regression	2009	^[279]^[280]	I. Yeh
Musk Dataset	Predict if a molecule, given the features, will be a musk or a non-musk.	168 features given for each molecule.	6598	Text	Classification	1994	^[281]	Arris Pharmaceutical Corp.
Steel Plates Faults Dataset	Steel plates of 7 different types.	27 features given for each sample.	1941	Text	Classification	2010	^[282]	Semeion Research Center

Biological data[edit]

Datasets from biological systems.

Human[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
EEG Database	Study to examine EEG correlates of genetic predisposition to alcoholism.	Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.	122	Text	Classification	1999	^[283]^[284]	H. Begleiter
P300 Interface Dataset	Data from nine subjects collected using P300-based brain-computer interface for disabled subjects.	Split into four sessions for each subject. MATLAB code given.	1,224	Text	Classification	2008	^[285]^[286]	U. Hoffman et al.
Heart Disease Data Set	Attributed of patients with and without heart disease.	75 attributes given for each patient with some missing values.	303	Text	Classification	1988	^[287]^[288]	A. Janosi et al.
Breast Cancer Wisconsin (Diagnostic) Dataset	Dataset of features of breast masses. Diagnoses by physician is given.	10 features for each sample are given.	569	Text	Classification	1995	^[289]^[290]	W. Wolberg et al.
National Survey on Drug Use and Health	Large scale survey on health and drug use in the United States.	None.	55,268	Text	Classification, regression	2012	^[291]	United States Department of Health and Human Services
Lung Cancer Dataset	Lung cancer dataset without attribute definitions	56 features are given for each case	32	Text	Classification	1992	^[292]^[293]	Z. Hong et al.
Arrhythmia Dataset	Data for a group of patients, of which some have cardiac arrhythmia.	276 features for each instance.	452	Text	Classification	1998	^[294]^[295]	H. Altay et al.
Diabetes 130-US hospitals for years 1999–2008 Dataset	9 years of readmission data across 130 US hospitals for patients with diabetes.	Many features of each readmission are given.	100,000	Text	Classification, clustering	2014	^[296]^[297]	J. Clore et al.
Diabetic Retinopathy Debrecen Dataset	Features extracted from images of eyes with and without diabetic retinopathy.	Features extracted and conditions diagnosed.	1151	Text	Classification	2014	^[298]^[299]	B. Antal et al.
Liver Disorders Dataset	Data for people with liver disorders.	Seven biological features given for each patient.	345	Text	Classification	1990	^[300]^[301]	Bupa Medical Research Ltd.
Thyroid Disease Dataset	10 databases of thyroid disease patient data.	None.	7200	Text	Classification	1987	^[302]^[303]	R. Quinlan
Mesothelioma Dataset	Mesothelioma patient data.	Large number of features, including asbestos exposure, are given.	324	Text	Classification	2016	^[304]^[305]	A. Tanrikulu et al.
KEGG Metabolic Reaction Network (Undirected) Dataset	Network of metabolic pathways. A reaction network and a relation network are given.	Detailed features for each network node and pathway are given.	65,554	Text	Classification, clustering, regression	2011	^[306]	M. Naeem et al.

Animal[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Abalone Dataset	Physical measurements of Abalone. Weather patterns and location are also given.	None.	4177	Text	Regression	1995	^[307]	Marine Research Laboratories – Taroona
Zoo Dataset	Artificial dataset covering 7 classes of animals.	Animals are classed into 7 categories and features are given for each.	101	Text	Classification	1990	^[308]	R. Forsyth
Demospongiae Dataset	Data about marine sponges.	503 sponges in the Demosponge class are described by various features.	503	Text	Classification	2010	^[309]	E. Armengol et al.
Splice-junction Gene Sequences Dataset	Primate splice-junction gene sequences (DNA) with associated imperfect domain theory.	None.	3190	Text	Classification	1992	^[293]	G. Towell et al.
Mice Protein Expression Dataset	Expression levels of 77 proteins measured in the cerebral cortex of mice.	None.	1080	Text	Classification, Clustering	2015	^[310]^[311]	C. Higuera et al.

Plant[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Forest Fires Dataset	Forest fires and their properties.	13 features of each fire are extracted.	517	Text	Regression	2008	^[312]^[313]	P. Cortez et al.
Iris Dataset	Three types of iris plants are described by 4 different attributes.	None.	150	Text	Classification	1936	^[314]^[315]	R. Fisher
Plant Species Leaves Dataset	Sixteen samples of leaf each of one-hundred plant species.	Shape descriptor, fine-scale margin, and texture histograms are given.	1600	Text	Classification	2012	^[316]^[317]	J. Cope et al.
Mushroom Dataset	Mushroom attributes and classification.	Many properties of each mushroom are given.	8124	Text	Classification	1987	^[318]	J. Schlimmer
Soybean Dataset	Database of diseased soybean plants.	35 features for each plant are given. Plants are classified into 19 categories.	307	Text	Classification	1988	^[319]	R. Michalshi et al.
Seeds Dataset	Measurements of geometrical properties of kernels belonging to three different varieties of wheat.	None.	210	Text	Classification, clustering	2012	^[320]^[321]	Charytanowicz et al.
Covertype Dataset	Data for predicting forest cover type strictly from cartographic variables.	Many geographical features given.	581,012	Text	Classification	1998	^[322]^[323]	J. Blackard et al.
Abscisic Acid Signaling Network Dataset	Data for a plant signaling network. Goal is to determine set of rules that governs the network.	None.	300	Text	Causal-discovery	2008	^[324]	J. Jenkens et al.
Folio Dataset	20 photos of leaves for each of 32 species.	None.	637	Images, text	Classification, clustering	2015	^[325]^[326]	T. Munisami et al.
Oxford Flower Dataset	17 category dataset of flowers.	Train/test splits, labeled images,	1360	Images, text	Classification	2006	^[113]^[327]	M-E Nilsback et al.

Microbe[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Ecoli Dataset	Protein localization sites.	Various features of the protein localizations sites are given.	336	Text	Classification	1996	^[328]^[329]	K. Nakai et al.
MicroMass Dataset	Identification of microorganisms from mass-spectrometry data.	Various mass spectrometer features.	931	Text	Classification	2013	^[330]^[331]	P. Mahe et al.
Yeast Dataset	Predictions of Cellular localization sites of proteins.	Eight features given per instance.	1484	Text	Classification	1996	^[332]^[333]	K. Nakai et al.

Drug Discovery[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Tox21 Dataset	Prediction of outcome of biological assays.	Chemical descriptors of molecules are given.	12707	Text	Classification	2016	^[334]	A. Mayr et al.

Anomaly data[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Numenta Anomaly Benchmark (NAB)	Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.	?	50+ files	Comma separated values	Anomaly detection	2016 (continually updated)	^[335]	Numenta

Multivariate data[edit]

Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.

Financial[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Dow Jones Index	Weekly data of stocks from the first and second quarters of 2011.	Calculated values included such as percentage change and a lags.	750	Comma separated values	Classification, regression, time Series	2014	^[336]^[337]	M. Brown et al.
Statlog (Australian Credit Approval)	Credit card applications either accepted or rejected and attributes about the application.	Attribute names are removed as well as identifying information. Factors have been relabeled.	690	Comma separated values	Classification	1987	^[338]^[339]	R. Quinlan
eBay auction data	Auction data from various eBay.com objects over various length auctions	Contains all bids, bidderID, bid times, and opening prices.	~ 550	Text	Regression, classification	2012	^[340]^[341]	G. Shmueli et al.
Statlog (German Credit Data)	Binary credit classification into “good” or “bad” with many features	Various financial features of each person are given.	690	Text	Classification	1994	^[342]	H. Hofmann
Bank Marketing Dataset	Data from a large marketing campaign carried out by a large bank .	Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.	45,211	Text	Classification	2012	^[343]^[344]	S. Moro et al.
Istanbul Stock Exchange Dataset	Several stock indexes tracked for almost two years.	None.	536	Text	Classification, regression	2013	^[345]^[346]	O. Akbilgic
Default of Credit Card Clients	Credit default data for Taiwanese creditors.	Various features about each account are given.	30,000	Text	Classification	2016	^[347]^[348]	I. Yeh

Weather[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Cloud DataSet	Data about 1024 different clouds.	Image features extracted.	1024	Text	Classification, clustering	1989	^[349]	P. Collard
El Nino Dataset	Oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.	12 weather attributes are measured at each buoy.	178080	Text	Regression	1999	^[350]	Pacific Marine Environmental Laboratory
Greenhouse Gas Observing Network Dataset	Time-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.	None.	2921	Text	Regression	2015	^[351]	D. Lucas
Atmospheric CO2 from Continuous Air Samples at Mauna Loa Observatory	Continuous air samples in Hawaii, USA. 44 years of records.	None.	44 years	Text	Regression	2001	^[352]	Mauna Loa Observatory
Ionosphere Dataset	Radar data from the ionosphere. Task is to classify into good and bad radar returns.	Many radar features given.	351	Text	Classification	1989	^[303]^[353]	Johns Hopkins University
Ozone Level Detection Dataset	Two ground ozone level datasets.	Many features given, including weather conditions at time of measurement.	2536	Text	Classification	2008	^[354]^[355]	K. Zhang et al.

Census[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Adult Dataset	Census data from 1994 containing demographic features of adults and their income.	Cleaned and anonymized.	48,842	Comma separated values	Classification	1996	^[356]	United States Census Bureau
Census-Income (KDD)	Weighted census data from the 1994 and 1995 Current Population Surveys.	Split into training and test sets.	299,285	Comma separated values	Classification	2000	^[357]^[358]	United States Census Bureau
IPUMS Census Database	Census data from the Los Angeles and Long Beach areas.	None	256,932	Text	Classification, regression	1999	^[359]	IPUMS
US Census Data 1990	Partial data from 1990 US census.	Results randomized and useful attributes selected.	2,458,285	Text	Classification, regression	1990	^[360]	United States Census Bureau

Transit[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Bike Sharing Dataset	Hourly and daily count of rental bikes in a large city.	Many features, including weather, length of trip, etc., are given.	17,389	Text	Regression	2013	^[361]^[362]	H. Fanaee-T
New York City Taxi Trip Data	Trip data for yellow and green taxis in New York City.	Gives pick up and drop off locations, fares, and other details of trips.	6 years	Text	Classification, clustering	2015	^[363]	New York City Taxi and Limousine Commission
Taxi Service Trajectory ECML PKDD	Trajectories of all taxis in a large city.	Many features given, including start and stop points.	1,710,671	Text	Clustering, causal-discovery	2015	^[364]^[365]	M. Ferreira et al.

Internet[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Webpages from Common Crawl 2012	Large collection of webpages and how they are connected via hyperlinks	None.	3.5B	Text	clustering, classification	2013	^[366]	V. Granville
Internet Advertisements Dataset	Dataset for predicting if a given image is an advertisement or not.	Features encode geometry of ads and phrases occurring in the URL.	3279	Text	Classification	1998	^[367]^[368]	N. Kushmerick
Internet Usage Dataset	General demographics of internet users.	None.	10,104	Text	Classification, clustering	1999	^[369]	D. Cook
URL Dataset	120 days of URL data from a large conference.	Many features of each URL are given.	2,396,130	Text	Classification	2009	^[370]^[371]	J. Ma
Phishing Websites Dataset	Dataset of phishing websites.	Many features of each site are given.	2456	Text	Classification	2015	^[372]	R. Mustafa et al.
Online Retail Dataset	Online transactions for a UK online retailer.	Details of each transaction given.	541,909	Text	Classification, clustering	2015	^[373]	D. Chen
Freebase Simple Topic Dump	Freebase is an online effort to structure all human knowledge.	Topics from Freebase have been extracted.	large	Text	Classification, clustering	2011	^[374]^[375]	Freebase
Farm Ads Dataset	The text of farm ads from websites. Binary approval or disapproval by content owners is given.	SVMlight sparse vectors of text words in ads calculated.	4143	Text	Classification	2011	^[376]^[377]	C. Masterharm et al.

Games[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Poker Hand Dataset	5 card hands from a standard 52 card deck.	Attributes of each hand are given, including the Poker hands formed by the cards it contains.	1,025,010	Text	Regression, classification	2007	^[378]	R. Cattral
Connect-4 Dataset	Contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced.	None.	67,557	Text	Classification	1995	^[379]	J. Tromp
Chess (King-Rook vs. King) Dataset	Endgame Database for White King and Rook against Black King.	None.	28,056	Text	Classification	1994	^[380]^[381]	M. Bain et al.
Chess (King-Rook vs. King-Pawn) Dataset	King+Rook versus King+Pawn on a7.	None.	3196	Text	Classification	1989	^[382]	R. Holte
Tic-Tac-Toe Endgame Dataset	Binary classification for win conditions in tic-tac-toe.	None.	958	Text	Classification	1991	^[383]	D. Aha

Other multivariate[edit]

Dataset Name	Brief description	Preprocessing	Instances	Format	Default Task	Created (updated)	Reference	Creator
Housing Data Set	Median home values of Boston with associated home and neighborhood attributes.	None.	506	Text	Regression	1993	^[384]	D. Harrison et al.
The Getty Vocabularies	structured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials.	None.	large	Text	Classification	2015	^[385]	Getty Center
Yahoo! Front Page Today Module User Click Log	User click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page.	Conjoint analysis with a bilinear model.	45,811,883 user visits	Text	Regression, clustering	2009	^[386]^[387]	Chu et al.
British Oceanographic Data Centre	Biological, chemical, physical and geophysical data for oceans. 22K variables tracked.	Various.	22K variables, many instances	Text	Regression, clustering	2015	^[388]	British Oceanographic Data Centre
Congressional Voting Records Dataset	Voting data for all USA representatives on 16 issues.	Beyond the raw voting data, various other features are provided.	435	Text	Classification	1987	^[389]	J. Schlimmer
Entree Chicago Recommendation Dataset	Record of user interactions with Entree Chicago recommendation system.	Details of each users usage of the app are recorded in detail.	50,672	Text	Regression, recommendation	2000	^[390]	R. Burke
Insurance Company Benchmark (COIL 2000)	Information on customers of an insurance company.	Many features of each customer and the services they use.	9,000	Text	Regression, classification	2000	^[391]^[392]	P. van der Putten
Nursery Dataset	Data from applicants to nursery schools.	Data about applicant’s family and various other factors included.	12,960	Text	Classification	1997	^[393]^[394]	V. Rajkovic et al.
University Dataset	Data describing attributed of a large number of universities.	None.	285	Text	Clustering, classification	1988	^[395]	S. Sounders et al.
Blood Transfusion Service Center Dataset	Data from blood transfusion service center. Gives data on donors return rate, frequency, etc.	None.	748	Text	Classification	2008	^[396]^[397]	I. Yeh
Record Linkage Comparison Patterns Dataset	Large dataset of records. Task is to link relevant records together.	Blocking procedure applied to select only certain record pairs.	5,749,132	Text	Classification	2011	^[398]^[399]	University of Mainz
Nomao Dataset	Nomao collects data about places from many different sources. Task is to detect items that describe the same place.	Duplicates labeled.	34,465	Text	Classification	2012	^[400]^[401]	Nomao Labs
Movie Dataset	Data for 10,000 movies.	Several features for each movie are given.	10,000	Text	Clustering, classification	1999	^[402]	G. Wiederhold
Open University Learning Analytics Dataset	Information about students and their interactions with a virtual learning environment.	None.	~ 30,000	Text	Classification, clustering, regression	2015	^[403]^[404]	J. Kuzilek et al.

Ref:

List of datasets for machine learning research - Wikipedia

你可能感兴趣的:(深度学习)

从LLM出发：由浅入深探索AI开发的全流程与简单实践（全文3w字）码事漫谈 AI 人工智能
文章目录第一部分：AI开发的背景与历史1.1人工智能的起源与发展1.2神经网络与深度学习的崛起1.3Transformer架构与LLM的兴起1.4当前AI开发的现状与趋势第二部分：AI开发的核心技术2.1机器学习：AI的基础2.1.1机器学习的类型2.1.2机器学习的流程2.2深度学习：机器学习的进阶2.2.1神经网络基础2.2.2深度学习的关键架构2.3Transformer架构：现代LLM的核
java实现卷积神经网络CNN（附带源码） Katie。 Java 实战项目 java
Java实现卷积神经网络（CNN）项目详解目录项目概述1.1项目背景与意义1.2什么是卷积神经网络（CNN）1.3卷积神经网络的应用场景相关知识与理论基础2.1神经网络与深度学习概述2.2卷积操作与卷积层原理2.3激活函数与池化层2.4全连接层与损失函数2.5前向传播、反向传播与梯度下降项目需求与分析3.1项目目标3.2功能需求分析3.3性能与扩展性要求3.4异常处理与鲁棒性考虑系统设计与实现思路
从0到1构建AI深度学习视频分析系统--基于YOLO 目标检测的动作序列检查系统：（2）消息队列与消息中间件 shiter 人工智能系统解决方案与技术架构人工智能深度学习音视频
文章大纲原始视频队列Python内存视频缓存优化方案（4GB以内）一、核心参数设计二、内存管理实现三、性能优化策略四、内存占用验证五、高级优化技巧六、部署建议检测结果队列YOLO检测结果队列技术方案一、技术选型矩阵二、核心实现代码三、性能优化策略四、可视化方案对比五、部署建议逻辑判定队列时间片图论时间序列大模型引入参考文献原始视频队列想要在单机内存中缓存1-5分钟的视频片段，python技术栈的话
从零开始大模型开发与微调：PyCharm的下载与安装 AI天才研究院 AI大模型企业级应用开发实战 AI大模型应用入门实战与进阶 DeepSeek R1 &大数据AI人工智能大模型计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
从零开始大模型开发与微调：PyCharm的下载与安装1.背景介绍随着人工智能和深度学习技术的不断发展,大型语言模型(LargeLanguageModels,LLMs)已经成为当前最引人注目的研究热点之一。LLMs能够在各种自然语言处理任务上展现出惊人的性能,例如机器翻译、文本生成、问答系统等。PyTorch和TensorFlow等深度学习框架为训练和微调大型语言模型提供了强大的支持。PyCharm
遗传算法与深度学习实战（2）——生命模拟及其应用盼小辉丶遗传算法与深度学习实战深度学习人工智能遗传算法
遗传算法与深度学习实战（2）——生命模拟及其应用0.前言1.康威生命游戏1.1康威生命游戏的规则1.2实现康威生命游戏1.3空间生命和智能体模拟2.实现生命模拟3.生命模拟应用小结系列链接0.前言生命模拟是进化计算的一个特定子集，模拟了自然界中所观察到的自然过程，例如粒子或鸟群的聚集方式。生命模拟只是用来探索和优化问题的模拟形式之一，还有很多其他形式的模拟，可以更好地建模各种过程，但它们都源于康威
PyTorch从入门到精通：探索深度学习新境界 lmtealily 深度学习 pytorch 人工智能
引言PyTorch作为当前最受欢迎的深度学习框架之一，凭借其动态计算图的独特设计和与Python生态的无缝集成，正重塑着人工智能开发的新范式1。从NVIDIA的研究实践到Meta的产业应用，PyTorch的价值已渗透至学术研究、工业部署的每个角落。本文将带领您从张量操作基础开始，逐步探索GPU加速、动态图机制、框架生态集成等高级主题，最终实现理论与实战的双重突破。一、PyTorch核心基础构建1.
【Python】已解决：pip安装第三方模块（库）与PyCharm中不同步的问题（PyCharm添加本地python解释器）屿小夏 python pip pycharm
个人简介：某不知名博主，致力于全栈领域的优质博客分享|用最优质的内容带来最舒适的阅读体验！文末获取免费IT学习资料！文末获取更多信息精彩专栏推荐订阅收藏专栏系列直达链接相关介绍书籍分享点我跳转书籍作为获取知识的重要途径，对于IT从业者来说更是不可或缺的资源。不定期更新IT图书，并在评论区抽取随机粉丝，书籍免费包邮到家AI前沿点我跳转探讨人工智能技术领域的最新发展和创新，涵盖机器学习、深度学习、自然
YOLOv5+UI界面在车辆检测中的应用与实现深度学习&目标检测实战项目 YOLOv5实战项目 YOLO ui 分类数据挖掘目标跟踪人工智能
1.引言随着智能交通系统（ITS）的快速发展，车辆检测已成为计算机视觉领域的重要研究方向。车辆检测技术广泛应用于交通流量监控、车辆违章抓拍、无人驾驶等场景中。近年来，深度学习技术的突破，特别是卷积神经网络（CNN）的崛起，使得目标检测技术取得了显著进展。其中，YOLO（YouOnlyLookOnce）系列模型以其高效的实时检测能力和出色的性能成为车辆检测领域的首选方法之一。在本文中，我们将基于YO
DeepSeek：技术教育领域的AI变革者——从理论到实践的全面解析量子纠缠BUG DeepSeek DeepSeek部署 AI 人工智能 python
一、技术教育为何需要DeepSeek？在数字化转型的浪潮下，技术教育面临着知识更新快、实践门槛高、个性化需求强三大核心挑战。传统的教学模式难以满足开发者快速掌握前沿技术、构建复杂系统能力的需求。DeepSeek作为国产开源大模型的代表，凭借其推理能力、多模态支持与低成本部署的特性，正在为技术教育带来突破性解决方案。二、DeepSeek赋能技术教育的核心技术优势1.推理能力驱动深度学习思维链（CoT
【人工智能基础2】机器学习、深度学习总结 roman_日积跬步-终至千里人工智能习题人工智能机器学习深度学习
文章目录一、人工智能关键技术二、机器学习基础1.监督、无监督、半监督学习2.损失函数：四种损失函数3.泛化与交叉验证4.过拟合与欠拟合5.正则化6.支持向量机三、深度学习基础1、概念与原理2、学习方式3、多层神经网络训练方法一、人工智能关键技术领域基础原理与逻辑机器学习机器学习基于数据，研究从观测数据出发寻找规律，利用这些规律对未来数据进行预测。基于学习模式，机器学习可以分为监督、无监督、强化学习
一文搞懂 AI Agent 与 AI 大模型的区别 a小胡哦人工智能 Manus Ai agent
在人工智能蓬勃发展的当下，新术语和新技术层出不穷。AIAgent和AI大模型便是其中的“明星”，但不少人对它们的区别感到困惑。今天，我们就以Manus这类AIAgent为例，深入剖析AIAgent与一般AI大模型的不同之处。Manus：Manus定义与核心能力AI大模型AI大模型是基于深度学习架构，通过海量数据训练得到的复杂模型，像GPT-4、文心一言等。它们具备强大的知识储备和语言理解生成能力，
清华大学《DeepSeek赋能家庭教育》深度解析：AI如何重塑现代家庭教育模式硅基打工人 AI 人工智能经验分享大数据开源语言模型
引言：家庭教育的困境与AI的破局在数字化与智能化浪潮下，家庭教育面临多重挑战：家长教育能力不足、教育资源分配不均、亲子沟通效率低下、个性化需求难以满足等。清华大学发布的《DeepSeek赋能家庭教育》系列报告（共56页）提出了一种基于人工智能的解决方案，通过深度学习平台DeepSeek，为家庭教育注入科技动能。本文将从技术原理、核心功能、应用场景、伦理安全及未来展望等多维度展开分析。一、DeepS
Spring深度学习 — 关于 Spring 搬运Gong Spring spring
前言作为一名Java程序猿，相信对Spring都不陌生，那么我们经常使用的Spring的发展史大家都了解过吗？它是如何来的？又是如何一步一步成长到了现在这种不可替代的重要地位？下面将对Spring进行一个整体认知和学习，对后面的深度学习起到铺垫作用。本文意在对知识点的温顾，如文中有写的不对的地方，还望不吝指教。一、Spring的发展史相信经历过不使用框架开发Web项目的70后、80后都会高如此感触
Python--读取mat文件一头大学牲程序--编程记录 python 开发语言深度学习机器学习
最近在进行学习深度学习过程中，遇到了以MATLAB的.mat格式存储的数据，需要用python读取出来处理，于是就找到了以下比较方便的三种python读取mat文件的方法：使用hdf5库来读取mat文件1.使用scipy.io来读取1.5知识小插曲2.使用hdf5来读取3.使用mat73来读取1.使用scipy.io来读取-如果你的matlab的版本比较旧，保存的.mat格式为‘-v7.3’以前的
AI笔记——语音识别 Yuki-^_^ 人工智能 AI 人工智能笔记语音识别
摘要：语音识别（AutomaticSpeechRecognition,ASR）是人工智能领域的一项重要技术，它将人类的语音信号转换成文字。随着科技的发展，语音识别已经成为现代生活和工作中不可或缺的一部分。本文旨在介绍语音识别的基本原理、关键技术、应用场景以及未来发展趋势。一、历史与发展语音识别技术的历史可以追溯到20世纪50年代，那时的技术基于规则和模板。随着计算能力的提升和深度学习方法的出现，语
Manus（一种AI代理或自动化工具）与DeepSeek（一种强大的语言模型或AI能力）结合使用任务自动化和智能决策 zzlyx99 人工智能自动化语言模型
一、Manus与DeepSeek差异十分好奇DeepSeek和Manus究竟谁更厉害些，DeepSeek是知识型大脑，Manus则是全能型执行者。即DeepSeek专注于语言处理、知识整合与专业文本生成。其核心优势在于海量参数支持的深度学习和知识推理能力，例如撰写论文、润色法律合同、解答专业问题等。Manus则更强调从规划到交付的闭环能力。它通过工具链调用（如浏览器、代码编辑器）自主执行复杂任务，
深度学习处理时间序列（2） yyc_audio 深度学习笔记深度学习人工智能
在数据中寻找周期性在多个时间尺度上的周期性，是时间序列数据非常重要且常见的属性。无论是天气、商场停车位使用率、网站流量、杂货店销售额，还是健身追踪器记录的步数，你都会看到每日周期性和年度周期性（人类生成的数据通常还有每周的周期性）。探索数据时，一定要注意寻找这些模式。（让人想到波，想到傅里叶变换）对于这个数据集，如果你想根据前几个月的数据来预测下个月的平均温度，那么问题很简单，因为数据具有可靠的年
机器视觉|手势识别：基于YOLOv5的手部检测与MediaPipe的关键点估计 RockLiu@805 机器视觉 YOLO
手势识别：基于YOLOv5的手部检测与MediaPipe的关键点估计在实时计算机视觉应用中，手部检测与关键点估计是实现手势识别的重要基础。本文将介绍一种基于深度学习的手势识别技术方案，通过结合YOLOv5物体检测网络和MediaPipe关键点检测框架，实现实时的手部定位与关键点提取。技术背景gesturerecognition作为计算机视觉领域的重要研究方向，在HCI（人机交互）、遥控行为分析、虚
基于深度学习的个性化新闻推荐系统设计与实现计算机毕设 sj52abcd 深度学习课程设计人工智能毕业设计
博主介绍：✌专注于VUE,小程序，安卓，Java,python,物联网专业，有17年开发经验，长年从事毕业指导，项目实战✌选取一个适合的毕业设计题目很重要。✌关注✌私信我✌具体的问题，我会尽力帮助你。研究的背景:随着互联网技术的发展和普及,人们越来越依赖互联网获取信息。然而,随着信息量的不断增加,用户在查找新闻时面临着信息过载的问题。为了解决这个问题,个性化新闻推荐系统被广泛应用。个性化新闻推荐系
Python 在人工智能领域的实际6大案例 Solomon_肖哥弹架构人工智能机器学习 python
Python作为一种功能强大且易于学习的编程语言，在人工智能（AI）领域得到了广泛的应用。从机器学习到深度学习，从自然语言处理到计算机视觉，Python提供了丰富的库和框架，使得开发者能够快速实现各种AI应用。本文将通过多个实际案例，展示Python在人工智能领域的强大功能和应用前景。二、案例一：手写数字识别（MNIST）1.背景介绍手写数字识别是机器学习领域的经典入门项目，MNIST数据集包含了
深入探究YOLO系列的骨干网路编码实践 YOLO 深度学习计算机视觉
深入探究YOLO系列的骨干网路YOLO系列是目标检测领域中非常知名的算法。其通过将整个图像作为输入，并且直接在图像上通过一个单独的神经网络输出每个检测框的类别预测和边界框信息。为了更好地理解YOLO系列，我们需要先了解它所使用的骨干网路。骨干网络是深度学习模型中的核心部分，负责提取图像的特征。如今常用的骨干网络有VGG、ResNet和MobileNet等。YOLO系列算法采用的是Darknet骨干
《Python深度学习》第四讲：计算机视觉中的深度学习 earthzhang2021 2025讲书课专栏 python 深度学习计算机视觉 1024程序员节 numpy 算法人工智能
计算机视觉是深度学习中最酷的应用之一，它让计算机能够像人类一样“看”和理解图像。想象一下，计算机可以自动识别照片中的物体、人脸，甚至可以读懂交通标志。这一切听起来是不是很神奇？其实，这一切都离不开深度学习中的卷积神经网络（CNN）。今天，我们就来深入了解一下CNN是如何工作的。5.1卷积神经网络简介先来看下卷积神经网络（CNN）是什么。CNN是一种专门用于处理图像数据的神经网络。它的灵感来源于人类
基于人工智能的智能视频内容分析系统小彭律师 python
基于人工智能的智能视频内容分析系统系统功能1.视频数据预处理降噪与滤波：去除视频画面中的噪点和干扰画质增强：调整亮度、对比度和色彩平衡关键帧提取：减少数据量，提取关键信息2.目标识别检测基于深度学习模型（YOLO、FasterR-CNN等）识别多种目标类型（人、车辆、物品等）适应不同光照、角度和遮挡情况输出目标位置、类别和置信度3.行为分析研判基于时序模型（LSTM、3D-CNN等）分析目标动作规
FastDVDnet：基于深度学习的视频去噪框架陆可鹃Joey
FastDVDnet：基于深度学习的视频去噪框架项目地址:https://gitcode.com/gh_mirrors/fa/fastdvdnet项目介绍FastDVDnet是一个高效、开源的深度学习模型，专注于视频去噪。该项目由MatteoTassano开发并维护，旨在提供一种快速且有效的解决方案，以消除视频中的噪声，同时保持图像细节和自然纹理。它利用了时间域的连续性和深层神经网络的力量，确保在
手撕multi-head self attention 代码心若成风、自然语言处理语言模型 transformer
在深度学习和自然语言处理领域，多头自注意力（Multi-HeadSelf-Attention）机制是Transformer模型中的核心组件之一。它允许模型在处理序列数据时，能够同时关注序列中的不同位置，从而捕获到丰富的上下文信息。下面，我们将详细解析多头自注意力机制的实现代码。一、概述多头自注意力机制的核心思想是将输入序列进行多次线性变换，然后分别计算自注意力得分，最后将所有头的输出进行拼接，并通
深度学习 Deep Learning 第2章线性代数 odoo中国 AI编程人工智能深度学习线性代数人工智能
深度学习第2章线性代数线性代数是深度学习的语言。张量操作是神经网络计算的基石，矩阵乘法是前向传播的核心，范数约束模型复杂度，而生成空间理论揭示模型表达能力的本质。本章介绍线性代数的基本内容，为进一步学习深度学习做准备。主要内容2.1标量、向量、矩阵和张量标量：单个数字，用斜体表示，通常赋予小写字母变量名。向量：数字数组，按顺序排列，用粗体小写字母表示，元素通过下标访问。矩阵：二维数字数组，用粗体大
MATLAB算法实战应用案例精讲-【深度学习】归一化林聪木 matlab 算法深度学习
目录为什么要做特征归一化/标准化？常用featurescaling方法计算方式上对比分析featurescaling需要还是不需要什么时候需要featurescaling？什么时候不需要FeatureScaling？归一化基础知识点1.什么是归一化2.为什么要归一化3.为什么归一化能提高求解最优解的速度4.归一化有哪些类型5.不同归一化的使用条件6.归一化和标准化的联系与区别层归一化综述提出背景概
必看！一文读懂知识蒸馏技术小天才学习机打游戏人工智能知识图谱神经网络 langchain windows
导读最近，DeepSeek的爆火让大家对人工智能领域的技术发展又有了新的关注。而知识蒸馏作为深度学习中一项重要的技术，也在背后默默地发挥着作用，今天就来给大家详细介绍一下知识蒸馏及其相关原理。1.知识蒸馏是什么在深度学习领域，大型模型（如DeepSeek）通常具有强大的性能，但它们的计算量和参数量都非常庞大，这使得它们难以在资源受限的设备（如移动设备或嵌入式设备）上部署。例如，GPT-3在570G
从零开始大模型开发与微调：PyTorch 2.0深度学习环境搭建 AI智能涌现深度研究 DeepSeek R1 &大数据AI人工智能 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
从零开始大模型开发与微调：PyTorch2.0深度学习环境搭建作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来随着深度学习在各个领域的广泛应用，大模型开发与微调成为了当前研究的热点。大模型能够学习到丰富的知识，并在各个下游任务上取得优异的性能。然而，大模型开发与微调需要强大的计算资源和专业的知识背景，这对于许多初学者和研究
AI大模型学习路线及相关资源推荐 python游乐园学习资源学习 Python AI AI编程人工智能
哈喽，大家好！本文为大家带来AI大模型学习路线及相关资源推荐，这对于学习掌握AI大模型很有帮助呦，希望大家多多点赞收藏～感谢～～1AI大模型的基础信息1.1什么是AI大模型AI大模型，即人工智能大型模型，是一种基于深度学习技术，具有海量参数、强大算力支持、能够处理和生成复杂数据的人工智能模型。1.2AI大模型的主要特点规模庞大：AI大模型通常包含海量的参数。例如，谷歌的BERT模型在最初发布时就有
多线程编程之join()方法周凡杨 java JOIN 多线程编程线程
现实生活中，有些工作是需要团队中成员依次完成的，这就涉及到了一个顺序问题。现在有T1、T2、T3三个工人，如何保证T2在T1执行完后执行，T3在T2执行完后执行？问题分析：首先问题中有三个实体，T1、T2、T3，因为是多线程编程，所以都要设计成线程类。关键是怎么保证线程能依次执行完呢？ Java实现过程如下： public class T1 implements Runnabl
java中switch的使用 bingyingao java enum break continue
java中的switch仅支持case条件仅支持int、enum两种类型。用enum的时候，不能直接写下列形式。 switch (timeType) { case ProdtransTimeTypeEnum.DAILY: break; default: br
hive having count 不能去重 daizj hive 去重 having count 计数
hive在使用having count()是，不支持去重计数 hive (default)> select imei from t_test_phonenum where ds=20150701 group by imei having count(distinct phone_num)>1 limit 10; FAILED: SemanticExcep
WebSphere对JSP的缓存周凡杨 WAS JSP 缓存
对于线网上的工程，更新JSP到WebSphere后，有时会出现修改的jsp没有起作用，特别是改变了某jsp的样式后，在页面中没看到效果，这主要就是由于websphere中缓存的缘故，这就要清除WebSphere中jsp缓存。要清除WebSphere中JSP的缓存，就要找到WAS安装后的根目录。现服务
设计模式总结朱辉辉33 java 设计模式
1.工厂模式 1.1 工厂方法模式 (由一个工厂类管理构造方法) 1.1.1普通工厂模式(一个工厂类中只有一个方法) 1.1.2多工厂模式(一个工厂类中有多个方法) 1.1.3静态工厂模式(将工厂类中的方法变成静态方法) &n
实例：供应商管理报表需求调研报告老A不折腾 finereport 报表系统报表软件信息化选型
引言随着企业集团的生产规模扩张，为支撑全球供应链管理，对于供应商的管理和采购过程的监控已经不局限于简单的交付以及价格的管理，目前采购及供应商管理各个环节的操作分别在不同的系统下进行，而各个数据源都独立存在，无法提供统一的数据支持；因此，为了实现对于数据分析以提供采购决策，建立报表体系成为必须。业务目标 1、通过报表为采购决策提供数据分析与支撑 2、对供应商进行综合评估以及管理，合理管理和
mysql 林鹤霄
转载源：http://blog.sina.com.cn/s/blog_4f925fc30100rx5l.html mysql -uroot -p ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: YES) [root@centos var]# service mysql
Linux下多线程堆栈查看工具(pstree、ps、pstack) aigo linux
原文：http://blog.csdn.net/yfkiss/article/details/6729364 1. pstree pstree以树结构显示进程$ pstree -p work | grep adsshd(22669)---bash(22670)---ad_preprocess(4551)-+-{ad_preprocess}(4552) &n
html input与textarea 值改变事件 alxw4616 JavaScript
// 文本输入框(input) 文本域(textarea)值改变事件 // onpropertychange(IE) oninput(w3c) $('input,textarea').on('propertychange input', function(event) { console.log($(this).val()) });
String类的基本用法百合不是茶 String
字符串的用法; // 根据字节数组创建字符串 byte[] by = { 'a', 'b', 'c', 'd' }; String newByteString = new String(by); 1,length() 获取字符串的长度 &nbs
JDK1.5 Semaphore实例 bijian1013 java thread java多线程 Semaphore
Semaphore类一个计数信号量。从概念上讲，信号量维护了一个许可集合。如有必要，在许可可用前会阻塞每一个 acquire()，然后再获取该许可。每个 release() 添加一个许可，从而可能释放一个正在阻塞的获取者。但是，不使用实际的许可对象，Semaphore 只对可用许可的号码进行计数，并采取相应的行动。 S
使用GZip来压缩传输量 bijian1013 java GZip
启动GZip压缩要用到一个开源的Filter：PJL Compressing Filter。这个Filter自1.5.0开始该工程开始构建于JDK5.0，因此在JDK1.4环境下只能使用1.4.6。 PJL Compressi
【Java范型三】Java范型详解之范型类型通配符 bit1129 java
定义如下一个简单的范型类， package com.tom.lang.generics; public class Generics<T> { private T value; public Generics(T value) { this.value = value; } }
【Hadoop十二】HDFS常用命令 bit1129 hadoop
1. 修改日志文件查看器 hdfs oev -i edits_0000000000000000081-0000000000000000089 -o edits.xml cat edits.xml 修改日志文件转储为xml格式的edits.xml文件，其中每条RECORD就是一个操作事务日志 2. fsimage查看HDFS中的块信息等 &nb
怎样区别nginx中rewrite时break和last ronin47
在使用nginx配置rewrite中经常会遇到有的地方用last并不能工作，换成break就可以，其中的原理是对于根目录的理解有所区别，按我的测试结果大致是这样的。 location / { proxy_pass http://test;
java-21.中兴面试题输入两个整数 n 和 m ，从数列 1 ， 2 ， 3.......n 中随意取几个数 , 使其和等于 m bylijinnan java
import java.util.ArrayList; import java.util.List; import java.util.Stack; public class CombinationToSum { /* 第21 题 2010 年中兴面试题编程求解：输入两个整数 n 和 m ，从数列 1 ， 2 ， 3.......n 中随意取几个数 , 使其和等
eclipse svn 帐号密码修改问题开窍的石头 eclipse SVN svn帐号密码修改
问题描述： Eclipse的SVN插件Subclipse做得很好，在svn操作方面提供了很强大丰富的功能。但到目前为止，该插件对svn用户的概念极为淡薄，不但不能方便地切换用户，而且一旦用户的帐号、密码保存之后，就无法再变更了。解决思路：删除subclipse记录的帐号、密码信息，重新输入
[电子商务]传统商务活动与互联网的结合 comsci 电子商务
某一个传统名牌产品，过去销售的地点就在某些特定的地区和阶层，现在进入互联网之后，用户的数量群突然扩大了无数倍，但是，这种产品潜在的劣势也被放大了无数倍，这种销售利润与经营风险同步放大的效应，在最近几年将会频繁出现。。。。如何避免销售量和利润率增加的
java 解析 properties-使用 Properties-可以指定配置文件路径 cuityang java properties
#mq xdr.mq.url=tcp://192.168.100.15:61618; import java.io.IOException; import java.util.Properties; public class Test { String conf = "log4j.properties"; private static final
Java核心问题集锦 darrenzhu java 基础核心难点
注意，这里的参考文章基本来自Effective Java和jdk源码 1)ConcurrentModificationException 当你用for each遍历一个list时，如果你在循环主体代码中修改list中的元素，将会得到这个Exception，解决的办法是： 1)用listIterator, 它支持在遍历的过程中修改元素， 2)不用listIterator, new一个
1分钟学会Markdown语法 dcj3sjt126com markdown
markdown 简明语法基本符号 *,-,+ 3个符号效果都一样，这3个符号被称为 Markdown符号空白行表示另起一个段落 `是表示inline代码，tab是用来标记代码段，分别对应html的code，pre标签换行单一段落( <p>) 用一个空白行连续两个空格会变成一个 <br> 连续3个符号，然后是空行
Gson使用二（GsonBuilder） eksliang json gson GsonBuilder
转载请出自出处：http://eksliang.iteye.com/blog/2175473 一.概述 GsonBuilder用来定制java跟json之间的转换格式二.基本使用实体测试类：温馨提示：默认情况下@Expose注解是不起作用的,除非你用GsonBuilder创建Gson的时候调用了GsonBuilder.excludeField
报ClassNotFoundException: Didn't find class "...Activity" on path: DexPathList gundumw100 android
有一个工程，本来运行是正常的，我想把它移植到另一台PC上，结果报： java.lang.RuntimeException: Unable to instantiate activity ComponentInfo{com.mobovip.bgr/com.mobovip.bgr.MainActivity}: java.lang.ClassNotFoundException: Didn't f
JavaWeb之JSP指令 ihuning javaweb
要点 JSP指令简介 page指令 include指令 JSP指令简介 JSP指令（directive）是为JSP引擎而设计的，它们并不直接产生任何可见输出，而只是告诉引擎如何处理JSP页面中的其余部分。 JSP指令的基本语法格式： <%@ 指令属性名="
mac上编译FFmpeg跑ios 啸笑天 ffmpeg
1、下载文件：https://github.com/libav/gas-preprocessor，复制gas-preprocessor.pl到/usr/local/bin/下，修改文件权限：chmod 777 /usr/local/bin/gas-preprocessor.pl 2、安装yasm-1.2.0 curl http://www.tortall.net/projects/yasm
sql mysql oracle中字符串连接 macroli oracle sql mysql SQL Server
有的时候，我们有需要将由不同栏位获得的资料串连在一起。每一种资料库都有提供方法来达到这个目的： MySQL: CONCAT() Oracle: CONCAT(), || SQL Server: + CONCAT() 的语法如下： Mysql 中 CONCAT(字串1, 字串2, 字串3, ...): 将字串1、字串2、字串3，等字串连在一起。请注意，Oracle的CON
Git fatal: unab SSL certificate problem: unable to get local issuer ce rtificate qiaolevip 学习永无止境每天进步一点点 git 纵观千象
// 报错如下： $ git pull origin master fatal: unable to access 'https://git.xxx.com/': SSL certificate problem: unable to get local issuer ce rtificate // 原因：由于git最新版默认使用ssl安全验证，但是我们是使用的git未设
windows命令行设置wifi surfingll windows wifi 笔记本wifi
还没有讨厌无线wifi的无尽广告么，还在耐心等待它慢慢启动么教你命令行设置笔记本电脑wifi： 1、开启wifi命令 netsh wlan set hostednetwork mode=allow ssid=surf8 key=bb123456 netsh wlan start hostednetwork pause 其中pause是等待输入，可以去掉 2、
Linux（Ubuntu）下安装sysv-rc-conf wmlJava linux ubuntu sysv-rc-conf
安装：sudo apt-get install sysv-rc-conf 使用：sudo sysv-rc-conf 操作界面十分简洁，你可以用鼠标点击，也可以用键盘方向键定位，用空格键选择，用Ctrl+N翻下一页，用Ctrl+P翻上一页，用Q退出。背景知识 sysv-rc-conf是一个强大的服务管理程序，群众的意见是sysv-rc-conf比chkconf
svn切换环境，重发布应用多了javaee标签前缀 zengshaotao javaee
更换了开发环境，从杭州，改变到了上海。svn的地址肯定要切换的，切换之前需要将原svn自带的.svn文件信息删除，可手动删除，也可通过废弃原来的svn位置提示删除.svn时删除。然后就是按照最新的svn地址和规范建立相关的目录信息，再将原来的纯代码信息上传到新的环境。然后再重新检出，这样每次修改后就可以看到哪些文件被修改过，这对于增量发布的规范特别有用。检出