R中有很多内置的数据集,用于学习和实验,下面仅就平时用的每一种算法摘取一个数据集,仅仅用于算法练习,当然,有可能一种数据集能用于不同的算法;更多的数据集请参考:
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
1. attitude
用于线性回归;
From a survey of the clerical employees of a large financial organization, the data are aggregated from the questionnaires of the approximately 35 employees for each of 30 (randomly selected) departments. The numbers give the percent proportion of favourable responses to seven questions in each department.
A dataframe with 30 observations on 7 variables. The first column are the short names from the reference, the second one the variable names in the data frame:
Y | rating | numeric | Overall rating |
X[1] | complaints | numeric | Handling of employee complaints |
X[2] | privileges | numeric | Does not allow special privileges |
X[3] | learning | numeric | Opportunity to learn |
X[4] | raises | numeric | Raises based on performance |
X[5] | critical | numeric | Too critical |
X[6] | advancel | numeric | Advancement |
2. infert
广义线性模型-二项分布(逻辑回归)
This is a matched case-control study dating from before the availability of conditional logistic regression.
1. | Education | 0 = 0-5 years |
1 = 6-11 years | ||
2 = 12+ years | ||
2. | age | age in years of case |
3. | parity | count |
4. | number of prior | 0 = 0 |
induced abortions | 1 = 1 | |
2 = 2 or more | ||
5. | case status | 1 = case |
0 = control | ||
6. | number of prior | 0 = 0 |
spontaneous abortions | 1 = 1 | |
2 = 2 or more | ||
7. | matched set number | 1-83 |
8. | stratum number | 1-63 |
例:
model1 <- glm(case ~ spontaneous+induced, data=infert,family=binomial())
3. 广义线性回归-泊松分布
在R内置的数据集中实在找不到单纯的泊松回归的测试集,可以用下面的数据进行测试:
http://www.ats.ucla.edu/stat/data/poisson_sim.csv
p <- read.csv(http://www.ats.ucla.edu/stat/data/poisson_sim.csv)
m1 <- glm(num_awards ~ prog + math, family = "poisson", data = p)
4. BOD (内在)非线性回归
The BOD
data frame has 6 rows and 2 columns giving the biochemical oxygen demand versus time in an evaluation of water quality.
This data frame contains the following columns:
A numeric vector giving the time of the measurement (days).
A numeric vector giving the biochemical oxygen demand (mg/l).
fm1 <- nls(demand ~ A*(1-exp(-exp(lrc)*Time)), data = BOD, start = c(A = 20, lrc = log(.35)))5
5. MASS 中的cats 用于CART算法,当然任何目标变量为分类变量的数据基本都可以用于CART算法
> library(MASS)
> data(cats)
> names(cats)
[1] "Sex" "Bwt" "Hwt"
> head(cats)
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
5 F 2.1 7.3
6 F 2.1 7.6
>library(rpart)
>cats_rpart_model <- rpart(Sex~., data = cats)
6. USArrests 用于主成分分析princomp、因子分析factanal
This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
USArrests
A data frame with 50 observations on 4 variables.
[,1] | Murder | numeric | Murder arrests (per 100,000) |
[,2] | Assault | numeric | Assault arrests (per 100,000) |
[,3] | UrbanPop | numeric | Percent urban population |
[,4] | Rape | numeric | Rape arrests (per 100,000) |