This is a final review of Introduction of Data Analysis with R. T h i s i s t h e w a y ! ! ! ^{^{This~is~the~way!!!}} This is the way!!!
command | explanation |
---|---|
Inf | infinity, e.g. 1/0=Inf |
NaN | not a number, e.g. 0/0=NaN |
NULL | empty |
basic command | explanation |
rm(list=ls()) , rm(a) |
remove all objects, or a single object in workspace |
class() #creat type , mode(a) #storing type > typeof( ) #for single value |
show the variable type, e.g. c<-factor( data.frame(1,2,3)) class=“factor”; typeof=“integer”; mode=“numeric” |
length(a) |
show the length of the character |
%% , %/% , %*% , %in% , ! , t() |
modulus, integer divide, matrix multiply, is a in b?, not, transpose |
options(digits=x) |
control number of displayed digits |
source('xx.r', echo=T) |
load History file, echo=T show command and output/ echo=F only show output |
Sys.setlocale("LC_ALL","English") |
set system language |
setwd('xx') , dir() |
set path, show files in current path |
print(xxx) , cat(numeric, character...,'hello world \n') |
show variable value |
set.seed() |
Important type | generation command | ralated functions |
---|---|---|
integer/ complex/ double | b<-1 ,a<-c(1,2,3) |
mode(a) , length(a) , a<2 (generate a sequence of logic vector), as.numeric() (where char converted into NA), ceiling , floor , trunc (ignore decimal), round(x, digits=#) (round(3.5)=4, round(2.5)=2) |
sequence | x1:x_n , c(x1,..,xn) , seq(x1,xn,step) |
x<-c(1,2,3,4) (x[4] is 4, x[-4] is 1,2,3), cumsum , cumprod (cumulative product), range , rep (a repeating sequence of number) |
random sample | sample(c(..),size=, replace=T/F, prob=c(...)) |
used with set.seed() |
matrix (every entry should be of same type) | matrix(sequence, nrow=, ncol=, byrow=T/F) |
dim() , nrow() , ncol() , m[row,col] (indexing), rbind() (row bind), cbind() , x%*%x : inter product, x%o%x : outer product, diag(seq) : create diagonal matrix or retrieve diagonal entry in a matrix, solve(A,b) : find solution, solve(A) : find inverse |
every entry of a list can be anything: w<-list(a,b,c)
related functions | explanation |
---|---|
names(w) or names(w)<-c('x1,'x2,'x3') |
show name of w’s index, or give w entry name |
w[[1]] or w$x1 |
indexing |
unlist() |
|
max(x) |
mx<-value, for (xxx) mx<-max(mx, value) : nice comparison method |
entry can be of different type: integer/numeric/factor: d<-data.frame()
, d<-read.csv('xx.csv')
related functions | explanation |
---|---|
names(d) or names(d)<-c('xxx') |
|
head(d, n=4L) , tail(d) |
display first(last) 4 lines(10 lines as default) |
d[row,col] or d$xx or d[d$xx=='xxx',] |
indexing or select data |
dim() , nrow() , ncol() |
basic info of dataframe |
mean(x, na.rm=T) , colMeans(x, na.rm=T) , rowMeans() |
total mean or column mean, romove NA is false as default |
var(x, na.rm=T) , sd(x, na.rm=T) |
sample variance, sample sd |
median(x, na.rm=T) |
|
apply(x, MARGIN=,FUN=) |
margin=1 by row, 2 by col; function can be user-defined |
d[order(d$xx),] |
sort by row xx, order return a sequence |
d[complete.cases(d),] or na.omit(d) |
select non-NA complete cases, not work on factor row/col |
merge(target, origin, by='Column name') |
combine data with unique key in by='' |
by(just data part in dataset, classification standard in dataset, function) |
apply function on data using classification, and generate a summary table |
time series data | |
date<-as.Date(d$Date,"$d/$m/$Y") |
change Date in dataframe d to date object in special fromat |
wkday<-weekdays(date) or wkday<- ordered(weekdays(date), c( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")) |
find the corresponding weekday, the later code is ordered, instead of order |
unique(wkday) |
display the unique levels in wkday |
ted<-as.ts(d$ClosePrice) |
change ClosePrice in d to time series data |
lag(ted) |
shift the time series 1 unit to the left and return the next value of the time series. |
by(ted, wkday, mean) |
just a summary |
write file | |
write.csv(d, file='xxx.scv', row.names=F |
row.names=F suppress write number index in the first column |
variable type | generation command | related functions |
---|---|---|
logical | TRUE FALSE |
& | (vec and/or), && || (control and/or, len=1) |
charcater | z <- c('a','b',"c") |
combine above types: c(num,logic)->num,c(num,char)->char |
factor |
related functions | explanation |
---|---|
par(mfrow=c(row, col)) |
define row × \times ×col multiple graph, fill-in by row; need to be reset next time |
par(mar=c(x1,x2,x3,x4)) |
define marginal width of graph, bottom=x1, left=x2, top=x3, right=x4 |
plot(x,y,main='title', type='',xlab='',ylab'', ylim=c(y1,y2)) , plot(x, y, pch=21, bg=c('red', 'blue')[binary]) |
type: “p” for points, “l” for lines, “b” for both, “c” for the lines part alone of “b”, “o” for both ‘overplotted’, “h” for ‘histogram’ like (or ‘high-density’) vertical lines, “s” for stair steps, “S” for other steps, see ‘Details’ below, “n” for no plotting.//pch represent dot type, bg means use red for binary=1 and blue for binary=2 |
plot(d$"classification standard column"~d$"data column") , plot(d) |
in plot(d) , d is a dataframe, R will plot it by applying plot(d$"classification standard column"~d$"data column" multiple times |
abline(a=, b=, h=, v=) |
a is intercept, b is slope, h is horizontal line, v is vertical line |
lines(c(x0, x1), c(y0, y1)) , lines(x_seq, y_seq), lty=) |
draw line from ( x 0 , y 0 ) (x_0,y_0) (x0,y0) to ( x 1 , y 1 ) (x_1,y_1) (x1,y1), or just a line through multiple points. lty : line style |
curve(y, x_start, x_end) |
draw curve ⇔ \Leftrightarrow ⇔ plot(x,y, type='l') |
barplot(a table variable, beside=T, horiz=F, legend.text=c('xx'), main='xx', args.legend=list(horiz=T, bty='n', cex=0.6), ylim=c(0.2)) |
log-transformation is frequently used. If beside=F , stack bar will be used. horiz=T will transpose the barplot |
boxplot() | |
hist(x, freq=F, main='...') |
freq=F produce histogram with density instead of frequency |
qqnorm(x, main='..') , qqline(x) |
QQ-plot |
name <- function(x,y,..., OR no input) {
...a bunch of cumputation...
output
}
name <- function(...){
arg <- list(...)
...}
or define own operator:
"%+-%" <- function(miu,criticalRegion){c(miu-criticalRegion,miu+criticalRegion)}
input
: if a variable has no input value, it will be automatically assigned NA
, so we can use if
to test. if input
is ...
, then numbers of input is undefined.
output
can be a single variable (not like in python, we have to write return output
) or c(x1=var1, x2=var2,...)
or outp<-list(var); names(outp)<-c('x')
or list(x1=var1, x2=var2,...)
as multiple output in one list.
fix(name)
can modify function ⇔ \Leftrightarrow ⇔ edit(file='name')
& source(‘name’) to make effect
proc.time()
: show current time
for (i in 1:len) {
a[i] <- xxx
}
i <- 1
while (i <= Len){
a[i] <- xxx
i <- I+1
}
repeat {
if (i>Len) break
a[i] <- xxx
i<- I+1
}
if (condition){
} else if (condition){
} else if (condition){
}
or
ifelse(condition, expression 1 given T, expression 2 given F)
ifelse(condition, ifelse(condition, ...), ifelse(condition, ...))
good for classification into several interval
swith(expression, 'expr I'=one type of computation, 'expr II'=another, ...)
terminates current loop and move to next
terminates current loop and return expression( or func name in recursion)
terminates current loop and give a warning message
get user’s input as a char.
simulation: set.seed()
⇒ \Rightarrow ⇒ sample()
distribution: d, p, q, r
+distribution_name(parameters)
, show density, cdf, quantiles, random number of certain distribution. E.g., dnorm(0,0,1)
, rnorm(#numbers,0,1)
code | Distribution | code | Distribution | code | Distribution |
---|---|---|---|---|---|
beta(x, α \alpha α, β \beta β) | beta | binom(x, size, prob) | binomial | cauchy(x, location, scale) | Cauchy |
chisq(x, df) | chi-squared | exp(x, rate) | exponential | f(x, df1, df2) | F |
gamma(x, shape, sclae) | gamma | geom(x, prob) | geometric | hyper(x, m, n, k) | hypergeometric |
lnorm(x, meanlog, sdlog) | log-normal | logis(x, location, scale) | logistic/ multinomial | nbinom(x, size, prob) | negative binomial |
norm(x, mean, sd) | normal | pois(x, λ \lambda λ) | Poisson | t(x, df) | Student’s t |
unif(x, min, max) | uniform | weibull | Weibull | wilcox(x, m, n) | Wilcoxon |
1-pXXX(...)
: find P ( X > x ) P(X>x) P(X>x)
one sample t test: t.test(x, mu=#, alt='less'/'greater'/two.sided')
. two.sided
is default.
two sample t test: t.test(x, y, alt='', var.eq=F/T, paired=F/T)
. var.eq=T
is two sample t test, F
is Welch’s t test(default). paired=T
is paired t test, F
is default.
one-way table
cut()
: slice x sequence into several part and specify each part with labels
result <- cut(x_seq, breaks=c(n1,n2,...), lables=c('...', '...',...))
table(result)
table(d$Name)
: summary of a factor/character column
two-way table
table(d$Name, result)
: Cartesian product of two one-way tables
prob.table(table(d$Name, result), margin=)
: create frequency table. margin=1
means P ( r o w ) = 1 P(row)=1 P(row)=1 (referred to row dimension), 2
means P ( c o l u m n ) = 1 P(column)=1 P(column)=1. Probability table can also be created using apply()
, rowMeans()
, colMeans()
.
chi-square goodness of fit test: chisq.test(x, p=)
: p=sequence
is the assumed distribution.
contingenty table on independence: chisq.test(x,y)
: y is considered when x is a factor
reg1 <- lm(y~x); summary(reg1)
plot(x,y);abline(reg1) #use reg1$coef
lm
is linear regression function. In output, y-reg1$fit
=reg1$resid
. And multiple R-squared measures is cor(x,y,use='complete')
.
Check 4 assumptions of linear regression model and ways to check data: linearity(residual vs fitted value, residual vs index), normality(QQ-plot), independence(residual[i] vs residual[i-1]), constant variance(residual vs fitted value, residual vs index).
# produce 4 residual plots
# input residual vector and fitted values
#
residplot<-function(resid,fit) {
par(mfrow=c(2,2)) # define 2x2 multiframe grahpic
n<-length(resid) # get no of points
plot(fit,resid); abline(h=0) # plot e(i) vs fit(i), add x-axis to plot
plot(1:n,resid); abline(h=0) # plot e(i) vs i, add x-axis to plot
plot(resid[1:(n-1)],resid[2:n]); abline(h=0) # plot e(i) vs e(i-1), add x-axis to plot
qqnorm(resid); qqline(resid) # QQ-normal plot of e(i)
par(mfrow=c(1,1)) # reset multiframe graphic
}
If residual converges when n becomes larger, the randomness is false, then it’s not a linear form.
Then H 0 : α = 0 H_0: \alpha=0 H0:α=0, H 0 : β = 0 H_0:\beta=0 H0:β=0 testing on y i = α + β × x i + ϵ i y_i=\alpha+\beta \times x_i+\epsilon_i yi=α+β×xi+ϵi, where ϵ ∼ N ( 0 , σ 2 ) \epsilon \sim N(0, \sigma^2) ϵ∼N(0,σ2):
We have test statistics: , T = a ^ σ ^ × ( 1 n + x ˉ 2 S x x ) T=\frac{\hat{a}}{\hat{\sigma} \times \sqrt(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})} T=σ^×(n1+Sxxxˉ2)a^, T = b ^ σ ^ S x x T=\frac{\hat{b}}{\frac{\hat{\sigma}}{\sqrt{S_{xx}}}} T=Sxxσ^b^, where S x x = ∑ i = 1 n ( x i − x ˉ ) S_{xx}=\sum_{i=1}^{n}(x_i-\bar{x}) Sxx=∑i=1n(xi−xˉ)
For [ y 1 . . . y n ] n × 1 \begin{bmatrix} y_1\\.\\.\\.\\y_n \end{bmatrix}_{n\times 1} ⎣⎢⎢⎢⎢⎡y1...yn⎦⎥⎥⎥⎥⎤n×1 = = = [ 1 x 11 . . . x 1 p . . . . . . . . . 1 x n 1 . . . x n p ] n × ( p + 1 ) \begin{bmatrix} 1 & x_{11} & ... & x_{1p}\\. & . & & . \\. & . & & . \\. & . & & . \\1 & x_{n1} & ... & x_{np} \end{bmatrix}_{n\times (p+1)} ⎣⎢⎢⎢⎢⎡1...1x11...xn1......x1p...xnp⎦⎥⎥⎥⎥⎤n×(p+1) × \times × [ β 0 β 1 . . β p ] ( p + 1 ) × 1 \begin{bmatrix} \beta_0\\\beta_1\\.\\.\\\beta_p \end{bmatrix}_{(p+1)\times 1} ⎣⎢⎢⎢⎢⎡β0β1..βp⎦⎥⎥⎥⎥⎤(p+1)×1 + + + [ ϵ 1 . . . ϵ n ] n × 1 \begin{bmatrix} \epsilon_1\\ . \\. \\. \\ \epsilon_{n} \end{bmatrix}_{n\times 1} ⎣⎢⎢⎢⎢⎡ϵ1...ϵn⎦⎥⎥⎥⎥⎤n×1, where ϵ ∼ N ( 0 , σ 2 ) \epsilon\sim N(0, \sigma^2) ϵ∼N(0,σ2):
We want to find Least Square Estimates of β \beta β so that Sum of Squared Error = ∑ i = 1 n ϵ 2 = ∑ i = 1 n ( y − X × β ) 2 =\sum_{i=1}^{n}{\epsilon^2}=\sum_{i=1}^{n}(y-X \times\beta)^2 =∑i=1nϵ2=∑i=1n(y−X×β)2 is minimized.
By minimizing SSE= ϵ ′ ϵ \epsilon'\epsilon ϵ′ϵ, we can find β ^ = ( X ′ X ) − 1 X ′ Y \hat{\beta}=(X'X)^{-1}X'Y β^=(X′X)−1X′Y with fitted value y ^ = X b \hat{y}=Xb y^=Xb and σ ^ 2 = ( y − X b ) ′ ( y − X b ) n − p − 1 \hat{\sigma}^2=\frac{(y-Xb)'(y-Xb)}{n-p-1} σ^2=n−p−1(y−Xb)′(y−Xb)
Code: reg<- lm(y~x1+x2+x3+..., data=d)
, x1,x2,… are column names in d.
Choose model: step(reg)
, then insignificant β i \beta_i βi will be dropped.
Y Y Y is binary variable, then π i = P ( Y i = 1 ∣ x i ) \pi_i=P(Y_i=1|x_i) πi=P(Yi=1∣xi) is probability of success. Then logistic model is: l n [ π i 1 − π i ] = β 0 + β 1 x i 1 + . . . + β p x i p ln[\frac{\pi_i}{1-\pi_i}]=\beta_0+\beta_1 x_{i1}+...+\beta_{p}x_{ip} ln[1−πiπi]=β0+β1xi1+...+βpxip. Assume LHS is linear on x i x_i xi, then π i = e x i ′ β 1 + e x i ′ β = 1 1 + e − x i ′ β ∈ [ 0 , 1 ] \pi_i=\frac{e^{x_{i}'\beta}}{1+e^{x_{i}'\beta}}=\frac{1}{1+e^{-x_{i}'\beta}}\in[0,1] πi=1+exi′βexi′β=1+e−xi′β1∈[0,1](logit transformation).
The likelihood function is L ( β ) = ∏ i = 1 n [ π i y i ( 1 − π i ) ( 1 − y i ) ] L(\beta)=\prod_{i=1}^{n}[\pi_{i}^{y_i}(1-\pi_{i})^{(1-y_i)}] L(β)=∏i=1n[πiyi(1−πi)(1−yi)], then do log-transformation and find MLE.
Code: reg2 <- glm(y~x1+x2+...+xn, data=d, binomial)
, binomial
is needed for logistic regression, where y is binomial variable. Also use step()
to drop insignificant regressor.
prediction:
b <- reg2$coef
X <- as.matrix(cbind(1,x1,x2,...))
c <- X%*%b
ps <- exp(c)/(1+exp(c)) # or directly use ps<-reg2$fit
pred <- ifelse(ps>0.5, "A", "B")
table(pred, oringinal list of "A"&"B") #show classification table
nls(y~f(x;a1,a2,a3,...), start=c(a1=#, a2=#, ...)
: nonlinear least square with parameters a 1 , a 2 , a 3 , . . . a_1, a_2, a_3,... a1,a2,a3,... and start value of all parameters are required to be used in searching the model.
nlsout <- summary(nls)
: to get coefficients, use nlsout$coefficients
, or use y-nlsout$res
to get fitted value.
If unclear form of y, then try log-transformation: log(y) or log(x). lm(log(y)~x
may fit a good nonlinear form.