统计学是如何总结数据特点的
分布(离散型、连续型)
描述、相关
单变量:描述
集中趋势、分散趋势
双变量:相关
共同变化趋势(协方差、相关系数)
可视化探索
数据分布
分布就是概率
可能结果(取值)有哪些
每个结果或者某个范围内的概率是多少?
可视化展现
概率密度图
累积分布图
常见分布
分类变量:二项分布、泊松分布
数值变量:均匀分布、正态分布、指数分布
得到分布是研究的最高境界,说着容易做着难。很多时候是不能得到分部的,这时就有了另一种抓取数字特点的方法–数字特征
描述————集中趋势
集中趋势是一组数据向某一中心值靠拢的程度,反映了一组数据的中心点的位置所在
分类变量
众数
数值变量
均值(切尾均值、算数平均数、加权平均数、几何平均数)
均值的无用
单峰分布
极值的影响
不能简单求和
中位数(左偏、右偏)
分位数
中位数、分位数比均值更好
历史因性能原因(均值O(n),分位数O(logn))
描述————离散趋势
集中趋势反应的是变量值向其中心值集中的程度。变量间的差异状况如何呢?
离散趋势反应的是各变量值远离其中心值的程度
集中趋势的统计量是从一组数据中选择代表,但是这个代表的代表能力取决于数据的离散程度。数据越离散,代表能力就越弱;数据越集中,代表能力就越强。
统计量
极差
方差、标准差
z分数
经验法则
切比雪夫定理
相关
协方差
cov
协方差取值范围
相关系数
cor
对协方差的单位化,取值范围[-1,+1]
单变量的描述
数值型变量
mean 均值、weighted.mean 加权平均数,median 中位数、quantile 分位数
var 方差、sd 标准差、min、max
summary 、fivenum
sum、length、prod
分类型变量
table
prop.table
分组统计
split、sapply、lapply、tapply
aggregate
by
双变量的相关
cov
cor
use:complete.obs、pairwise.complete.obs
可视化
单变量
散点图plot
箱线图boxplot
柱状图barplot
直方图hist、密度图
小提琴图(vioplot::vioplot)
QQ图(qqnorm、qqline、car::qq.plot)
双变量
散点图 plot、jitter、smoothScatter、sunflowerplot
散点图集 pairs、plot、car::scatterplotMatrix
相关性 corrgram::corrgram
分组:
lattice包
xyplot 散点图
bwplot 箱线图
histogram 直方图
densityplot 密度图
代码示例:
> str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
> summary(airquality) # …………
> mean(airquality$Ozone,na.rm=T) # na.rm=T 缺失值处理,否则返回NA
[1] 42.12931
> mean(airquality$Temp,na.rm=T,trim = .01) # trim = .01 切尾均值
[1] 77.90066
> temp100 = rnorm(100,30,1)
> w = 1:100
> (wtm = weighted.mean(temp100,w,na.rm=T))
[1] 30.03303
> (tm = mean(temp100,na.rm = T))
[1] 30.03004
> x = c(.045,.021,.255,.019)
> (xm = mean(x))
[1] 0.085
> (xg = exp(mean(log(x))))
[1] 0.04625742
> (tmid = median(temp100,na.rm = T))
[1] 30.01199
> quantile(airquality$Temp,na.rm = T)
0% 25% 50% 75% 100%
56 72 79 85 97
> quantile(airquality$Temp,probs = c(0,0.1,0.9,1))
0% 10% 90% 100%
56.0 64.2 90.0 97.0
> (tv = var(temp100))
[1] 0.9677992
> (ts = sd(temp100))
[1] 0.9837679
> ts^2
[1] 0.9677992
> fivenum(temp100) # Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input data.
[1] 27.36138 29.47040 30.01199 30.69225 32.77827
> summary(airquality)
> cov(airquality[,-5:-6],use = 'pairwise.complete.obs') # use = 'pairwise.complete.obs' 配对删除
Ozone Solar.R Wind Temp
Ozone 1088.20052 1056.58346 -70.93853 218.52121
Solar.R 1056.58346 8110.51941 -17.94597 229.15975
Wind -70.93853 -17.94597 12.41154 -15.27214
Temp 218.52121 229.15975 -15.27214 89.59133
> cov(airquality[,-5:-6],use = 'complete.obs') # use = 'complete.obs' 整行删除
Ozone Solar.R Wind Temp
Ozone 1107.29009 1056.5835 -72.51124 221.52072
Solar.R 1056.58346 8308.7422 -41.24480 255.46765
Wind -72.51124 -41.2448 12.65732 -16.85717
Temp 221.52072 255.4676 -16.85717 90.82031
> cor(airquality[,-5:-6],use = 'pairwise.complete.obs')
Ozone Solar.R Wind Temp
Ozone 1.0000000 0.34834169 -0.60154653 0.6983603
Solar.R 0.3483417 1.00000000 -0.05679167 0.2758403
Wind -0.6015465 -0.05679167 1.00000000 -0.4579879
Temp 0.6983603 0.27584027 -0.45798788 1.0000000
> cor(airquality[,-5:-6],use = 'complete.obs')
Ozone Solar.R Wind Temp
Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
Temp 0.6985414 0.2940876 -0.4971897 1.0000000
> apply(airquality[,c(-5,-6)],2,FUN = mean,na.rm = T)
Ozone Solar.R Wind Temp
42.129310 185.931507 9.957516 77.882353
> # 分组统计
> split(airquality[,-5:-6],airquality$Month) # 按月分组 返回对象为列表
> sapply(split(airquality$Temp,airquality$Month),FUN = quantile,probs = c(0,0.1,0.9,1))
# lapply(split(airquality$Temp,airquality$Month),FUN = quantile,probs = c(0,0.1,0.9,1))
> sapply(split(airquality$Temp,list(airquality$Month,airquality$Day)),FUN = quantile,probs = c(0,0.1,0.9,1))
# sapply(split(airquality$Temp,list(airquality$Month,airquality$Day)),FUN = quantile,probs = c(0,0.1,0.9,1))
> tapply(airquality$Temp,airquality$Month,FUN = quantile,probs = c(0,0.1,0.9,1)) # 比sapply和lapply使用简单
> aggregate(airquality[,-5:-6],by = list(airquality$Month),FUN = mean,na.rm = T)
Group.1 Ozone Solar.R Wind Temp
1 5 23.61538 181.2963 11.622581 65.54839
2 6 29.44444 190.1667 10.266667 79.10000
3 7 59.11538 216.4839 8.941935 83.90323
4 8 59.96154 171.8571 8.793548 83.96774
5 9 31.44828 167.4333 10.180000 76.90000
> aggregate(airquality,by = list(airquality$Month),FUN = quantile,probs = c(0,0.1,0.9,1),na.rm = T
> aggregate(cbind(Ozone,Solar.R)~Month,data = airquality,FUN = quantile,probs = c(0,0.1,0.9,1))
> attach(airquality)
> cor(Ozone,Wind,use = 'pairwise.complete.obs')
[1] -0.6015465
> by(cbind(Ozone,Wind),Month,function(m) cor(m[,1],m[,2],use = 'pairwise.complete.obs'))
INDICES: 5
[1] -0.3742975
------------------------------------------------------------------
INDICES: 6
[1] 0.3572546
------------------------------------------------------------------
INDICES: 7
[1] -0.6673491
------------------------------------------------------------------
INDICES: 8
[1] -0.7085496
------------------------------------------------------------------
INDICES: 9
[1] -0.6104514
> detach(airquality)
> as.vector(by(airquality,airquality$Month,function(m) cor(m[,1],m[,3],use = 'pairwise.complete.obs')))
[1] -0.3742975 0.3572546 -0.6673491 -0.7085496 -0.6104514
可视化实例演示:
> x = rnorm(100,100,5)
> plot(x)
> abline(h = 100)
> plot(airquality$Ozone)
> boxplot(airquality$Ozone)
> hist(airquality$Ozone)
> hist(airquality$Ozone,breaks = seq(0,180,5),prob = T)
> lines(density(airquality$Ozone,na.rm = T),col = 3,lw = 4)
> par(mfrow = c(1,2)) # 画布布局设置一行两列
> hist(airquality$Ozone,breaks = seq(0,180,5),prob = T)
> lines(density(airquality$Ozone,na.rm = T),col = 3, lw = 4)
> library(car)
> qqnorm(airquality$Ozone)
> par(mfrow = c(1,1))
> install.packages('vioplot')
> library(vioplot)
> par(mfrow = c(1,3))
> attach(airquality) # 挂载数据集
> plot(density(Ozone,na.rm = T))
> abline(v = mean(Ozone,na.rm = T),col = 'red',lw = 2) # v是垂直方向,h是水平方向
> abline(v = median(Ozone,na.rm = T),col = 'green',lw = 2)
> boxplot(Ozone)
> abline(h = mean(Ozone,na.rm = T),col = 'red',lw = 2)
> vioplot(na.omit(airquality$Ozone))
> abline(h = mean(Ozone,na.rm = T),col = 'red',lw = 2)
> par(mfrow = c(1,1))
> table(airquality$Month)
5 6 7 8 9
31 30 31 31 30
> barplot(table(airquality$Month))
> summary(airquality)
> # 散点图
> attach(airquality)
> plot(Wind,Temp)
> detach(airquality)
> # 添加回归线
> attach(airquality)
> plot(Wind,Temp)
> # 1、一元线性回归
> alm = lm(Temp~Wind)
> abline(alm$coefficients)
> # 2、局部加权回归
> alowess = loess(Temp~Wind)
>
> ord = order(Wind)
> lines(Wind[ord],alowess$fitted[ord],lwd = 1,col = 2,lty = 1)
> detach(airquality)
> # 解决数据点重叠
> x = rbinom(1000,10,0.1)
> y = rbinom(1000,10,0.1)
>
> par(mfrow = c(1,4))
> plot(x,y)
> sunflowerplot(x,y,col = 'red',seg.col = 'blue')
> plot(jitter(x),jitter(y))
> smoothScatter(x,y)
> par(mfrow = c(1,1))
> # 散点图集
> pairs(airquality[,-5:-6])
> plot(airquality[,-5:-6])
>scatterplotMatrix((airquality[,-5:-6]),lty.smooth = 2,spread = F)
> # 相关系数
> install.packages("corrgram")
> library(corrgram)
> corrgram(airquality[,-5:-6],lower.panel = panel.conf,upper.panel = panel.pie,text.panel = panel.txt)
> corrgram(airquality[,-5:-6],lower.panel = panel.conf,upper.panel = panel.pts,diag.panel = panel.minmax)
> library(vcd)
> vcd::mosaic(~cyl+gear,data = mtcars,shade = T,legend = T)