R语言学习笔记_数据清理2

统计学是如何总结数据特点的
分布(离散型、连续型)
描述、相关
单变量:描述
集中趋势、分散趋势
双变量:相关
共同变化趋势(协方差、相关系数)
可视化探索

数据分布
分布就是概率
可能结果(取值)有哪些
每个结果或者某个范围内的概率是多少?
可视化展现
概率密度图
累积分布图
常见分布
分类变量:二项分布、泊松分布
数值变量:均匀分布、正态分布、指数分布

得到分布是研究的最高境界,说着容易做着难。很多时候是不能得到分部的,这时就有了另一种抓取数字特点的方法–数字特征

描述————集中趋势
集中趋势是一组数据向某一中心值靠拢的程度,反映了一组数据的中心点的位置所在
分类变量
众数
数值变量
均值(切尾均值、算数平均数、加权平均数、几何平均数)
均值的无用
单峰分布
极值的影响
不能简单求和
中位数(左偏、右偏)
分位数
中位数、分位数比均值更好
历史因性能原因(均值O(n),分位数O(logn))

描述————离散趋势
集中趋势反应的是变量值向其中心值集中的程度。变量间的差异状况如何呢?
离散趋势反应的是各变量值远离其中心值的程度
集中趋势的统计量是从一组数据中选择代表,但是这个代表的代表能力取决于数据的离散程度。数据越离散,代表能力就越弱;数据越集中,代表能力就越强。

统计量
    极差
    方差、标准差
    z分数
经验法则
切比雪夫定理

相关
协方差
cov
协方差取值范围

相关系数
    cor
    对协方差的单位化,取值范围[-1,+1]

单变量的描述
数值型变量
mean 均值、weighted.mean 加权平均数,median 中位数、quantile 分位数
var 方差、sd 标准差、min、max
summary 、fivenum
sum、length、prod
分类型变量
table
prop.table
分组统计
split、sapply、lapply、tapply
aggregate
by

双变量的相关
cov
cor
use:complete.obs、pairwise.complete.obs

可视化
单变量
散点图plot
箱线图boxplot
柱状图barplot
直方图hist、密度图
小提琴图(vioplot::vioplot)
QQ图(qqnorm、qqline、car::qq.plot)

双变量
    散点图 plot、jitter、smoothScatter、sunflowerplot
    散点图集 pairs、plot、car::scatterplotMatrix
    相关性 corrgram::corrgram

分组:
    lattice包
        xyplot 散点图
        bwplot 箱线图
        histogram 直方图
        densityplot 密度图

代码示例:
    > str(airquality)
    'data.frame':	153 obs. of  6 variables:
     $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
     $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
     $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
     $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
     $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
     $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
    > summary(airquality)           # …………
    > mean(airquality$Ozone,na.rm=T)            # na.rm=T 缺失值处理,否则返回NA
    [1] 42.12931
    > mean(airquality$Temp,na.rm=T,trim = .01)      # trim = .01 切尾均值
    [1] 77.90066

    > temp100 = rnorm(100,30,1)
    > w = 1:100
    > (wtm = weighted.mean(temp100,w,na.rm=T))
    [1] 30.03303
    > (tm = mean(temp100,na.rm = T))
    [1] 30.03004
    > x = c(.045,.021,.255,.019)
    > (xm = mean(x))
    [1] 0.085
    > (xg = exp(mean(log(x))))
    [1] 0.04625742
    > (tmid = median(temp100,na.rm = T))
    [1] 30.01199
    > quantile(airquality$Temp,na.rm = T)
      0%  25%  50%  75% 100%
      56   72   79   85   97
    > quantile(airquality$Temp,probs = c(0,0.1,0.9,1))
      0%  10%  90% 100%
    56.0 64.2 90.0 97.0
    > (tv = var(temp100))
    [1] 0.9677992
    > (ts = sd(temp100))
    [1] 0.9837679
    > ts^2
    [1] 0.9677992
    > fivenum(temp100)          # Returns Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum) for the input data.
    [1] 27.36138 29.47040 30.01199 30.69225 32.77827
    > summary(airquality)

    > cov(airquality[,-5:-6],use = 'pairwise.complete.obs')     # use = 'pairwise.complete.obs' 配对删除
                 Ozone    Solar.R      Wind      Temp
    Ozone   1088.20052 1056.58346 -70.93853 218.52121
    Solar.R 1056.58346 8110.51941 -17.94597 229.15975
    Wind     -70.93853  -17.94597  12.41154 -15.27214
    Temp     218.52121  229.15975 -15.27214  89.59133
    > cov(airquality[,-5:-6],use = 'complete.obs')              # use = 'complete.obs' 整行删除
                 Ozone   Solar.R      Wind      Temp
    Ozone   1107.29009 1056.5835 -72.51124 221.52072
    Solar.R 1056.58346 8308.7422 -41.24480 255.46765
    Wind     -72.51124  -41.2448  12.65732 -16.85717
    Temp     221.52072  255.4676 -16.85717  90.82031

    > cor(airquality[,-5:-6],use = 'pairwise.complete.obs')
                 Ozone     Solar.R        Wind       Temp
    Ozone    1.0000000  0.34834169 -0.60154653  0.6983603
    Solar.R  0.3483417  1.00000000 -0.05679167  0.2758403
    Wind    -0.6015465 -0.05679167  1.00000000 -0.4579879
    Temp     0.6983603  0.27584027 -0.45798788  1.0000000
    > cor(airquality[,-5:-6],use = 'complete.obs')
                 Ozone    Solar.R       Wind       Temp
    Ozone    1.0000000  0.3483417 -0.6124966  0.6985414
    Solar.R  0.3483417  1.0000000 -0.1271835  0.2940876
    Wind    -0.6124966 -0.1271835  1.0000000 -0.4971897
    Temp     0.6985414  0.2940876 -0.4971897  1.0000000

    > apply(airquality[,c(-5,-6)],2,FUN = mean,na.rm = T)
         Ozone    Solar.R       Wind       Temp
     42.129310 185.931507   9.957516  77.882353
    > # 分组统计
    > split(airquality[,-5:-6],airquality$Month)        # 按月分组 返回对象为列表
    > sapply(split(airquality$Temp,airquality$Month),FUN = quantile,probs = c(0,0.1,0.9,1))
    # lapply(split(airquality$Temp,airquality$Month),FUN = quantile,probs = c(0,0.1,0.9,1))
    > sapply(split(airquality$Temp,list(airquality$Month,airquality$Day)),FUN = quantile,probs = c(0,0.1,0.9,1))
    # sapply(split(airquality$Temp,list(airquality$Month,airquality$Day)),FUN = quantile,probs = c(0,0.1,0.9,1))

    > tapply(airquality$Temp,airquality$Month,FUN = quantile,probs = c(0,0.1,0.9,1))    # 比sapply和lapply使用简单

    > aggregate(airquality[,-5:-6],by = list(airquality$Month),FUN = mean,na.rm = T)
      Group.1    Ozone  Solar.R      Wind     Temp
    1       5 23.61538 181.2963 11.622581 65.54839
    2       6 29.44444 190.1667 10.266667 79.10000
    3       7 59.11538 216.4839  8.941935 83.90323
    4       8 59.96154 171.8571  8.793548 83.96774
    5       9 31.44828 167.4333 10.180000 76.90000

    > aggregate(airquality,by = list(airquality$Month),FUN = quantile,probs = c(0,0.1,0.9,1),na.rm = T

    > aggregate(cbind(Ozone,Solar.R)~Month,data = airquality,FUN = quantile,probs = c(0,0.1,0.9,1))

    > attach(airquality)
    > cor(Ozone,Wind,use = 'pairwise.complete.obs')
    [1] -0.6015465
    > by(cbind(Ozone,Wind),Month,function(m) cor(m[,1],m[,2],use = 'pairwise.complete.obs'))
    INDICES: 5
    [1] -0.3742975
    ------------------------------------------------------------------
    INDICES: 6
    [1] 0.3572546
    ------------------------------------------------------------------
    INDICES: 7
    [1] -0.6673491
    ------------------------------------------------------------------
    INDICES: 8
    [1] -0.7085496
    ------------------------------------------------------------------
    INDICES: 9
    [1] -0.6104514
    > detach(airquality)

    > as.vector(by(airquality,airquality$Month,function(m) cor(m[,1],m[,3],use = 'pairwise.complete.obs')))
    [1] -0.3742975  0.3572546 -0.6673491 -0.7085496 -0.6104514

可视化实例演示:
> x = rnorm(100,100,5)
> plot(x)
> abline(h = 100)

> plot(airquality$Ozone)
> boxplot(airquality$Ozone)

> hist(airquality$Ozone)
> hist(airquality$Ozone,breaks = seq(0,180,5),prob = T)
> lines(density(airquality$Ozone,na.rm = T),col = 3,lw = 4)

> par(mfrow = c(1,2))       # 画布布局设置一行两列
> hist(airquality$Ozone,breaks = seq(0,180,5),prob = T)
> lines(density(airquality$Ozone,na.rm = T),col = 3, lw = 4)
> library(car)
> qqnorm(airquality$Ozone)

> par(mfrow = c(1,1))

> install.packages('vioplot')
> library(vioplot)
> par(mfrow = c(1,3))
> attach(airquality)            # 挂载数据集
> plot(density(Ozone,na.rm = T))
> abline(v = mean(Ozone,na.rm = T),col = 'red',lw = 2)          # v是垂直方向,h是水平方向
> abline(v = median(Ozone,na.rm = T),col = 'green',lw = 2)
> boxplot(Ozone)
> abline(h = mean(Ozone,na.rm = T),col = 'red',lw = 2)
> vioplot(na.omit(airquality$Ozone))
> abline(h = mean(Ozone,na.rm = T),col = 'red',lw = 2)
> par(mfrow = c(1,1))

> table(airquality$Month)

 5  6  7  8  9
31 30 31 31 30
> barplot(table(airquality$Month))

> summary(airquality)
> # 散点图
> attach(airquality)
> plot(Wind,Temp)
> detach(airquality)

> # 添加回归线
> attach(airquality)
> plot(Wind,Temp)
> # 1、一元线性回归
> alm = lm(Temp~Wind)
> abline(alm$coefficients)

> # 2、局部加权回归
> alowess = loess(Temp~Wind)
>
> ord = order(Wind)
> lines(Wind[ord],alowess$fitted[ord],lwd = 1,col = 2,lty = 1)
> detach(airquality)

> # 解决数据点重叠
> x = rbinom(1000,10,0.1)
> y = rbinom(1000,10,0.1)
>
> par(mfrow = c(1,4))
> plot(x,y)
> sunflowerplot(x,y,col = 'red',seg.col = 'blue')
> plot(jitter(x),jitter(y))
> smoothScatter(x,y)
> par(mfrow = c(1,1))

> # 散点图集
> pairs(airquality[,-5:-6])
> plot(airquality[,-5:-6])

>scatterplotMatrix((airquality[,-5:-6]),lty.smooth = 2,spread = F)

> # 相关系数
> install.packages("corrgram")
> library(corrgram)
> corrgram(airquality[,-5:-6],lower.panel = panel.conf,upper.panel = panel.pie,text.panel = panel.txt)
> corrgram(airquality[,-5:-6],lower.panel = panel.conf,upper.panel = panel.pts,diag.panel = panel.minmax)

> library(vcd)
> vcd::mosaic(~cyl+gear,data = mtcars,shade = T,legend = T)

你可能感兴趣的:(R语言学习)