R语言处理离群值(outlier)

今天在处理土壤酶活数据时发现8个技术重复中存在离群值,于是想通过R将这些离群值删除。原始数据如下:

sample_id r1 r2 r3 r4 r5 r6 r7 r8
BT1 16.388 4.088 5.138 4.873 5.686 4.115 4.626 4.461
BT2 7.525 3.808 3.917 4.102 4.166 4.104 3.373 5.581
BT3 11.897 3.928 4.505 4.465 4.901 4.089 3.979 4.996
BS1 -0.851 -0.895 -0.892 -0.734 -0.855 -0.889 -0.971 -0.892
BS2 -0.678 -0.735 -0.919 -0.941 -0.902 -0.859 -0.804 -0.907
BS3 -0.761 -0.811 -0.901 -0.833 -0.874 -0.87 -0.883 -0.895
10T1 2.748 2.748 2.989 2.292 2.678 3.037 4.328 2.207
10T2 3.2 4.7 2.719 4.17 3.282 3.932 4.118 3.654
10T3 2.959 3.705 2.84 3.215 2.965 3.467 4.202 2.915
10S1 0.154 0.723 1.961 0.267 0.052 2.049 -0.053 0.184
10S2 -0.008 0.232 0.318 0.252 1.179 -0.056 0.415 0.163
10S3 0.072 0.475 1.134 0.258 0.612 0.992 0.018 0.173
4T1 6.281 9.823 8.209 5.994 5.735 5.356 4.573 6.228
4T2 7.042 6.291 6.272 5.743 4.587 6.156 3.504 5.678
4T3 6.628 8.017 7.204 5.839 5.135 5.727 4.019 5.923
4S1 -0.705 -0.661 -0.081 -0.334 1.331 -0.566 -0.306 1.138
4S2 -0.76 -0.5 -0.69 -0.716 -0.774 -0.604 -0.802 -0.743
4S3 -0.729 -0.577 -0.384 -0.522 0.278 -0.582 -0.511 0.196

一眼就能看出BT1和BT3样品的第一个数据明显偏离了整体的均值,那么如何处理呢?

## 定义一个将数据离群值替换为NA的函数
fun.outlier <- function(x,time.iqr=1.5) {
  outlier.low <- quantile(x,probs=c(0.25))-IQR(x)*time.iqr
  outlier.high <- quantile(x,probs=c(0.75))+IQR(x)*time.iqr
  x[which(x>outlier.high | x% plyr::summarise(    BT1=fun.outlier(BT1),
                                                      BT2=fun.outlier(BT2),
                                                      BT3=fun.outlier(BT3),
                                                      BS1=fun.outlier(BS1),
                                                      BS2=fun.outlier(BS2),
                                                      BS3=fun.outlier(BS3),
                                                      10T1=fun.outlier(10T1),
                                                      10T2=fun.outlier(10T2),
                                                      10T3=fun.outlier(10T3),
                                                      10S1=fun.outlier(10S1),
                                                      10S2=fun.outlier(10S2),
                                                      10S3=fun.outlier(10S3),
                                                      4T1=fun.outlier(4T1),
                                                      4T2=fun.outlier(4T2),
                                                      4T3=fun.outlier(4T3),
                                                      4S1=fun.outlier(4S1),
                                                      4S2=fun.outlier(4S2),
                                                      4S3=fun.outlier(4S3)
                                                      )

## 求替换后的各组的平均值
enzyme_m <- sapply(outlier.enzyme, mean, na.rm = TRUE)
enzy_act_o <- as.data.frame(enzyme_m)
head(enzy_act_o)
        enzyme_m
# BT1  4.712429
# BT2  3.911667
# BT3  4.409000
# BS1 -0.879000
# BS2 -0.843125
# BS3 -0.853500

## 将结果保存为对应csv文件
write.csv(enzy_act_o, file = "enzyme_xyloside_mean.csv")

参考资料:
https://blog.csdn.net/qq_43872984/article/details/105479445

你可能感兴趣的:(R语言处理离群值(outlier))