Linux统计某一列中特定字符串的个数

问题描述:想统计一下vcf文件中FILTER这一列中,是PASS的个数
我们知道,vcf文件包括的是前面很长的注释,以及后面的正文部分。注释都是以#开头的,所以我的思路是,

  1. 把正文提取出来;
  2. 然后提取其中的FILTER这一列;
  3. 最后把这一列的PASS提取出来,最后是统计个数。
  • 因为所有的注释行都是以#开头的,所以用grep -v提取非注释行,就是正文,还有一个就是“^”的用法,这个正则表达式,也是linux 中一个比较难的部分。主要是需要记住的部分很多。可以参考http://www.runoob.com/regexp/regexp-metachar.html 里面总结了各种字符。
[qiany@gm112-2 T2B]$ less -S 57_T2B_somatic.vcf.gz |grep -v '^#' |head
chrUn_JTFH01001680v1_decoy      528     .       G       A       .       clustered_events;contamination;t_lod    DP=336;ECNT=14;NLOD=3.01;N_ART_LOD=-1.041e+00;POP_AF=2.500e-06;P_CONTAM=0.908;P_GERMLINE=-9.168e+01;TLOD=4.25  GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB     0/0:10,0:8.505e-05:6,0:4,0:35,0:252,0:0:0:0|1:528_G_A   0/1:319,3:0.015:147,2:172,1:35,36:234,234:40:26:0|1:528_G_A:0.010,0.00,9.317e-03:8.225e-03,2.072e-03,0.990
chrUn_JTFH01001680v1_decoy      532     .       T       A       .       clustered_events;contamination;t_lod    DP=322;ECNT=14;NLOD=2.71;N_ART_LOD=-7.404e-01;POP_AF=2.500e-06;P_CONTAM=0.898;P_GERMLINE=-8.888e+01;TLOD=4.60  GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:PGT:PID:SA_MAP_AF:SA_POST_PROB     0/0:9,0:4.700e-05:5,0:4,0:34,0:254,0:0:0:0|1:528_G_A    0/1:310,3:9.632e-03:145,2:165,1:33,35:236,234:40:22:0|1:528_G_A:0.010,0.00,9.585e-03:8.677e-03,2.116e-03,0.989
chrUn_JTFH01001862v1_decoy      32      .       C       A       .       clustered_events        DP=477;ECNT=4;NLOD=3.59;N_ART_LOD=-8.451e-01;POP_AF=2.500e-06;P_CONTAM=2.127e-03;P_GERMLINE=-1.145e+02;TLOD=9.79       GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/0:12,0:4.012e-03:7,0:5,0:36,0:204,0:0:0       0/1:426,7:0.051:183,5:243,2:34,35:199,176:40:9:0.010,0.020,0.016:1.705e-03,0.013,0.985
chrUn_JTFH01001862v1_decoy      36      .       A       T       .       clustered_events;panel_of_normals       DP=534;ECNT=4;IN_PON;NLOD=3.61;N_ART_LOD=-1.114e+00;POP_AF=2.500e-06;P_CONTAM=0.00;P_GERMLINE=-1.061e+02;TLOD=38.63    GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB     0/0:12,0:1.442e-04:7,0:5,0:38,0:204,0:0:0       0/1:457,22:0.083:204,8:253,14:35,33:195,191:40:22:0.030,0.040,0.046:4.988e-03,6.120e-03,0.989

  • 查看FILTER 这一列是第几行,可以把注释信息的最后一列提出来,就可以知道正文都包含哪些列了。数了一下发现是第7列
[qiany@gm112-2 T2B]$ less -S 57_T2B_somatic.vcf.gz |grep  '^#' |tail -1
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  57_B    57_T
  • 然后可以用awk把第七列提取出来,其实想的是直接按照FILTER提取,但是找了一圈没有找到。
[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |head
contamination;mapping_quality;t_lod
clustered_events;contamination;t_lod
clustered_events;contamination;t_lod
clustered_events;contamination;t_lod
artifact_in_normal;contamination;t_lod
artifact_in_normal;str_contraction
contamination;str_contraction;t_lod
base_quality;clustered_events;t_lod
clustered_events
clustered_events
  • 用grep -o提取这一列中是PASS的行,参考的是https://www.cnblogs.com/ywl925/p/3947778.html
[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "PASS" |head
PASS
PASS
PASS
PASS
PASS
PASS
PASS
PASS
PASS
PASS
[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o "PASS" |wc -l
2136

有一个问题是grep 和grep -o有什么区别呢?官方讲的是grep -o是-o, --only-matching show only the part of a line matching PATTERN,也就是说只会把字符串部分输出来。但是如果是grep的话,会把含有字符串的这一行都输出来。以下是例子

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep  "clu" |head
clustered_events;contamination;t_lod
clustered_events;contamination;t_lod
clustered_events;contamination;t_lod
base_quality;clustered_events;t_lod
clustered_events
clustered_events
base_quality;clustered_events;multiallelic;panel_of_normals
clustered_events
base_quality;clustered_events;contamination;read_position
clustered_events;contamination

并且的话grep -o会把字符串单独输出到一行。

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o  "clu" |head
clu
clu
clu
clu
clu
clu
clu
clu
clu
clu

下面的例子提现了两种方式输出的行数不一样

[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep  "_" |wc -l
54211
[qiany@gm112-2 T2B]$ less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep -o  "_" |wc -l
107908

但是我的目的是统计这一列是PASS的情况,也就是说这一列中除了PASS再也没有其他的情况。所以应该是用grep,然后以PASS开始,以PASS结束

[qiany@gm112-2 T2B]$  less -S 61_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "^PASS$" |wc -l
2136

我要统计很多个文件的话,就可以写成shell脚本了。

for i in {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}
do 
echo -n "${i}_T2B_somatic.vcf.gz " #不换行的方式1
echo -e "the number is \c"  #不换行的方式2
less -S ${i}_T2B_somatic.vcf.gz |grep -v "^#" |awk '{print $7}' |grep "^PASS$" |wc -l
done
[qiany@gm112-2 T2B]$ sh countpsaanum.sh
10_T2B_somatic.vcf.gz the number is 470
20_T2B_somatic.vcf.gz the number is 294
23_T2B_somatic.vcf.gz the number is 467
24_T2B_somatic.vcf.gz the number is 487
29_T2B_somatic.vcf.gz the number is 306
3_T2B_somatic.vcf.gz the number is 258
35_T2B_somatic.vcf.gz the number is 249
39_T2B_somatic.vcf.gz the number is 427
44_T2B_somatic.vcf.gz the number is 366
5_T2B_somatic.vcf.gz the number is 729
6_T2B_somatic.vcf.gz the number is 432
61_T2B_somatic.vcf.gz the number is 2136
65_T2B_somatic.vcf.gz the number is 308
8_T2B_somatic.vcf.gz the number is 1776
52_T2B_somatic.vcf.gz the number is 371
55_T2B_somatic.vcf.gz the number is 447
57_T2B_somatic.vcf.gz the number is 281
58_T2B_somatic.vcf.gz the number is 253
59_T2B_somatic.vcf.gz the number is 262

这里用了echo输出不换行的两种方式,参考的是https://www.cnblogs.com/qingtouniao/p/6667874.html。

但是个人还是比较习惯Python的方式。

#!/usr/bin/env python
import re
for i in  {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}:
    num = 0
    infile = open( "10_T2B_somatic.vcf.gz", "r")
    for line in infile:
        if re.match("^#", line) is None:
            if line.split()[6] == "PASS":
                num = num + 1
    print("10_T2B_somatic.vcf.gz is" + num)
    infile.close()
[qiany@gm112-2 T2B]$ python counttest.py
Traceback (most recent call last):
  File "counttest.py", line 6, in 
    for line in infile:
  File "/thinker/globe/udata/big/software/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

出现这个报错,我搜索了一下好像是因为python 不能直接对gz压缩文件进行处理,https://www.cnblogs.com/chyu/p/4558782.html
于是在网上找到了一个代码参考的是https://blog.csdn.net/zhangchilei/article/details/46313315?utm_source=blogxgwz7
思路就是定义一个函数,来读这个gz,但是又不解压这个gz

import re
import os
import os.path
import gzip
def read_gz_file(path):
    if os.path.exists(path):
        with gzip.open(path, 'r') as infile:
            for line in infile:
                yield line
    else:
        print('the path [{}] is not exist!'.format(path))
num = 0
con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/10_T2B_somatic.vcf.gz")
if getattr(con, '__iter__', None):
    for line in con:
        if re.match("^#", line) is None:
            if line.split()[6] == "PASS":
                num = num + 1
    print("10_T2B_somatic.vcf.gz is " + str(num))  
[qiany@gm112-2 T2B]$ python2 counttest.py
10_T2B_somatic.vcf.gz is 470

但是还是有问题,当我用Python2的时候,是能够运行的,但是用python3却提示错误。

[qiany@gm112-2 T2B]$ python3 counttest.py
Traceback (most recent call last):
  File "counttest.py", line 17, in 
    if re.match("^#", line) is None:
  File "/thinker/globe/udata/big/software/anaconda3/lib/python3.6/re.py", line 172, in match
    return _compile(pattern, flags).match(string)
TypeError: cannot use a string pattern on a bytes-like object

查了一下原因说是py3的urlopen返回的不是string是bytes所以需要在正则表达式之前加一个编码

#coding=utf-8
import re
import os
import os.path
import gzip
def read_gz_file(path):
    if os.path.exists(path):
        with gzip.open(path, 'r') as infile:
            for line in infile:
                yield line
    else:
        print('the path [{}] is not exist!'.format(path))
num = 0
con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/10_T2B_somatic.vcf.gz")
if getattr(con, '__iter__', None):
    for line in con:
        line = line.decode('utf-8')  #这里py3的urlopen返回的不是string是bytes,所以需要加入decode('utf-8') 编码
        if re.match("^#", line) is None:
            if line.split()[6] == "PASS":
                num = num + 1
    print("10_T2B_somatic.vcf.gz is " + str(num))       
[qiany@gm112-2 T2B]$ python3 counttest.py
10_T2B_somatic.vcf.gz is 470

再把原来的循环加上

#coding=utf-8
import re
import os
import os.path
import gzip
def read_gz_file(path):
    if os.path.exists(path):
        with gzip.open(path, 'r') as infile:
            for line in infile:
                yield line
    else:
        print('the path [{}] is not exist!'.format(path))
for i in  {10,20,23,24,29,3,35,39,44,5,6,61,65,8,52,55,57,58,59}:
    num = 0
    con = read_gz_file("/thinker/globe/udata/big/qiany/gliomas/wes/gatkvcf/mutect2/T2B/" + str(i) + "_T2B_somatic.vcf.gz")
    if getattr(con, '__iter__', None):
        for line in con:
            line = line.decode('utf-8')
            if re.match("^#", line) is None:
                if line.split()[6] == "PASS":
                    num = num + 1
    print(str(i) + "_T2B_somatic.vcf.gz is " + str(num)) 

运行一下结果

[qiany@gm112-2 T2B]$ python3 counttest.py
65_T2B_somatic.vcf.gz is 308
3_T2B_somatic.vcf.gz is 258
35_T2B_somatic.vcf.gz is 249
5_T2B_somatic.vcf.gz is 729
6_T2B_somatic.vcf.gz is 432
39_T2B_somatic.vcf.gz is 427
8_T2B_somatic.vcf.gz is 1776
10_T2B_somatic.vcf.gz is 470
44_T2B_somatic.vcf.gz is 366
61_T2B_somatic.vcf.gz is 2136
20_T2B_somatic.vcf.gz is 294
52_T2B_somatic.vcf.gz is 371
55_T2B_somatic.vcf.gz is 447
23_T2B_somatic.vcf.gz is 467
24_T2B_somatic.vcf.gz is 487
57_T2B_somatic.vcf.gz is 281
58_T2B_somatic.vcf.gz is 253
59_T2B_somatic.vcf.gz is 262
29_T2B_somatic.vcf.gz is 306

这个结果与shell是一致的。

https://www.e-learn.cn/content/python/870620
yield解释 https://blog.csdn.net/mieleizhi0522/article/details/82142856
参考:
https://www.cnblogs.com/ywl925/p/3947778.html
https://www.cnblogs.com/qingtouniao/p/6667874.html

你可能感兴趣的:(Linux统计某一列中特定字符串的个数)