CMP分析流程

reads 1 和 reads 2 去掉adapter,程序一样,且from SJ

  perl read1_2_filter_adapter.pl $s.1.clean.fq.gz $s.1.adapter.fq.gz

reads2 去掉前6bp和12bp后,还要剩余20bp;reads1去掉前12bp和后6bp,还要剩余20bp。

  perl rm_firstx_leny.pl $s.1.adapter.fq.gz $s.2.filter.fq.gz 6 20 12
  • 想了一下,去掉前6bp后12bp或者前12bp后6bp的策略是合理的,因为基因组motif显示cgcgcgg中后5bpmapping更好一些,所以去掉前4+2bp,由于片段很短,基本上reads1和reads2是重合的,所以有必要前后都要去掉一定的碱基。但是对于组织样本,是否有必要去掉一定的碱基呢,如果组织片段大,那么就没有必要了吧。对组织的测序为什么不一开始就用RRBS呢?哦,对,因为RRBS和MCTA-Seq覆盖位点有差异。我觉得应该衡量一下切胶片段大小对结果的影响,同一个组织切不同长短,上机测序,看看结果有没有异同,如果相关性很好,那么就没有必要纠结长短了;如果切长一些更好,那么组织的去掉前xbp后xbp的策略可能需要修改

去掉reads中含有3个或以上CpH的序列

  perl ch3deleate.pl $2.2.filter.fq.gz $s.2.nonCG.fq.gz
  • perl很简单,reads匹配三个或三个以上CA/CT/CC就删掉
  • 其实这步没有将CH去完,只去掉序列中的CH;noch_arra去掉了比对到reference的CH

bismark比对

noch_arra

STEP6_cgcgma_pcr

**(1)在xx范围内含有至少2个CG;(2)reads1 barcode区域的质量值控制;(3)barcode一样的reads仅保留一个 **

(1)在xx范围内含有至少2个CG

  • 目的:筛选在-3~+3bp(含3bp)(mapping位点为0)中至少有2个CG的序列
  • 输入文件:$s.temp2
  • 结果文件:$s.5bp.cgcgmat.gz
perl extractCGx2.pl $s.temp2 hg19.fa $s.5bp.cgcgmat.gz
$samtools view $s.2.nonCG.fq.gz_bismark.sort.bam > $s.5temp2

(2)reads1 barcode区域的质量值控制

  • 目的:barcode区域中每个碱基的质量值大于等于17
  • 测序得到第一链如下,H至少有5个,所以将前5个H作为barcode,作为barcode的5个碱基质量值要大于等于17

TTTCCCTACACGACGCTCTTCCGATCTHHHHHHHHCGCH
TTTCCCTACACGACGCTCTTCCGATCTHHHHHHHCGHCH
TTTCCCTACACGACGCTCTTCCGATCTHHHHHHCGHHCH
TTTCCCTACACGACGCTCTTCCGATCTHHHHHCGHHHCH

perl qc5bp.pl $s.clean.1.fq.gz 5 $s.5qc.gz

(3)、barcode一样的reads仅保留一个

  • 目的:如果read1没有比对上基因组,则只考虑reads1的barcode以及reads2的起始位置,一致,则认为是同一条reads。若read1和read2的比对信息都有,则考虑reads1的barcode、reads1的起始位置以及reads2的起始位置,一致,则认为是同一条reads
  • 输入文件:$s.5temp1
  • 输出文件:$s.5bp.cgcgmat.rmd.gz
perl rmduppcrv2.pl \$s.5qc.gz \$s.5bp.cgcgmat.gz 5 \$s.5temp1 \$s.5bp.cgcgmat.rmd.gz

用$s.5bp.cgcgmat.gz得到noch_arra

重复1.5、noch_arra 过程,输入文件为$s.5bp.cgcgmat.gz

step10 combine

perl combine.pl Pcrc1_m Pcrc2_m Pcrc3_m Pcrc1_Pcrc3 6

CMP部分文件说明

生成文件说明

s07.Pcgibed:$s.5bp.cgcgmat.rmd.gz s07.Tcgibed:$s.5bp.cgcgmat.gz T- s07.Tcgibed/$s.cgcgmat P- s07.Pcgibed/$s.cgcgmat.qc uPnorm- s07.Pcgibed/$s.cgcgmat.qc*(qc-dCGI-$s/rmd-dCGI-$s)
uP-xxx:对于血浆样本,在MePM基础上乘以 qc-dCGI-$s/rmd-dCGI-$s

P和T的差别在于P算的UMI,T算的MePM

cal文件内容说明

fmg9_m.clean:
clean reads 条数
fmg9_m.qc5.clean:
对clean reads再次做前5bp的qc后的reads
fmg9_m.filter:去掉前6bp后12bp后剩余的reads数
fmg9_m.rmfilter:去掉含有3个及以上nonCG的reads
fmg9_m.unique_mapping:bismark mapping到基因组
fmg9_m.cgcgmat:在-3~+3bp(含3bp)(mapping位点为0)中至少有2个CG
fmg9_m.cgcgmat.qc:在fmg9_m.cgcgmat的基础上,reads的前5bp做过qc
fmg9_m.cgcgmat.qc.rmd:用UMI去掉PCR重复序列
P-CGI-fmg9_m:用UMI去掉PCR重复后落在CGI中的序列
T-CGI-fmg9_m:不用UMI去掉PCR重复后落在CGI中的序列
qc-dCGI-fmg9_m:qc前5bp,落在dCGI区域中的reads(dCGI有3024个,分别是什么
呢?)
qc-dCGI2-fmg9_m:qc前5bp,落在dCGI2区域中的reads(dCGI2有9513个,分别是
什么呢?)
rmd-dCGI-fmg9_m:qc前5bp,去掉UMI,落在dCGI中的reads
rmd-dCGI2-fmg9_m:qc前5bp,去掉UMI,落在dCGI2中的reads
filter2_nonCGfmg9_m:1-("total methylated C in CHG"+"total methylated
C in CHH" )/("total methylated C in CHG"+"total methylated C in
CHH" +"total C to T conversions in CHG context"+"Total C to T
conversions in CHH context"))完成filter而未去掉含3个及以上nonCG的第二端序列然后bismark比对结果
filter1_nonCGfmg9_m :同上,第一端序列

CMP改进流程

单位点C和C+T衡量甲基化程度

不光用MePM衡量甲基化程度,还用测到reads含有的甲基化位点的C/C+T来衡量。

一个小问题:是否应该同时考虑reads1和reads2的信息,为了解决这个问题,
应该计算reads1和reads2重叠区域是否很多,如果基本上重复,那么reads1和
reads2的信息是一致的,仅需要考虑一条reads即可,如果reads1和reads2重叠
区域少,那么应该同时考虑两条reads的情况。这样计算有点麻烦,因为不能分别
把reads1和reads2的C加起来,C+T加起来,然后C/C+T,原因是这样会导致重叠
区域权重增大,应该是上述的C-重叠区域C,上述C+T-重叠区域C+T,然后C/C+T
才是真的甲基化程度。所以我觉得考虑一条reads足以。


计算单个cgcgcgg

正链序列(起始点)落在正链cgcgcgg上,负链序列(起始点)落在负链cgcgcgg上

sh CGIcgcgcggv2.sh Pn1 Pn

问:为什么要对单碱基数据也做normalise?
答:文老师发现一个肝癌数据中C特别高,但是癌症程度并不算太高,而是由于测序深度太深造成的。那么如果只关注C的绝对值,测序越深,C的绝对值就会越高。如果测饱和了(每个阳性位点都测到了),C的绝对值不会因为测序而升高(去掉PCR duplicate后),没有测饱和的时候,用绝对值计算是要受到测序深度影响的。另外,两个病人释放不同量的ccfDNA,而其中癌症相关的都是一条,因为取血量一样,都是5ml,那么癌症相关DNA浓度一样,但是测序得到的结果(同样测序深度)就不一样了,解决办法:饱和程度。测饱和可以解决以上两个问题。

问:为什么不用CGI-qc/CGI-rmd作为duplication rate,既然T和upnor的方法本质上是一样的,upnor的优势是什么?
答:upnor的duplication rate是一样的,而去重的时候,不可能每个位点去掉重复的比例一致,只要是乘以一个固定的duplication rate,T带来的随机性就被去掉了,至于能否用CGI-qc/rmd-CGI作为duplication rate,也要筛选那些低拷贝的地方吧,不筛选得到的duplication rate,高拷贝的地方占权重会大。


1. read1_2_filter_adapter.pl

  1 #! /usr/bin/perl -w
  2 use strict;
  3
  4 open IN,"zcat $ARGV[0] |" or die $!;
  5 open OUT,"|gzip > $ARGV[1]" or die $!;
  6
  7 while()
  8 {
  9         chomp;
 10         my $line1=$_;
 11         chomp(my $line2=);
 12         #;
 13         chomp(my $line3=);
 14         chomp(my $line4=);
 15         if($line2 =~ /^(\w+)AGATCGGAAGAGC/)
 16         {
 17                 #$line2= s/AGATCGGAAGAGCAC//;
 18                 #my $len = length($1)
 19                 if(length($1)>2)
 20                 {
 21                         my $len = length($1);
 22                         my $new_line2=substr($line2,0,$len);
 23                         #my $line2_length= length $line2;
 24                         my $new_line4=substr($line4,0,$len);
 25                         print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
 26                 }
 27         }
 28         elsif($line2 =~ /^AGATCGGAAGAGCAC/)
 29         {
 30                 next;
 31         }
 32         #elsif($line2 =~ /^(\w+)AGATCGGAAGAGCA$/)
 33         #{
 34         #       #$line2= s/AGATCGGAAGAGCA$//;
 35         #       my $len = length($1);
 36         #       my $new_line2=substr($line2,0,$len);
 37         #       my $new_line1=substr($line1,0,$len);
 38         ##      print OUT "$new_line1\n$new_line2\n\n";
 39         #}
 40         elsif($line2 =~ /^(\w+)AGATCGGAAGAGC$/)
 41         {
 42                 #$line2= s/AGATCGGAAGAGC$//;
 43                 my $len = length($1);
 44                 my $new_line2=substr($line2,0,$len);
 45                 my $new_line4=substr($line4,0,$len);
 46         #       print OUT "$new_line1\n$new_line2\n\n";
 47                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
 48         }
 49         elsif($line2 =~ /^(\w+)AGATCGGAAGAG$/)
 50         {
 51                 #$line2= s/AGATCGGAAGAG$//;
 52                 my $len = length($1);
 53                 my $new_line2=substr($line2,0,$len);
 54                 my $new_line4=substr($line4,0,$len);
 55                 #print OUT "$new_line1\n$new_line2\n\n";
 56                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
 57         }
 58         elsif($line2 =~ /^(\w+)AGATCGGAAGA$/)
 59         {
 60                 #$line2= s/AGATCGGAAGA$//;
 61                 my $len = length($1);
 62                 my $new_line2=substr($line2,0,$len);
 63                 my $new_line4=substr($line4,0,$len);
 64                 #print OUT "$new_line1\n$new_line2\n\n";
 65                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
 66         }
 67         elsif($line2 =~ /^(\w+)AGATCGGAAG$/)
 68         {
 69                 #$line2= s/AGATCGGAAG$//;
 70                 my $len = length($1);
 71                 my $new_line2=substr($line2,0,$len);
 72                 my $new_line4=substr($line4,0,$len);
 73 #               print OUT "$new_line1\n$new_line2\n\n";
 74                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
 75         }
 76         elsif($line2 =~ /^(\w+)AGATCGGAA$/)
 77         {
 78                 #$line2= s/AGATCGGAA$//;
 79                 my $len = length($1);
 80                 my $new_line2=substr($line2,0,$len);
 81                 my $new_line4=substr($line4,0,$len);
 82                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
 83 #               print OUT "$new_line1\n$new_line2\n\n";
 84         }
 85         elsif($line2 =~ /^(\w+)AGATCGGA$/)
 86         {
 87                 #$line2= s/AGATCGGA$//;
 88                 #my $line2_length= length $line2;
 89                 my $len = length($1);
 90                 my $new_line2=substr($line2,0,$len);
 91                 my $new_line4=substr($line4,0,$len);
 92                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
 93 #               print OUT "$new_line1\n$new_line2\n\n";
 94         }
 95         elsif($line2 =~ /^(\w+)AGATCGG$/)
 96         {
 97                 #$line2= s/AGATCGG$//;
 98                 #my $line2_length= length $line2;
 99                 my $len = length($1);
100                 my $new_line2=substr($line2,0,$len);
101                 my $new_line4=substr($line4,0,$len);
102 #               print OUT "$new_line1\n$new_line2\n\n";
103                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
104         }
105         elsif($line2 =~ /^(\w+)AGATCG$/)
106         {
107                 #$line2= s/AGATCG$//;
108                 #my $line2_length= length $line2;
109                 my $len = length($1);
110                 my $new_line2=substr($line2,0,$len);
111                 my $new_line4=substr($line4,0,$len);
112                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
113 #               print OUT "$new_line1\n$new_line2\n\n";
114         }
115         elsif($line2 =~ /^(\w+)AGATC$/)
116         {
117                 #$line2= s/AGATC$//;
118                 #my $line2_length= length $line2;
119                 my $len = length($1);
120                 my $new_line2=substr($line2,0,$len);
121                 my $new_line4=substr($line4,0,$len);
122         #       print OUT "$new_line1\n$new_line2\n\n";
123                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
124         }
125         elsif($line2 =~ /^(\w+)AGAT$/)
126
127         {
128                  #$line2= s/AGAT$//;
129                 #my $line2_length= length $line2;
130                 my $len = length($1);
131                 my $new_line2=substr($line2,0,$len);
132                 my $new_line4=substr($line4,0,$len);
133 #               print OUT "$new_line1\n$new_line2\n\n";
134                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
135         }
136         elsif($line2 =~ /^(\w+)AGA$/)
137         {
138                 #$line2= s/AGA$//;
139                 #my $line2_length= length $line2;
140                 my $len = length($1);
141                 my $new_line2=substr($line2,0,$len);
142                 my $new_line4=substr($line4,0,$len);
143 #               print OUT "$new_line1\n$new_line2\n\n";
144                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
145         }
146         elsif($line2 =~ /^(\w+)AG$/)    #以AG结尾则去掉AG
147         {
148                 #$line2= s/AG$//;
149                 #my $line2_length= length $line2;
150                 my $len = length($1);
151                 my $new_line2=substr($line2,0,$len);
152                 my $new_line4=substr($line4,0,$len);
153                 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
154 #               print OUT "$new_line1\n$new_line2\n\n";
155         }
156         else
157         {
158                 print OUT "$line1\n$line2\n$line3\n$line4\n";
159         }
160 }
161 close IN;
162 close OUT;


2.rm_firstx_leny.pl

#! /usrt/bin/perl -w
use strict;

# remove forward xbp, backward zbp, remain more than ybp
my $usage = "perl $0   x y z";
die $usage unless @ARGV==5;
open IN,"zcat $ARGV[0] |" or die $!;
open OUT,"| gzip > $ARGV[1]" or die $!;
while()
{
        chomp;
        my $line1=$_;
        chomp(my $line2=);
        chomp(my $line3=);
        chomp(my $line4=);
        my $len_2=length($line2);
        if($len_2 >=$ARGV[2]+$ARGV[4]+$ARGV[3])  #大于或等于6+12+20(38)
        {
                my $second=substr($line2,$ARGV[2],$len_2-$ARGV[2]-$ARGV[4]);
                my $line4_1=substr($line4,$ARGV[2],$len_2-$ARGV[2]-$ARGV[4]);
                my $len=length($second);
                #my $third=substr($second,0,$len_2-24);
                #my $line4_2=substr($line4_1,0,$len_2-24);
                print OUT "$line1\n$second\n$line3\n$line4_1\n";
        }
        else
        {
                next;
        }
}
close IN;
close OUT;

3.ch3deleate.pl

  1 #!/usr/bin/perl
  2 use strict;
  3 use warnings;
  4
  5 my ($line1,$line2,$line3,$line4,$count);
  6 open IN,"zcat $ARGV[0]|" or die $!;
  7 open OUT,"| gzip > $ARGV[1]" or die $!;
  8 while ()
  9 {
 10         $count=0;
 11         chomp;
 12         $line1 = $_;
 13         chomp($line2=);
 14         chomp($line3=);
 15         chomp($line4=);
 16         while ($line2 =~m/CA/g)  #从建库方法来看,第二链不会有TG这种情况
 17         {
 18                 $count++;
 19         }
 20         while ($line2 =~m/CT/g)
 21         {
 22                 $count++;
 23         }
 24         while ($line2 =~m/CC/g)
 25         {
 26                 $count++;
 27         }
 28         next if ($count >= 3);
 29         print OUT "$line1\n$line2\n$line3\n$line4\n";
 30 }
 31 close IN;

4.s05.noch_arrange

109 out=\$dir/\${f}_\${b}trim/s05.noch_arrange/\$s
110 for i in {1..22} X Y
111 do
112  c=chr\$i
113  ff=\$f
114  bb=\$b
115  sbam2=\$dir/\${ff}_\${bb}trim/s04.align/\$s/\$s.2.nonCG.fq.gz_bismark.sort.bam
116  log=\$dir/\${ff}_\${bb}trim/s05.noch_arrange/\$s/\$c.log
117  outout=\$dir/\${ff}_\${bb}trim/s05.noch_arrange/\$s
118
119 touch \$shell/\$c.ing... && \\
120 \$samtools view -h \$sbam2 \$c\\
121 |perl -ne 'print \$_ and next if /^@/;my \$met = \$1 if/XM:Z:(\S+)/;next if (\$met =~tr/HX/HX/)>=3;print \    $_'\\
122 |perl \$dirperl/samStat.pl /dev/stdin \$log \\
123 |\$samtools view -uSb /dev/stdin \\
124 |\$samtools mpileup -f \$database/hg19/DNA_fa/\$c.fa /dev/stdin \\
125 |perl \$dirperl/covStrBed.pl /dev/stdin \$outout 1 \\
126 |perl \$dirperl/covStrBed.pl /dev/stdin \$outout 3 \\
127 |perl \$dirperl/modCytSplt.pl \$database/hg19/dbSNP/human_build_137/vcf_by_chromosome/CHB/processed_data/\    $c.gz \$database/hg19/DNA_fa/\$c.fa  \\
128 /dev/stdin \$outout/\$c. >/dev/null && \\
129 mv \$shell/\$c.ing... \$shell/\$c.ok
130 done
131
132 for i in {1..22} X Y
133 do
134 zcat \$out/chr\$i.CG.gz|awk '{if (\$9==\".\")print \$1\"\\t\"\$2-1\"\\t\"\$2\"\\t\"\$3\"\\t\"\$4\"\\t\"\$5    \"\\t\"\$6}' - > \$out/chr\$i.CG
135 done
136 cat \$out/chr*.CG > \$out/all.CG
137
138 for i in {1..22} X Y
139 do
140 zcat \$out/chr\$i.CHG.gz|awk '{if (\$9==\".\")print \$1\"\\t\"\$2-1\"\\t\"\$2\"\\t\"\$3\"\\t\"\$4\"\\t\"\$    5\"\\t\"\$6}' - > \$out/chr\$i.CHG
141 zcat \$out/chr\$i.CHH.gz|awk '{if (\$9==\".\")print \$1\"\\t\"\$2-1\"\\t\"\$2\"\\t\"\$3\"\\t\"\$4\"\\t\"\$    5\"\\t\"\$6}' - > \$out/chr\$i.CHH
142 perl \$dirperl/sorcharacter.pl \$out/chr\$i.CHG > \$out/chr\$i.CHG.sort
143 perl \$dirperl/sorcharacter.pl \$out/chr\$i.CHH > \$out/chr\$i.CHH.sort
144 done
145 cat \$out/chr*.CHG.sort \$out/chr*.CHH.sort > \$out/all.nonCG
146 rm \$out/chr*
147 rm \$dir/\${f}_\${b}trim/s05.noch_arrange/\${s}chr*.cov*.bed
148 echo s05done

5.extractCGx2.pl

  1 #!/usr/bin/perl
  2 use strict;
  3 use warnings;
  4
  5 my $usage = "perl $0   
  6 :bam file;reads_name flag chr1 pos 255 74M * 0 0 TTTTTTTTTTTTTTT ############## ...(not imp    ortant)
  7 :fasta(hg19)
  8 :bam file;reads_name flag chr1 pos 255 74M * 0 0 TTTTTTTTTTTTTTT ############## ...(not imp    ortant)(delete some reads)";
  9 die $usage unless @ARGV==3;
 10 my ($id,$seq,%hash2,$len,$end1,$str1,$i);
 11 open OUT,"|gzip >  $ARGV[2]" or die $!;
 12 open IN1,"<$ARGV[1]" or die $!;
 13 $/=">";
 14 while ()
 15 {
 16         if ($_=~/(chr.*?)\s(.*)/ms)
 17         {
 18                 $id = $1;
 19                 $seq = $2;
 20                 $seq =~s/\s//g;
 21                 $hash2{$id} = $seq;
 22         }
 23 }
 24 close IN1;
 25
 26 $/="\n";
 27 open IN2,"<$ARGV[0]" or die $!;
 28 while ()
 29 {
 30         $i=0;
 31         chomp;
 32         my @a = split;
 33         if ($a[1] == 16 )
 34         {
 35                 $len = length($a[10]);
 36                 $end1 = $a[3]+$len+2;
 37                 $str1 = $end1-7;
 38                 while (substr($hash2{$a[2]},$str1,7)=~/GC/ig)
 39                 {
 40                         $i++;
 41                 }
 42         }
 43         if ($a[1]==0)
 44         {
 45                 $str1 = $a[3]-1-3;
 46                 while (substr($hash2{$a[2]},$str1,7)=~/CG/ig)
 47                 {
 48                         $i++;
 49                 }
 50         }
 51         if ($i>=2)
 52         {
 53                 print OUT "$_\n";
 54         }
 55 }
 56 close IN2;

6.qc5bp.pl

  1 #!/usr/bin/perl
  2 use strict;
  3 use warnings;
  4
  5 my $usage = "perl $0   
  6 :clean_data
  7 :5 (or 10)
  8 :clean_data_5bpqc,if 5bp or 10bp phred score >= 20,retain";
  9 die $usage unless @ARGV==3;
 10
 11 my ($asc,$i,$j);
 12 open IN1,"zcat $ARGV[0]|" or die $!;
 13 open OUT,"|gzip > $ARGV[2]" or die $!;
 14 while ()
 15 {
 16         chomp;
 17         my $line1 = $_;
 18         chomp(my $line2=);
 19         chomp(my $line3=);
 20         chomp(my $line4=);
 21         if ($line1=~/@(.*?)\s/)
 22         {
 23                 $j = 0;
 24                 for ($i=1;$i<=$ARGV[1];$i++)
 25                 {
 26                         my $s = substr($line4,$i,1);
 27                         if (ord($s)>=50)  #按照phred33系统,50-33=17,大于等于17哦
 28                         {
 29                                 $j++;
 30                         }
 31                 }
 32                 if ($j == 5 )
 33                 {
 34                         print OUT "$line1\n$line2\n$line3\n$line4\n";
 35                 }
 36         }
 37 }
 38 close IN1;

7.rmduppcrv2.pl

    1 #!/usr/bin/perl
  2 use strict;
  3 use warnings;
  4
  5 my $usage = "perl $0     
  6 :clean_data
  7 :bam2
  8 :5bp or 10bp;
  9 ;bam1
 10 :bam.txt";
 11 die $usage unless @ARGV==5;
 12 my (%bam,%rmdup,%seq,%pos,%pos2,$i,$end);#$end is the right-most coordinate of bam reads
 13 open IN1,"zcat $ARGV[0] |" or die $!;
 14 open OUT,"|gzip > $ARGV[4]" or die $!;
 15 while ()  # for barcode quality
 16 {
 17         chomp;
 18         my $line1 = $_;
 19         chomp(my $line2=);
 20         chomp(my $line3=);
 21         chomp(my $line4=);
 22         if ($line1=~/@(.*?)\s/)
 23         {
 24                 my $s = substr($line2,0,$ARGV[2]); #extract barcode in reads1
 25                 my $key = $1;# read name
 26                 $seq{$key} = $s; # barcode <- read_name1
 27         }
 28 }
 29 close IN1;
 30 open IN2,"zcat $ARGV[1] |" or die $!;
 31 while ()
 32 {
 33         chomp;
 34         my @a = split;
 35         if ($_=~/(.*?)_/)
 36         {
 37                 my $key = $1;# read_name2
 38                 $end=$a[3]+length($a[9]);# end position
 39                 if ($a[1]==0)
 40                 {
 41                         $pos{$key} = "$a[2]\t$a[3]"; #chr&start <- read_name2
 42                 }
 43                 if ($a[1]==16)
 44                 {
 45                         $pos{$key} = "$a[2]\t$end"; #chr&end <- read_name2
 46                 }
 47         }
 48 }
 49 close IN2;
 50 open IN4,"<$ARGV[3]" or die $!;
 51 while ()
 52 {
 53         chomp;
 54         my @a = split;
 55         if ($_=~/(.*?)_/)
 56         {
 57                 my $key = $1;# read_name1
 58                 $end=$a[3]+length($a[9]);# end position
 59                 if ($a[1]==0)
 60                 {
 61                         $pos2{$key} = "$a[2]\t$a[3]"; #chr&str <- read_name1
 62                 }
 63                 if ($a[1]==16)
 64                 {
 65                         $pos2{$key} = "$a[2]\t$end"; #chr&end <- read_name1
 66                 }
 67         }
 68 }
 69
 70 foreach my $key (keys %pos) #number of keys of %seq is more than that of %pos, for bamfile only contains t    he reads that mapped on the genome;as it is sorted by %pos's key, if there is a seqid that can only be fou    nd in %pos, but not in %seq, this id will be printed in final results, and there will be an error reported    . But this situation won't happen, for %seq contains id in %pos.
 71 {
 72         next if (not exists $seq{$key}); # 5qc clean data.Discard reads with low quality
 73         if (not exists $pos2{$key})
 74         {
 75                 $i=0;
 76                 my $item = "$seq{$key}\t$pos{$key}\t$i";  #barcode,read2_chromosome,read2_start_coordinate    ,0 。如果read1没有比对信息,则只考虑reads1的barcode以及reads2的起始位置,一致,则认为是同一条reads
 77                 $rmdup{$item} = $key; #read name <- barcode,read2_chromosome,read2_start_coordinate
 78         }
 79         else
 80         {
 81                 my $item = "$seq{$key}\t$pos{$key}\t$pos2{$key}"; #barcode,read2_chromosome,read2_start_co    ordinate,read1_chromosome,read1_start_coordinate 。若read1和read2的比对信息都有,则考虑reads1的barcode、reads1的起始位置以及reads2的起始位置,一致,则认为是同一条reads
 82                 $rmdup{$item} = $key; #read2name <- barcode&chr&str;key is xbp seq and position of the seq    id without pcr duplication, value is the seqid without pcr duplication;
 83         }
 84 }
 85 undef %seq;
 86 undef %pos;
 87 undef %pos2;
 88
 89 open IN2,"zcat $ARGV[1]|" or die $!; # read bam file again, print reads without pcr duplication in bam.txt     format;
 90 while ()
 91 {
 92         chomp;
 93         my @a = split;
 94         if ($_=~/(.*?)_/)
 95         {
 96                 my $item = $1;
 97                 $bam{$item} = $_; # put bamfile into a hash, key is the seqid, value is the whole informat    ion from bam file;
 98         }
 99 }
100 foreach my $id (values %rmdup)
101 {
102         print OUT "$bam{$id}\n";
103 }
104 close IN2;

8、CGIcgcgcggv2.sh

  1 export PATH=/data/bin/:$PATH
  2 i=ss
  3 s=$1 #Pnlc1
  4 f=6
  5 b=12
  6 type=$2 #gac
  7
  8 shell=/WPSnew/liuxiaomeng/shell/cmp_pipeline/CGIcgcgcgg/shell2/$type
  9 out=/WPSnew/liuxiaomeng/project/cmp/analysis/7.additional_pipeline/2.CGIcgcg    cgg/singlesample/$s
 10 txt=/WPSnew/liuxiaomeng/project/cmp/analysis/1.normal_pipeline/${f}_${b}trim    /s06.cgcgma_pcr/$s.5bp.cgcgmat.rmd.gz
 11 perlsup=/WPSnew/liuxiaomeng/perl/sorcharacter.pl
 12 #bedtemp=$out/temp.$s.${b}bp.cgcgmat.rm.bed
 13 Pbedres=$out/P_$s.${b}bp.CGI.bed
 14 Ppositivebedres=$out/temppositiveP_$s.${b}bp.CGI.bed
 15 Pnegativebedres=$out/tempnegativeP_$s.${b}bp.CGI.bed
 16 intersectBed=/WPSnew/liuxiaomeng/software/bedtools-2.17.0/bin/intersectBed
 17 CGIcgcgcgg=/WPSnew/liuxiaomeng/Database/hg19/CGI/CGIcgcgcgg.bed
 18 CGIcgcgcggpositive=/WPSnew/liuxiaomeng/Database/hg19/CGI/CGIcgcgcggpositive.    bed
 19 CGIcgcgcggnegative=/WPSnew/liuxiaomeng/Database/hg19/CGI/CGIcgcgcggnegative.    bed
 20 bedtemppositive=$out/temppositive.$s.${b}bp.cgcgmat.rm.bed
 21 bedtempnegative=$out/tempnegative.$s.${b}bp.cgcgmat.rm.bed
 22
 23 mkdir -p $out $shell
 24 echo "zcat $txt|awk '{OFS=\"\t\";if(\$2==0) print \$3,\$4-1,\$4,\$1,\$5,\"+\    "}' > $bedtemppositive && \
 25 zcat $txt|awk '{OFS=\"\t\";if(\$2==16)print \$3,\$4+length(\$10)-2,\$4+lengt    h(\$10)-1,\$1,\$5,\"-\"}' > $bedtempnegative && \
 26 $intersectBed -a $CGIcgcgcggpositive -b $bedtemppositive -c > $Ppositivebedr    es && \
 27 $intersectBed -a $CGIcgcgcggnegative -b $bedtempnegative -c > $Pnegativebedr    es && \
 28 cat $Ppositivebedres $Pnegativebedres|perl $perlsup - > $Pbedres && \
 29 rm $bedtemppositive $bedtempnegative $Ppositivebedres $Pnegativebedres
 30
 31
 32 txt=/WPSnew/liuxiaomeng/project/cmp/analysis/1.normal_pipeline/${f}_${b}trim    /s06.cgcgma_pcr/$s.5bp.cgcgmat.gz
 33 Tbedres=$out/T_$s.${b}bp.CGI.bed
 34 Tpositivebedres=$out/temppositiveT_$s.${b}bp.CGI.bed
 35 Tnegativebedres=$out/tempnegativeT_$s.${b}bp.CGI.bed
 36
 37 zcat \$txt|awk '{OFS=\"\t\";if(\$2==0) print \$3,\$4-1,\$4,\$1,\$5,\"+\"}' >     $bedtemppositive && \
 38 zcat \$txt|awk '{OFS=\"\t\";if(\$2==16)print \$3,\$4+length(\$10)-2,\$4+leng    th(\$10)-1,\$1,\$5,\"-\"}' > $bedtempnegative && \
 39 $intersectBed -a $CGIcgcgcggpositive -b $bedtemppositive -c > \$Tpositivebed    res && \
 40 $intersectBed -a $CGIcgcgcggnegative -b $bedtempnegative -c > \$Tnegativebed    res && \
 41 cat \$Tpositivebedres \$Tnegativebedres|perl $perlsup - > \$Tbedres && \
 42 rm $bedtemppositive $bedtempnegative \$Tpositivebedres \$Tnegativebedres &&     \
 43 echo "done $s.$b"
 44 ############STEP7_Pcgi_distri.sh STEP7_Tcgi_distri.sh END##############
 45
 46 ############STEP8_Pcalreads.sh##########################################
 47 reads=$out/$s.${f}_${b}reads.txt
 48 awk -v var=$s 'BEGIN{sum=0}{sum+=\$4}END{printf \"P-CGIcgcgcgg-\"var\"\\t\"s    um\"\\t\"}' $Pbedres >> \$reads
 49 awk -v var=$s 'BEGIN{sum=0}{sum+=\$4}END{printf \"T-CGIcgcgcgg-\"var\"\\t\"s    um\"\\t\"}' \$Tbedres >> \$reads
 50 ################################STEP8_Pcalreads.sh END######################    ###############
 51
 52 ############STEP9_Mepm.sh##############################################
 53 perl=/WPSnew/liuxiaomeng/perl/cmp/Mepm.pl
 54 perl2=/WPSnew/liuxiaomeng/perl/cmp/upMepm.pl
 55 TMepm=$out/T_$s.${b}bp.Mepm.bed
 56 PMepm=$out/P_$s.${b}bp.Mepm.bed
 57 uPMepm=$out/uPnor_$s.${b}bp.Mepm.bed
 58 in3=/WPSnew/liuxiaomeng/project/cmp/analysis/1.normal_pipeline/cal/$s.${f}_$    {b}reads.txt
 59
 60 awk '{print \$12}' \$in3|perl \$perl \$Tbedres - > \$TMepm
 61 awk '{print \$14}' \$in3|perl \$perl $Pbedres - > \$PMepm
 62 perl \$perl2 $Pbedres \$in3 > \$uPMepm
 63 ###########STEP9_Mepm.sh END###########################################" > $    shell/$s.sh
 64 qsub -o $shell/$s.sh.o -e $shell/$s.sh.e -cwd -l vf=1g -V $shell/$s.sh $s $t    ype

9.combine.pl

#!/usr/bin/perl
use strict;
use warnings;
no strict 'refs';

my $usage = "perl $0 IN1 IN2 IN3...  OUT
IN1:chr10   100028204       100028508       2
IN2:chr10   100028204       100028508       2
...
INn:6(or 8)
OUT1:chr10   100028204       100028508       2  3       5...";
die $usage unless @ARGV>2;
my ($f,$i,$j,%cgi,$in);
my $dir="/WPSnew/liuxiaomeng/project/cmp/analysis/1.normal_pipeline";
open IN0,"<$dir/$ARGV[-1]_12trim/s07.Tcgibed/$ARGV[0].12bp.CGI.bed" or die $!;
$f=@ARGV-1;
open OUT1,">$dir/$ARGV[-1]_12trim/s08.Tcgibednorm/T_$ARGV[$f-1].cgireads.txt" or die $!;
open OUT2,">$dir/$ARGV[-1]_12trim/s08.Tcgibednorm/T_$ARGV[$f-1].cgiMePM.txt" or die $!;
open OUT3,">$dir/$ARGV[-1]_12trim/s08.Pcgibednorm/P_$ARGV[$f-1].cgireads.txt" or die $!;
open OUT4,">$dir/$ARGV[-1]_12trim/s08.Pcgibednorm/P_$ARGV[$f-1].cgiMePM.txt" or die $!;
open OUT5,">$dir/$ARGV[-1]_12trim/s08.Pcgibednorm/uPnor_$ARGV[$f-1].cgiMePM.txt" or die $!;

while ()
{
        chomp;
        $i++;
        my @a = split;
        $cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
        $j=0;
        $in = "IN".$i;
        open $in, "<$dir/$ARGV[-1]_12trim/s07.Tcgibed/$ARGV[$i].12bp.CGI.bed" or die $!;
        while (<$in>)
        {
                $j++;
                chomp;
                my @a = split;
                $cgi{$j}=$cgi{$j}."$a[3]\t";
        }
}

foreach my $key (sort {$a<=>$b} keys %cgi)
{
        print OUT1 "$cgi{$key}\n";
}
###########################################
open IN0,"<$dir/$ARGV[-1]_12trim/s09.mepm/T_$ARGV[0].12bp.Mepm.bed" or die $!;
$f=@ARGV-1;
$i=0;
while ()
{
        chomp;
        $i++;
        my @a = split;
        $cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
        $j=0;
        $in = "IN".$i;
        open $in, "<$dir/$ARGV[-1]_12trim/s09.mepm/T_$ARGV[$i].12bp.Mepm.bed" or die $!;
        while (<$in>)
        {
                $j++;
                chomp;
                my @a = split;
                $cgi{$j}=$cgi{$j}."$a[3]\t";
        }
}

foreach my $key (sort {$a<=>$b} keys %cgi)
{
        print OUT2 "$cgi{$key}\n";
}

############################################
open IN0,"<$dir/$ARGV[-1]_12trim/s07.Pcgibed/$ARGV[0].12bp.CGI.bed" or die $!;
$f=@ARGV-1;
$i=0;
while ()
{
        chomp;
        $i++;
        my @a = split;
        $cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
        $j=0;
        $in = "IN".$i;
        open $in, "<$dir/$ARGV[-1]_12trim/s07.Pcgibed/$ARGV[$i].12bp.CGI.bed" or die $!;
        while (<$in>)
        {
                $j++;
                chomp;
                my @a = split;
                $cgi{$j}=$cgi{$j}."$a[3]\t";
        }
}

foreach my $key (sort {$a<=>$b} keys %cgi)
{
        print OUT3 "$cgi{$key}\n";
}

###########################
open IN0,"<$dir/$ARGV[-1]_12trim/s09.mepm/P_$ARGV[0].12bp.Mepm.bed" or die $!;
$f=@ARGV-1;
$i=0;
while ()
{
        chomp;
        $i++;
        my @a = split;
        $cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
        $j=0;
        $in = "IN".$i;
        open $in, "<$dir/$ARGV[-1]_12trim/s09.mepm/P_$ARGV[$i].12bp.Mepm.bed" or die $!;
        while (<$in>)
        {
                $j++;
                chomp;
                my @a = split;
                $cgi{$j}=$cgi{$j}."$a[3]\t";
        }
}

foreach my $key (sort {$a<=>$b} keys %cgi)
{
        print OUT4 "$cgi{$key}\n";
}
##########
open IN0,"<$dir/$ARGV[-1]_12trim/s09.mepm/uPnor_$ARGV[0].12bp.Mepm.bed" or die $!;
$f=@ARGV-1;
$i=0;
while ()
{
        chomp;
        $i++;
        my @a = split;
        $cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
        $j=0;
        $in = "IN".$i;
        open $in, "<$dir/$ARGV[-1]_12trim/s09.mepm/uPnor_$ARGV[$i].12bp.Mepm.bed" or die $!;
        while (<$in>)
        {
                $j++;
                chomp;
                my @a = split;
                $cgi{$j}=$cgi{$j}."$a[3]\t";
        }
}

foreach my $key (sort {$a<=>$b} keys %cgi)
{
        print OUT5 "$cgi{$key}\n";
}

你可能感兴趣的:(CMP分析流程)