周六考试(一)

生物信息学练习题
一、data/newBGIseq500_1.fq和data/newBGIseq500_2.fq中是基于BGIseq500测序平台的一种真核生物基因组DNA的PE101测序数据,插入片段长度为450 bp;已知该基因组大小约在6M左右。
1) 请统计本次测序的PE reads数是多少对reads?理论上能否使基因组99%以上的区域达到至少40X覆盖?请简要写出推理和计算的过程与结果,数值计算使用R等工具时请写出所用代码。
代码:

wc -l data/newBGIseq500_1.fq|awk ‘{print $1/4}’
#结果为1599999

2、

参考1

# 根据上一步结果计算全部的数据量
base<-1599999*200
#计算测序深度
dep<-base/6000000
#由于基因组dna长度长,某片段被检测到的概率p<<1,并且测序过程中会产生趋向无穷的reads,因此碱基被测到的深度符合泊松分布
#r语言的ppois()函数表示,累积泊松分布函数,因此,要计算基因组碱基至少40X覆盖的概率,应先求出0-39X被覆盖的累积分布概率ppois(39,dep),再用1-此概率就是大于等于40X的概率,即:
1-ppois(39, dep)
[1] 0.975145
#因此,理论上认为,该数据量不能使基因组99%以上区域达到至少40X覆盖度

结题思路:
1、 fq每四行表示一条reads的信息,1.fq和2.fq是成对存在的
2、 通过PE101、reads对数计算得到总碱基数,与6M基因组大小的99%的区域40X以上需要的碱基数作比较即可

2) 请下载并安装SOAPdenovo软件,设置-K参数为35对该数据进行de novo组装,并画出组装结果序列从长到短的长度累积曲线图;

下载地址:
https://sourceforge.net/projects/soapdenovo2/files/latest/download
安装:
make

配置文件:

#maximal read length 
max_rd_len=100 
[LIB] 
#average
insert size avg_ins=450
#if sequence needs to be reversed 
reverse_seq=0 
#in which part(s) the reads are used 
asm_flags=3 
#use only first 100 bps of each read 
rd_len_cutoff=100 
#in which order the reads are used while scaffolding 
rank=1 
#cutoff of pair number for a reliable connection (at least 3 for short insert size) 
pair_num_cutoff=3 
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) 
map_len=32 
#a pair of fastq file, read 1 file should always be followed by read 2 file 
q1=/home/stu27/data/newBGIseq500_1.fq 
q2=/home/stu27/data/newBGIseq500_2.fq 

执行

../SOAPdenovo_master/SOAPdenovo-63mer all -s ../SOAPdenovo_master/config_file -K 35 -R -o ../data/denovo_graph_prefix 1>../data/denovo.log 2>../data/denovo.err

提取序列长度,python脚本

#!/usr/bin/python

fasta = {}
scafSeq = open("../data/denovo_graph_prefix.scafSeq")
for line in scafSeq:
        if line.startswith(">"):
                id = line.strip().lstrip(">")
                fasta[id] = 0
        else:
                fasta[id] += len(line.strip())

with open("./length.txt", "w") as f_in:
        accumu_len = 0
        for key in sorted(fasta, key = fasta.__getitem__, reverse = True):
                length = fasta[key]
                accumu_len += length
                f_in.write("%s\t%s\t%s\n"% (key, length, accumu_len))

用R画累积分布曲线:

pdf("length.pdf") 
lens <- read.table("length.txt") 
plot(lens$V3, type="l",ylab='Total length',xlab = 'Seq num')
dev.off()

求N50

#!/usr/bin/python

fasta = open("./length.txt")
total_seq_len = 0
length_list = []
temp_len = 0
N50 = 0
for line in fasta:
        length_list.append(int(line.split("\t")[1]))
        total_seq_len += int(line.split("\t")[1])

N50_pos = total_seq_len / 2
for value in length_list:
        temp_len += value
        if temp_len >= N50_pos:
                N50 = value
                break
print "The length of N50 is:", str(N50)

二、考试参考目录下文件data/chr17.vcf.gz,中是某trio家系的17号染色体的变异集合,参考序列为hg38。
1) 编写脚本或选择适当工具,统计vcf中变异位点的Qual值分布情况,并画图展示。

#!/usr/bin/bash

awk '{if(/^#/){next};print $6}' ../data/chr17.vcf > qual_value.txt

../vcftools_0.1.13/bin/vcftools --vcf ../data/chr17.vcf --chr chr17 --from-bp 7661779 --to-bp 7687538 --recode --out TP53.txt

#counting HO and HET:
python viriant_number.py

R画图:

pdf("qual.pdf")
qual <- read.table('qual.txt')
hist(qual$V1,main = "Qual Hist")
dev.off()

2)选择合适的工具或方法提取该家系在 TP53 基因上是变异情况进行输出,说明变异位点的数目以及各样品的情况(纯合、杂合位点数目)。
通过ensembl 查到位置信息: 7,661,779-7,687,538
vcftools使用见上代码
python提取变异信息:

#!/usr/bin/python
vcf_file = open("./TP53.txt.recode.vcf")
sample = {}
sample_INFO = []
sample_name = []
for line in vcf_file:
        if line.startswith("##"):
                continue
        elif line.startswith("#CHROM"):
                line = line.strip().split("\t")
                sample_name = line[9:]
                for i in range(len(sample_name)):
                        sample[sample_name[i]]= {}
                        sample[sample_name[i]]["HO"] = 0
                        sample[sample_name[i]]["HET"] = 0
        else:
                line = line.strip()
                sample_INFO = line.split("\t")[9:]
                for k in range(len(sample_INFO)):
                        GT_INFO = sample_INFO[k].split(":")[0]
                        if GT_INFO.split("/")[0] == GT_INFO.split("/")[1]:
                                sample[sample_name[k]]["HO"] += 1
                        else:
                                sample[sample_name[k]]["HET"] += 1
for sample, info in sample.items():
        print sample+ "  " + "HO_NUM: "+str(info["HO"])+"  "+  "HET_NUM: "+str(info["HET"])+"\n"

方法二:

 grep -v '#' TP53.recode.vcf | cut -f10 | awk -F ":" '{print $1}' | awk -F "/" 'BEGIN{HO =0;HET = 0}{if($1==$2){HO++}else{HET++}}END{print "HO:", HO, "HET:", HET}'
 grep -v '#' TP53.recode.vcf | cut -f11 | awk -F ":" '{print $1}' | awk -F "/" 'BEGIN{HO =0;HET = 0}{if($1==$2){HO++}else{HET++}}END{print "HO:", HO, "HET:", HET}'
 grep -v '#' TP53.recode.vcf | cut -f12 | awk -F ":" '{print $1}' | awk -F "/" 'BEGIN{HO =0;HET = 0}{if($1==$2){HO++}else{HET++}}END{print "HO:", HO, "HET:", HET}'

你可能感兴趣的:(周六考试(一))