生物信息学练习题
一、data/newBGIseq500_1.fq和data/newBGIseq500_2.fq中是基于BGIseq500测序平台的一种真核生物基因组DNA的PE101测序数据,插入片段长度为450 bp;已知该基因组大小约在6M左右。
1) 请统计本次测序的PE reads数是多少对reads?理论上能否使基因组99%以上的区域达到至少40X覆盖?请简要写出推理和计算的过程与结果,数值计算使用R等工具时请写出所用代码。
代码:
wc -l data/newBGIseq500_1.fq|awk ‘{print $1/4}’
#结果为1599999
2、
参考1
# 根据上一步结果计算全部的数据量
base<-1599999*200
#计算测序深度
dep<-base/6000000
#由于基因组dna长度长,某片段被检测到的概率p<<1,并且测序过程中会产生趋向无穷的reads,因此碱基被测到的深度符合泊松分布
#r语言的ppois()函数表示,累积泊松分布函数,因此,要计算基因组碱基至少40X覆盖的概率,应先求出0-39X被覆盖的累积分布概率ppois(39,dep),再用1-此概率就是大于等于40X的概率,即:
1-ppois(39, dep)
[1] 0.975145
#因此,理论上认为,该数据量不能使基因组99%以上区域达到至少40X覆盖度
结题思路:
1、 fq每四行表示一条reads的信息,1.fq和2.fq是成对存在的
2、 通过PE101、reads对数计算得到总碱基数,与6M基因组大小的99%的区域40X以上需要的碱基数作比较即可
2) 请下载并安装SOAPdenovo软件,设置-K参数为35对该数据进行de novo组装,并画出组装结果序列从长到短的长度累积曲线图;
下载地址:
https://sourceforge.net/projects/soapdenovo2/files/latest/download
安装:
make
配置文件:
#maximal read length
max_rd_len=100
[LIB]
#average
insert size avg_ins=450
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 100 bps of each read
rd_len_cutoff=100
#in which order the reads are used while scaffolding
rank=1
#cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=/home/stu27/data/newBGIseq500_1.fq
q2=/home/stu27/data/newBGIseq500_2.fq
执行
../SOAPdenovo_master/SOAPdenovo-63mer all -s ../SOAPdenovo_master/config_file -K 35 -R -o ../data/denovo_graph_prefix 1>../data/denovo.log 2>../data/denovo.err
提取序列长度,python脚本
#!/usr/bin/python
fasta = {}
scafSeq = open("../data/denovo_graph_prefix.scafSeq")
for line in scafSeq:
if line.startswith(">"):
id = line.strip().lstrip(">")
fasta[id] = 0
else:
fasta[id] += len(line.strip())
with open("./length.txt", "w") as f_in:
accumu_len = 0
for key in sorted(fasta, key = fasta.__getitem__, reverse = True):
length = fasta[key]
accumu_len += length
f_in.write("%s\t%s\t%s\n"% (key, length, accumu_len))
用R画累积分布曲线:
pdf("length.pdf")
lens <- read.table("length.txt")
plot(lens$V3, type="l",ylab='Total length',xlab = 'Seq num')
dev.off()
求N50
#!/usr/bin/python
fasta = open("./length.txt")
total_seq_len = 0
length_list = []
temp_len = 0
N50 = 0
for line in fasta:
length_list.append(int(line.split("\t")[1]))
total_seq_len += int(line.split("\t")[1])
N50_pos = total_seq_len / 2
for value in length_list:
temp_len += value
if temp_len >= N50_pos:
N50 = value
break
print "The length of N50 is:", str(N50)
二、考试参考目录下文件data/chr17.vcf.gz,中是某trio家系的17号染色体的变异集合,参考序列为hg38。
1) 编写脚本或选择适当工具,统计vcf中变异位点的Qual值分布情况,并画图展示。
#!/usr/bin/bash
awk '{if(/^#/){next};print $6}' ../data/chr17.vcf > qual_value.txt
../vcftools_0.1.13/bin/vcftools --vcf ../data/chr17.vcf --chr chr17 --from-bp 7661779 --to-bp 7687538 --recode --out TP53.txt
#counting HO and HET:
python viriant_number.py
R画图:
pdf("qual.pdf")
qual <- read.table('qual.txt')
hist(qual$V1,main = "Qual Hist")
dev.off()
2)选择合适的工具或方法提取该家系在 TP53 基因上是变异情况进行输出,说明变异位点的数目以及各样品的情况(纯合、杂合位点数目)。
通过ensembl 查到位置信息: 7,661,779-7,687,538
vcftools使用见上代码
python提取变异信息:
#!/usr/bin/python
vcf_file = open("./TP53.txt.recode.vcf")
sample = {}
sample_INFO = []
sample_name = []
for line in vcf_file:
if line.startswith("##"):
continue
elif line.startswith("#CHROM"):
line = line.strip().split("\t")
sample_name = line[9:]
for i in range(len(sample_name)):
sample[sample_name[i]]= {}
sample[sample_name[i]]["HO"] = 0
sample[sample_name[i]]["HET"] = 0
else:
line = line.strip()
sample_INFO = line.split("\t")[9:]
for k in range(len(sample_INFO)):
GT_INFO = sample_INFO[k].split(":")[0]
if GT_INFO.split("/")[0] == GT_INFO.split("/")[1]:
sample[sample_name[k]]["HO"] += 1
else:
sample[sample_name[k]]["HET"] += 1
for sample, info in sample.items():
print sample+ " " + "HO_NUM: "+str(info["HO"])+" "+ "HET_NUM: "+str(info["HET"])+"\n"
方法二:
grep -v '#' TP53.recode.vcf | cut -f10 | awk -F ":" '{print $1}' | awk -F "/" 'BEGIN{HO =0;HET = 0}{if($1==$2){HO++}else{HET++}}END{print "HO:", HO, "HET:", HET}'
grep -v '#' TP53.recode.vcf | cut -f11 | awk -F ":" '{print $1}' | awk -F "/" 'BEGIN{HO =0;HET = 0}{if($1==$2){HO++}else{HET++}}END{print "HO:", HO, "HET:", HET}'
grep -v '#' TP53.recode.vcf | cut -f12 | awk -F ":" '{print $1}' | awk -F "/" 'BEGIN{HO =0;HET = 0}{if($1==$2){HO++}else{HET++}}END{print "HO:", HO, "HET:", HET}'