2021-04-24 分析Reads比对基因组后5和3端的序列情况

目的:为了分析基因转录后5‘端和3’端的修饰或添加序列的情况。采用的斑马鱼线粒体基因组为标准序列

  1. 将去掉接头的序列文件fq.gz 变为fasta文件格式
    2.由于fasta文件过大,可以切割为多个文件
  2. 下载blast软件,构建标准数据库
    4.对多个文件进行本地blast(构建的标准数据库),生成对应的多个excel表
    5.运行python程序对多个excel表进行数据处理分析,并添加5和3端序列

1.将去掉接头的序列文件fq.gz 变为fasta文件格式

import time
start = time.time()
import glob
import gzip
path = r'K:\test_clean_data\MCK_FKDL210000063-1a_1.clean.fq.gz'
for i in glob.glob(path):
    with gzip.open(i,'rt') as fp:
        output_fasta = open("%s.fa"%i[:-9],'w')
        i = 0
        for line in fp:
            i += 1
            if i % 4 == 1:
                line_new = line[1:]
                output_fasta.write('>'+line_new)
            elif i % 4 == 2:
                output_fasta.write(line)
        output_fasta.close()
end = time.time()
print("used %s s" % str(end - start))
image.png

2. fasta文件切割为多个文件

2.1 将一个文件按每多少行切割成若干个文件

首先判断文件多少行数

wc -l MCK_FKDL210000063-1a_1.cl.fa

6521488 MCK_FKDL210000063-1a_1.cl.fa

MCK_FKDL210000063-1a_1.cl.fa 该文件有600多万行,因此选择1000000为一个文件

$ split -l 1000000 MCK_FKDL210000063-1a_1.cl.fa ./SPLIT/SPLIT.fa -d -a 4

1000000 按1000000行大小分割,可根据需要需要修改
MCK_FKDL210000063-1a_1.cl.fa 需要切割的大文件
./SPLIT/SPLIT.fa 保存文件的位置以及文件名
-d 后缀位数字 如果不加则是字母形式的后缀
-a -4 后缀数字为4位数

批量操作, ./SPLIT/SPLIT.fa 后面不能接文件,应改为文件夹
ls
*.fa  ## 没有结果
ls *.fa  ### 输出路径下所有的文件数
ls *.fa|while read id ; do split -l 1000000 $id ./SPLIT/ -d -a 4;done
###在单个文件夹下生成
ls *.fa|while read id ; do split -l 1000000 $id ./SPLIT/$id/ -d -a 4;done
###在不同文件夹下生成
ls *.fa|while read id ; do mkdir ./SPLIT/$id ;done
###先自动生成子文件夹

ls *.fa|while read id ; do split -l 1000000 $id ./SPLIT/$id/ -d -a 4;done
ls *.fa|while read id ; do split -l 1000000 $id ./SPLIT/$id/$id -d -a 4;done
##用这个更好
ls *.fa|while read id; do blastn -task blastn-short -query $id -out 
./zebrafish_mito_blastn_excels/$id.xls -db ./zebrafish_mito_blast_databases/zebra_mito_blast -outfmt 7 -evalue 1e-5; done

2.2将一个文件按每多少kb切割成若干个文件

$ split -b 300k hibernate.log bbb

2.3 合并 -- cat命令

$ cat part_* > merge_file.txt

例子

split -l 1000000 MCK_FKDL210000063-1a_1.cl.fa ./SPLIT/SPLIT.fa -d -a 4


图片1.png

3.下载blast软件,构建标准数据库

3.1.本地安装blast软件

https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

mkdir BLAST
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-x64-linux.tar.gz
tar  -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
cd ~ 
vi ~/.bashrc
小写i进入编辑模式,或者按两次i。
按ESC,保存退出:wq ,不保存退出:q!

#写入PATH
source ~/.bashrc
echo $PATH

3.2. 构建数据库

首先去ncbi下载斑马鱼线粒体基因组fasta文件,命名为zebra_mito.fasta

$ makeblastdb -in zebra_mito.fasta -dbtype nucl -parse_seqids -out zebra_mito

makeblastdb -in input_file -dbtype molecule_type -title database_title -parse_seqids -out database_name -logfile File_Name
-in 后接输入文件,你要格式化的fasta序列
-dbtype 后接序列类型,nucl为核酸,prot为蛋白
-parse_seqids 推荐加上,
-out 后接数据库名
-logfile 日志文件,如果没有默认输出到屏幕
执行上述代码后将生成后缀为nin,nhr,nsq,nsi,nsd,nog等9个文件,至此,自定义搜索文件生成完成

图片1.png

4.对多个分割文件进行本地blast(构建的标准数据库),生成对应的多个excel表

4.1 blastn 使用库文件进行搜索了,搜索代码如下:

blastn -task blastn-short -query SPLIT.fa0000 -out seq0000.xls -db ./mito_dna_data/zebra_mito  -outfmt 7 -evalue 1e-5 

也可以是其他代码

$ blastn -query part.fasta -out seq.blast -db cox  -outfmt 0
blastn -query part.fasta -out seq.blast -db cox  -outfmt 7
$ blastp -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8
$ blastn -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8
$ blastx -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8

-query: 输入文件路径及文件名
-out:输出文件路径及文件名
-db:格式化了的数据库路径及数据库名
-outfmt:输出文件格式,总共有12种格式,6是tabular格式对应之前BLAST的m8格式
-evalue:设置输出结果的e-value值, 越少越严格
-num_alignments 显示比对数Default = 250
-num_descriptions:单行描述的最大数目 default=50
-num_threads:线程
重点是-outfmt 6,也就是之前版本的m 8格式 不同数字代表输出格式不同
-outfmt
alignment view options: #####常用0,6,7
0 = Pairwise, 简单数据输出
1 = Query-anchored showing identities, 可视化图形
2 = Query-anchored no identities, 图形化比配,与2差不多
3 = Flat query-anchored showing identities, 可视化图形
4 = Flat query-anchored no identities, 可视化图形
5 = BLAST XML, 输出为XML,不建议用
6 = Tabular, 输出简单明了
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1), 不懂
9 = Seqalign (Binary ASN.1),
10 = Comma-separated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
16 = Single-file BLAST XML2,
17 = Sequence Alignment/Map (SAM),
18 = Organism Report

6的输出结果
图片1.png

结果中从左到右每一列的意义分别是:

[00] Query id
[01] Subject id
[02] % identity
[03] alignment length
[04] mismatches
[05] gap openings
[06] q. start
[07] q. end
[08] s. start
[09] s. end
[10] e-value
[11] bit score

4.2 极短序列比对

$ blastn -task blastn-short -query part.fasta -out seq.blast -db cox -outfmt 7 -evalue 1e-5
对于特别短的序列例如20bp(<30)左右的序列,在查找的时候我们需要添加三个参数,如下代表所示:

还是输出为txt格式,读取速度比excel快,后面的程序进行更换

-task blastn-short
 -word_size 4 \
 -evalue 1 \

-task参数指定为短序列的匹配算法
-word_size该参数只能指定4以上的值,这里写了最小的4,指定用于匹配算法的最短序列
-evalue 该值越小,得到的结果将更加严苛,这里设定为1即可。

$ blastn -help 可以获取帮助

4.3 批量完成程序

为了方便后续对excel的操作,生成格式改为.xlsx,,###好像还是不行,只能改为xls

for i in `seq 0 6`;do blastn -task blastn-short -query SPLIT.fa000${i} -out seq000${i}.xls -db ./mito_dna_data/zebra_mito  -outfmt 7 -evalue 1e-5;done
图片1.png

图片2.png

同时也可以生成可视化文件作为对比,选择1 作为输出格式

for i in `seq 0 6`;do blastn -task blastn-short -query SPLIT.fa000${i} -out seq000${i}.blast -db ./mito_dna_data/zebra_mito  -outfmt 1 -evalue 1e-5;done
图片1.png
图片3.png
图片1.png

5. 对多个excel表进行python分析

首先需要将xls格式的表通过另存为改为xlsx格式,否则读取报错

变为xlsx格式

import os
import os.path
import win32com.client as win32

## 根目录
rootdir = u'K:\\test_clean_data\seq3'
# 三个参数:父目录;所有文件夹名(不含路径);所有文件名
for parent, dirnames, filenames in os.walk(rootdir):
    for fn in filenames:
        filedir = os.path.join(parent, fn)
        print(filedir)

        excel = win32.gencache.EnsureDispatch('Excel.Application')
        wb = excel.Workbooks.Open(filedir)
        # xlsx: FileFormat=51
        # xls:  FileFormat=56,
        # 后缀名的大小写不通配,需按实际修改:xls,或XLS
        wb.SaveAs(filedir.replace('xls', 'xlsx'), FileFormat=51)  
        wb.Close()                                 
        excel.Application.Quit()

重新生成xlsx文件。接着对xlsx文件进行处理

from xlrd import open_workbook

import requests
import re
import json
import time
import os
import xlwt
import xlwt
import xlrd
path = r'K:\test_clean_data\seq'

path1 = r'K:\test_clean_data\MCK_FKDL210000063-1a_1.cl.fa'

seq_id=[]
seq=[]
with open(path1, 'rb') as f:

    for line in f:
        Line = line.decode('utf-8', 'replace')
        if Line.startswith('>'):
           Line1= Line.split('1:N')[0]
           Line2=Line1[1:].strip()
           seq_id.append(Line2)
           #print(Line2)
        else:
            seq.append(Line)
#seq_dict=dict(zip(seq_id,seq))

try:
 for root, dirs, files in os.walk(path):
    style = xlwt.XFStyle()
    font = xlwt.Font()
    font.name = 'SimSun'
    style.font = font
    w = xlwt.Workbook(encoding='utf-8')
    # 添加个sheet
    ws = w.add_sheet('sheet 1', cell_overwrite_ok=True)
    # 当前写入表格到第 row行

    ws.write(0, 0, 'ID')
    ws.write(0, 1, 'refer_ID0')
    ws.write(0, 2, 'identity')
    ws.write(0, 3, 'length')
    ws.write(0, 4, 'mismatch')
    ws.write(0, 5, 'gaps')
    ws.write(0, 6, 'q. start')
    ws.write(0, 7, 'q. end')
    ws.write(0, 8, 's. start')
    ws.write(0, 9, 's. end')
    ws.write(0, 10, 'evalue')
    ws.write(0, 11, 'score')
    ws.write(0, 12, '5-seq')
    ws.write(0, 13, '3-seq')



###把数据汇入一个表中
    for file in files:####可以获取所有文件的路径
        path = os.path.join(root, file)####单个文件的绝对路径
        print(path)
        workbook = xlrd.open_workbook(path)
        sheet2 = workbook.sheet_by_index(0)
        print(sheet2 )
        cols_0 = sheet2.col_values(0)
        cols_1 = sheet2.col_values(1)
        cols_2=sheet2.col_values(2)
        cols_3 = sheet2.col_values(3)
        cols_4 = sheet2.col_values(4)
        cols_5 = sheet2.col_values(5)
        cols_8 = sheet2.col_values(8)
        cols_9 = sheet2.col_values(9)
        cols_10 = sheet2.col_values(10)
        cols_11 = sheet2.col_values(11)
        cols_start=sheet2.col_values(6)
        cols_end = sheet2.col_values(7)
        row = 1
        i=0

        for col in  cols_1:

           if col == 'NC_002333.2':
               #print(col)
               #ws.write(row,i,col)
               print(i+1)
               if cols_0[i].strip() in seq_id:
                   ws.write(row, 0, cols_0[i])
                   ws.write(row, 1, cols_1[i])
                   ws.write(row, 2, cols_2[i])
                   ws.write(row, 3, cols_3[i])
                   ws.write(row, 4, cols_4[i])
                   ws.write(row, 5, cols_5[i])
                   ws.write(row, 6, cols_start[i])
                   ws.write(row, 7, cols_end[i])
                   ws.write(row, 8, cols_8[i])
                   ws.write(row, 9, cols_9[i])
                   ws.write(row, 10, cols_10[i])
                   ws.write(row, 11, cols_11[i])


                   index=seq_id.index(cols_0[i])

                   start=int(cols_start[i])

                   print(start)
                   end=int(cols_end[i])
                   print(end)

                   tagart_seq=seq[index]
                   short_seq=tagart_seq[start-1:end]
                   short_seq_5=tagart_seq.split(short_seq)[0]
                   short_seq_3 = tagart_seq.split(short_seq)[1]

                   #print(index)
                   print(seq[index])
                   print(short_seq_5)
                   print(len(tagart_seq))
                   print(len(short_seq))
                   print(len(short_seq_5))
                   print(short_seq_3)
                   print(len(short_seq_3))
                   ws.write(row, 12, short_seq_5)
                   ws.write(row, 13, short_seq_3)
                   row=row+1


                   #if cols_0[i] in total_seq[m]:
                     #print(i+1)
                     #print(total_seq[m+1])
                     #print(total_seq[m])
                     ##ws.write(row,i,cols_0[i+1])

           i=i+1
        w.save(file+'zebra_mito.xls')
except BaseException as e:  ##设置避免出差错,运行不了
    print('Error:', e)

生成的结果如下


image.png

后续就可以按照位置排序,把polya或polyt挑出来

##################################################################################################################

如果还想分析基因之间的相交情况

from xlrd import open_workbook

import requests
import re
import json
import time
import os
import xlwt
import xlwt
import xlrd3

path_excel = r'I:\MTO1_mito_rna_seq\MTO1_whole_adult_mito_mRNA_seq\zebrafish_mito_blastn_excels'
path2 = r'I:\MTO1_mito_rna_seq\MTO1_whole_adult_mito_mRNA_seq\zebrafish_mito_gene_id\zebrafish_mito_gene_id.xlsx'
path_fasta = r'I:\MTO1_mito_rna_seq\MTO1_whole_adult_mito_mRNA_seq\clean_data_fq_to_fa'
file_names = []
for root, dirs, files in os.walk(path_excel):
    file_names.append(dirs)

print(file_names[0])

for name in file_names[0]:
    path_xls = os.path.join(path_excel,name)

    path_fa = os.path.join(path_fasta,name)

    ###########################################################
    print(path_fa)
    print(path_xls)
    # 获取基因对应的序列区间。先构建列表
    mito_plus_gene_position = []
    mito_minus_gene_position = []
    mito_plus_gene = []
    mito_minus_gene_direction = []
    workbook = xlrd3.open_workbook(path2)
    sheet1 = workbook.sheet_by_index(0)

    col_1 = sheet1.col_values(0)
    col_2 = sheet1.col_values(1)
    col_3 = sheet1.col_values(2)
    col_4 = sheet1.col_values(3)

    for x in col_2:
        mito_plus_gene_position.append(int(x))
    for x in col_1:
        mito_plus_gene.append(x)
    for x in col_3:
        if x is not None:
            mito_minus_gene_position.append(int(x))

    for x in col_4:
        if x is not None:
            mito_minus_gene_direction.append(x)
    print(mito_plus_gene)
    print(mito_plus_gene_position)
    print(mito_minus_gene_position)
    print(mito_minus_gene_direction)


    #################################################
    # 构建测序文件所有测序Reads的集合,同时建立相对应的ID全集合
    seq_id = []
    seq = []
    path_total=[]
    for root, dirs, files in os.walk(path_fa):
      for file in files:  ####可以获取所有文件的路径

        path_11 = os.path.join(root, file)  ####单个文件的绝对路径#######
        print(path_11)
        path_total.append(path_11)

        with open(path_11, 'rb') as f:
         for line in f:
            #print(line)
            Line = line.decode('utf-8', 'replace')
            #print(Line)
            if Line.startswith('>'):
                Line1 = Line.split('1:N')[0]
                Line2 = Line1[1:].strip()
                seq_id.append(Line2)
                #print(Line2)
            else:
                seq.append(Line)
        f.close()
    print(len(seq_id))
    #print(seq[1:100])
    # seq_dict=dict(zip(seq_id,seq))
    ######################################################################################################################
    # 遍历所有路径下的Excel表,
    if len(seq_id)>0:
     try:
        for root, dirs, files in os.walk(path_xls):
            style = xlwt.XFStyle()
            font = xlwt.Font()
            font.name = 'SimSun'
            style.font = font
            w = xlwt.Workbook(encoding='utf-8')
            # 添加个sheet
            ws = w.add_sheet('sheet 1', cell_overwrite_ok=True)
            # 当前写入表格到第 row行

            ws.write(0, 0, 'ID')
            ws.write(0, 1, 'refer_ID0')
            ws.write(0, 2, 'identity')
            ws.write(0, 3, 'length')
            ws.write(0, 4, 'mismatch')
            ws.write(0, 5, 'gaps')
            ws.write(0, 6, 'q. start')
            ws.write(0, 7, 'q. end')
            ws.write(0, 8, 's. start')
            ws.write(0, 9, 's. end')
            ws.write(0, 10, 'evalue')
            ws.write(0, 11, 'score')
            ws.write(0, 12, '5-seq')
            ws.write(0, 13, '3-seq')

            ###把数据汇入一个表中

            #print(files)
            for file in files:  ####可以获取所有文件的路径
                row = 1
                file_name=file[:-4]
                print(file)
                print(file_name)

                path = os.path.join(root, file)  ####单个文件的绝对路径
                print(path)
                with open(path, 'r') as file_to_read:
                    for line in file_to_read.readlines():
                        if 'MT' in line:
                            b = line.split()
                            if int(b[3]) <= 146:             ##########针对750万reads数,设置147大小仍然有点大,会有几个文件写入excel表写不进去
                                cols_0 = b[0].strip()
                                cols_1 = b[1].strip()
                                cols_2 = b[2].strip()
                                cols_3 = b[3].strip()
                                cols_4 = b[4].strip()
                                cols_5 = b[5].strip()
                                cols_8 = b[8].strip()
                                cols_9 = b[9].strip()
                                cols_10 = b[10].strip()
                                cols_11 = b[11].strip()
                                cols_start = b[6].strip()
                                cols_end = b[7].strip()
                                if cols_0 in seq_id:
                                    ws.write(row, 0, cols_0)
                                    ws.write(row, 1, cols_1)
                                    ws.write(row, 2, cols_2)
                                    ws.write(row, 3, cols_3)
                                    ws.write(row, 4, cols_4)
                                    ws.write(row, 5, cols_5)
                                    ws.write(row, 6, cols_start)
                                    ws.write(row, 7, cols_end)
                                    ws.write(row, 8, cols_8)
                                    ws.write(row, 9, cols_9)
                                    ws.write(row, 10, cols_10)
                                    ws.write(row, 11, cols_11)
                                    index = seq_id.index(cols_0)
                                    # print(index)
                                    tagart_seq = seq[index]
                                    start = int(cols_start)
                                    end = int(cols_end)
                                    short_seq = tagart_seq[start - 1:end]  ########匹配到的序列, 进行截取
                                    short_seq_5 = tagart_seq.split(short_seq)[0]
                                    short_seq_3 = tagart_seq.split(short_seq)[1]
                                    ws.write(row, 12, short_seq_5)
                                    ws.write(row, 13, short_seq_3)
                                    #########################################################################
                                    q_start = int(cols_8)

                                    q_end = int(cols_9)
                                    #print(q_start, q_end)
                                    if q_start > q_end:

                                        for pos in range(len(mito_plus_gene_position)):
                                            # print(len(mito_plus_gene_position))

                                            if mito_plus_gene_position[pos] <= q_start <= mito_minus_gene_position[pos]:
                                                # print(pos)
                                                gene_name_1 = mito_plus_gene[pos]
                                                #print(gene_name_1)
                                                ws.write(row, 14, gene_name_1)
                                                break

                                        for pos in range(len(mito_plus_gene_position)):
                                            if mito_plus_gene_position[pos] <= q_end <= mito_minus_gene_position[pos]:
                                                gene_name_2 = mito_plus_gene[pos]
                                                ws.write(row, 15, gene_name_2)
                                                ws.write(row, 16, mito_minus_gene_direction[pos])
                                                break

                                    if q_start < q_end:
                                        for pos in range(len(mito_plus_gene_position)):
                                            # print(len(mito_plus_gene_position))

                                            if mito_plus_gene_position[pos] < q_start < mito_minus_gene_position[pos]:
                                                # print(pos)
                                                gene_name_1 = mito_plus_gene[pos]
                                                # print(gene_name_1)
                                                ws.write(row, 14, gene_name_1)
                                                break

                                        for pos in range(len(mito_plus_gene_position)):
                                            if mito_plus_gene_position[pos] < q_end < mito_minus_gene_position[pos]:
                                                gene_name_2 = mito_plus_gene[pos]
                                                ws.write(row, 15, gene_name_2)
                                                ws.write(row, 16, mito_minus_gene_direction[pos])
                                                break
                                    row = row + 1
                print(row)

                w.save(file_name + 'zebra_mito.xls')
     except BaseException as e:  ##设置避免出差错,运行不了
          print('Error:', e)
    else:
        continue









你可能感兴趣的:(2021-04-24 分析Reads比对基因组后5和3端的序列情况)