构建蛋白质数据库:
makeblastdb -in spinach.fa -parse_seqids -hash_index -out spinach_DB -dbtype prot
构建核酸数据库:
makeblastdb -in spinach.fa -parse_seqids -hash_index -out spinach_DB -dbtype nucl #一般带上-parse_seqids -hash_index
blastp
-query 后接查询序列的文件名称;
-db 后接格式化好的数据库名称;
-out 后接要输出的文件名称及格式
blastp -query spinach_1.fa -db spinach_DB -out spinach_blast
个人在使用blastn的过程中总结了一些自认为常用的参数,总结如下:
blastn -query input_file -db database_name -out output_file -evalue evalue -max_target_seqs num_sequences -num_threads int_value -outfmt format format_string
blastn -query input_file -db database_name -out output_file -evalue evalue -max_target_seqs num_sequences -num_threads int_value -outfmt format "7 qacc sacc evalue length pident"
blastn -task blastn-short -query input_file -out output_file -db database_name -outfmt 6(or 7) -evalue 0.01 -num_threads 15 ##序列比对
-db: 指定blast搜索用的数据库,详见上篇文章
-query:用来查询的输入序列,fasta格式
-out:输出结果文件
-evalue: 设置e值cutoff
-max_target_seqs:设置最多的目标序列匹配数(以前我都用-b 5 -v 5,理解不对请指教)
-num_threads:指定多少个cpu运行任务(依赖于你的系统,同于以前的-a参数)
-outfmt format "7 qacc sacc evalue length pident" :这个是新BLAST+中最拉风的功能了,直接控制输出格式,不用再用parser啦, 7表示带注释行的tab格式的输出,可以自定义要输出哪些内容,用空格分格跟在7的后面,并把所有的输出控制用双引号括起来,其中qacc查询序列的acc,sacc表示目标序列的acc,evalue即是e值,length即是匹配的长度,pident即是序列相同的百分比,其他可用的特征(红色字体)如下:
结果文件简写 含义
qaccver 查询的AC版本(与此类似的还有qseqid,qgi,qacc,与序列命名有关)
saccver 目标的AC版本(于此类似的还有sseqid,sallseqid,sgi,sacc,sallacc,也是序列命名相关)
pident 完全匹配百分比 (响应的nident则是匹配数)
length 联配长度(另外slen表示查询序列总长度,qlen表示目标序列总长度)
mismatch 错配数目
gapopen gap的数目
qstart 查询序列起始
qend 查询序列结束
sstart 目标序列起始
send 目标序列结束
evalue 期望值
bitscore Bit得分
score 原始得分
AC: accession
*** Formatting options
-outfmt
alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = XML Blast output,
6 = tabular,
7 = tabular with comment lines,
8 = Text ASN.1,
9 = Binary ASN.1
10 = Comma-separated values
Options 6, 7, and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
sallacc means All subject accessions
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
When not provided, the default value is:
'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
evalue bitscore', which is equivalent to the keyword 'std'
Default = `0'