一、介绍
全称Basic Local Alignment Search Tool,即"基于局部比对算法的搜索工具"
Blast的运行方式是先用目标序列建数据库(这种数据库称为database,里面的每一条序列称为subject),然后用待查的序列(称为query)在database中搜索,每一条query与database中的每一条subject都要进行双序列比对,从而得出全部比对结果。
Blast是一个集成的程序包,通过调用不同的比对模块,blast实现了五种可能的序列比对方式:
blastp:蛋白序列与蛋白库做比对,直接比对蛋白序列的同源性。
blastx:核酸序列对蛋白库的比对,先将核酸序列翻译成蛋白序列(根据相位可以翻译为6种可能的蛋白序列),然后再与蛋白库做比对。
blastn:核酸序列对核酸库的比对,直接比较核酸序列的同源性。
tblastn:蛋白序列对核酸库的比对,将库中的核酸翻译成蛋白序列,然后进行比对。
tblastx:核酸序列对核酸库在蛋白级别的比对,将库和待查序列都翻译成蛋白序列,然后对蛋白序列进行比对。
二、下载安装
下载地址:ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
用linux下载:
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
然后解压
tar zxvf ncbi-blast-2.7.1+-x64-linux.tar.gz
移动到自己常用的文件夹,并改名为Blast
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
tar zxvf ncbi_blast_2.7.1+_x64-linux.tar.gz
mv ncbi_blast_2.7.1+_x64-linux blast
mv ncbi-blast-2.7.1+-x64-linux blast
注:1.注意下载的版本,是否为linux系统适用
2.解压时要带参数
让文件在任何路径都能使用
echo "export PATH=/db/home/shenwei/local/app/blast/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
进入bin文件ls可以看看文件是否变绿表示运行
运行文件
blastn的参数设置
$ blastx -help
USAGE
blastx [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
[-db_hard_mask filtering_algorithm] [-subject subject_input_file]
[-subject_loc range] [-query input_file] [-out output_file]
[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
[-gapextend extend_penalty] [-qcov_hsp_perc float_value]
[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
[-soft_masking soft_masking] [-matrix matrix_name]
[-threshold float_value] [-culling_limit int_value]
[-best_hit_overhang float_value] [-best_hit_score_edge float_value]
[-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
[-strand strand] [-parse_deflines] [-query_gencode int_value]
[-outfmt format] [-show_gis] [-num_descriptions int_value]
[-num_alignments int_value] [-line_length line_length] [-html]
[-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
[-comp_based_stats compo] [-use_sw_tback] [-version]
DESCRIPTION
Translated Query-Protein Subject BLAST 2.7.1+
OPTIONAL ARGUMENTS
-h
Print USAGE and DESCRIPTION; ignore all other parameters
-help
Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
-version
Print version number; ignore other arguments
*** Input query options
-query
Input file name
Default = `-'
-query_loc
Location on the query sequence in 1-based offsets (Format: start-stop)
-strand
Query strand(s) to search against database/subject
Default = `both'
-query_gencode
Genetic code to use to translate query (see user manual for details)
Default = `1'
*** General search options
-task
Task to execute
Default = `blastx'
-db
BLAST database name
* Incompatible with: subject, subject_loc
-out
Output file name
Default = `-'
-evalue
Expectation value (E) threshold for saving hits
Default = `10'
-word_size =2>
Word size for wordfinder algorithm
-gapopen
Cost to open a gap
-gapextend
Cost to extend a gap
-max_intron_length =0>
Length of the largest intron allowed in a translated nucleotide sequence
when linking multiple distinct alignments
Default = `0'
-matrix
Scoring matrix name (normally BLOSUM62)
-threshold =0>
Minimum word score such that the word is added to the BLAST lookup table
-comp_based_stats
Use composition-based statistics:
D or d: default (equivalent to 2 )
or F or f: No composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911,
2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,
2005, unconditionally
Default = `2'
*** BLAST-2-Sequences options
-subject
Subject sequence(s) to search
* Incompatible with: db, gilist, seqidlist, negative_gilist,
negative_seqidlist, db_soft_mask, db_hard_mask
-subject_loc
Location on the subject sequence in 1-based offsets (Format: start-stop)
* Incompatible with: db, gilist, seqidlist, negative_gilist,
negative_seqidlist, db_soft_mask, db_hard_mask, remote
*** Formatting options
-outfmt
alignment view options:
= Pairwise,
= Query-anchored showing identities,
= Query-anchored no identities,
= Flat query-anchored showing identities,
= Flat query-anchored no identities,
= BLAST XML,
= Tabular,
= Tabular with comment lines,
= Seqalign (Text ASN.1),
= Seqalign (Binary ASN.1),
= Comma-separated values,
= BLAST archive (ASN.1),
= Seqalign (JSON),
= Multiple-file BLAST JSON,
= Multiple-file BLAST XML2,
= Single-file BLAST JSON,
= Single-file BLAST XML2,
= Organism Report
Options 6, 7 and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)
When not provided, the default value is:
'qaccver saccver pident length mismatch gapopen qstart qend sstart send
evalue bitscore', which is equivalent to the keyword 'std'
Default = `0'
-show_gis
Show NCBI GIs in deflines?
-num_descriptions =0>
Number of database sequences to show one-line descriptions for
Not applicable for outfmt > 4
Default = `500'
* Incompatible with: max_target_seqs
-num_alignments =0>
Number of database sequences to show alignments for
Default = `250'
* Incompatible with: max_target_seqs
-line_length =1>
Line length for formatting alignments
Not applicable for outfmt > 4
Default = `60'
-html
Produce HTML output?
*** Query filtering options
-seg
Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or
'no' to disable)
Default = `12 2.2 2.5'
-soft_masking
Apply filtering locations as soft masks
Default = `false'
-lcase_masking
Use lower case filtering in query and subject sequence(s)?
*** Restrict search or results
-gilist
Restrict search of database to list of GI's
* Incompatible with: negative_gilist, seqidlist, negative_seqidlist,
remote, subject, subject_loc
-seqidlist
Restrict search of database to list of SeqId's
* Incompatible with: gilist, negative_gilist, negative_seqidlist, remote,
subject, subject_loc
-negative_gilist
Restrict search of database to everything except the listed GIs
* Incompatible with: gilist, seqidlist, remote, subject, subject_loc
-negative_seqidlist
Restrict search of database to everything except the listed SeqIDs
* Incompatible with: gilist, seqidlist, remote, subject, subject_loc
-entrez_query
Restrict search with the given Entrez query
* Requires: remote
-db_soft_mask
Filtering algorithm ID to apply to the BLAST database as soft masking
* Incompatible with: db_hard_mask, subject, subject_loc
-db_hard_mask
Filtering algorithm ID to apply to the BLAST database as hard masking
* Incompatible with: db_soft_mask, subject, subject_loc
-qcov_hsp_perc
Percent query coverage per hsp
-max_hsps =1>
Set maximum number of HSPs per subject sequence to save for each query
-culling_limit =0>
If the query range of a hit is enveloped by that of at least this many
higher-scoring hits, delete the hit
* Incompatible with: best_hit_overhang, best_hit_score_edge
-best_hit_overhang 0 and <0.5)>
Best Hit algorithm overhang value (recommended value: 0.1)
* Incompatible with: culling_limit
-best_hit_score_edge 0 and <0.5)>
Best Hit algorithm score edge value (recommended value: 0.1)
* Incompatible with: culling_limit
-max_target_seqs =1>
Maximum number of aligned sequences to keep
Not applicable for outfmt <= 4
Default = `500'
* Incompatible with: num_descriptions, num_alignments
*** Statistical options
-dbsize
Effective length of the database
-searchsp =0>
Effective length of the search space
-sum_stats
Use sum statistics
*** Search strategy options
-import_search_strategy
Search strategy to use
* Incompatible with: export_search_strategy
-export_search_strategy
File name to record the search strategy used
* Incompatible with: import_search_strategy
*** Extension options
-xdrop_ungap
X-dropoff value (in bits) for ungapped extensions
-xdrop_gap
X-dropoff value (in bits) for preliminary gapped extensions
-xdrop_gap_final
X-dropoff value (in bits) for final gapped alignment
-window_size =0>
Multiple hits window size, use 0 to specify 1-hit algorithm
-ungapped
Perform ungapped alignment only?
*** Miscellaneous options
-parse_deflines
Should the query and subject defline(s) be parsed?
-num_threads =1 and =<24)>
Number of threads (CPUs) to use in the BLAST search
Default = `1'
* Incompatible with: remote
-remote
Execute search remotely?
* Incompatible with: gilist, seqidlist, negative_gilist,
negative_seqidlist, subject_loc, num_threads
-use_sw_tback
Compute locally optimal Smith-Waterman alignments?
E-value的设置
如果检索的序列较短,可适当的提高E值,否则可能会找不到目的序列,反之如果序列较长可适当提高E值。
通常无论是从DNA水平,还是蛋白质水平进行检索,E值设为1通常可满足要求。
Word size的选择(比对长度)
BLAST算法将查询序列分割成一系列具有字段长度的小的序列段进行数据库搜索,因此当此值越小得到的搜索结果越多,但假阳性也越多,服务器负担也越重。因此如果你对搜索的结果不满意时可以试着降低Word size的值。
Blast的运行分为两个步骤:第一,建立目标序列的数据库;第二,做blast比对。
#建立本地数据库
./makeblastdb -in /home/liuqian/biosoft/blast/bin/new_immuno_VDJ.fasta -dbtype nucl
#建库
#搜库
./blastn -query temp_test_2.fasta -db /home/liuqian/biosoft/blast/bin/new_immuno_VDJ.fasta -out blast_test_result_4.txt -word_size 20
#搜库
三、总结
整个代码
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.2.30+-x64-linux.tar.gz
tar -zxvf ncbi-blast-2.2.30+-x64-linux.tar.gz
echo "export PATH=/db/home/shenwei/local/app/blast/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
mv ncbi-blast-2.2.30+ ~/local/app/ # 移动
cd ~/local/app/ # 进入本地程序安装路径
mv ncbi-blast-2.2.30+ blast # 修改目录名
#建立本地数据库
./makeblastdb -in /home/liuqian/biosoft/blast/bin/new_immuno_VDJ.fasta -dbtype nucl
#搜库
./blastn -query temp_test_2.fasta -db /home/liuqian/biosoft/blast/bin/new_immuno_VDJ.fasta -out blast_test_result_4.txt -word_size 20