Blast中文手册(6)

Appendices

Created: June 23, 2008; Updated: March 14, 2021.

Conversion from C toolkit applications(从C工具包到应用程序的转换)

The functionality offered by the BLAST+ applications has been organized by program type. The following graph depicts a correspondence between the NCBI C Toolkit BLAST command line applications and the BLAST+ applications:

BLAST+应用程序提供的功能按程序类型进行组织。下图描述了NCBI C Toolkit BLAST命令行应用程序与BLAST+应用程序之间的对应关系:
Blast中文手册(6)_第1张图片

The easiest way to get started using the BLAST+ command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications. For example, instead of using

开始使用BLAST+命令行应用程序的最简单方法是使用 legacy_blast.pl PERL脚本,它与BLAST+应用程序捆绑在一起。要使用此脚本,只需将其添加到C toolkit BLAST命令行应用程序的调用之前,并附加指向BLAST+应用程序安装目录的–path选项。
在这里插入图片描述

The purpose of the legacy_blast.pl PERL script is to help users make the transition from the C Toolkit BLAST command line applications to the BLAST+ applications. This script produces its own documentation by invoking it without any arguments.

legacy_blast.plPERL脚本的目的是帮助用户从C Toolkit BLAST命令行应用程序过渡到BLAST+应用程序。此脚本通过调用它而不带任何参数来生成自己的文档。

The legacy_blast.pl script supports two modes of operation, one in which the C Toolkit BLAST command line invocation is converted and executed on behalf of the user and another which solely displays the BLAST+ application equivalent to what was provided, without executing the command.

legacy_blast.pl脚本支持两种操作模式,一种是转换并代表用户执行C Toolkit BLAST命令行调用,另一种是仅显示与所提供内容等效的BLAST+应用程序,而不执行命令。???

The first mode of operation is achieved by specifying the C Toolkit BLAST command line application invocation and optionally providing the --path argument after the command line to convert if the installation path for the BLAST+ applications differs from the default (available by invoking the script without arguments). See example in the first section of the Quick start.

第一种操作模式是通过指定C Toolkit BLAST命令行应用程序调用来实现的,并且如果BLAST+应用程序的安装路径与默认路径不同(通过调用不带参数的脚本来实现),则可以在命令行之后可选地提供–path参数进行转换。请参见快速入门第一节中的示例。

The second mode of operation is achieved by specifying the C Toolkit BLAST command line application invocation and appending the --print_only command line option as follows:

第二种操作模式是通过指定C Toolkit BLAST命令行应用程序调用并附加–print_only命令行选项来实现的,如下所示:

./legacy_blast.pl megablast -i query.fsa -d nt -o mb.out --print_only
/opt/ncbi/blast/bin/blastn -query query.fsa -db "nt" -out mb.out

Exit codes

All BLAST+ applications have consistent exit codes to signify the exit status of the application. The possible exit codes along with their meaning are detailed in the table below:

所有BLAST+应用程序都有一致的退出代码,以表示应用程序的退出状态。下表详细说明了可能的退出代码及其含义:

Exit Code Meaning
0 Success
1 Error in query sequence(s) or BLAST options
2 Error in BLAST database
3 Error in BLAST engine
4 Out of memory
5 Network error connecting to NCBI to fetch sequence data
6 Error creating output files
255 Unknown error

In the case of BLAST+ database applications, the possible exit codes are 0 (indicating success) and 1 (indicating failure).

对于BLAST+数据库应用程序,可能的退出代码为0(表示成功)和1(表示失败)。

Options for the command-line applications.(命令行应用程序的选项。)

This appendix consists of several tables that list option names, types, default values, and a short description of the option. These tables were first published as an appendix to an article in BMC Bioinformatics (BLAST+: architecture and applications). They have been updated for this manual.

本附录由几个表格组成,列出了选项名称、类型、默认值和选项的简短说明。这些表格最初作为BMC生物信息学(BLAST+:架构和应用)一篇文章的附录发布。本手册已对其进行了更新。

Table C1: Options common to all BLAST+ search applications. An option of type “flag” takes no argument, but if present is true. Some options are valid only for a local search (“remote” option not used), others are valid only for a remote search (“remote” option used).

表C1:所有BLAST+搜索应用程序共有的选项。 \colorbox{yellow}{表C1:所有BLAST+搜索应用程序共有的选项。} C1:所有BLAST+搜索应用程序共有的选项。“flag”类型的选项不带参数,但如果存在则为true。某些选项仅对本地搜索有效(“未使用远程”选项),其他选项仅对远程搜索有效(“使用远程”按钮)。

option type default value description and notes
db string none BLAST database name.
(BLAST数据库名称.)
query string stdin Query file name.
(查询文件名.)
query_loc string none Location on the query sequence (Format: start-stop)
(查询序列上的位置(格式:开始-停止).)
out string stdout Output file name
(输出文件名.)
evalue real 10.0 Expect value (E) for saving hits
(保存命中的期望值(E).)
subject string none File with subject sequence(s) to search.
(带有要搜索的subject序列的文件.)
subject_loc string none Location on the subject sequence (Format: start-stop).
(subject序列上的位置(格式:开始-停止).)
show_gis flag N/A Show NCBI GIs in report.
(在报告中显示NCBI GIs.)
num_descriptions integer 500 Show one-line descriptions for this number of database sequences.
(显示此数据库序列数目的单行描述.)
num_alignments integer 250 Show alignments for this number of database sequences.
(显示这个数据库序列的数目的对齐.)
max_target_seqs integer 500 Number of aligned sequences to keep. Use with report formats that do not have separate definition line and alignment sections such as tabular (all outfmt > 4). Not compatible with num_descriptions or num_alignments. Ties are broken by order of sequences in the database.
(要保留的对齐序列数。与没有单独定义线和对齐部分的报告格式一起使用,如表格格式(all Outpmt>4)。与num_descriptions或num_alignments不兼容。按数据库中的序列顺序断开连接.)
max_hsps integer none Maximum number of HSPs (alignments) to keep for any single query-subject pair. The HSPs shown will be the best as judged by expect value. This number should be an integer that is one or greater. If this option is not set, BLAST shows all HSPs meeting the expect value criteria. Setting it to one will show only the best HSP for every query-subject pair
(为任何单个查询主题对保留的最大HSP数(对齐)。根据预期值判断,显示的HSP将是最佳的。此数字应为一个或更大的整数。如果未设置此选项,BLAST将显示所有符合预期值标准的HSP。将其设置为1将仅显示每个查询主题对的最佳HSP.)
html flag N/A Produce HTML output
(生成HTML输出.)
gilist string none Restrict search of database to GI’s listed in this file. Local searches only.
(将数据库搜索限制在此文件中列出的GI。仅限本地搜索.)
negative_gilist string none Restrict search of database to everything except the GI’s listed in this file. Local searches only.
(将数据库搜索限制为除此文件中列出的GI之外的所有内容。仅限本地搜索.)
entrez_query string none Restrict search with the given Entrez query. Remote searches only.
(使用给定的Entrez查询限制搜索。仅限远程搜索.)
culling_limit integer none Delete a hit that is enveloped by at least this many higher-scoring hits.
(删除被至少这么多得分较高的命中包围的命中.)
best_hit_overhang real none Best Hit algorithm overhang value (recommended value: 0.1)
(最佳命中算法悬垂值(推荐值:0.1).)
best_hit_score_edge real none Best Hit algorithm score edge value (recommended value: 0.1)(最佳命中算法得分边缘值(推荐值:0.1).)
dbsize integer none Effective size of the database
(数据库的有效大小.)
searchsp integer none Effective length of the search space
(搜索空间的有效长度.)
import_search_strategy string none Search strategy file to read.
(要读取的搜索策略文件)
export_search_strategy string none Record search strategy to this file.
(将搜索策略记录到此文件)
parse_deflines flag N/A Parse query and subject bar delimited sequence identifiers (e.g., gi|129295).
(分析查询和主题栏分隔的序列标识符)
num_threads integer 1 Number of threads (CPUs) to use in blast search.
(在blast搜索中使用的线程(CPU)数)
remote flag N/A Execute search on NCBI servers?
(在NCBI服务器上执行搜索?)
outfmt string 0 alignment view options:
0 = pairwise,(成对的)
1 = query-anchored showing identities,(查询锚定为显示标识)
2 = query-anchored no identities,(查询没有标识)
3 = flat query-anchored,show identities,(平面查询锚定,显示标识)
4 = flat query-anchored,no identities,(平面查询被锚定,没有标识)
5 = XML Blast output(XML Blast输出)
6 = tabular(表格的)
7 = tabular with comment lines(带注释行的表格)
8 = Text ASN.1(文本ASN.1)
9 = Binary ASN.1(二进制ASN.1)
10 = Comma-separated values(逗号分隔值)
11 = BLAST archive format (ASN.1)(Blast存档格式(ASN.1))
12 = Seqalign (JSON)
13 = Multiple-file BLAST JSON(多文件Blast JSON)
14 = Multiple-file BLAST XML2(多文件Blast XML2)
15 = Single-file BLAST JSON(单文件Blast JSON)
16 = Single-file BLAST XML2(单文件Blast XML2)
17 = Sequence Alignment/Map (SAM)(序列比对/映射(SAM) )
18 = Organism Report(组成报告)
Options 6,7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers.(选项6、7和10还可以配置为生成由空格分隔格式说明符指定的自定义格式)
The supported format specifiers are:(支持的格式说明符有:)
qseqid means Query Seq-id(查询序列ID标识)
qgi means Query GI(查询GI)
qacc means Query accession(查询加入?)
sseqid means Subject Seq-id(比对上的目标序列ID标识)
sallseqid means All subject Seq-id(s), separated by a ‘;’(所有目标序列的序列ID,用分号相隔)
sgi means Subject GI(目标序列的GI)
sallgi means All subject GIs(所有目标序列的GIs)
sacc means Subject accession(目标序列的加入?)
sallacc means All subject accessions(所有目标序列的加入?)
qstart means Start of alignment in query(比对区域在查询序列上的起始位点)
qend means End of alignment in query(比对区域在查询序列上的终止位点)
sstart means Start of alignment in subject(比对区域在目标序列上的起始位点)
send means End of alignment in subject(比对区域在目标序列上的终止位点)
qseq means Aligned part of query sequence(比对上的部分查询序列)
sseq means Aligned part of subject sequence(比对上的部分目标序列)
evalue means Expect value(期望值)
bitscore means Bit score(二进制值)
score means Raw score(原始分数)
length means Alignment length(对齐长度)
pident means Percentage of identical matches(序列比对的一致性百分比)
nident means Number of identical matches(匹配相同的数目)
mismatch means Number of mismatches(比对区域的错配数)
positive means Number of positive-scoring matches(正得分匹配数)
gapopen means Number of gap openings(比对区域的gap数目)
gaps means Total number of gap(gap的总数目)
ppos means Percentage of positive-scoring matches(正得分匹配的百分比)
frames means Query and subject frames separated by a ‘/’(查询和目标序框架?被’/'分离)
qframe means Query frame(查询框架)
sframe means Subject frame(目标框架)
btop means Blast traceback operations (BTOP)(blast追踪操作)
staxids means unique Subject Taxonomy ID(s), separated by a ‘;’(in numerical order)(唯一的目标分类ID,由“;”分隔(按数字顺序))
sscinames means unique Subject Scientific Name(s), separated by a ‘;’(唯一的目标学名,用“;”分隔)
scomnames means unique Subject Common Name(s), separated by a ‘;’(唯一的目标通用名称,用“;”分隔)
blastnames means unique Subject Blast Name(s), separated by a ‘;’ (in alphabetical order)(唯一的目标blast名称,用“;”分隔(按字母顺序排列))
sskingdoms means unique Subject Super Kingdom(s), separated by a ‘;’ (in alphabetical order)(唯一目标超级王国,由“;”分隔(按字母顺序排列))
stitle means Subject Title(目标题目)
salltitles means All Subject Title(s), separated by a ‘<>’(所有目标题目,用’<>'分隔)
sstrand means Subject Strand(目标链)
qcovs means Query Coverage Per Subject (for all HSPs)(每个主题的查询覆盖率(适用于所有HSP))
qcovhsp means Query Coverage Per HSP(每个HSP的查询覆盖率)
qcovus is a measure of Query Coverage that counts a position in a subject sequence for this measure only once. The second time the position is aligned to the query is not counted towards this measure.(查询覆盖率的一种度量,该度量仅计算一次目标序列中的位置。位置与查询对齐的第二次时间不计入此度量)
When not provided, the default value is:
‘qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore’, which is equivalent to the keyword ‘std’

blastn:核酸序列与核酸库的比对,直接比对核酸序列的同源性。 \colorbox{yellow}{blastn:核酸序列与核酸库的比对,直接比对核酸序列的同源性。} blastn:核酸序列与核酸库的比对,直接比对核酸序列的同源性。

Table C2: blastn application options. The blastn application searches a nucleotide query against nucleotide subject sequences or a nucleotide database. An option of type “flag” takes no arguments, but if present the argument is true. Four different tasks are supported: 1.) “megablast”, for very similar sequences (e.g, sequencing errors), 2.) “dc-megablast”, typically used for inter-species comparisons, 3.) “blastn”, the traditional program used for inter-species comparisons, 4.) “blastn-short”, optimized for sequences less than 30 nucleotides.

表C2:blastn应用程序选项。 \colorbox{yellow}{表C2:blastn应用程序选项。} C2:blastn应用程序选项。blastn应用程序根据核苷酸主题序列或核苷酸数据库搜索核苷酸查询序列。“flag”类型的选项不接受任何参数,但如果存在,则该参数为true。支持四种不同的任务:1.“megablast”,用于非常相似的序列(例如,测序错误),2.“dc-megablast”,通常用于物种间比较,3.“blastn”是用于物种间比较的传统程序,4.“短blastn”,针对少于30个核苷酸的序列进行优化。

option task(s) type default value description and notes
word_size megablast integer 28 Length of initial exact match.(初始精确匹配的长度.)
word_size dc-megablast integer 11 Number of matching nucleotides in initial match. dc-megablast allows non-consecutive letters to match.(初始匹配中匹配的核苷酸数."dc megablast"允许非连续字母匹配)
word_size blastn integer 11 Length of initial exact match.(初始精确匹配的长度.)
word_size blastn-short integer 7 Length of initial exact match.(初始精确匹配的长度.)
gapopen megablast integer 0 Cost to open a gap. See appendix “BLASTN reward/penaltyvalues”.(打开缺口的成本.见附录“BLASTN奖励/惩罚价值”.)
gapextend megablast integer none Cost to extend a gap. This default is a function of reward/penalty value. See appendix “BLASTN reward/penalty values”.(扩大差距的成本.此默认值是奖励/惩罚值的函数.见附录“BLASTN奖励/惩罚值”.)
gapopen blastn, blastn-short,dc-megablast integer 5 Cost to open a gap. See appendix “BLASTN reward/penalty values”.(打开缺口的成本.见附录“BLASTN奖励/惩罚值”)
gapextend blastn, blastn-short,dc-megablast integer 2 Cost to extend a gap. See appendix “BLASTN reward/penalty values”.(扩大差距的成本.见附录“BLASTN奖励/惩罚值”.)
reward megablast integer 1 Reward for a nucleotide match.(核苷酸匹配奖励.)
penalty megablast integer -2 Penalty for a nucleotide mismatch.(核苷酸错配的惩罚.)
reward blastn, dc-megablast integer 2 Reward for a nucleotide match.(核苷酸匹配奖励.)
penalty blastn, dc-megablast integer -3 Penalty for a nucleotide mismatch.(核苷酸错配的惩罚.)
reward blastn-short integer 1 Reward for a nucleotide match.(核苷酸匹配奖励.)
penalty blastn-short integer -3 Penalty for a nucleotide mismatch.(核苷酸错配的惩罚.)
strand all string both Query strand(s) to search against database/subject. Choice of both, minus, or plus.(根据数据库/主题搜索的查询strand(s).选择两者,减或加.)
dust blastn-short string 20 64 1 Filter query sequence with dust.(带灰尘的过滤器查询序列.)
filtering_db all string none Mask query using the sequences in this database.(使用此数据库中的序列屏蔽查询.)
window_masker_taxid all integer none Enable WindowMasker filtering using a Taxonomic ID.(使用分类ID启用WindowMasker筛选.)
window_masker_db all string none Enable WindowMasker filtering using this file.(使用此文件启用WindowMasker筛选.)
soft_masking all boolean true Apply filtering locations as soft masks (i.e., only for finding initial matches).(将过滤位置应用为软掩码(即,仅用于查找初始匹配).)
lcase_masking all flag N/A Use lower case filtering in query and subject sequence(s).(在查询和主题序列中使用小写筛选.)
db_soft_mask all integer none Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).(作为软掩码应用于BLAST数据库的过滤算法ID(即,仅用于查找初始匹配).)
db_hard_mask all integer none Filtering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).(作为硬掩码应用于BLAST数据库的过滤算法ID(即,对搜索的所有阶段屏蔽序列).)
perc_identity all integer 0 Percent identity cutoff.(身份截止百分比.)
template_type dc-megablast string coding Discontiguous MegaBLAST template type. Allowed values are coding, optimal and coding_and_optimal.(不连续MegaBLAST模板类型.允许的值是coding、optimal和coding_)
template_length dc-megablast integer 18 Discontiguous MegaBLAST template length.(不连续MegaBLAST模板长度)
use_index megablast boolean false Use MegaBLAST database index. Indices may be created with the makembindex application.(使用MegaBLAST数据库索引.可以使用makembindex应用程序创建索引.)
index_name megablast string none MegaBLAST database index name.(MegaBLAST数据库索引名称.)
xdrop_ungap all real 20 Heuristic value (in bits) for ungapped extensions.(无上限扩展的启发式值(位).)
xdrop_gap all real 30 Heuristic value (in bits) for preliminary gapped extensions.(初始间隙扩展的启发式值(位).)
xdrop_gap_final all real 100 Heuristic value (in bits) for final gapped alignment.(最终间隙对齐的启发式值(位).)
no_greedy megablast flag N/A Use non-greedy dynamic programming extension.(使用非贪婪动态规划扩展 .)
min_raw_gapped_score all integer none Minimum raw gapped score to keep an alignment in the preliminary gapped and trace-back stages. Normally set based upon expect value.(在初始间隙和追溯阶段保持对齐的最小原始间隙分数 .通常根据期望值设置.)
ungapped all flag N/A Perform ungapped alignment.(执行无盖对齐.)
window_size dc-megablast integer 40 Multiple hits window size, use 0 to specify 1-hit algorithm.(执行无盖对齐.多次命中窗口大小,使用0指定一次命中算法)

blastp:蛋白序列与蛋白库作比对,直接比对蛋白序列的同源性。 \colorbox{yellow}{blastp:蛋白序列与蛋白库作比对,直接比对蛋白序列的同源性。} blastp:蛋白序列与蛋白库作比对,直接比对蛋白序列的同源性。

Table C3: blastp application options. The blastp application searches a protein sequence against protein subject sequences or a protein database. An option of type “flag” takes no arguments, but if present the argument is true. Three different tasks are supported: 1.) “blastp”, for standard protein-protein comparisons, 2.) “blastp-short”, optimized for query sequences shorter than 30 residues, and 3.)“blastp-fast”, a faster version that uses a larger word-size per https://www.ncbi.nlm.nih.gov/pubmed/17921491. This table reflects the 2.2.27 BLAST+ release.

表C3:blastp应用程序选项。 \colorbox{yellow}{表C3:blastp应用程序选项。} C3:blastp应用程序选项。blastp应用程序根据蛋白质主题序列或蛋白质数据库搜索蛋白质序列。“flag”类型的选项不接受任何参数,但如果存在,则该参数为true。支持三种不同的任务:1.“blastp”,用于标准蛋白质-蛋白质比较,2.“blastp short”,针对短于30个残基的查询序列进行了优化;3.“BLATP fast”,这是一个更快的版本,每个查询序列使用更大的字数https://www.ncbi.nlm.nih.gov/pubmed/17921491.此表反映了2.2.27 BLAST+版本。

option task type default value description and notes
word_size blastp integer 3 Word size of initial match. Valid word sizes are 2-7. (初始匹配的字长。有效字长为2-7。)
word_size blastp-short integer 2 Word size of initial match.(初始匹配的字长。)
word size blastp-fast Integer 6 Word size of initial match. (初始匹配的字长。)
gapopen blastp integer 11 Cost to open a gap. (打开缺口的成本。)
gapextend blastp integer 1 Cost to extend a gap. (扩大差距的成本。)
gapopen blastp-short integer 9 Cost to open a gap. (打开缺口的成本。)
gapextend blastp-short integer 1 Cost to extend a gap. (扩大差距的成本。)
matrix blastp string BLOSUM62 Scoring matrix name. (评分矩阵名称。)
matrix blastp-short string PAM30 Scoring matrix name. (评分矩阵名称。)
threshold blastp integer 11 Minimum score to add a word to the BLAST lookup table.(将单词添加到BLAST查找表的最低分数。)
threshold blastp-short integer 16 Minimum score to add a word to the BLAST lookup table.(将单词添加到BLAST查找表的最低分数。)
Threshold Blastp-fast integer 21 Minimum score to add a word to the BLAST lookup table.(将单词添加到BLAST查找表的最低分数。)
comp_based_stats Blastp and blastp-fast string 2 Use composition-based statistics:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, unconditionally
comp_based_stats blastp-short string 0 Use composition-based statistics:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, unconditionally
seg all string no Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).(使用SEG过滤查询序列(格式为“是”、“窗口插入”或“否”以禁用)。)
soft_masking blastp boolean false Apply filtering locations as soft masks (i.e., only for finding initial matches).(将过滤位置应用为软掩码(即,仅用于查找初始匹配)。)
lcase_masking all flag N/A Use lower case filtering in query and subject sequence(s).(在查询和主题序列中使用小写筛选。)
db_soft_mask all integer none Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).(作为软掩码应用于BLAST数据库的过滤算法ID(即,仅用于查找初始匹配)。)
db_hard_mask all integer none Filtering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).(作为硬掩码应用于BLAST数据库的过滤算法ID(即,对搜索的所有阶段屏蔽序列)。)
xdrop_gap_final all real 25 Heuristic value (in bits) for final gapped alignment/(最终间隙对齐的启发式值(位)/)
window_size Blastp and blastp-fast integer 40 Multiple hits window size, use 0 to specify 1-hit algorithm.(多次命中窗口大小,使用0指定一次命中算法)
window_size blastp-short integer 5 Multiple hits window size, use 0 to specify 1-hit algorithm.(多次命中窗口大小,使用0指定一次命中算法)
use_sw_tback all flag N/A Compute locally optimal Smith-Waterman alignments?(计算局部最优Smith-Waterman路线?)

blastx:核酸序列与蛋白库作比对,将核酸序列先翻译成蛋白序列,再将其与蛋白库作比对。 \colorbox{yellow}{blastx:核酸序列与蛋白库作比对,将核酸序列先翻译成蛋白序列,再将其与蛋白库作比对。} blastx:核酸序列与蛋白库作比对,将核酸序列先翻译成蛋白序列,再将其与蛋白库作比对。

Table C4: blastx application options. The blastx application translates a nucleotide query and searches it against protein subject sequences or a protein database. Two different tasks are supported: 1.) “blastx” for standard translated nucleotide-protein comparison and 2.) “blastx-fast”, a faster version that uses a larger word-size based on https://www.ncbi.nlm.nih.gov/pubmed/17921491.

表C4:blastx应用程序选项。 \colorbox{yellow}{表C4:blastx应用程序选项。} C4:blastx应用程序选项。blastx应用程序翻译核苷酸查询并根据蛋白质主题序列或蛋白质数据库进行搜索。支持两种不同的任务:1.“blastx”用于标准翻译的核苷酸-蛋白质比较,2.“blast x-fast”,这是一种更快的版本,使用基于https://www.ncbi.nlm.nih.gov/pubmed/17921491.

option task type default value description and notes
word_size Blastx integer 3 Word size for initial match. Valid word sizes are 2-7.
word_size Blastx-fast integer 6 Word size for initial match.
gapopen All integer 11 Cost to open a gap.
gapextend All integer 1 Cost to extend a gap.
matrix All string BLOSUM62 Scoring matrix name.
threshold Blastx integer 12 Minimum score to add a word to the BLAST lookup table.
threshold Blastx-fast integer 21 Minimum score to add a word to the BLAST lookup table.
seg All string 12 2.2 2.5 Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).
soft_masking all boolean false Apply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_masking all flag N/A Use lower case filtering in query and subject sequence(s).
db_soft_mask all integer none Filtering algorithm ID to apply to the BLAST database as soft mask (i.e.,only for finding initial matches).
db_hard_mask all integer none Filtering algorithm ID to apply to the BLAST database as hard mask (i.e.,sequence is masked for all phases of search).
xdrop_gap_final all real 25 Heuristic value (in bits) for final gapped alignment.
window_size all integer 40 Multiple hits window size, use 0 to specify 1-hit algorithm.
strand all string both Query strand(s) to search against database/subject. Choice of both, minus,or plus.
query_genetic_code all integer 1 Genetic code to translate query, seeftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_length all integer 0 Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
comp_based_stats all integer 2 Use composition-based statistics for blastx:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,2005, unconditionally
Default = `2’

tblastn:蛋白序列对核酸库的比对,现将核酸库翻译成蛋白库,再将蛋白序列与翻译后的蛋白库进行比对。 \colorbox{yellow}{tblastn:蛋白序列对核酸库的比对,现将核酸库翻译成蛋白库,再将蛋白序列与翻译后的蛋白库进行比对。} tblastn:蛋白序列对核酸库的比对,现将核酸库翻译成蛋白库,再将蛋白序列与翻译后的蛋白库进行比对。

Table C5: tblastn application options. The tblastn application searches a protein query against nucleotide subject sequences or a nucleotide database translated at search time. Two different tasks are supported: 1.) “tblastn” for a standard protein-translated nucleotide comparison and 2.) “tblastn-fast” for a faster version with a larger word-size based on https://www.ncbi.nlm.nih.gov/pubmed/17921491.

表C5:tblastn应用程序选项。 \colorbox{yellow}{表C5:tblastn应用程序选项。} C5:tblastn应用程序选项。tblastn应用程序根据核苷酸主题序列或在搜索时翻译的核苷酸数据库搜索蛋白质查询。支持两种不同的任务:1)“tblastn”用于标准蛋白质翻译核苷酸比较,2)“tblastn fast”用于更快速的版本,基于https://www.ncbi.nlm.nih.gov/pubmed/17921491.

option task type default value description and notes
word_size tblastn integer 3 Word size for initial match. Valid word sizes are 2-7.
word_size tblastn-fast integer 6 Word size for initial match.
gapopen All integer 11 Cost to open a gap.
gapextend All integer 1 Cost to extend a gap.
matrix All string BLOSUM62 Scoring matrix name.
threshold tblastn integer 13 Minimum score to add a word to the BLAST lookup table.
threshold tblastn-fast integer 21 Minimum score to add a word to the BLAST lookup table.
seg All string 12 2.2 2.5 Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).
soft_masking all boolean false Apply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_masking all flag N/A Use lower case filtering in query and subject sequence(s).
db_soft_mask all integer none Filtering algorithm ID to apply to the BLAST database as soft mask (i.e.,only for finding initial matches).
db_hard_mask all integer none Filtering algorithm ID to apply to the BLAST database as hard mask (i.e.,sequence is masked for all phases of search).
xdrop_gap_final all real 25 Heuristic value (in bits) for final gapped alignment.
window_size all integer 40 Multiple hits window size, use 0 to specify 1-hit algorithm.
db_gen_code All integer 1 Genetic code to translate subject sequences, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_length All integer 0 Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
comp_based_stats all string 2 Use composition-based statistics for tblastn:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911,2005, unconditionally
Default = `2’

tblastx:核酸与核酸数据库在蛋白质水平比较 \colorbox{yellow}{tblastx:核酸与核酸数据库在蛋白质水平比较} tblastx:核酸与核酸数据库在蛋白质水平比较

Table C6: tblastx application options. The tblastx application searches a translated nucleotide query against translated nucleotide subject sequences or a translated nucleotide database. An option of type “flag” takes no arguments, but if present the argument is true.This table reflects the 2.2.27 BLAST+ release. Only ungapped searches are supported for tblastx.

表C6:tblastx应用程序选项。 \colorbox{yellow}{表C6:tblastx应用程序选项。} C6:tblastx应用程序选项。tblastx应用程序根据翻译的核苷酸主题序列或翻译的核苷酸数据库搜索翻译的核苷酸查询。“flag”类型的选项不接受任何参数,但如果存在,则该参数为true。此表反映了2.2.27 BLAST+版本。tblastx仅支持未加上限的搜索。

option type default value description and notes
word_size integer 3 Word size for initial match.
matrix string BLOSUM62 Scoring matrix name.
threshold integer 13 Minimum word score to add the word to the BLAST lookup table.
seg string 12 2.2 2.5 Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).
soft_masking boolean false Apply filtering locations as soft masks (i.e., only for finding initial matches).
lcase_masking flag N/A Use lower case filtering in query and subject sequence(s).
db_soft_mask integer none Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
db_hard_mask integer none Filtering algorithm ID to apply to the BLAST database as hard mask (i.e., sequence is masked for all phases of search).
strand string both Query strand(s) to search against database subject sequences. Choice of both, minus, or plus.
query_genetic_code integer 1 Genetic code to translate query, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
db_gen_code integer 1 Genetic code to translate subject sequences, see ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt
max_intron_length integer 0 Length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking)

CDD(Conserved Domain Database)
简介:CDD是蛋白质保守结构域数据库,收集了大量保守结构域序列信息和蛋白质序列信息。一个蛋白质的保守结构域在一定程度上体现了该蛋白质的功能,检索时通过CD-Search服务,可获得蛋白质序列中所含的保守结构域信息,从而分析、预测该蛋白质的功能。https://zhuanlan.zhihu.com/p/460178458

Table C7: rpsblast application options. The rpsblast application searches a protein query against the conserved domain database(CDD), which is a set of protein profiles. Many of the common options such as matrix or word threshold are set when the CDD is built and cannot be changed by the rpsblast application. A search ready CDD can be downloaded from

表C7:rpsblast应用程序选项。 \colorbox{yellow}{表C7:rpsblast应用程序选项。} C7:rpsblast应用程序选项。rpsblast应用程序根据保守域数据库(CDD)搜索蛋白质查询,CDD是一组蛋白质图谱。许多常见选项(如矩阵或字阈值)是在构建CDD时设置的,rpsblast应用程序无法更改。可从以下网站下载搜索就绪CDD:ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/

option type default value description and notes
word_size integer 40 Multiple hits window size, use 0 to specify 1-hit algorithm.
xdrop_ungap real 15 Heuristic value (in bits) for ungapped extensions
xdrop_gap real 25 Heuristic value (in bits) for preliminary gapped extensions.
xdrop_gap_final real 40 Heuristic value (in bits) for final gapped alignment.
seg string 12 2.2 2.5 Filter query sequence with SEG (Format: ‘yes’, ‘window locut hicut’, or ‘no’ to disable).
soft_masking boolean false Apply filtering locations as soft masks (i.e., only for finding initial matches).
db_soft_mask integer none Filtering algorithm ID to apply to the BLAST database as soft mask (i.e., only for finding initial matches).
mt_mode integer 0 Set to 1 if a large number of queries are to be searched and you wish to use multiple threads, as specified by the num_threads argument.
comp_based_stats integer 2 Use composition-based statistics for rpsblast:
D or d: default (equivalent to 2)
0 or F or f: no composition-based statistics
1: Composition-based statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-based score adjustment as in Bioinformatics
21:902-911, 2005, conditioned on sequence properties
3: Composition-based score adjustment as in Bioinformatics 21:902-911, 2005,unconditionally
Default = `2’

Table C8: Makeblastdb application options. This application builds a BLAST database. An option of type “flag” takes no arguments, but if present the argument is true. Starting with the 2.10.0 release, makeblastdb produces version 5 databases by default, which uses LMDB. LMDB requires virtual memory (at least 600 GB, but 800 GB is recommended) to build an index. If makeblastdb cannot access enough virtual memory, it will produce a message containing the string “mdb_env_open”. Virtual memory is just that (virtual) and doesn’t depend on the hardware in your system. In general, we recommend that BLAST users simply set the virtual memory to unlimited. The other alternative is to use an environment variable (BLASTDB_LMDB_MAP_SIZE) to set the required virtual memory lower, but this runs the risk of LMDB not being able to complete indexing the database. For a smaller database (tens of millions of letters) it may be possible to use a value of 100 million.

option type default value description and notes
in string stdin Input file/database name
input_type string fasta Input file type, it may be any of the following:
fasta: for FASTA file(s)
blastdb: for BLAST database(s)
asn1_txt: for Seq-entries in text ASN.1 format
asn1_bin: for Seq-entries in binary ASN.1 format
dbtype string prot Molecule type of input, values can be nucl or prot.
title string none Title for BLAST database. If not set, the input file name will be used.
parse_seqids flag N/A Parse bar delimited sequence identifiers (e.g., gi
hash_index flag N/A Create index of sequence hash values.
mask_data string none Comma-separated list of input files containing masking data as produced by NCBI masking applications (e.g. dustmasker, segmasker, windowmasker).
out string input file name Name of BLAST database to be created. Input file name is used if none provided.This field is required if input consists of multiple files.
max_file_size string 1GB Maximum file size to use for BLAST database. 4GB is the maximum supported by the database structure.
blastdb_version integer 5 Version 5 (taxonomy aware) is the default starting with the 2.10.0 release. Value must be 4 or 5.
taxid integer none Taxonomy ID to assign to all sequences.
taxid_map string none File with two columns mapping sequence ID to the taxonomy ID. The first column is the sequence ID represented as one of:
1.fasta with accessions (e.g., emb|X17276.1|)
2.fasta with GI (e.g., gi|4)
3.GI as a bare number (e.g., 4)
4.A local ID. The local ID must be prefixed with “lcl” (e.g., lcl|4).
The second column should be the NCBI taxonomy ID (e.g., 9606 for human).
metadata_output_prefix string none Path prefix for “files” field in BLASTDB metadata file
logfile string none Program log file (default is stderr).

Table C9: Makeprofiledb application options. This application builds an RPS-BLAST database. An option of type “flag” takes no arguments, but if present the argument is true. COBALT (a multiple sequence alignment program) and DELTA-BLAST both use RPS-BLAST searches as part of their processing but use specialized versions of the database. This application can build databases for COBALT, DELTA-BLAST, and a standard RPS-BLAST search. The “dbtype” option (see entry in table) determines which flavor of the database is built.

option type default value description and notes
in string stdin Input file that contains a list of scoremat files (delimited by space, tab, or newline)
binary flag N/A The scoremat files are binary ASN.1
title string none Title for RPS-BLAST database. If not set, the input file name will be used.
threshold real 9.82 Threshold for RPSBLAST lookup table.
out string input file name Name of BLAST database to be created. Input file name is used if none provided.
max_file_size string 1GB Maximum file size to use for BLAST database.
dbtype string rps Specifies use for RPSBLAST db. One of rps, cobalt, or delta.
index flag N/A Creates index files.
gapopen integer none Cost to open a gap. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
gapextend integer none Cost to extend a gap by one residue. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
scale real 100 PSSM scale factor.
matrix string BLOSUM62 Matrix to use in constructing PSSM. One of BLOSUM45, BLOSUM50, BLOSUM62,BLOSUM80, BLOSUM90, PAM250, PAM30 or PAM70. Used only if scoremat files do not contain PSSM scores, otherwise ignored.
obsr_threshold real 6 Exclude domains with maximum number of independent observations below this value (for use in DELTA-BLAST searches).
exclude_invalid real true Exclude domains that do not pass validation test (for use in DELTA-BLAST searches).
logfile string none Program log file (default is stderr).

Table C10: Blastdbcmd application options. This application reads a BLAST database and produces reports.

option type default value description and notes
db string nr BLAST database name.
dbtype string guess Molecule type stored in BLAST database, one of nucl, prot, or guess.
entry string none Comma-delimited search string(s) of sequence identifiers: e.g.: 555, AC147927, ‘gnl|dbname|tag’, or ‘all’ to select all sequences in the database
entry_batch string none Input file for batch processing. The format requires one entry per line; each line should begin with the sequence ID followed by any of the following optional specifiers (in any order): range (format: ‘from-to’, inclusive in 1-offsets), strand (‘plus’ or ‘minus’), or masking algorithm ID (integer value representing the available masking algorithm).Omitting the ending range (e.g.: ‘10-‘) is supported, but there should not be any spaces around the ‘-‘.
pig integer none PIG (protein identity group) to retrieve.
info flag N/A Print BLAST database information.
range string none Range of sequence to extract (Format: start-stop).
strand string plus Strand of nucleotide sequence to extract. Choice of plus or minus.
mask_sequence_with string none Produce lower-case masked FASTA using the algorithm IDs specified.
out string stdout Output file name.
outfmt string %f Output format, where the available format specifiers are:
%f means sequence in FASTA format
%s means sequence data (without defline)
%a means accession
%g means gi
%o means ordinal id (OID)
%t means sequence title
%l means sequence length
%T means taxid
%L means common taxonomic name
%S means scientific name
%P means PIG
%mX means sequence masking data, where X is an optional comma-separated list of integers to specify the algorithm ID(s) to display (or all masks if absent or invalid specification). Masking data will be displayed as a series of ‘N-M’ values separated by ‘;’ or the word ‘none’ if none are available. For every format except '%f ', each line of output will correspond to a sequence.
target_only flag N/A Definition line should contain target GI only.
get_dups flag N/A Retrieve duplicate accessions.
line_length integer 80 Line length for output.
ctrl_a flag N/A Use Ctrl-A as the non-redundant definition line separator.

Table C11: Makembindex application options. The indexed databases created by makembindex are used by production MegaBLAST software and by a new srsearch utility designed to quickly search for nearly exact matches (up to one mismatch) of short queries against a genomic database. When a FASTA formatted file is used as the input, then masking by lower case letters is incorporated in the index. Makembindex can currently build two types of indices, called “old style” and “new style” indexing. The NCBI offers full support for the new style and has deprecated the old style. A MegaBLAST search with a new style index requires that both the index and the corresponding BLAST database be present. The index structure is described in PMID:18567917. Please cite this paper in any publication that uses makembindex.

option type default value description and notes
input string stdin Input file name or BLAST database name, depending on the value of the iformat parameter.For FASTA formatted input, this parameter is optional and defaults to the program’s standard input stream.
output string none The resulting index name. The index itself can consist of multiple files, called volumes, called .00.idx, .01.idx,…
This option should not be used with new style indices.
iformat string fasta The input format selector. Possible values are ‘fasta’ and ‘blastdb’.
old_style_index boolean false The old_style_index is no longer supported. If set to ‘false’ the new style index is created.New style indices require a BLAST database as input (use -iformat blastdb), which can be downloaded from the NCBI FTP site or created with makeblastdb. The option -output is ignored for a new style index. New style indices are always created at the same location as the corresponding BLAST database.
db_mask integer None Exclude masked regions of BLAST db from the index. Use makeblastdb to discover the algorithm ID to be used as input for this argument.
legacy boolean true This is a compatibility feature to support current production MegaBLAST. If true, then -stride, -nmer, and -ws_hint are ignored. The legacy format must be used for BLAST.
nmer integer 12 N-mer size to use. Ignored if –legacy is specified
ws_hint integer 28 This is an optimization hint for makembindex that indicates an expected minimum match size in searches that use the index. If n is the value of -nmer parameter and s is the value of –stride parameter, then the value of -ws_hint must be at least n + s - 1.
stride integer 5 makembindex will index every stride-th N-mer of the database.
volsize integer 1536 Target index volume size in megabytes.

BLASTN reward/penalty values(BLASTN奖励/惩罚值)

BLASTN uses a simple approach to score alignments, with identically matching bases assigned a reward and mismatching bases assigned a penalty. It is important to choose reward/penalty values appropriate to the sequences being aligned with the (absolute) reward/penalty ratio increasing for more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved [2]. For each reward/penalty pair, a number of different gap costs are supported. A gap cost includes a value to open the gap and a value to extend the gap by a base. Following the convention of the command-line applications, these costs are listed as positive numbers here. MegaBLAST uses a specialized algorithm to calculate the default gap costs for a reward/penalty pair that is described in PMID:10890397. Briefly, the default megaBLAST cost to open a gap is zero and the cost to extend a gap two letters is given by the absolute value of two mismatches minus one match. For example, given a reward of 1 and penalty of -5, the cost to extend a gap by one letter is 5.5.The default gap costs for other tasks supported by the blastn application is 5 to open a gap and 2 to extend one base.

Table D1 presents the supported reward/penalty values and gap costs.

Table D1: Supported reward/penalty values and gap costs for the blastn application. The left-most column presents the supported reward/penalty values. The middle column presents pairs of numbers for the cost to open and extend a gap for each reward/penalty value. Blastn also supports gap costs more stringent than those listed (e.g., for reward/penalty of 1/-3 gap costs of 5/2 or 500/2 are supported). The reward/penalty values are ordered from most to least stringent, with the more stringent values better suited for alignments with high sequence identity. The default megaBLAST gap costs are shown in the right-most column. Accurate statistics for these default megaBLAST gap costs can only be calculated for the most stringent reward/penalty values, but the values listed in the middle column can always be used.

Blast中文手册(6)_第2张图片Blast中文手册(6)_第3张图片

BLAST Substitution Matrices(BLAST置换矩阵)

BLAST uses a substitution matrix for any program that aligns residues. The program may align residues because both the query and database consist of proteins (e.g. BLASTP) or the program may align DNA translated to protein with protein (e.g. BLASTX). A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general,different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62.In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is:

补充来源:

https://blog.csdn.net/weixin_43202635/article/details/82962032?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-1-82962032-blog-88382137.pc_relevant_vip_default&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-1-82962032-blog-88382137.pc_relevant_vip_default&utm_relevant_index=2


Blast中文手册(6)_第4张图片
Blast中文手册(6)_第5张图片 Blast中文手册(6)_第6张图片
在这里插入图片描述 Blast中文手册(6)_第7张图片

BLAST使用替换矩阵来表示任何对齐残基的程序。程序可以对齐残基,因为查询和数据库都由蛋白质组成(例如BLASTP),或者程序可以将翻译成蛋白质的DNA与蛋白质对齐(例如BLATX)。评估成对序列比对质量的一个关键因素是“替换矩阵”,它为任何可能的残基对的比对分配分数。[1]中描述了氨基酸置换矩阵理论,并将其应用于[2]中的DNA序列比较。一般来说,不同的替换矩阵可用于检测不同程度发散的序列之间的相似性[1-3]。然而,单个矩阵在相对广泛的进化变化范围内可能是合理有效的[1-3]。实验表明,BLOSUM-62矩阵[4]是检测最弱蛋白质相似性的最佳方法之一 对于特别长和弱的对齐,BLOSUM-45矩阵可能会证明更优越。短比对需要相对较强(即具有较高百分比的匹配残基)才能高于背景噪声。使用比BLOSUM-62具有更高“相对熵”[1]的矩阵更容易检测到这种短而强的对齐 \colorbox{yellow}{对于特别长和弱的对齐,BLOSUM-45矩阵可能会证明更优越。短比对需要相对较强(即具有较高百分比的匹配残基)才能高于背景噪声。使用比BLOSUM-62具有更高“相对熵”[1]的矩阵更容易检测到这种短而强的对齐} 对于特别长和弱的对齐,BLOSUM-45矩阵可能会证明更优越。短比对需要相对较强(即具有较高百分比的匹配残基)才能高于背景噪声。使用比BLOSUM-62具有更高相对熵”[1]的矩阵更容易检测到这种短而强的对齐特别是,短查询序列只能产生短对齐,因此使用短查询的数据库搜索应使用适当定制的矩阵。关于间隙对准的详细统计理论尚未开发,使用给定替代矩阵的最佳间隙成本是通过经验确定的。短比对需要相对较强(即具有较高百分比的匹配残基)才能高于背景噪声。使用比BLOSUM-62具有更高“相对熵”[1]的矩阵更容易检测到这种短而强的对齐。特别是,短查询序列只能产生短对齐,因此使用短查询的数据库搜索应使用适当定制的矩阵。BLOSUM序列不包括任何具有适合于最短查询的相对熵的矩阵,因此可以使用较旧的PAM矩阵[5,6]。对于蛋白质,不同查询长度的推荐替代矩阵和缺口成本临时表如下:
Blast中文手册(6)_第8张图片

Gap Costs(缺口成本)

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. **Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap.**Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).

对齐的原始得分是对齐残基对的得分和间隙得分之和。Gapped BLAST和PSI-BLAST使用"affine gap costs",对间隙的存在收取分数-a,对间隙中的每个残基收取分数-b。因此,k残基缺口的总分为-(a+bk);具体而言,长度为1的间隙接收分数-(a+b)。
Blast中文手册(6)_第9张图片

Lambda Ratio

To convert a raw score S into a normalized score S' expressed in bits, one uses the formula S' = (lambda*S - lnK)/(ln 2), where lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed [7-9]. For determining S', the more important of these parameters is lambda. The "lambda ratio"quoted here is the ratio of the lambda for the given scoring system to that for one using the same substitution scores, **but with infinite gap costs** [8]. This ratio indicates what proportion of information in an ungapped alignment must be sacrificed in the hope of improving its score through extension using gaps. We have found empirically that the most effective gap costs tend to be those with lambda ratios in the range 0.8 to 0.9.

为了将原始分数S转换为以位表示的归一化分数S’,我们使用公式S’=(lambda*S-lnK)/(ln2),其中lambda和K是取决于采用的评分系统(替代矩阵和间隙成本)的参数[7-9]。对于确定S′,这些参数中更重要的是lambda。此处引用的“lambda比率”是给定评分系统的lambda与使用相同替代分数但具有无限缺口成本的系统的lampda的比率[8]。该比率表明,为了通过使用间隙进行扩展来提高其得分,必须牺牲无上限对齐中的信息比例。我们经验发现,最有效的缺口成本往往是那些lambda比率在0.8到0.9范围内的缺口成本。

你可能感兴趣的:(Blast学习与运用,linux,ubuntu)