如何利用NCBI预测基因的结构域和保守位点

1. 使用CD-Search工具来可以鉴定蛋白质或者核酸序列内的保守结构域或功能单位。该工具位于NCBI中。具体我们可以进入NCBI后选择Conserved Domain然后点击Search。

出现如下界面,黄色部分即是本次要讲的工具。其中CD-search只能提交单条序列,Batch CD-Search可以上传多条序列。

2. 我们先以CD-Search为例,预测单条序列的结构域。

点击上图中的CD-Search,输入蛋白质/核酸查询序列,可以是FASTA格式的序列数据,或者输入GI或Accession号,同时在右方OPTIONS中选择要搜索的数据库,Expect Value等,或者使用默认设置,然后按“提交”按钮。

3. 一分钟后,运行结果产生,见下图。搜索结果将显示在默认条件下使用简要显示模式(图中右上方View可以下拉选择其他模式),该模式仅显示查询序列最高得分的区域。如果您想查看所有匹配的区域,请在View中更改为完整显示。

搜索结果中有四种类型的匹配:特定匹配(specific hits),非特定匹配(non-specific hits),这些匹配所属的超家族(superfamily),以及多结构域(multi-domains)。保守特征/位点的氨基酸用小三角形标识,这些位点可能为催化位点或者结合位点等。具体参见上图中的注解。

如果CD-Search发现特定匹配,则查询序列与命中的保守结构域之间的关联具有高置信度,进推断查询序列的功能也是高可信的。其他类型的匹配也可以揭示查询蛋白的假定功能,其可信度由E值来评价。

4.批量递交。点击Batch CD-Search,如下:

可以选择文件来上上传,文件序列数目不超过4000条。其他选项选择后,填入邮箱,程序运行完会将结果发送邮件。

以下为结果页面,可以点击Download下载:

其中对结果解释如下:

| Query | 你输入的序列ID |

| Hit type | CD-Search results can include hit types that represent various confidence levels (specific hits, non-specific hits) and domain model scope (superfamilies, multi-domains). They can be seen in both the Concise display and Full display, except for non-specific hits, which are shown only in the Full Display. |

| PSSM-ID | A PSSM ID is the unique identifier for a domain model's position-specific scoring matrix (PSSM). |

| From..To | The range of amino acids in the query protein sequence to which the domain model aligns. (Note: If the alignment found by RPS-BLAST omitted more than 20% of the CD's extent at either the n- or c-terminus or both, the partial nature of the hit is indicated in the "Incomplete" column of the hit table. Partial hits can also be spotted in the graphical display as domain model cartoons with jagged edges (illustrated example).) |

| E-value | The expect value, or E-value, indicates the statistical significance of the hit as the likelihood the hit was found by chance. |

| Bit Score | 比对得分分 |

| Accession | The accession number of the hit, which can either be a domain model or a superfamily cluster. (If the hit is a domain model, then the accession number (cl) of the superfamily cluster to which it belongs is listed in the "Superfamily" column of the output file.)* |

| Short name | The short name of a conserved domain, which concisely defines the domain. For example, "Voltage gated ClC" is the short title of the NCBI-curated conserved domain model for the voltage gated chloride channel (cd00400). |

| Incomplete | If the hit to a conserved domain is partial (i.e., if the alignment found by RPS-BLAST omitted more than 20% of the CD's extent at either the n- or c-terminus or both), this column will be populated with one of the following values:
N: incomplete at the N-terminus
C: incomplete at the C-terminus
NC: incomplete at both the N-terminus and C-terminus
If the hit to a conserved domain is complete, then this column will be populated with a dash (-).
(Note: Partial hits can also be spotted in the graphical display as domain model cartoons with jagged edges (illustrated example).) |

| Superfamily | This column is populated only for domain models that are specific or non-specific hits, and it lists the accession number of the superfamily to which the domain model belongs.
(If the hit is to a superfamily itself, then this column is simply populated with a dash because the superfamily accession is already listed in the preceding "Accession" column.) |

你可能感兴趣的:(如何利用NCBI预测基因的结构域和保守位点)