转录组分析实战第七节:采用Trinotate对拼接结果进行注释

既然可以通过Trinity对所有的Reads进行拼接后得到很多的转录本(Transcripts) , 因此很有必要对这些转录本进行注释。

注释的工具有很多,我们可以通过Trinotate对拼接的转录本进行注释

首先我们安装Trinotate

到下载页下载最新的Trinotate
然后解压后
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft$ wget https://github.com/Trinotate/Trinotate/archive/Trinotate-v3.1.1.zip
--2019-01-30 10:54:44--  https://github.com/Trinotate/Trinotate/archive/Trinotate-v3.1.1.zip
Connecting to 127.0.0.1:8118... connected.
Proxy request sent, awaiting response... 302 Found
Location: https://codeload.github.com/Trinotate/Trinotate/zip/Trinotate-v3.1.1 [following]
--2019-01-30 10:54:45--  https://codeload.github.com/Trinotate/Trinotate/zip/Trinotate-v3.1.1
Connecting to 127.0.0.1:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘Trinotate-v3.1.1.zip’

Trinotate-v3.1.1.zip                                    [                               <=>                                                                               ]  28.59M  5.06MB/s    in 7.5s    

2019-01-30 10:54:53 (3.80 MB/s) - ‘Trinotate-v3.1.1.zip’ saved [29979458]

yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft$ l
mafft_7.407-1_amd64.deb  Trinotate-v3.1.1.zip
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft$ unzip Trinotate-v3.1.1.zip 
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft$ cd Trinotate-Trinotate-v3.1.1/
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ l
admin/  Changelog.txt  notes     README.md   resources/                  sample_data/  Trinotate.github.io/  TrinotateWeb.conf/
auto/   LICENSE.txt    PerlLib/  README.txt  run_TrinotateWebserver.pl*  Trinotate*    TrinotateWeb/         util/
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ pwd
/home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ export TRINOTATE_HOME=/home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ $TRINOTATE_HOME/
admin/                     PerlLib/                   run_TrinotateWebserver.pl  Trinotate                  TrinotateWeb/              util/                      
auto/                      resources/                 sample_data/               Trinotate.github.io/       TrinotateWeb.conf/   
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ echo 'export TRINOTATE_HOME=/home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1' >> ~/.bashrc
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ tail ~/.bashrc 
 fi
fi
export https_proxy='127.0.0.1:8118'
export http_proxy='127.0.0.1:8118'
export PATH=/usr/local/texlive/2018/bin/x86_64-linux:$PATH
export MANPATH=/usr/local/texlive/2018/texmf-dist/doc/man:$MANPATH
export INFOPATH=/usr/local/texlive/2018/texmf-dist/doc/info:$INFOPATH
export PATH=/home/yeyuntian/Biodata/trinitytest/trinityrnaseq-Trinity-v2.8.3:$PATH
export TRINITY_HOME=/home/yeyuntian/Biodata/trinitytest/trinityrnaseq-Trinity-v2.8.3
export TRINOTATE_HOME=/home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ source ~/.bashrc 
根据Trinotate的操作说明
需要安装的软件有
  1. Trinity (用于生成拼接后的转录本fasta文件)
  2. TransDecoder (用于预测转录本的蛋白编码区域)
  3. SQLite (用于整合数据库数据)
  4. NCBI BLAST+ (用于比对Blast库)
  5. HMMER/PFAM(用于通过HMMER工具注释蛋白质结构域)
此外还推荐安装的软件:
  1. signalP v4(用于预测信号肽)
  2. tmhmm v2 (用于预测跨膜结构域)
  3. RNAMMER (用于预测rRNA 转录本)
需要的数据有:
在安装好了Trinotate后就可以开始运行了。
首先是构建后期所需要的数据库
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ $TRINOTATE_HOME/admin/Build_Trinotate_Boilerplate_SQLite_db.pl Trinotate
这个步骤会下载所需要的数据库
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ $TRINOTATE_HOME/admin/Build_Trinotate_Boilerplate_SQLite_db.pl Trinotate
-- Skipping CMD: wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz", checkpoint exists.
-- Skipping CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_swissprot_parser.pl uniprot_sprot.dat.gz Trinotate, checkpoint exists.
-- Skipping CMD: mv uniprot_sprot.dat.gz.pep uniprot_sprot.pep, checkpoint exists.
* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --create
Can't locate DBI.pm in @INC (you may need to install the DBI module) (@INC contains: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/../../PerlLib /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/../../PerlLib/Sqlite_connect.pm line 15.
BEGIN failed--compilation aborted at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/../../PerlLib/Sqlite_connect.pm line 15.
Compilation failed in require at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl line 8.
BEGIN failed--compilation aborted at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl line 8.
Error, cmd: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --create died with ret 512 at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/../PerlLib/Pipeliner.pm line 102.
    Pipeliner::run(Pipeliner=HASH(0x23cf860)) called at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/Build_Trinotate_Boilerplate_SQLite_db.pl line 119
但是我们在这个地方遇到了麻烦,通过参看Google的帮助。我们发现需要安装一个perl 的模块进行补充。
perl -MCPAN -e shell
    install DBD::SQLite
    exit
具体运行是这样的,我们是初次安装所以会出现其他的东西。
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ perl -MCPAN -e shell

CPAN.pm requires configuration, but most of it can be done automatically.
If you answer 'no' below, you will enter an interactive dialog for each
configuration option instead.

Would you like to configure as much as possible automatically? [yes] yes

 

Warning: You do not have write permission for Perl library directories.

To install modules, you need to configure a local Perl library directory or
escalate your privileges.  CPAN can help you by bootstrapping the local::lib
module or by configuring itself to use 'sudo' (if available).  You may also
resolve this problem manually if you need to customize your setup.

What approach do you want?  (Choose 'local::lib', 'sudo' or 'manual')
 [local::lib] 


Autoconfiguration complete.

Attempting to bootstrap local::lib...

Writing /home/yeyuntian/.cpan/CPAN/MyConfig.pm for bootstrap...
commit: wrote '/home/yeyuntian/.cpan/CPAN/MyConfig.pm'
Proxy must be specified as absolute URI; '127.0.0.1:8118' is not at /usr/share/perl/5.22/CPAN/FTP.pm line 351.
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ perl -MCPAN -e shell
Terminal does not support AddHistory.

cpan shell -- CPAN exploration and modules installation (v2.11)
Enter 'h' for help.

cpan[1]> install DBD::SQListe   
Catching error: "Proxy must be specified as absolute URI; '127.0.0.1:8118' is not at /usr/share/perl/5.22/CPAN/FTP.pm line 351.\cJ" at /usr/share/perl/5.22/CPAN.pm line 391,  line 1.
    CPAN::shell() called at -e line 1
Fetching with LWP:
http://www.cpan.org/authors/01mailrc.txt.gz
Reading '/home/yeyuntian/.cpan/sources/authors/01mailrc.txt.gz'
............................................................................DONE
Fetching with LWP:
http://www.cpan.org/modules/02packages.details.txt.gz
Reading '/home/yeyuntian/.cpan/sources/modules/02packages.details.txt.gz'
  Database was generated on Sat, 02 Feb 2019 03:55:12 GMT
.............
  New CPAN.pm version (v2.22) available.
  [Currently running version is v2.11]
  You might want to try
    install CPAN
    reload cpan
  to both upgrade CPAN.pm and run the new version without leaving
  the current session.


...............................................................DONE
Fetching with LWP:
http://www.cpan.org/modules/03modlist.data.gz
Reading '/home/yeyuntian/.cpan/sources/modules/03modlist.data.gz'
DONE
Writing /home/yeyuntian/.cpan/Metadata

cpan[2]> exit
Terminal does not support GetHistory.
Lockfile removed.
安装好后就可以继续了
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ $TRINOTATE_HOME/admin/Build_Trinotate_Boilerplate_SQLite_db.pl Trinotate
-- Skipping CMD: wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz", checkpoint exists.
-- Skipping CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_swissprot_parser.pl uniprot_sprot.dat.gz Trinotate, checkpoint exists.
-- Skipping CMD: mv uniprot_sprot.dat.gz.pep uniprot_sprot.pep, checkpoint exists.
* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --create
CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/init_Trinotate_sqlite_db.pl --sqlite Trinotate.sqlite
-done creating database Trinotate.sqlite

* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --uniprot_index Trinotate.UniprotIndex
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import Trinotate.UniprotIndex UniprotIndex" | sqlite3 Trinotate.sqlite
sh: 5: sqlite3: not found
Error, cmd: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import Trinotate.UniprotIndex UniprotIndex" | sqlite3 Trinotate.sqlite died with ret 32512 at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/../../PerlLib/Sqlite_connect.pm line 190.
    Sqlite_connect::bulk_load_sqlite("Trinotate.sqlite", "UniprotIndex", "Trinotate.UniprotIndex") called at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl line 95
Error, cmd: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --uniprot_index Trinotate.UniprotIndex died with ret 32512 at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/../PerlLib/Pipeliner.pm line 102.
    Pipeliner::run(Pipeliner=HASH(0x19d2860)) called at /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/Build_Trinotate_Boilerplate_SQLite_db.pl line 119
又存在一个报错,SourceForge上面的解决方案参考一下,认为是sqlite3没有安装上。
因此我们采用apt-get安装就好了
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ sudo apt-get install sqlite3
[sudo] password for yeyuntian: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  sqlite3-doc
The following NEW packages will be installed:
  sqlite3
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 515 kB of archives.
After this operation, 1,938 kB of additional disk space will be used.
Get:1 https://mirrors.tuna.tsinghua.edu.cn/ubuntu xenial/main amd64 sqlite3 amd64 3.11.0-1ubuntu1 [515 kB]
Fetched 515 kB in 8s (62.2 kB/s) 
Selecting previously unselected package sqlite3.
(Reading database ... 282659 files and directories currently installed.)
Preparing to unpack .../sqlite3_3.11.0-1ubuntu1_amd64.deb ...
Unpacking sqlite3 (3.11.0-1ubuntu1) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up sqlite3 (3.11.0-1ubuntu1) ...
然后就继续
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ $TRINOTATE_HOME/admin/Build_Trinotate_Boilerplate_SQLite_db.pl Trinotate
-- Skipping CMD: wget "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz", checkpoint exists.
-- Skipping CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_swissprot_parser.pl uniprot_sprot.dat.gz Trinotate, checkpoint exists.
-- Skipping CMD: mv uniprot_sprot.dat.gz.pep uniprot_sprot.pep, checkpoint exists.
-- Skipping CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --create, checkpoint exists.
* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --uniprot_index Trinotate.UniprotIndex
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import Trinotate.UniprotIndex UniprotIndex" | sqlite3 Trinotate.sqlite
memory
* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --taxonomy_index Trinotate.TaxonomyIndex
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import Trinotate.TaxonomyIndex TaxonomyIndex" | sqlite3 Trinotate.sqlite
memory
* Running CMD: wget "http://eggnogdb.embl.de/download/latest/data/NOG/NOG.annotations.tsv.gz"
--2019-02-02 15:11:16--  http://eggnogdb.embl.de/download/latest/data/NOG/NOG.annotations.tsv.gz
Connecting to 127.0.0.1:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 1911409 (1.8M) [application/octet-stream]
Saving to: ‘NOG.annotations.tsv.gz’

NOG.annotations.tsv.gz              100%[==================================================================>]   1.82M   236KB/s    in 7.9s    

2019-02-02 15:11:26 (236 KB/s) - ‘NOG.annotations.tsv.gz’ saved [1911409/1911409]

* Running CMD: gunzip -c NOG.annotations.tsv.gz | /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/print.pl 1 5 > NOG.annotations.tsv.gz.bulk_load
* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --eggnog NOG.annotations.tsv.gz.bulk_load
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import NOG.annotations.tsv.gz.bulk_load eggNOGIndex" | sqlite3 Trinotate.sqlite
memory
* Running CMD: wget "http://purl.obolibrary.org/obo/go/go-basic.obo"
--2019-02-02 15:11:27--  http://purl.obolibrary.org/obo/go/go-basic.obo
Connecting to 127.0.0.1:8118... connected.
Proxy request sent, awaiting response... 302 Found
Location: http://snapshot.geneontology.org/ontology/go-basic.obo [following]
--2019-02-02 15:11:29--  http://snapshot.geneontology.org/ontology/go-basic.obo
Reusing existing connection to 127.0.0.1:8118.
Proxy request sent, awaiting response... 200 OK
Length: 31348362 (30M) [text/obo]
Saving to: ‘go-basic.obo’

go-basic.obo                        100%[==================================================================>]  29.90M  4.98MB/s    in 5.1s    

2019-02-02 15:11:37 (5.88 MB/s) - ‘go-basic.obo’ saved [31348362/31348362]

* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/obo_to_tab.pl go-basic.obo > go-basic.obo.tab
* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --go_obo_tab go-basic.obo.tab
CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/obo_tab_to_sqlite_db.pl Trinotate.sqlite go-basic.obo.tab
[47000]   

done.

* Running CMD: wget "ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz"
--2019-02-02 15:11:38--  ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
           => ‘Pfam-A.hmm.gz’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.192.4
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.192.4|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/Pfam/current_release ... done.
==> SIZE Pfam-A.hmm.gz ... 270995712
==> PASV ... done.    ==> RETR Pfam-A.hmm.gz ... done.
Length: 270995712 (258M) (unauthoritative)

Pfam-A.hmm.gz                       100%[==================================================================>] 258.44M   239KB/s    in 33m 56s 

2019-02-02 15:45:41 (130 KB/s) - ‘Pfam-A.hmm.gz’ saved [270995712]

* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/PFAM_dat_parser.pl Pfam-A.hmm.gz
[17900]  * Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --pfam Pfam-A.hmm.gz.pfam_sqlite_bulk_load
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import Pfam-A.hmm.gz.pfam_sqlite_bulk_load PFAMreference" | sqlite3 Trinotate.sqlite
memory
* Running CMD: wget "http://www.geneontology.org/external2go/pfam2go" 
--2019-02-02 15:45:49--  http://www.geneontology.org/external2go/pfam2go
Connecting to 127.0.0.1:8118... connected.
Proxy request sent, awaiting response... 200 OK
Length: 700762 (684K) [text/plain]
Saving to: ‘pfam2go’

pfam2go                             100%[==================================================================>] 684.34K   436KB/s    in 1.6s    

2019-02-02 15:45:52 (436 KB/s) - ‘pfam2go’ saved [700762/700762]

* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/PFAMtoGoParser.pl pfam2go > pfam2go.tab
* Running CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/admin/util/EMBL_dat_to_Trinotate_sqlite_resourceDB.pl --sqlite Trinotate.sqlite --pfam2go pfam2go.tab
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import pfam2go.tab pfam2go" | sqlite3 Trinotate.sqlite
memory
最后到这个地方就完成了。
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ l -alt
total 1417012
drwxrwxr-x 11 yeyuntian yeyuntian      4096 2月   2 15:45 ./
-rw-r--r--  1 yeyuntian yeyuntian 366116864 2月   2 15:45 Trinotate.sqlite
-rw-rw-r--  1 yeyuntian yeyuntian 270995712 2月   2 15:45 Pfam-A.hmm.gz
-rw-rw-r--  1 yeyuntian yeyuntian 237871496 2月   1 19:11 uniprot_sprot.pep
-rw-rw-r--  1 yeyuntian yeyuntian 575923689 2月   1 18:57 uniprot_sprot.dat.gz
这四个文件就是下载好的的数据库文件。
接下来就是进行比对了
首先是blast比对,这个安装我们在其他文章中提及过因此这里建议参考Ubuntu 下Blast+工具的安装与环境配置
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ makeblastdb -in uniprot_sprot.pep -dbtype prot 


Building a new DB, current time: 02/02/2019 16:13:53
New DB name:   /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/uniprot_sprot.pep
New DB title:  uniprot_sprot.pep
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 559077 sequences in 12.2485 seconds.
接着我们准备HMMER的数据库,软件安装方法参考我的另外一篇文章。

Ubuntu 环境下的 HMMER软件安装与基因家族成员挖掘

同样是构建数据库
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ hmmpress Pfam-A.hmm 
Working...    done.
Pressed and indexed 17929 HMMs (17929 names and 17929 accessions).
Models pressed into binary file:   Pfam-A.hmm.h3m
SSI index for binary model file:   Pfam-A.hmm.h3i
Profiles (MSV part) pressed into:  Pfam-A.hmm.h3f
Profiles (remainder) pressed into: Pfam-A.hmm.h3p
接下来,我们开始往下做。在转录组分析实战第二节:无参考基因转录组拼接中我们构建得到了转录本的编码蛋白的蛋白序列。我们在这个地方会调取这个文件。
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ l -alt
total 4463240
-rw-rw-r--  1 yeyuntian yeyuntian   56068696 2月   2 20:08 Trinity.fasta.transdecoder.pep

序列比对

接下来的工作就是通过之前下载配置好的Blast+和HMMER进行比对和预测(当然两者工具采用的算法会有差异)
这个过程会消耗大量的时间,因此建议采用服务器配合screen命令进行托管运行。
因此在此就将命令放出来,结果我这边运算了接近两天才完成。
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ blastx -query Trinity.fasta -db uniprot_sprot.pep -num_threads 8 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastx.outfmt6
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ blastp -query transdecoder.pep -db uniprot_sprot.pep -num_threads 8 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastp.outfmt6
这两条命令中有一个参数 -num_threads 是关于采用运算CPU核心数的,可以通过以下命令获取当前可以用的核心数
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ nproc
4
当然我这个是笔记本电脑的情况,实际在服务器上的会更多。
以上两条命令都是采用的Blast比对算法进行的,还有一种是HMMER基于隐马尔科夫链的算法进行的。
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ hmmscan --cpu 12 --domtblout TrinotatePFAM.out Pfam-A.hmm transdecoder.pep > pfam.log
因此经过运算后会得到三个结果文件,在Trinotate的使用指南上还有采用其他的软件对蛋白信号肽序列和跨膜结构域进行了预测,我们这里没有进行演示。
得到的结果如下:
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ ll -alt 
total 5636108
-rw-rw-r--  1 yeyuntian yeyuntian   10296365 2月   6 14:37 blastx.outfmt6
drwxrwxr-x 11 yeyuntian yeyuntian       4096 2月   6 14:37 ./
-rw-rw-r--  1 yeyuntian yeyuntian    7231505 2月   6 14:37 blastp.outfmt6
-rw-rw-r--  1 yeyuntian yeyuntian  770576424 2月   6 14:35 pfam.log
-rw-rw-r--  1 yeyuntian yeyuntian  168149360 2月   6 14:29 Trinit_TrinotatePFAM.out
drwxrwxr-x  4 yeyuntian yeyuntian       4096 2月   3 11:15 ../
那这样完了过后,我们需要把这些结果合并起来。

导入SQL数据库中

我们之前获得了很多的结果,我们需要统一的管理
因此需要将之前得到的结果放到Trinotate SQLite数据库中进行合并。
首先是获取帮助
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ perl Trinotate -h


   usage: Trinotate    [...]


     : 

         * Initial import of transcriptome and protein data:

             init --gene_trans_map  --transcript_fasta  --transdecoder_pep 

         * Transdecoder protein search results:

             LOAD_swissprot_blastp 
             LOAD_pfam 
             LOAD_tmhmm 
             LOAD_signalp 

          * Trinity transcript search results:

             LOAD_swissprot_blastx 
             LOAD_rnammer 
             

          * Load custom blast results using any searchable database


             LOAD_custom_blast --outfmt6  --prog  --dbtype 


          * report generation:

             report [ -E (default: 1e-5) ] [--pfam_cutoff DNC|DGC|DTC|SNC|SGC|STC (default: DNC=domain noise cutoff)]

因此我们需要把以下的几个结果一一进行Load到SQLite db中,在我们这个工作中的database就是Trinotate.sqlite
  1. 转录本和蛋白数据
  2. Blast的结果(这个包括蛋白比对和核酸比对的两个结果)
  3. HMMER比对的Pfam结果
    那么接下来就是采用以下命令进行(这个要明白Trinotate就是一个把数据装到数据库的工具)
导入蛋白和转录本结果

注意有三个参数

yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ perl Trinotate \#开始运行调运的程序这个地方是用perl运行 
Trinotate.sqlite \#导入的数据库
init \#表示这个工作是初始工作
--gene_trans_map Trinity.fasta.gene_trans_map \#导入的是转录本与基因的关系文件
--transcript_fasta Trinity.fasta \#导入转录本文件
--transdecoder_pep Trinity.fasta.transdecoder.pep #导入转录本编码蛋白质文件 
CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/util/trinotateSeqLoader/TrinotateSeqLoader.pl --sqlite Trinotate.sqlite --gene_trans_map Trinity.fasta.gene_trans_map --transcript_fasta Trinity.fasta --transdecoder_pep Trinity.fasta.transdecoder.pep --bulk_load
-parsing gene/trans map file.... done.
-loading Transcripts.
[220400]   
done.
-loading ORFs.
[210000]   
done.

CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import tmp.Transcript.bulk_load Transcript" | sqlite3 Trinotate.sqlite
memory
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import tmp.ORF.bulk_load ORF" | sqlite3 Trinotate.sqlite
memory


Loading complete..
接下来导入Blast比对结果:
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ perl Trinotate \#调用的工具
Trinotate.sqlite \#处理的数据库
LOAD_swissprot_blastp blastp.outfmt6 #导入的是blastp的结果
CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/util/trinotateSeqLoader/Trinotate_BLAST_loader.pl --sqlite Trinotate.sqlite --outfmt6 blastp.outfmt6 --prog blastp --dbtype Swissprot
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import tmp.blast_bulk_load.15830 BlastDbase" | sqlite3 Trinotate.sqlite
memory

BlastDbase loading complete..
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ perl Trinotate Trinotate.sqlite LOAD_swissprot_blastx blastx.outfmt6 
CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/util/trinotateSeqLoader/Trinotate_BLAST_loader.pl --sqlite Trinotate.sqlite --outfmt6 blastx.outfmt6 --prog blastx --dbtype Swissprot
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import tmp.blast_bulk_load.16407 BlastDbase" | sqlite3 Trinotate.sqlite
memory


BlastDbase loading complete..
最后导入HMMER比对结果
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ perl Trinotate Trinotate.sqlite LOAD_pfam Trinit_TrinotatePFAM.out 
CMD: /home/yeyuntian/Biosoft/Trinotate-Trinotate-v3.1.1/util/trinotateSeqLoader/Trinotate_PFAM_loader.pl --sqlite Trinotate.sqlite --pfam Trinit_TrinotatePFAM.out
CMD: echo "pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=4000000;
.mode tabs
.import tmp.pfam_bulk_load.16631 HMMERDbase" | sqlite3 Trinotate.sqlite
memory


Loading complete..
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ ll *.sq*
-rw-r--r-- 1 yeyuntian yeyuntian 926683136 2月   6 20:21 Trinotate.sqlite
这样我们就将之前的结果导入到了我们 Trinotate.sqlite 这个文件中。
之后在其他的地方我们会使用到这个,而且这个数据库也可以通过以下命令直接进行调用:
yeyuntian@yeyuntian-RESCUER-R720-15IKBN:~/Biosoft/Trinotate-Trinotate-v3.1.1$ perl Trinotate report Trinotate.sqlite > database_report.xls 

做个小结:

1. 我们通过Trinotate将Blast和HMMER预测的结果进行了整合,因此需要注意的是:Trinotate做个perl脚本在这里的作用就是作为数据库导入和查看的工具,而真正比较耗时间的是Blast和HMMER的比对。

2. 此外我们可以看到在大数据整合过程中SQL数据库的管理是非常重要的。

3. 做个SQL数据库同样可以在其他脚本中被调用。

你可能感兴趣的:(转录组分析实战第七节:采用Trinotate对拼接结果进行注释)