欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/131934642
MMseq2 是非常强大和高效的生物信息学软件,可以在极短的时间内对大规模的核苷酸和蛋白质序列进行搜索和聚类。主要特点有:
Paper:MMseqs2: sensitive protein sequence searching for the analysis of massive data sets
GitHub:MMseqs2: ultra fast and sensitive sequence search and clustering suite
参考文档:MMseqs2 User Guide
数据库的 FASTA 文件:virus4/gmgc.fa
数据库下载,60G,参考 MSA DB - GMGC:
数据如下:
>GMGC10.033_133_404.UNKNOWN|built-environment
MKGLVVTGLTVAFGFVAYEKLTYVVDVVKHLIA
>GMGC10.026_381_531.UNKNOWN|built-environment
MIWRSGRFAAVALQRKPPVGWAGNLVQPEKLRV
>GMGC10.013_658_495.UNKNOWN|built-environment
MMNKMKSNKFFNTMGMVLSALIDAKAVATTLNK
设置 BOS 命令:
alias bos='bcecmd/bcecmd --conf-path bcecmd/bceconf/ bos'
安装 MMseqs2:
conda install -c conda-forge -c bioconda mmseqs2
版本号 13.45111:
conda list mmseqs2
# packages in environment at miniconda3/envs/torch-def:
#
# Name Version Build Channel
mmseqs2 13.45111 h95f258a_1 bioconda
配置 ~/.bashrc
,即:
if [ -f miniconda3/envs/torch-def/util/bash-completion.sh ]; then
source miniconda3/envs/torch-def/util/bash-completion.sh
fi
测试命令:mmseqs
,输出信息。
默认命令顺序:
mmseqs createdb examples/DB.fasta targetDB
mmseqs createindex targetDB tmp
mmseqs easy-search examples/QUERY.fasta targetDB alnRes.m8 tmp
构建 mmseqs 索引:
mmseqs createdb data/gmgc.fa gmgc.db # 耗时较长
mmseqs createindex gmgc.db tmp
搜索 MSA:
mmseqs easy-search QUERY.fasta gmgc.db alnRes.m8 tmp
createdb
的日志:
createdb gmgc.fa gmgc.db
MMseqs Version: 13.45111
Database type 0
Shuffle input database true
Createdb mode 0
Write lookup file 1
Offset of numeric ids 0
Compressed 0
Verbosity 3
Converting sequences
[294700023] 10m 36s 322ms
Time for merging to gmgc.db_h: 0h 1m 45s 892ms
Time for merging to gmgc.db: 0h 3m 52s 379ms
Database type: Aminoacid
Time for processing: 0h 27m 40s 149ms
createdb
的输出:
gmgc.db
gmgc.db.dbtype
gmgc.db_h
gmgc.db_h.dbtype
gmgc.db_h.index
gmgc.db.index
gmgc.db.lookup
gmgc.db.source
gmgc.fa
createindex
的日志:
indexdb gmgc.db gmgc.db --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --alph-size nucl:5,aa:21 --comp-bias-corr 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score 0 --check-compatible 0 --search-type 0 --split 0 --split-memory-limit 0 -v 3 --threads 232
Target split mode. Searching through 5 splits
Estimated memory consumption: 691G
Write VERSION (0)
Write META (1)
Write SCOREMATRIX3MER (4)
Write SCOREMATRIX2MER (3)
...
Time for merging to gmgc.db.idx: 0h 0m 0s 523ms
Time for processing: 0h 27m 43s 883ms
搜索 MSA,以 T1157s1_A1029.fasta
为例,使用 mmseqs
进行搜索,即:
mmseqs easy-search T1157s1_A1029.fasta virus4/gmgc/gmgc.db alnRes.m8 tmp
输出日志,默认 -s 5.7
:
-s 1.0
, while a very sensitive search would use a sensitivity of up to -s 7.0
...
createdb T1157s1_A1029.fasta tmp/15503592239095911252/query --dbtype 0 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
Converting sequences
Time for merging to query_h: 0h 0m 14s 960ms
Time for merging to query: 0h 0m 3s 422ms
Database type: Aminoacid
Time for processing: 0h 0m 22s 843ms
Create directory tmp/15503592239095911252/search_tmp
search tmp/15503592239095911252/query virus4/gmgc/gmgc.db tmp/15503592239095911252/result tmp/15503592239095911252/search_tmp --alignment-mode 3 -s 5.7 --remove-tmp-files 1
...
搜索出的结果 alnRes.m8
,包括 450 条搜索结果,如下:
A GMGC10.285_640_775.MNL1|built-environment 0.483 954 493 0 39 992 1 954 2.234E-258 832
A GMGC10.241_341_405.MNL1|soil 0.569 657 280 0 3 659 55 705 1.592E-232 756
A GMGC10.275_542_093.MNL1|human-skin 0.615 348 127 0 179 526 2 333 7.241E-130 454
A GMGC10.302_382_477.MNL1|human-gut 0.393 592 290 0 22 613 1 478 1.668E-109 393
A GMGC10.233_643_273.MNL1|soil 0.613 320 120 0 141 451 1 320 2.039E-107 387
A GMGC10.291_266_275.MNL1|human-gut 0.362 544 286 0 7 550 36 484 7.208E-96 351
A GMGC10.295_625_477.MNL1|human-skin 0.801 212 42 0 71 282 1 212 2.547E-94 347
A GMGC10.230_501_499.MNL1|soil 0.632 280 100 0 90 362 3 282 3.678E-93 343
A GMGC10.053_306_367.MNL1|human-gut 0.360 546 297 0 7 552 25 489 2.928E-92 340
A GMGC10.284_538_017.MNL1|built-environment 0.365 529 292 0 22 550 1 461 1.359E-85 320
即,搜索完成。
AF2 官网:https://github.com/deepmind/alphafold
安装下载工具:
apt-get install aria2
下载文件 small_bfd
:
mkdir small_bfd
cd small_bfd/
# 下载文件
aria2c https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz
gunzip bfd-first_non_consensus_sequences.fasta.gz