vConTACT2病毒分类注释

文章:Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks
中文:通过基因分享网络给META中的病毒基因组做分类注释
杂志:Nature Biotechnology
时间:2019

bitbucket: https://bitbucket.org/MAVERICLab/vcontact2/wiki/Home

安装

conda install -n vcontact2 python=3
conda activate vcontact2
conda install -y -c bioconda vcontact2
conda install -y -c bioconda mcl blast diamond
# -y, --yes             Do not ask for confirmation.

获取依赖cluster_one

# 下载聚类软件,移动到conda/bin路径 (可使用win下载代替)
wget -c http://www.paccanarolab.org/static_content/clusterone/cluster_one-1.0.jar

java -jar cluster_one-1.0.jar -h

查看数据库:

安装vcontact2也顺便下载了数据库

ll -alh /route/miniconda3/envs/vcontact2/lib/python3.8/site-packages/vcontact2/data

查看蛋白序列数:

zcat ViralRefSeq-prokaryotes-v94.faa.gz | grep '^>' | wc -l
268145
zcat ViralRefSeq-prokaryotes-v88.faa.gz | grep '^>' | wc -l
230992
zcat ViralRefSeq-prokaryotes-v85.faa.gz | grep '^>' | wc -l
231165
zcat ViralRefSeq-prokaryotes-v201.faa.gz | grep '^>' | wc -l
363514

蛋白信息

zcat ViralRefSeq-prokaryotes-v94.faa.gz | grep '^>' | head
>NP_037662.1 terminase small subunit [Escherichia virus HK022]
>NP_037663.1 terminase large subunit [Escherichia virus HK022]
>NP_037664.1 head portal protein [Escherichia virus HK022]
>NP_037665.1 head maturation protease [Escherichia virus HK022]
>NP_037666.1 major capsid subunit precursor [Escherichia virus HK022]
>NP_037667.1 gp6 [Escherichia virus HK022]

注释信息

less -S ViralRefSeq-prokaryotes-v94.Merged-reference.csv | head
Organism/Name,origin,order,family,subfamily,genus
Acholeplasma virus L2,RefSeq-94,,Plasmaviridae,,Plasmavirus
Acholeplasma virus MV-L51,RefSeq-94,,Inoviridae,,Plectrovirus

protein_id,contig_id,keywords

less -S ViralRefSeq-prokaryotes-v94.protein2contig.csv | head
protein_id,contig_id,keywords
NP_955551.1,Acholeplasma virus L2,envelope protein
NP_040808.1,Acholeplasma virus L2,envelope protein
NP_040809.1,Acholeplasma virus L2,hypothetical protein L2_02
NP_040810.1,Acholeplasma virus L2,hypothetical protein L2_03
NP_040811.1,Acholeplasma virus L2,hypothetical protein L2_04
NP_040812.1,Acholeplasma virus L2,hypothetical protein L2_05

vcontact2参数

vcontact2 --help

--raw-proteins FASTA-formatted proteins file
--proteins-fp A file linking the protein name and the genome names (csv or tsv)
--rel-mode {BLASTP,Diamond,MMSeqs2} 蛋白比对方法,计算蛋白序列相似性
--pcs-mode {ClusterONE,MCL} 蛋白聚类方法
--vcs-mode {ClusterONE,MCL} 病毒聚类方法
--c1-bin "cluster_one-1.0.jar"的路径
--db 参考库
{None,
ProkaryoticViralRefSeq85-ICTV [default],
ProkaryoticViralRefSeq85-Merged,
ProkaryoticViralRefSeq88-Merged,
ProkaryoticViralRefSeq94-Merged,
ProkaryoticViralRefSeq97-Merged,
ProkaryoticViralRefSeq201-Merged,
ArchaeaViralRefSeq85-Merged,
ArchaeaViralRefSeq94-Merged,
ArchaeaViralRefSeq97-Merged,
ArchaeaViralRefSeq201-Merged}

推荐参数:

vcontact --raw-proteins [proteins file] \
--rel-mode ‘Diamond’ \
--proteins-fp [gene-to-genome mapping file] \
--db 'ProkaryoticViralRefSeq94-Merged' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin [path to ClusterONE] \
--output-dir [target output directory]

输入数据格式:

1 prodigal获取蛋白序列

提取中质量ID和序列

# medium more quality sequence id
cat quality_summary.tsv | awk -F"\t" '{if($8=="Medium-quality") print $1}' > medium_more.contigs
# medium more quality sequence
for i in `cat medium_more.contigs`;
do
    cat combined.fna | grep -A 1 $i >> medium_more.fna
    echo -e "$i done..."
done

蛋白预测和翻译

prodigal \
-a ./prodigal/out.faa \
-d ./prodigal/out.fna \
-f gff \
-g 11 \
-o ./prodigal/out.gff \
-p single \
-s ./prodigal/out.stat \
-i ./checkv/output_sop/medium_more.fna

2 准备gene2genome文件

conda activate vcontact2
vcontact2_gene2genome \
--proteins out.faa \
--output out_map.csv \
--source-type Prodigal-FAA

必须使用csv后缀,否则后续分析报错

参数:
--source-type
{VIRSorter,Prodigal-coords,Prodigal-FAA, MetaGeneMark,NCBICodingSequence,NCBIFasta}

过程:

vcontact2_gene2genome:174: DeprecationWarning: 'U' mode is deprecated
  with open(results.proteins, 'rU') as proteins_fh:

结果:

3 vcontact2获取PC和VC

vcontact2 \
--rel-mode 'Diamond' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/ \
--db 'ProkaryoticViralRefSeq94-Merged' \
--verbose --threads 4 \
--raw-proteins ./prodigal/out.faa \
--proteins-fp ./prodigal/out_map.csv \
--output-dir ./vcontact2/

过程:

============================This is vConTACT2 0.9.19

----------------------------------Pre-Analysis

INFO:vcontact2: Found Diamond
INFO:vcontact2: Found MCL
INFO:vcontact2: Identified 4 CPUs
INFO:vcontact2: Using reference database: ProkaryoticViralRefSeq94-Merged
INFO:vcontact2: Using existing directory ./vcontact2/

------------------------------Reference databases

INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user sequences...
INFO:vcontact2: Creating Diamond database and running Diamond...
INFO:vcontact2.protein_clusters: Creating Diamond database...
INFO:vcontact2.protein_clusters: Running Diamond...

-------------------------------Protein clustering

INFO:vcontact2: Loading proteins...
INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user gene-to-genome mapping...
DEBUG:vcontact2: Read 268201 proteins from ./prodigal/out_map.csv.
DEBUG:vcontact2.protein_clusters: Generating abc file...
DEBUG:vcontact2.protein_clusters: Running MCL...
INFO:vcontact2: Building the cluster and profiles 
INFO:vcontact2: Saving intermediate files...

----------------------------------Loading data

DEBUG:vcontact2: Read 2617 entries from ./vcontact2/vConTACT2_contigs.csv
INFO:vcontact2: Read 232886 entries (dropped 2328 singletons)

--------------------------------Adding Taxonomy

------------------------Calculating Similarity Networks

DEBUG:vcontact2.pcprofiles: Hypergeometric contig-similarity network:
DEBUG:vcontact2.pcprofiles: 21269 PCs present in strictly more than 3 contigs
DEBUG:vcontact2.pcprofiles: Hypergeometric PCs-similarity network
DEBUG:vcontact2: Network Contigs

------------------------Contig Clustering & Affiliation

DEBUG:vcontact2.contig_clusters: 3 taxonomic levels detected: genus, order, fami
INFO:vcontact2.contig_clusters: Exporting for ClusterONE
DEBUG:vcontact2.contig_clusters: Saving network in file ./vcontact2/c1.ntw (9513ines).
INFO:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClustNE
INFO:vcontact2.contig_clusters: Running clusterONE
DEBUG:vcontact2.contig_clusters: ClusterONE results are being saved to ./vcontacc1.clusters.
INFO:vcontact2.contig_clusters: 346 clusters loaded (singletons and non-connecteodes are dropped).
INFO:vcontact2.contig_clusters: Computing membership matrix...
ERROR:vcontact2: Error in viral clusters
ERROR:vcontact2: type object 'object' has no attribute 'dtype'

Traceback (most recent call last):
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/bin/vcontact2", line 637, in main
    vc = vcontact2.cluster_refinements.ViralClusters(gc.contigs, profiles_fp, optimize=options.optimize)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/vcontact2/cluster_refinements.py", line 37, in __in
    self.metrics = pd.DataFrame(columns=summary_headers)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/frame.py", line 411, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 242, i
    val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
  File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1221, in construc
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

debug round 1

更新pandas,自动downgrade vcontact2,结果依然bug

conda install pandas=1.2.3

Downloading and Extracting Packages
certifi-2021.10.8    | 145 KB    | ####################################### | 100%
pandas-1.2.3         | 12.1 MB   | ####################################### | 100%
vcontact2-0.9.15     | 98.0 MB   | ####################################### | 100%
openssl-3.0.0        | 2.9 MB    | ####################################### | 100%
ca-certificates-2021 | 139 KB    | ####################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

# 运行
# vConTACT2 0.9.13
ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'
AttributeError: 'DataFrame' object has no attribute 'ix'

debug round 2

此次vcontact2版本依然是0.9.15,没有实现更新,仅更新了numpy等依赖。

conda update vcontact2

Downloading and Extracting Packages
numpy-1.21.2         | 6.2 MB    | ####################################### | 100%

conda list

vcontact2                 0.9.15                     py_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
pandas                    1.2.3            py38h51da96c_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy                     1.21.2           py38he2449b9_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'
AttributeError: 'DataFrame' object has no attribute 'ix'

debug round 3

conda list
pandas                    0.25.3
numpy                     1.21.2
conda install numpy=1.19.5

------------------------Contig Clustering & Affiliation-------------------------
DEBUG:vcontact2.contig_clusters: 3 taxonomic levels detected: genus, order, family
INFO:vcontact2.contig_clusters: Exporting for ClusterONE
DEBUG:vcontact2.contig_clusters: Network file already exist.
INFO:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClusterONE
DEBUG:vcontact2.contig_clusters: ClusterONE file ./vcontact2/c1.clusters already exist.
INFO:vcontact2.contig_clusters: 346 clusters loaded (singletons and non-connected nodes are dropped).
INFO:vcontact2.contig_clusters: Computing membership matrix...
DEBUG:vcontact2.cluster_refinements: 3 taxonomic levels detected: genus, order, family
INFO:vcontact2.cluster_refinements: Optimizing on distance: 9
INFO:vcontact2.evaluations: Performance evaluations at the genus level...
INFO:vcontact2.cluster_refinements: Identified a single best composite score 2.761417223726828 for distance 9
INFO:vcontact2.cluster_refinements: Merging optimal distance determined from performance evaluations.
DEBUG:vcontact2.evaluations: 3 taxonomic levels detected: order, family, genus
INFO:vcontact2.evaluations: Performance evaluations at the order level...
INFO:vcontact2.evaluations: Performance evaluations at the family level...
INFO:vcontact2.evaluations: Performance evaluations at the genus level...
INFO:vcontact2.cluster_refinements:              PPV  Sensitivity  Accuracy
order   1.000000     0.351764  0.593097
family  0.994381     0.630965  0.792098
genus   0.869642     0.972256  0.919519

--------------------------------Protein modules---------------------------------
DEBUG:vcontact2.modules: Filtered 0 edges according to the sig. threshold 1.0.
INFO:vcontact2.modules: Exporting the PC-network for MCL
DEBUG:vcontact2.modules: Saving network in file ./vcontact2/modules.ntwk (2292198 lines)
INFO:vcontact2.modules: Clustering the PC similarity-network
DEBUG:vcontact2.modules: MCL(5.0) results are saved in ./vcontact2/modules_mcl_5.0.clusters.
INFO:vcontact2.modules: Loading the clustering results
DEBUG:vcontact2.modules: Saving 622 modules containing 18958  protein clusters in ./vcontact2/modules_mcl_5.0_modules.pandas.
---------------------------Link modules and clusters----------------------------
INFO:vcontact2.modules: 2844 contigs-modules owning association, 50018 filtered (a contig must have 50% of the PCs to own a module).
INFO:vcontact2.modules: Linking 622 modules with 346 contigs clusters...
INFO:vcontact2.modules: Network done 346 clusters, 622 modules and 297 edges.


----------------------------Exporting results files-----------------------------
INFO:vcontact2: Identifying genomes that are not clustered (i.e. singletons, outliers and overlaps
There were 540 genomes (including refs) that were singleton, outlier or overlaps.
INFO:vcontact2: Building final summary table
INFO:vcontact2.exports.summaries: Reading edges for 2617 contigs
INFO:vcontact2.exports.summaries: Building PC array
INFO:vcontact2.exports.summaries: Calculating comparisons for back-calculations
...
INFO:vcontact2.exports.summaries: Writing viral cluster overview file...
INFO:vcontact2.exports.summaries: Examining each viral cluster and breaking it down into individual genomes...
INFO:vcontact2.exports.summaries: Writing the genome-by-genome overview file...

yesssssssssssssssss

4 结果

更多:
Supplementing and Colouring vConTACT2 Clusters
Applying vContact to Viral Sequences and Visualizing the Output
https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
(2017). vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ
Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Computational and Structural Biotechnology Journal. 2020

https://github.com/pandas-dev/pandas/issues/39520

if you are stuck on pandas==0.24.2 (don't ask); downgrading to numpy==1.19.5 works
THANK YOU
also works with pandas==0.25.3

你可能感兴趣的:(vConTACT2病毒分类注释)