文章:Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks
中文:通过基因分享网络给META中的病毒基因组做分类注释
杂志:Nature Biotechnology
时间:2019
bitbucket: https://bitbucket.org/MAVERICLab/vcontact2/wiki/Home
安装
conda install -n vcontact2 python=3
conda activate vcontact2
conda install -y -c bioconda vcontact2
conda install -y -c bioconda mcl blast diamond
# -y, --yes Do not ask for confirmation.
获取依赖cluster_one
# 下载聚类软件,移动到conda/bin路径 (可使用win下载代替)
wget -c http://www.paccanarolab.org/static_content/clusterone/cluster_one-1.0.jar
java -jar cluster_one-1.0.jar -h
查看数据库:
安装vcontact2也顺便下载了数据库
ll -alh /route/miniconda3/envs/vcontact2/lib/python3.8/site-packages/vcontact2/data
查看蛋白序列数:
zcat ViralRefSeq-prokaryotes-v94.faa.gz | grep '^>' | wc -l
268145
zcat ViralRefSeq-prokaryotes-v88.faa.gz | grep '^>' | wc -l
230992
zcat ViralRefSeq-prokaryotes-v85.faa.gz | grep '^>' | wc -l
231165
zcat ViralRefSeq-prokaryotes-v201.faa.gz | grep '^>' | wc -l
363514
蛋白信息
zcat ViralRefSeq-prokaryotes-v94.faa.gz | grep '^>' | head
>NP_037662.1 terminase small subunit [Escherichia virus HK022]
>NP_037663.1 terminase large subunit [Escherichia virus HK022]
>NP_037664.1 head portal protein [Escherichia virus HK022]
>NP_037665.1 head maturation protease [Escherichia virus HK022]
>NP_037666.1 major capsid subunit precursor [Escherichia virus HK022]
>NP_037667.1 gp6 [Escherichia virus HK022]
注释信息
less -S ViralRefSeq-prokaryotes-v94.Merged-reference.csv | head
Organism/Name,origin,order,family,subfamily,genus
Acholeplasma virus L2,RefSeq-94,,Plasmaviridae,,Plasmavirus
Acholeplasma virus MV-L51,RefSeq-94,,Inoviridae,,Plectrovirus
protein_id,contig_id,keywords
less -S ViralRefSeq-prokaryotes-v94.protein2contig.csv | head
protein_id,contig_id,keywords
NP_955551.1,Acholeplasma virus L2,envelope protein
NP_040808.1,Acholeplasma virus L2,envelope protein
NP_040809.1,Acholeplasma virus L2,hypothetical protein L2_02
NP_040810.1,Acholeplasma virus L2,hypothetical protein L2_03
NP_040811.1,Acholeplasma virus L2,hypothetical protein L2_04
NP_040812.1,Acholeplasma virus L2,hypothetical protein L2_05
vcontact2参数
vcontact2 --help
--raw-proteins
FASTA-formatted proteins file
--proteins-fp
A file linking the protein name and the genome names (csv or tsv)
--rel-mode
{BLASTP,Diamond,MMSeqs2} 蛋白比对方法,计算蛋白序列相似性
--pcs-mode
{ClusterONE,MCL} 蛋白聚类方法
--vcs-mode
{ClusterONE,MCL} 病毒聚类方法
--c1-bin
"cluster_one-1.0.jar"的路径
--db
参考库
{None,
ProkaryoticViralRefSeq85-ICTV [default],
ProkaryoticViralRefSeq85-Merged,
ProkaryoticViralRefSeq88-Merged,
ProkaryoticViralRefSeq94-Merged,
ProkaryoticViralRefSeq97-Merged,
ProkaryoticViralRefSeq201-Merged,
ArchaeaViralRefSeq85-Merged,
ArchaeaViralRefSeq94-Merged,
ArchaeaViralRefSeq97-Merged,
ArchaeaViralRefSeq201-Merged}
推荐参数:
vcontact --raw-proteins [proteins file] \
--rel-mode ‘Diamond’ \
--proteins-fp [gene-to-genome mapping file] \
--db 'ProkaryoticViralRefSeq94-Merged' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin [path to ClusterONE] \
--output-dir [target output directory]
输入数据格式:
1 prodigal获取蛋白序列
提取中质量ID和序列
# medium more quality sequence id
cat quality_summary.tsv | awk -F"\t" '{if($8=="Medium-quality") print $1}' > medium_more.contigs
# medium more quality sequence
for i in `cat medium_more.contigs`;
do
cat combined.fna | grep -A 1 $i >> medium_more.fna
echo -e "$i done..."
done
蛋白预测和翻译
prodigal \
-a ./prodigal/out.faa \
-d ./prodigal/out.fna \
-f gff \
-g 11 \
-o ./prodigal/out.gff \
-p single \
-s ./prodigal/out.stat \
-i ./checkv/output_sop/medium_more.fna
2 准备gene2genome文件
conda activate vcontact2
vcontact2_gene2genome \
--proteins out.faa \
--output out_map.csv \
--source-type Prodigal-FAA
必须使用csv后缀,否则后续分析报错
参数:
--source-type
{VIRSorter,Prodigal-coords,Prodigal-FAA, MetaGeneMark,NCBICodingSequence,NCBIFasta}
过程:
vcontact2_gene2genome:174: DeprecationWarning: 'U' mode is deprecated
with open(results.proteins, 'rU') as proteins_fh:
结果:
3 vcontact2获取PC和VC
vcontact2 \
--rel-mode 'Diamond' \
--pcs-mode MCL \
--vcs-mode ClusterONE \
--c1-bin /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/ \
--db 'ProkaryoticViralRefSeq94-Merged' \
--verbose --threads 4 \
--raw-proteins ./prodigal/out.faa \
--proteins-fp ./prodigal/out_map.csv \
--output-dir ./vcontact2/
过程:
============================This is vConTACT2 0.9.19
----------------------------------Pre-Analysis
INFO:vcontact2: Found Diamond
INFO:vcontact2: Found MCL
INFO:vcontact2: Identified 4 CPUs
INFO:vcontact2: Using reference database: ProkaryoticViralRefSeq94-Merged
INFO:vcontact2: Using existing directory ./vcontact2/
------------------------------Reference databases
INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user sequences...
INFO:vcontact2: Creating Diamond database and running Diamond...
INFO:vcontact2.protein_clusters: Creating Diamond database...
INFO:vcontact2.protein_clusters: Running Diamond...
-------------------------------Protein clustering
INFO:vcontact2: Loading proteins...
INFO:vcontact2: Merging ProkaryoticViralRefSeq94-Merged to user gene-to-genome mapping...
DEBUG:vcontact2: Read 268201 proteins from ./prodigal/out_map.csv.
DEBUG:vcontact2.protein_clusters: Generating abc file...
DEBUG:vcontact2.protein_clusters: Running MCL...
INFO:vcontact2: Building the cluster and profiles
INFO:vcontact2: Saving intermediate files...
----------------------------------Loading data
DEBUG:vcontact2: Read 2617 entries from ./vcontact2/vConTACT2_contigs.csv
INFO:vcontact2: Read 232886 entries (dropped 2328 singletons)
--------------------------------Adding Taxonomy
------------------------Calculating Similarity Networks
DEBUG:vcontact2.pcprofiles: Hypergeometric contig-similarity network:
DEBUG:vcontact2.pcprofiles: 21269 PCs present in strictly more than 3 contigs
DEBUG:vcontact2.pcprofiles: Hypergeometric PCs-similarity network
DEBUG:vcontact2: Network Contigs
------------------------Contig Clustering & Affiliation
DEBUG:vcontact2.contig_clusters: 3 taxonomic levels detected: genus, order, fami
INFO:vcontact2.contig_clusters: Exporting for ClusterONE
DEBUG:vcontact2.contig_clusters: Saving network in file ./vcontact2/c1.ntw (9513ines).
INFO:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClustNE
INFO:vcontact2.contig_clusters: Running clusterONE
DEBUG:vcontact2.contig_clusters: ClusterONE results are being saved to ./vcontacc1.clusters.
INFO:vcontact2.contig_clusters: 346 clusters loaded (singletons and non-connecteodes are dropped).
INFO:vcontact2.contig_clusters: Computing membership matrix...
ERROR:vcontact2: Error in viral clusters
ERROR:vcontact2: type object 'object' has no attribute 'dtype'
Traceback (most recent call last):
File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/bin/vcontact2", line 637, in main
vc = vcontact2.cluster_refinements.ViralClusters(gc.contigs, profiles_fp, optimize=options.optimize)
File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/vcontact2/cluster_refinements.py", line 37, in __in
self.metrics = pd.DataFrame(columns=summary_headers)
File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/frame.py", line 411, in __init__
mgr = init_dict(data, index, columns, dtype=dtype)
File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 242, i
val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
File "/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/miniconda3/envs/vcontact2/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1221, in construc
dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'
debug round 1
更新pandas,自动downgrade vcontact2,结果依然bug
conda install pandas=1.2.3
Downloading and Extracting Packages
certifi-2021.10.8 | 145 KB | ####################################### | 100%
pandas-1.2.3 | 12.1 MB | ####################################### | 100%
vcontact2-0.9.15 | 98.0 MB | ####################################### | 100%
openssl-3.0.0 | 2.9 MB | ####################################### | 100%
ca-certificates-2021 | 139 KB | ####################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
# 运行
# vConTACT2 0.9.13
ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'
AttributeError: 'DataFrame' object has no attribute 'ix'
debug round 2
此次vcontact2版本依然是0.9.15,没有实现更新,仅更新了numpy等依赖。
conda update vcontact2
Downloading and Extracting Packages
numpy-1.21.2 | 6.2 MB | ####################################### | 100%
conda list
vcontact2 0.9.15 py_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
pandas 1.2.3 py38h51da96c_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
numpy 1.21.2 py38he2449b9_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ERROR:vcontact2: Error in contig clustering
ERROR:vcontact2: 'DataFrame' object has no attribute 'ix'
AttributeError: 'DataFrame' object has no attribute 'ix'
debug round 3
conda list
pandas 0.25.3
numpy 1.21.2
conda install numpy=1.19.5
------------------------Contig Clustering & Affiliation-------------------------
DEBUG:vcontact2.contig_clusters: 3 taxonomic levels detected: genus, order, family
INFO:vcontact2.contig_clusters: Exporting for ClusterONE
DEBUG:vcontact2.contig_clusters: Network file already exist.
INFO:vcontact2.contig_clusters: Clustering the PC Similarity-Network using ClusterONE
DEBUG:vcontact2.contig_clusters: ClusterONE file ./vcontact2/c1.clusters already exist.
INFO:vcontact2.contig_clusters: 346 clusters loaded (singletons and non-connected nodes are dropped).
INFO:vcontact2.contig_clusters: Computing membership matrix...
DEBUG:vcontact2.cluster_refinements: 3 taxonomic levels detected: genus, order, family
INFO:vcontact2.cluster_refinements: Optimizing on distance: 9
INFO:vcontact2.evaluations: Performance evaluations at the genus level...
INFO:vcontact2.cluster_refinements: Identified a single best composite score 2.761417223726828 for distance 9
INFO:vcontact2.cluster_refinements: Merging optimal distance determined from performance evaluations.
DEBUG:vcontact2.evaluations: 3 taxonomic levels detected: order, family, genus
INFO:vcontact2.evaluations: Performance evaluations at the order level...
INFO:vcontact2.evaluations: Performance evaluations at the family level...
INFO:vcontact2.evaluations: Performance evaluations at the genus level...
INFO:vcontact2.cluster_refinements: PPV Sensitivity Accuracy
order 1.000000 0.351764 0.593097
family 0.994381 0.630965 0.792098
genus 0.869642 0.972256 0.919519
--------------------------------Protein modules---------------------------------
DEBUG:vcontact2.modules: Filtered 0 edges according to the sig. threshold 1.0.
INFO:vcontact2.modules: Exporting the PC-network for MCL
DEBUG:vcontact2.modules: Saving network in file ./vcontact2/modules.ntwk (2292198 lines)
INFO:vcontact2.modules: Clustering the PC similarity-network
DEBUG:vcontact2.modules: MCL(5.0) results are saved in ./vcontact2/modules_mcl_5.0.clusters.
INFO:vcontact2.modules: Loading the clustering results
DEBUG:vcontact2.modules: Saving 622 modules containing 18958 protein clusters in ./vcontact2/modules_mcl_5.0_modules.pandas.
---------------------------Link modules and clusters----------------------------
INFO:vcontact2.modules: 2844 contigs-modules owning association, 50018 filtered (a contig must have 50% of the PCs to own a module).
INFO:vcontact2.modules: Linking 622 modules with 346 contigs clusters...
INFO:vcontact2.modules: Network done 346 clusters, 622 modules and 297 edges.
----------------------------Exporting results files-----------------------------
INFO:vcontact2: Identifying genomes that are not clustered (i.e. singletons, outliers and overlaps
There were 540 genomes (including refs) that were singleton, outlier or overlaps.
INFO:vcontact2: Building final summary table
INFO:vcontact2.exports.summaries: Reading edges for 2617 contigs
INFO:vcontact2.exports.summaries: Building PC array
INFO:vcontact2.exports.summaries: Calculating comparisons for back-calculations
...
INFO:vcontact2.exports.summaries: Writing viral cluster overview file...
INFO:vcontact2.exports.summaries: Examining each viral cluster and breaking it down into individual genomes...
INFO:vcontact2.exports.summaries: Writing the genome-by-genome overview file...
yesssssssssssssssss
4 结果
更多:
Supplementing and Colouring vConTACT2 Clusters
Applying vContact to Viral Sequences and Visualizing the Output
https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
(2017). vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ
Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Computational and Structural Biotechnology Journal. 2020
https://github.com/pandas-dev/pandas/issues/39520
if you are stuck on pandas==0.24.2 (don't ask); downgrading to numpy==1.19.5 works
THANK YOU
also works with pandas==0.25.3