PSP - 配置 AlphaFold2 的高效 Tensorflow 运行环境

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://blog.csdn.net/caroline_wendy/article/details/130560538

PSP - 配置 AlphaFold2 的高效 Tensorflow 运行环境_第1张图片

AlphaFold2 是由 DeepMind 开发,可以根据蛋白质的氨基酸序列预测其三维结构,准确度经常可以与实验相媲美。DeepMind 和 EMBL 的欧洲生物信息学研究所合作,创建AlphaFold DB,免费向科学界提供这些预测结果。最新的数据库版本,包含了超过 200 万种蛋白质的结构预测,涵盖人类和其他 20 多种物种的蛋白质组。AlphaFold2 的核心是基于神经网络的计算模型,结合了蛋白质的物理和生物学知识,利用多序列比对(MSA)所设计出的深度学习算法。

1. Docker 环境

命令如下:

# 启动 nvidia-docker 环境
nvidia-docker run -it --name [docker-name] -v [...]:[...] [nvidia-base]:v1.0

# 配置 conda
bash Miniconda3-py38_4.10.3-Linux-x86_64.sh
source ~/.bashrc

# 创建 alphafold 环境
conda create --name alphafold python==3.8
conda update -n base conda
conda activate alphafold

# 配置 conda 库
conda install -y -c conda-forge openmm==7.5.1 cudatoolkit==11.2.2 pdbfixer
conda install -y -c bioconda hmmer hhsuite==3.3.0 kalign2

# 再次更新
conda install -y -c conda-forge openmm==7.7.0 
conda install -y -c conda-forge pdbfixer==1.8.1

# 配置 pip 库, tensorflow-gpu 或 tensorflow-cpu,根据机器选择
pip install absl-py==1.0.0 biopython==1.79 chex==0.0.7 dm-haiku==0.0.9 dm-tree==0.1.6 immutabledict==2.0.0 jax==0.3.25 ml-collections==0.1.0 numpy==1.21.6 pandas==1.3.4 protobuf==3.20.1 scipy==1.7.0 tensorflow-gpu==2.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

# 配置 pip jax 库
pip install --upgrade --no-cache-dir jax==0.3.25 jaxlib==0.3.25+cuda11.cudnn805 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# 导出环境
export alphafold_path="$(pwd)"

# 下载配置文件
wget -q -P $alphafold_path/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

# 配置 openmm.patch
git checkout v2.3.1  # 最新版本删除 openmm.patch
cd ~/miniconda3/envs/alphafold/lib/python3.8/site-packages/
patch -p0 < $alphafold_path/docker/openmm.patch

测试 Tensorflow 是否安装成功,以及 GPU 是否启动:

python3  # 进入命令行

import tensorflow as tf
 
print(f"is_gpu_available: {tf.test.is_gpu_available()}")
gpu_device_name = tf.test.gpu_device_name()
print(f"gpu_device_name: {gpu_device_name}")

from tensorflow.python.client import device_lib 
# 列出所有的本地机器设备
local_device_protos = device_lib.list_local_devices()
# 只打印GPU设备
print(x) for x in local_device_protos if x.device_type == 'GPU'

保存和复用 docker,命令如下:

# 保存环境
docker ps -l
docker commit [container-id] af2:v1.0
docker save af2:v1.0 | gzip > af2_v1.tar.gz

# 加载环境
docker image load -i af2_v1.tar.gz
nvidia-docker run -it --name [docker-name] -v [...]:[...] af2:v1.0

如需更换 Tensorflow 的 CPU 或 GPU 配置,先卸载再更新即可:

pip uninstall tensorflow-cpu tensorflow-estimator tensorflow-io-gcs-filesystem
pip install tensorflow-gpu==2.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple/

Bug1: OpenMM 相关 Bug

如遇 OpenMM Bug,以及解决方案:

openmm Bug 1: conda install -c conda-forge openmm==7.7.0

Traceback (most recent call last):
  File "run_alphafold.py", line 41, in <module>
    from alphafold.relax import relax
  File "alphafold/relax/relax.py", line 18, in <module>
    from alphafold.relax import amber_minimize
  File "alphafold/relax/amber_minimize.py", line 25, in <module>
    from alphafold.relax import cleanup
  File "alphafold/relax/cleanup.py", line 23, in <module>
    from openmm import app
ModuleNotFoundError: No module named 'openmm'

pdbfixer Bug2:conda install -c conda-forge pdbfixer==1.8.1

Traceback (most recent call last):
  File "run_alphafold.py", line 41, in <module>
    from alphafold.relax import relax
  File "alphafold/relax/relax.py", line 18, in <module>
    from alphafold.relax import amber_minimize
  File "alphafold/relax/amber_minimize.py", line 25, in <module>
    from alphafold.relax import cleanup
  File "alphafold/relax/cleanup.py", line 22, in <module>
    import pdbfixer
  File "/root/miniconda3/envs/alphafold/lib/python3.8/site-packages/pdbfixer/__init__.py", line 2, in <module>
    from .pdbfixer import PDBFixer
  File "/root/miniconda3/envs/alphafold/lib/python3.8/site-packages/pdbfixer/pdbfixer.py", line 38, in <module>
    from simtk.openmm.app.internal.pdbstructure import PdbStructure
ModuleNotFoundError: No module named 'simtk.openmm.app.internal'

参考:PSP - 替换 MSA 数据库 以及 OpenMM 和 mmCIF 异常

Bug2: Collecting package metadata (repodata.json): / Killed

参考:StackOverflow - Collecting package metadata (repodata.json): / Killed

显存 RAM 过低,提升显存 0.5GB 至 8GB + 即可。

2. 配置数据库

参考:官方GitHub:GitHub - deepmind/alphafold

2.1 AlphaFold2 Model

目前,最新版本 (2023.5.7) 是 alphafold_params_2022-12-06

下载命令:

mkdir params
cd params/
wget -P . https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar   # 5.2G
tar --extract --verbose --file="alphafold_params_2022-12-06.tar" --directory="." --preserve-permissions

模型参数说明:

2.2 Small BFD

下载命令:

mkdir small_bfd
cd small_bfd/
wget -P . https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz  # 9.6G
gunzip "bfd-first_non_consensus_sequences.fasta.gz"

2.3 数据库配置

其他数据库,根据工程自行下载。将已有的数据库,配置到一个数据文件夹中,可以使用软连接的方式,即 ln -s,数据库如下:

bfd/					# 多个文件的相同前缀
mgnify/				# fa文件,64G
params/   		# 模型参数,最新版本2022-12-06,monomer,monomer-ptm,multimer_v3
pdb70/				# 文件夹
pdb_mmcif/		# 文件夹
pdb_seqres/		# multimer使用txt,208M
small_bfd/		# bfd的fasta文件,17G
uniprot/			# fasta文件,98G,注意版本信息
uniref30/			# 多个文件的相同前缀,注意日期
uniref90/			# fasta文件,59G

3. 配置脚本

修改运行脚本:run_alphafold.sh

修改数据库配置,注意 uniref30 的不同版本信息,配置如下:

# Path and user config (change me if required)
uniref90_database_path="$data_dir/uniref90/uniref90.fasta"
uniprot_database_path="$data_dir/uniprot/uniprot.fasta"
mgnify_database_path="$data_dir/mgnify/mgy_clusters_2022_05.fa"
bfd_database_path="$data_dir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
small_bfd_database_path="$data_dir/small_bfd/bfd-first_non_consensus_sequences.fasta"
# uniref30_database_path="$data_dir/uniref30/UniRef30_2021_03"
uniref30_database_path="$data_dir/uniref30/uniclust30_2018_08"
pdb70_database_path="$data_dir/pdb70/pdb70"
pdb_seqres_database_path="$data_dir/pdb_seqres/pdb_seqres.txt"
template_mmcif_dir="$data_dir/pdb_mmcif/mmcif_files"
obsolete_pdbs_path="$data_dir/pdb_mmcif/obsolete.dat"

修改 MSA 搜索工具位置,配置如下:

hhblits_binary_path="/root/miniconda3/envs/alphafold/bin/hhblits"
hhsearch_binary_path="/root/miniconda3/envs/alphafold/bin/hhsearch"
jackhmmer_binary_path="/root/miniconda3/envs/alphafold/bin/jackhmmer"
kalign_binary_path="/root/miniconda3/envs/alphafold/bin/kalign"

修改 数据库位置 与 最大模版日期,配置如下:

if [[ "$data_dir" == "" || "$output_dir" == "" || "$fasta_path" == "" || "$max_template_date" == "" ]] ; then
    data_dir=[my data dir];
    max_template_date="2022-04-01";
fi

搜索 MSA 的过程,在 AF2 推理运行中,占用时间较长,修改优先使用已有 MSA 文件,如下:

if [[ "$use_precomputed_msas" == "" ]] ; then
    use_precomputed_msas="true"
fi

4. 配置源码

加速搜索 MSA 的过程,需要修改 CPU 数量,默认是8个。查询 Linux 的 GPU 数量,如下:

lscpu | grep 'CPU(s):' | head -1 | awk '{print $2}'   # 查询 CPU 数量

修改文件 alphafold/data/tools/hhblits.py,如下:

                binary_path: str,
                databases: Sequence[str],
-               n_cpu: int = 4,
+               n_cpu: int = [your num],
                n_iter: int = 3,
                e_value: float = 0.001,
                maxseq: int = 1_000_000,

修改文件 alphafold/data/tools/hmmsearch.py,如下:

       cmd = [
           self.binary_path,
           '--noali',  # Don't include the alignment in stdout.
-          '--cpu', '8'
+          '--cpu', '[your num]'
       ]
       # If adding flags, we have to do so before the output and input:
       if self.flags:

修改文件 alphafold/data/tools/jackhmmer.py,如下:

                binary_path: str,
                database_path: str,
-               n_cpu: int = 8,
+               n_cpu: int = [your num],
                n_iter: int = 1,
                e_value: float = 0.0001,
                z_value: Optional[int] = None,

也可以修改 monomer_casp14 模式的默认模型,由 monomer 替换为 monomer_ptm,如下:

-MODEL_PRESETS['monomer_casp14'] = MODEL_PRESETS['monomer']
+# MODEL_PRESETS['monomer_casp14'] = MODEL_PRESETS['monomer']
+MODEL_PRESETS['monomer_casp14'] = MODEL_PRESETS['monomer_ptm']

其中,pTM 模型:

pTM models were fine-tuned to produce pTM (predicted TM-score) and (PAE) predicted aligned error values alongside their structure predictions.

pTM 模型经过微调 (基于monomer),在进行结构预测时,产生 pTM(预测的TM得分)和 PAE(预测的对齐误差)值。

5. 推理序列

推理命令:

bash run_alphafold.sh -o mydata/output/ -f mydata/query.fasta -m monomer_casp14 -c full_dbs

seq:

>dummy_sequence
GWSTELEKHREELKEFLKKEGITNVEIRIDNGRLEVRVEGGTERLKRFLEELRQKLEKKGYTVDIKIE

主要输出:

  1. pdb,最好的结构是 ranked_0.pdb
  2. ranking_debug.json,pdb排名
  3. msas,搜索出的 MSA 文件,即mgnify_hits.stopdb_hits.hhrbfd_uniref_hits.a3muniref90_hits.sto 等。
  4. timings.json,运行耗时。

其中 ranking_debug.json,如下:

{
    "plddts": {
        "model_1_pred_0": 86.26850453604357,
        "model_2_pred_0": 85.06505646965638,
        "model_3_pred_0": 87.40822765097714,
        "model_4_pred_0": 84.71053426936133,
        "model_5_pred_0": 82.69870802756033
    },
    "order": [
        "model_3_pred_0",
        "model_1_pred_0",
        "model_2_pred_0",
        "model_4_pred_0",
        "model_5_pred_0"
    ]
}

其中timings.json,如下:

{
    "features": 103.40737819671631,
    "process_features_model_1_pred_0": 3.8775177001953125,
    "predict_and_compile_model_1_pred_0": 116.74437546730042,
    "relax_model_1_pred_0": 11.63992977142334,
    "process_features_model_2_pred_0": 1.3910491466522217,
    "predict_and_compile_model_2_pred_0": 114.51620531082153,
    "relax_model_2_pred_0": 5.43536114692688,
    "process_features_model_3_pred_0": 1.1890630722045898,
    "predict_and_compile_model_3_pred_0": 87.88086938858032,
    "relax_model_3_pred_0": 5.768261194229126,
    "process_features_model_4_pred_0": 1.1486437320709229,
    "predict_and_compile_model_4_pred_0": 87.95040488243103,
    "relax_model_4_pred_0": 5.295060873031616,
    "process_features_model_5_pred_0": 1.2103533744812012,
    "predict_and_compile_model_5_pred_0": 88.90721249580383,
    "relax_model_5_pred_0": 5.518966436386108
}

输出的最优PDB结构,如下:

PSP - 配置 AlphaFold2 的高效 Tensorflow 运行环境_第2张图片

参考

  1. GitHub - deepmind/alphafold
  2. GitHub - kalininalab/alphafold_non_docker

源码如下:

#!/bin/bash

usage() {
        echo ""
        echo "Please make sure all required parameters are given"
        echo "Usage: $0 "
        echo "Required Parameters:"
        echo "-d          Path to directory of supporting data"
        echo "-o        Path to a directory that will store the results."
        echo "-f       Path to FASTA files containing sequences. If a FASTA file contains multiple sequences, then it will be folded as a multimer. To fold more sequences one after another, write the files separated by a comma"
        echo "-t  Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets"
        echo "Optional Parameters:"
        echo "-g           Enable NVIDIA runtime to run with GPUs (default: true)"
        echo "-r         Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the relaxation stage (default: true)"
        echo "-e  Run relax on GPU if GPU is enabled (default: true)"
        echo "-n    OpenMM threads (default: all available cores)"
        echo "-a       Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)"
        echo "-m      Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model (default: 'monomer')"
        echo "-c         Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: 'full_dbs')"
        echo "-p  Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed (default: 'false')"
        echo "-l  How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5 models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer (default: 5)"
        echo "-b         Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'false')"
        echo ""
        exit 1
}

while getopts ":d:o:f:t:g:r:e:n:a:m:c:p:l:b:" i; do
        case "${i}" in
        d)
                data_dir=$OPTARG
        ;;
        o)
                output_dir=$OPTARG
        ;;
        f)
                fasta_path=$OPTARG
        ;;
        t)
                max_template_date=$OPTARG
        ;;
        g)
                use_gpu=$OPTARG
        ;;
        r)
                run_relax=$OPTARG
        ;;
        e)
                enable_gpu_relax=$OPTARG
        ;;
        n)
                openmm_threads=$OPTARG
        ;;
        a)
                gpu_devices=$OPTARG
        ;;
        m)
                model_preset=$OPTARG
        ;;
        c)
                db_preset=$OPTARG
        ;;
        p)
                use_precomputed_msas=$OPTARG
        ;;
        l)
                num_multimer_predictions_per_model=$OPTARG
        ;;
        b)
                benchmark=$OPTARG
        ;;
        esac
done

# Parse input and set defaults
if [[ "$data_dir" == "" || "$output_dir" == "" || "$fasta_path" == "" || "$max_template_date" == "" ]] ; then
    usage
fi

if [[ "$benchmark" == "" ]] ; then
    benchmark=false
fi

if [[ "$use_gpu" == "" ]] ; then
    use_gpu=true
fi

if [[ "$gpu_devices" == "" ]] ; then
    gpu_devices=0
fi

if [[ "$run_relax" == "" ]] ; then
    run_relax="true"
fi

if [[ "$enable_gpu_relax" == "" ]] ; then
    enable_gpu_relax="true"
fi

if [[ "$enable_gpu_relax" == true && "$use_gpu" == true ]] ; then
    use_gpu_relax="true"
else
    use_gpu_relax="false"
fi

if [[ "$num_multimer_predictions_per_model" == "" ]] ; then
    num_multimer_predictions_per_model=5
fi

if [[ "$model_preset" == "" ]] ; then
    model_preset="monomer"
fi

if [[ "$model_preset" != "monomer" && "$model_preset" != "monomer_casp14" && "$model_preset" != "monomer_ptm" && "$model_preset" != "multimer" ]] ; then
    echo "Unknown model preset! Using default ('monomer')"
    model_preset="monomer"
fi

if [[ "$db_preset" == "" ]] ; then
    db_preset="full_dbs"
fi

if [[ "$db_preset" != "full_dbs" && "$db_preset" != "reduced_dbs" ]] ; then
    echo "Unknown database preset! Using default ('full_dbs')"
    db_preset="full_dbs"
fi

if [[ "$use_precomputed_msas" == "" ]] ; then
    use_precomputed_msas="false"
fi

# This bash script looks for the run_alphafold.py script in its current working directory, if it does not exist then exits
current_working_dir=$(pwd)
alphafold_script="$current_working_dir/run_alphafold.py"

if [ ! -f "$alphafold_script" ]; then
    echo "Alphafold python script $alphafold_script does not exist."
    exit 1
fi

# Export ENVIRONMENT variables and set CUDA devices for use
# CUDA GPU control
export CUDA_VISIBLE_DEVICES=-1
if [[ "$use_gpu" == true ]] ; then
    export CUDA_VISIBLE_DEVICES=0

    if [[ "$gpu_devices" ]] ; then
        export CUDA_VISIBLE_DEVICES=$gpu_devices
    fi
fi

# OpenMM threads control
if [[ "$openmm_threads" ]] ; then
    export OPENMM_CPU_THREADS=$openmm_threads
fi

# TensorFlow control
export TF_FORCE_UNIFIED_MEMORY='1'

# JAX control
export XLA_PYTHON_CLIENT_MEM_FRACTION='4.0'

# Path and user config (change me if required)
uniref90_database_path="$data_dir/uniref90/uniref90.fasta"
uniprot_database_path="$data_dir/uniprot/uniprot.fasta"
mgnify_database_path="$data_dir/mgnify/mgy_clusters_2022_05.fa"
bfd_database_path="$data_dir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
small_bfd_database_path="$data_dir/small_bfd/bfd-first_non_consensus_sequences.fasta"
uniref30_database_path="$data_dir/uniref30/UniRef30_2021_03"
pdb70_database_path="$data_dir/pdb70/pdb70"
pdb_seqres_database_path="$data_dir/pdb_seqres/pdb_seqres.txt"
template_mmcif_dir="$data_dir/pdb_mmcif/mmcif_files"
obsolete_pdbs_path="$data_dir/pdb_mmcif/obsolete.dat"

# Binary path (change me if required)
hhblits_binary_path=$(which hhblits)
hhsearch_binary_path=$(which hhsearch)
jackhmmer_binary_path=$(which jackhmmer)
kalign_binary_path=$(which kalign)

command_args="--fasta_paths=$fasta_path --output_dir=$output_dir --max_template_date=$max_template_date --db_preset=$db_preset --model_preset=$model_preset --benchmark=$benchmark --use_precomputed_msas=$use_precomputed_msas --num_multimer_predictions_per_model=$num_multimer_predictions_per_model --run_relax=$run_relax --use_gpu_relax=$use_gpu_relax --logtostderr"

database_paths="--uniref90_database_path=$uniref90_database_path --mgnify_database_path=$mgnify_database_path --data_dir=$data_dir --template_mmcif_dir=$template_mmcif_dir --obsolete_pdbs_path=$obsolete_pdbs_path"

binary_paths="--hhblits_binary_path=$hhblits_binary_path --hhsearch_binary_path=$hhsearch_binary_path --jackhmmer_binary_path=$jackhmmer_binary_path --kalign_binary_path=$kalign_binary_path"

if [[ $model_preset == "multimer" ]]; then
	database_paths="$database_paths --uniprot_database_path=$uniprot_database_path --pdb_seqres_database_path=$pdb_seqres_database_path"
else
	database_paths="$database_paths --pdb70_database_path=$pdb70_database_path"
fi

if [[ "$db_preset" == "reduced_dbs" ]]; then
	database_paths="$database_paths --small_bfd_database_path=$small_bfd_database_path"
else
	database_paths="$database_paths --uniref30_database_path=$uniref30_database_path --bfd_database_path=$bfd_database_path"
fi

# Run AlphaFold with required parameters
$(python $alphafold_script $binary_paths $database_paths $command_args)

你可能感兴趣的:(AI,for,Biotech,python,开发语言,深度学习,人工智能,pytorch)