2021SC@SDUSC
之前关于GNN基础知识,GCN的一些编程知识,以及contact map的生成都讲了很多了,这周主要针对这份代码 https://github.com/595693085/DGraphDTA 进行分析。由于代码本身比较长,本周主要分析利用contact map提取蛋白质药物特征的部分。
指定一下选中的cuda,载入模型,选择损失函数,学习率,并初始化模型参数
USE_CUDA = torch.cuda.is_available()
device = torch.device(cuda_name if USE_CUDA else 'cpu')
model = GNNNet()
model.to(device)
model_st = GNNNet.__name__
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LR)
接下来是一个for循环(其实并不是,前面有
datasets = [['davis', 'kiba'][int(sys.argv[1])]]
,dataset为指定用davis还是kiba数据集)
for dataset in datasets:
train_data, valid_data = create_dataset_for_5folds(dataset, fold)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=TRAIN_BATCH_SIZE, shuffle=True,
collate_fn=collate)
valid_loader = torch.utils.data.DataLoader(valid_data, batch_size=TEST_BATCH_SIZE, shuffle=False,
collate_fn=collate)
这一部分很重要,我们先去分析一下create_dataset_for_5folds(dataset,fold)方法
这里根据输入的fold,从5份fold中,选择4份作为train_fold,另外一份作为vaild_fold,这种操作可以防止过拟合
def create_dataset_for_5folds(dataset, fold=0):
# load dataset
dataset_path = 'data/' + dataset + '/'
#TODO 此处的train_fold_setting1.txt 文件不懂,猜测是模型或者训练方式的一些设定?
train_fold_origin = json.load(open(dataset_path + 'folds/train_fold_setting1.txt'))
train_fold_origin = [e for e in train_fold_origin] # for 5 folds
.......
valid_fold = train_fold_origin[fold] # one fold
for i in range(len(train_fold_origin)): # other folds
if i != fold:
train_folds += train_fold_origin[i]
载入蛋白质和药物字典(字典为key+序列)
ligands = json.load(open(dataset_path + 'ligands_can.txt'), object_pairs_hook=OrderedDict)
proteins = json.load(open(dataset_path + 'proteins.txt'), object_pairs_hook=OrderedDict)
找到存aln(比对文件)和pconsc4(contactmap文件)位置,并存起来
# load contact and aln
msa_path = 'data/' + dataset + '/aln'
contac_path = 'data/' + dataset + '/pconsc4'
msa_list = []
contact_list = []
for key in proteins:
msa_list.append(os.path.join(msa_path, key + '.aln'))
contact_list.append(os.path.join(contac_path, key + '.npy'))
存储药物和药物序列
# smiles
for d in ligands.keys():
lg = Chem.MolToSmiles(Chem.MolFromSmiles(ligands[d]), isomericSmiles=True)
drugs.append(lg)
drug_smiles.append(ligands[d])
存储蛋白质的key和序列
# seqs
for t in proteins.keys():
prots.append(proteins[t])
prot_keys.append(t)
针对davis数据集处理一下affinity
if dataset == 'davis':
affinity = [-np.log10(y / 1e9) for y in affinity]
affinity = np.asarray(affinity)
针对Y,剔除addinity为nan的顶点对,找到预测出亲和力的药物蛋白分子对
for opt in opts:
if opt == 'train':
rows, cols = np.where(np.isnan(affinity) == False)
rows, cols = rows[train_folds], cols[train_folds]
将所有接触图与字典中key匹配,并将药物序列,蛋白质序列,蛋白质key,亲和力加起来成ls,经处理后放到train_fold_entries中
for pair_ind in range(len(rows)):
if not valid_target(prot_keys[cols[pair_ind]], dataset): # ensure the contact and aln files exists
continue
ls = []
ls += [drugs[rows[pair_ind]]]
ls += [prots[cols[pair_ind]]]
ls += [prot_keys[cols[pair_ind]]]
ls += [affinity[rows[pair_ind], cols[pair_ind]]]
train_fold_entries.append(ls)
valid_train_count += 1
提取药物特征,并生成分子图
# create smile graph
smile_graph = {}
for smile in compound_iso_smiles:
g = smile_to_graph(smile)
smile_graph[smile] = g
根据蛋白质key和接触图和多序列比对结果提取蛋白质特征,接下来我们去看下target_to_graph方法
for key in target_key:
if not valid_target(key, dataset): # ensure the contact and aln files exists
continue
g = target_to_graph(key, proteins[key], contac_path, msa_path)
target_graph[key] = g
过滤一下接触图,接触概率>=0.5的才认为其接触
def target_to_graph(target_key, target_sequence, contact_dir, aln_dir):
target_edge_index = []
target_size = len(target_sequence)
# contact_dir = 'data/' + dataset + '/pconsc4'
contact_file = os.path.join(contact_dir, target_key + '.npy')
contact_map = np.load(contact_file)
contact_map += np.matrix(np.eye(contact_map.shape[0]))
index_row, index_col = np.where(contact_map >= 0.5)
根据接触图得到的邻接矩阵,来构建蛋白质图的edge_index
for i, j in zip(index_row, index_col):
target_edge_index.append([i, j])
target_feature = target_to_feature(target_key, target_sequence, aln_dir)
target_edge_index = np.array(target_edge_index)
接下来我们去看下target_to_feature()看下,蛋白质的节点特征是如何得到的,这里找到了.aln多重序列比对的结果,我们需要去看下target_feature(),使用多重序列比对和序列得到了feature
def target_to_feature(target_key, target_sequence, aln_dir):
# aln_dir = 'data/' + dataset + '/aln'
aln_file = os.path.join(aln_dir, target_key + '.aln')
# if 'X' in target_sequence:
# print(target_key)
feature = target_feature(aln_file, target_sequence)
return feature
该方法就比较清晰了,根据.aln和序列得到pssm,由序列得到other features,组合起来作为节点特征,大功告成了!
def target_feature(aln_file, pro_seq):
pssm = PSSM_calculation(aln_file, pro_seq)
other_feature = seq_feature(pro_seq)
# print('target_feature')
# print(pssm.shape)
# print(other_feature.shape)
# print(other_feature.shape)
# return other_feature
return np.concatenate((np.transpose(pssm, (1, 0)), other_feature), axis=1)
以上便是根据DGraph中根据接触图提取出药物和蛋白质特征的全工程,由于知识储备,代码中有一部分不影响理解的细节代码没有在这里写出,如有问题,欢迎在评论区指出!