一.源码结构:
class CircularFingerprint(Featurizer):
def __init__(self, radius=2, size=2048, chiral=False, bonds=True,
features=False, sparse=False, smiles=False):
def _featurize(self, mol):
return fp
参数:
__init__ | |
radius |
int, optional (default 2);Fingerprint radius. 半径 |
size | int, optional (default 2048);Length of generated bit vector. 特征数量 |
chiral | bool, optional (default False);Whether to consider chirality in fingerprint generation. 手性 |
bonds | bool, optional (default True);Whether to consider bond order in fingerprint generation. 键序 |
features | bool, optional (default False);Whether to use feature information instead of atom information; see RDKit docs for more info. |
sparse | bool, optional (default False);Whether to return a dict for each molecule containing the sparse fingerprint. |
smiles | bool, optional (default False);Whether to calculate SMILES strings for fragment IDs (only applicable when calculating sparse fingerprints). |
_featurize | |
mol | RDKit Mol;Molecule. |
二.代码实现:
smiles_file(type:pandas.dataframe)
compound_ID | smiles |
---|---|
D06396 | N(C1C2CCCN(CC1)C2)C(=O)c1cc(c(cc1OC)N)Cl |
D06056 | c1(c2c([nH]c1)ccc(c2)OC)/C=N/NC(=N)NCCCCC |
import pandas as pd
import numpy as np
from rdkit import Chem
from deepchem.feat import fingerprints as fp
ID_list = []
error_ID_list = []
feature_list = []
for index,smiles in zip(smiles_file['compound_ID'],smiles_file['smiles']):
mol = Chem.MolFromSmiles(smiles)
mol = [mol] #如果不加此行,则TypeError: 'Mol' object is not iterable
engine = fp.CircularFingerprint(radius=2, size=2048, chiral=False,
bonds=True,features=False, sparse=False, smiles=False) #千万不要把mol加到这里,因为TypeError: __init__() got an unexpected keyword argument 'mol'
feature = engine(mol) #结果形式为:[array([0, 0, 0, ..., 0, 0, 0])]
ID_list.append(index)
feature_list.extend(feature) #如果用append,则为[[array([0, 0, 0, ..., 0, 0, 0])],……]]型,生成dataframe时出现ValueError: Must pass 2-d input
ID_feature_df = pd.DataFrame(feature_list,ID_list)
vec_name = ['feature_{0}'.format(i) for i in range(0,size)]
ID_feature_df.columns = vec_name
ID_feature_df.index.name = 'compound_ID'
ID_feature_df.to_csv('***') #保存结果到文件
print('------------------程序结束--------------------------------------')
结果为:
compound_ID | feature_0 | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | ... | feature_2038 | feature_2039 | feature_2040 | feature_2041 | feature_2042 | feature_2043 | feature_2044 | feature_2045 | feature_2046 | feature_2047 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
D06396 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
D06056 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
保存文件时可能会出现OSError: [Errno 28] No space left on device
solution:分批处理
参考网站:
MoleculeNet:Models and Featurizations
deepchem中ECFP的源码