One-Hot 编码,又称为一位有效编码,主要是采用位状态寄存器来对个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候只有一位有效。
每一个原子用One-Hot 编码表示成一维向量,向量长度为所有现存原子的个数。材料化学式的One-Hot表示方法是按照化学式的原子组成和原子个数,将对应的原子一维向量相加得到的。
One_Hot_Vec()函数可以返回一个材料化学式的One-Hot向量:
def One_Hot_Vec(composition):
"""
得到一个111维材料的One-hot向量
:param composition: 材料化学式,eg.H2O1
:return: 返回在H位置是2,O位置是1的向量[2,0,...,0,1,0...,0]
"""
one_hot_vector = np.zeros((111), dtype=np.int32)
number = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
ele_to_index, _ = atomnumber_dict(ELEMENTS) # 得到原子符号和原子序数的对应字典
# 先判断化学式是否少于三个字符
if len(composition) < 4:
if len(composition) == 2:
atom_total = composition[1]
atom_total = int(atom_total)
element = composition[0]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber - 1] += atom_total
else: # len(composition) == 3
if composition[1] in number: # C60
atom_total = composition[1] + composition[2]
atom_total = int(atom_total)
element = composition[0]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber - 1] += atom_total
else: # Si2
atom_total = composition[2]
atom_total = int(atom_total)
element = composition[0] + composition[1]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber - 1] += atom_total
else:
if composition[1] in number: # 考虑初始情况,防止for循环中不能前向找到数字
if composition[2] in number: # 考虑到化学式中有原子数是两位数的情况
atom_total = composition[1]+composition[2]
atom_total = int(atom_total)
element = composition[0]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
else:
atom_total = composition[1]
atom_total = int(atom_total)
element = composition[0]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
else:
if composition[2] in number:
if composition[3] in number:
atom_total = composition[2] + composition[3]
atom_total = int(atom_total)
element = composition[0] + composition[1]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
else:
atom_total = composition[2]
atom_total = int(atom_total)
element = composition[0] + composition[1]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
jump = 0 # 跳过原子个数是两位数的时候对第二位数的循环
for i in range(3,len(composition)-1): # 化学式前三位和最后一位单独考虑,最后两位都是数字的情况包含在此for循环中
if jump == 1:
jump = 0
continue
if composition[i] in number:
if composition[i+1] in number: # 考虑原子数是两位数的情况
jump = 1
if composition[i-2] in number: # eg.H2O1 往回原子是一个字母
atom_total = composition[i] + composition[i+1]
atom_total = int(atom_total)
element = composition[i-1]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
else: # eg.H1Br1 往回原子是两个字母
atom_total = composition[i] + composition[i+1]
atom_total = int(atom_total)
element = composition[i-2] + composition[i-1]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
else: # 原子数是一位的情况
if composition[i - 2] in number: # eg.H2O1 往回原子是一个字母
atom_total = composition[i]
atom_total = int(atom_total)
element = composition[i - 1]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
else: # eg.H1Br1 往回原子是两个字母
atom_total = composition[i]
atom_total = int(atom_total)
element = composition[i - 2] + composition[i - 1]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
if composition[len(composition)-1] in number and composition[len(composition)-2] not in number: # 判断最后一个数是数字且前一个数不是数字的情况
if composition[len(composition)-3] in number: # eg.H2O1 往回原子是1个字母
atom_total = composition[len(composition)-1]
atom_total = int(atom_total)
element = composition[len(composition)-2]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
else: # 往回原子是2个字母
atom_total = composition[len(composition)-1]
atom_total = int(atom_total)
element = composition[len(composition)-3] + composition[len(composition)-2]
atomnumber = ele_to_index[element]
one_hot_vector[atomnumber-1] += atom_total
return one_hot_vector
通过H2O进行测试,输出结果如下:
参考:
[1] Property Prediction of Crystalline Solids from Composition and Crystal Structure