材料One-Hot特征表示

材料One-Hot特征表示

  • 1.介绍
  • 2.在材料信息学中应用
  • 3.代码实现

1.介绍

  One-Hot 编码,又称为一位有效编码,主要是采用位状态寄存器来对个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候只有一位有效。

2.在材料信息学中应用

  每一个原子用One-Hot 编码表示成一维向量,向量长度为所有现存原子的个数。材料化学式的One-Hot表示方法是按照化学式的原子组成和原子个数,将对应的原子一维向量相加得到的。

3.代码实现

  One_Hot_Vec()函数可以返回一个材料化学式的One-Hot向量:

def One_Hot_Vec(composition):
    """
    得到一个111维材料的One-hot向量
    :param composition: 材料化学式,eg.H2O1
    :return: 返回在H位置是2,O位置是1的向量[2,0,...,0,1,0...,0]
    """
    one_hot_vector = np.zeros((111), dtype=np.int32)
    number = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
    ele_to_index, _ = atomnumber_dict(ELEMENTS)  # 得到原子符号和原子序数的对应字典
    # 先判断化学式是否少于三个字符
    if len(composition) < 4:
        if len(composition) == 2:
            atom_total = composition[1]
            atom_total = int(atom_total)
            element = composition[0]
            atomnumber = ele_to_index[element]
            one_hot_vector[atomnumber - 1] += atom_total
        else:    # len(composition) == 3
            if composition[1] in number:    # C60
                atom_total = composition[1] + composition[2]
                atom_total = int(atom_total)
                element = composition[0]
                atomnumber = ele_to_index[element]
                one_hot_vector[atomnumber - 1] += atom_total
            else:    # Si2
                atom_total = composition[2]
                atom_total = int(atom_total)
                element = composition[0] + composition[1]
                atomnumber = ele_to_index[element]
                one_hot_vector[atomnumber - 1] += atom_total
    else:
        if composition[1] in number:    # 考虑初始情况,防止for循环中不能前向找到数字
            if composition[2] in number:    # 考虑到化学式中有原子数是两位数的情况
                atom_total = composition[1]+composition[2]
                atom_total = int(atom_total)
                element = composition[0]
                atomnumber = ele_to_index[element]
                one_hot_vector[atomnumber-1] += atom_total
            else:
                atom_total = composition[1]
                atom_total = int(atom_total)
                element = composition[0]
                atomnumber = ele_to_index[element]
                one_hot_vector[atomnumber-1] += atom_total

        else:
            if composition[2] in number:
                if composition[3] in number:
                    atom_total = composition[2] + composition[3]
                    atom_total = int(atom_total)
                    element = composition[0] + composition[1]
                    atomnumber = ele_to_index[element]
                    one_hot_vector[atomnumber-1] += atom_total
                else:
                    atom_total = composition[2]
                    atom_total = int(atom_total)
                    element = composition[0] + composition[1]
                    atomnumber = ele_to_index[element]
                    one_hot_vector[atomnumber-1] += atom_total

        jump = 0    # 跳过原子个数是两位数的时候对第二位数的循环
        for i in range(3,len(composition)-1):    # 化学式前三位和最后一位单独考虑,最后两位都是数字的情况包含在此for循环中
            if jump == 1:
                jump = 0
                continue
            if composition[i] in number:
                if composition[i+1] in number:    # 考虑原子数是两位数的情况
                    jump = 1
                    if composition[i-2] in number:  # eg.H2O1 往回原子是一个字母
                        atom_total = composition[i] + composition[i+1]
                        atom_total = int(atom_total)
                        element = composition[i-1]
                        atomnumber = ele_to_index[element]
                        one_hot_vector[atomnumber-1] += atom_total

                    else:  # eg.H1Br1 往回原子是两个字母
                        atom_total = composition[i] + composition[i+1]
                        atom_total = int(atom_total)
                        element = composition[i-2] + composition[i-1]
                        atomnumber = ele_to_index[element]
                        one_hot_vector[atomnumber-1] += atom_total

                else:    # 原子数是一位的情况
                    if composition[i - 2] in number:  # eg.H2O1 往回原子是一个字母
                        atom_total = composition[i]
                        atom_total = int(atom_total)
                        element = composition[i - 1]
                        atomnumber = ele_to_index[element]
                        one_hot_vector[atomnumber-1] += atom_total

                    else:        # eg.H1Br1  往回原子是两个字母
                        atom_total = composition[i]
                        atom_total = int(atom_total)
                        element = composition[i - 2] + composition[i - 1]
                        atomnumber = ele_to_index[element]
                        one_hot_vector[atomnumber-1] += atom_total

        if composition[len(composition)-1] in number and composition[len(composition)-2] not in number:    # 判断最后一个数是数字且前一个数不是数字的情况
            if composition[len(composition)-3] in number:  # eg.H2O1 往回原子是1个字母
                atom_total = composition[len(composition)-1]
                atom_total = int(atom_total)
                element = composition[len(composition)-2]
                atomnumber = ele_to_index[element]
                one_hot_vector[atomnumber-1] += atom_total
            else:  # 往回原子是2个字母
                atom_total = composition[len(composition)-1]
                atom_total = int(atom_total)
                element = composition[len(composition)-3] + composition[len(composition)-2]
                atomnumber = ele_to_index[element]
                one_hot_vector[atomnumber-1] += atom_total

    return one_hot_vector

  通过H2O进行测试,输出结果如下:
材料One-Hot特征表示_第1张图片
参考:
[1] Property Prediction of Crystalline Solids from Composition and Crystal Structure

你可能感兴趣的:(机器学习在材料信息学中的应用,python,深度学习)