内容:如何使用matminer、automatminer、panases和scikit-learn来获取机器学习材料属性。
目录
1.数据检索和过滤
操作和检查 pandas DataFrame 对象
检查数据集
对数据集进行索引
过滤数据集
生成新列
2.为机器学习生成描述符
特征化方法和基础
dataframes特征化(Featurizing dataframes)
结构特征器(Structure Featurizers)
转换特征器(Conversion Featurizers)
高级功能(Advanced capabilities)
处理错误
引用作者
3.机器学习模型
Scikit-Learn
加载并准备一个预先特征化的模型
使用scikit-learn尝试随机森林模型(random forest model)
评估模型性能
交叉验证(Cross validation)
可视化模型性能(Visualizing model performance)
模型解释(Model interpretation)
4.使用自动机的自动机器学习
用Automatminer's MatPipe进行拟合和预测
安装管道(Fitting the pipeline)
预测新数据(Examine predictions)
检查预测集
评分预测(Score predictions)
检查MatPipe的内部(Examining the internals of MatPipe)
直接访问MatPipe的内部对象
管道的持久性(Persistence of pipelines)
典型的机器学习工作流,整个过程可以概括为:
1 .获取原始输入,如作文列表和相关的目标属性来学习。
2.将原始输入转换成可通过机器学习算法学习的描述符或特征。
3.在数据上训练机器学习模型。
4.绘制并分析模型的性能。
Matminer与许多材料数据库接口,包括:-材料项目- Citrine - AFLOW -材料数据设施(MDF) -数据科学材料平台(MPDS)此外,它还包括来自已发表文献的数据集。Matminer拥有一个由26个(并且还在增长的)数据集组成的存储库,这些数据集来自对材料特性的已发表和同行评审的机器学习研究或高通量计算研究的出版物。在本节中,我们将展示如何访问和操作已发表文献中的数据集。有关访问其他材料数据库的更多信息,请参见matminer_examples知识库。
可以使用get _ available _ datasets()函数打印基于文献的数据集列表。
这还会打印数据集包含的信息,例如样本数量、目标属性以及数据是如何获得的(例如,通过理论或实验)。
from matminer.datasets import get_available_datasets
get_available_datasets()
结果:
boltztrap_mp: Effective mass and thermoelectric properties of 8924 compounds in The Materials Project database that are calculated by the BoltzTraP software package run on the GGA-PBE or GGA+U density functional theory calculation results. The properties are reported at the temperature of 300 Kelvin and the carrier concentration of 1e18 1/cm3.
brgoch_superhard_training: 2574 materials used for training regressors that predict shear and bulk modulus.
castelli_perovskites: 18,928 perovskites generated with ABX combinatorics, calculating gllbsc band gap and pbe structure, and also reporting absolute band edge positions and heat of formation.
citrine_thermal_conductivity: Thermal conductivity of 872 compounds measured experimentally and retrieved from Citrine database from various references. The reported values are measured at various temperatures of which 295 are at room temperature.
dielectric_constant: 1,056 structures with dielectric properties, calculated with DFPT-PBE.
double_perovskites_gap: Band gap of 1306 double perovskites (a_1-b_1-a_2-b_2-O6) calculated using Gritsenko, van Leeuwen, van Lenthe and Baerends potential (gllbsc) in GPAW.
double_perovskites_gap_lumo: Supplementary lumo data of 55 atoms for the double_perovskites_gap dataset.
elastic_tensor_2015: 1,181 structures with elastic properties calculated with DFT-PBE.
expt_formation_enthalpy: Experimental formation enthalpies for inorganic compounds, collected from years of calorimetric experiments. There are 1,276 entries in this dataset, mostly binary compounds. Matching mpids or oqmdids as well as the DFT-computed formation energies are also added (if any).
expt_gap: Experimental band gap of 6354 inorganic semiconductors.
flla: 3938 structures and computed formation energies from "Crystal Structure Representations for Machine Learning Models of Formation Energies."
glass_binary: Metallic glass formation data for binary alloys, collected from various experimental techniques such as melt-spinning or mechanical alloying. This dataset covers all compositions with an interval of 5 at. % in 59 binary systems, containing a total of 5959 alloys in the dataset. The target property of this dataset is the glass forming ability (GFA), i.e. whether the composition can form monolithic glass or not, which is either 1 for glass forming or 0 for non-full glass forming.
glass_binary_v2: Identical to glass_binary dataset, but with duplicate entries merged. If there was a disagreement in gfa when merging the class was defaulted to 1.
glass_ternary_hipt: Metallic glass formation dataset for ternary alloys, collected from the high-throughput sputtering experiments measuring whether it is possible to form a glass using sputtering. The hipt experimental data are of the Co-Fe-Zr, Co-Ti-Zr, Co-V-Zr and Fe-Ti-Nb ternary systems.
glass_ternary_landolt: Metallic glass formation dataset for ternary alloys, collected from the "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. This dataset contains experimental measurements of whether it is possible to form a glass using a variety of processing techniques at thousands of compositions from hundreds of ternary systems. The processing techniques are designated in the "processing" column. There are originally 7191 experiments in this dataset, will be reduced to 6203 after deduplicated, and will be further reduced to 6118 if combining multiple data for one composition. There are originally 6780 melt-spinning experiments in this dataset, will be reduced to 5800 if deduplicated, and will be further reduced to 5736 if combining multiple experimental data for one composition.
heusler_magnetic: 1153 Heusler alloys with DFT-calculated magnetic and electronic properties. The 1153 alloys include 576 full, 449 half and 128 inverse Heusler alloys. The data are extracted and cleaned (including de-duplicating) from Citrine.
jarvis_dft_2d: Various properties of 636 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.
jarvis_dft_3d: Various properties of 25,923 bulk materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.
jarvis_ml_dft_training: Various properties of 24,759 bulk and 2D materials computed with the OptB88vdW and TBmBJ functionals taken from the JARVIS DFT database.
m2ax: Elastic properties of 223 stable M2AX compounds from "A comprehensive survey of M2AX phase elastic properties" by Cover et al. Calculations are PAW PW91.
matbench_dielectric: Matbench v0.1 test dataset for predicting refractive index from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having refractive indices less than 1 and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_expt_gap: Matbench v0.1 test dataset for predicting experimental band gap from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, removing compositions with reported band gaps spanning more than a 0.1eV range; remaining compositions were assigned values based on the closest experimental value to the mean experimental value for that composition among all reports. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_expt_is_metal: Matbench v0.1 test dataset for classifying metallicity from composition alone. Retrieved from Zhuo et al. supplementary information. Deduplicated according to composition, ensuring no conflicting reports were entered for any compositions (i.e., no reported compositions were both metal and nonmetal). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_glass: Matbench v0.1 test dataset for predicting full bulk metallic glass formation ability from chemical formula. Retrieved from "Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys,’ a volume of the Landolt– Börnstein collection. Deduplicated according to composition, ensuring no compositions were reported as both GFA and not GFA (i.e., all reports agreed on the classification designation). For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_jdft2d: Matbench v0.1 test dataset for predicting exfoliation energies from crystal structure (computed with the OptB88vdW and TBmBJ functionals). Adapted from the JARVIS DFT database. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_log_gvrh: Matbench v0.1 test dataset for predicting DFT log10 VRH-average shear modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_log_kvrh: Matbench v0.1 test dataset for predicting DFT log10 VRH-average bulk modulus from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those having negative G_Voigt, G_Reuss, G_VRH, K_Voigt, K_Reuss, or K_VRH and those failing G_Reuss <= G_VRH <= G_Voigt or K_Reuss <= K_VRH <= K_Voigt and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_mp_e_form: Matbench v0.1 test dataset for predicting DFT formation energy from structure. Adapted from Materials Project database. Removed entries having formation energy more than 3.0eV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_mp_gap: Matbench v0.1 test dataset for predicting DFT PBE band gap from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_mp_is_metal: Matbench v0.1 test dataset for predicting DFT metallicity from structure. Adapted from Materials Project database. Removed entries having a formation energy (or energy above the convex hull) more than 150meV and those containing noble gases. Retrieved April 2, 2019. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_perovskites: Matbench v0.1 test dataset for predicting formation energy from crystal structure. Adapted from an original dataset generated by Castelli et al. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_phonons: Matbench v0.1 test dataset for predicting vibration properties from crystal structure. Original data retrieved from Petretto et al. Original calculations done via ABINIT in the harmonic approximation based on density functional perturbation theory. Removed entries having a formation energy (or energy above the convex hull) more than 150meV. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
matbench_steels: Matbench v0.1 test dataset for predicting steel yield strengths from chemical composition alone. Retrieved from Citrine informatics. Deduplicated. For benchmarking w/ nested cross validation, the order of the dataset must be identical to the retrieved data; refer to the Automatminer/Matbench publication for more details.
mp_all_20181018: A complete copy of the Materials Project database as of 10/18/2018. mp_all files contain structure data for each material while mp_nostruct does not.
mp_nostruct_20181018: A complete copy of the Materials Project database as of 10/18/2018. mp_all files contain structure data for each material while mp_nostruct does not.
phonon_dielectric_mp: Phonon (lattice/atoms vibrations) and dielectric properties of 1296 compounds computed via ABINIT software package in the harmonic approximation based on density functional perturbation theory.
piezoelectric_tensor: 941 structures with piezoelectric properties, calculated with DFT-PBE.
steel_strength: 312 steels with experimental yield strength and ultimate tensile strength, extracted and cleaned (including de-duplicating) from Citrine.
wolverton_oxides: 4,914 perovskite oxides containing composition data, lattice constants, and formation + vacancy formation energies. All perovskites are of the form ABO3. Adapted from a dataset presented by Emery and Wolverton.
数据集列表为:
['boltztrap_mp',
'brgoch_superhard_training',
'castelli_perovskites',
'citrine_thermal_conductivity',
'dielectric_constant',
'double_perovskites_gap',
'double_perovskites_gap_lumo',
'elastic_tensor_2015',
'expt_formation_enthalpy',
'expt_gap',
'flla',
'glass_binary',
'glass_binary_v2',
'glass_ternary_hipt',
'glass_ternary_landolt',
'heusler_magnetic',
'jarvis_dft_2d',
'jarvis_dft_3d',
'jarvis_ml_dft_training',
'm2ax',
'matbench_dielectric',
'matbench_expt_gap',
'matbench_expt_is_metal',
'matbench_glass',
'matbench_jdft2d',
'matbench_log_gvrh',
'matbench_log_kvrh',
'matbench_mp_e_form',
'matbench_mp_gap',
'matbench_mp_is_metal',
'matbench_perovskites',
'matbench_phonons',
'matbench_steels',
'mp_all_20181018',
'mp_nostruct_20181018',
'phonon_dielectric_mp',
'piezoelectric_tensor',
'steel_strength',
'wolverton_oxides']
可以使用load_dataset()函数和数据库名称加载数据集。为了节省安装空间,安装matminer时不会自动下载数据集。相反,第一次加载数据集时,它将从互联网上下载并存储在matminer安装目录中。
让我们加载介电常数数据集。它包含1056个用DFPT-PBE计算的介电性质的结构。
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
结果:
Fetching dielectric_constant.json.gz from https://ndownloader.figshare.com/files/13213475 to D:\anaconda3\lib\site-packages\matminer\datasets\dielectric_constant.json.gz
数据集以pandas 对象的形式提供。在Python中,您可以将这些视为一种“电子表格”对象。数据框架有几种有用的方法可以用来探索和清理数据,其中一些我们将在下面探讨。
head()函数打印数据集前几行的摘要。你可以滚动查看更多栏。由此,很容易看出数据集中可用的数据类型。
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
print(df.head())
结果:
material_id ... poscar
0 mp-441 ... Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1 mp-22881 ... Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2 mp-28013 ... Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3 mp-567290 ... La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4 mp-560902 ... Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...
[5 rows x 16 columns]
有时,如果数据集非常大,将无法看到所有可用的列。相反,可以使用columns属性查看列的完整列表:
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
print(df.columns)
结果:
Index(['material_id', 'formula', 'nsites', 'space_group', 'volume',
'structure', 'band_gap', 'e_electronic', 'e_total', 'n',
'poly_electronic', 'poly_total', 'pot_ferroelectric', 'cif', 'meta',
'poscar'],
dtype='object')
pandas 包括一个名为description()的函数,它有助于确定数据中各种数字/分类列的统计数据。请注意,description()函数默认情况下只描述数字列。
有时,description()函数会显示异常值,表明数据中存在错误。
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
print(df.describe())
结果:
nsites space_group ... poly_electronic poly_total
count 1056.000000 1056.000000 ... 1056.000000 1056.000000
mean 7.530303 142.970644 ... 7.248049 14.777898
std 3.388443 67.264591 ... 13.054947 19.435303
min 2.000000 1.000000 ... 1.630000 2.080000
25% 5.000000 82.000000 ... 3.130000 7.557500
50% 8.000000 163.000000 ... 4.790000 10.540000
75% 9.000000 194.000000 ... 7.440000 15.482500
max 20.000000 229.000000 ... 256.840000 277.780000
[8 rows x 7 columns]
进程已结束,退出代码为 0
我们可以通过使用列名索引对象来访问数据帧的特定列。例如:
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
print(df["band_gap"])
结果:
0 1.88
1 3.52
2 1.17
3 1.12
4 2.87
...
1051 0.87
1052 3.60
1053 0.14
1054 0.21
1055 0.26
Name: band_gap, Length: 1056, dtype: float64
或者,我们可以使用iloc属性访问Dataframe的特定行。
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
print(df.iloc[100])
结果:
material_id mp-7140
formula SiC
nsites 4
space_group 186
volume 42.005504
structure [[-1.87933700e-06 1.78517223e+00 2.53458835e...
band_gap 2.3
e_electronic [[6.9589498, -3.29e-06, 0.0014472600000000001]...
e_total [[10.193825310000001, -3.7090000000000006e-05,...
n 2.66
poly_electronic 7.08
poly_total 10.58
pot_ferroelectric False
cif #\#CIF1.1\n###################################...
meta {u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...
poscar Si2 C2\n1.0\n3.092007 0.000000 0.000000\n-1.54...
Name: 100, dtype: object
pandas 对象使得基于特定列过滤数据变得非常容易。我们可以使用典型的Python比较运算符(==、>、> =、<,等等)来过滤数值。例如,让我们查找单元格体积大于580的所有条目。我们通过对 volume 列进行过滤来实现这一点。
请注意,我们首先生成一个布尔掩码——一系列取决于比较的 true 和 false 。然后,我们可以使用掩码来过滤 DataFrame。
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
mask = df["volume"]>=580
print(df[mask])
结果:
material_id ... poscar
206 mp-23280 ... As4 Cl12\n1.0\n4.652758 0.000000 0.000000\n0.0...
216 mp-9064 ... Rb6 Te6\n1.0\n10.118717 0.000000 0.000000\n-5....
219 mp-23230 ... P4 Cl12\n1.0\n6.523152 0.000000 0.000000\n0.00...
251 mp-2160 ... Sb8 Se12\n1.0\n4.029937 0.000000 0.000000\n0.0...
[4 rows x 16 columns]
我们可以使用这种过滤方法来清理数据集。例如,如果我们只希望我们的数据集包含半导体(band gap 非零的材料),我们可以通过过滤 band gap 列轻松做到这一点。
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
mask = df["band_gap"] > 0
semiconductor_df = df[mask]
print(semiconductor_df)
结果:
material_id ... poscar
0 mp-441 ... Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1 mp-22881 ... Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2 mp-28013 ... Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3 mp-567290 ... La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4 mp-560902 ... Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...
... ... ... ...
1051 mp-568032 ... Cd1 In2 Se4\n1.0\n5.912075 0.000000 0.000000\n...
1052 mp-696944 ... La2 H2 Br4\n1.0\n4.137833 0.000000 0.000000\n-...
1053 mp-16238 ... Li2 Ag1 Sb1\n1.0\n4.078957 0.000000 2.354987\n...
1054 mp-4405 ... Rb3 Au1 O1\n1.0\n5.617516 0.000000 0.000000\n0...
1055 mp-3486 ... K2 Sn2 Sb2\n1.0\n4.446803 0.000000 0.000000\n-...
[1056 rows x 16 columns]
通常,数据集包含许多机器学习不需要的附加列。在我们可以在数据上训练模型之前,我们需要移除任何无关的列。我们可以使用 drop() 函数从数据集中移除整列。该函数可用于删除行和列。
该函数接受要删除的项目列表。对于列,这是列名,而对于行,这是行号。最后,axis 选项指定要删除的数据是列( axis=1 )还是行( axis=0 )。
例如,要删除nsites、space_group、e_electronic和e_total列,我们可以运行:
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
print("drop前:")
print(df.describe())
print("--"*20)
cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"],axis=1)
print("drop后")
print(cleaned_df.describe())
结果:
drop前:
nsites space_group ... poly_electronic poly_total
count 1056.000000 1056.000000 ... 1056.000000 1056.000000
mean 7.530303 142.970644 ... 7.248049 14.777898
std 3.388443 67.264591 ... 13.054947 19.435303
min 2.000000 1.000000 ... 1.630000 2.080000
25% 5.000000 82.000000 ... 3.130000 7.557500
50% 8.000000 163.000000 ... 4.790000 10.540000
75% 9.000000 194.000000 ... 7.440000 15.482500
max 20.000000 229.000000 ... 256.840000 277.780000
[8 rows x 7 columns]
----------------------------------------
drop后
volume band_gap n poly_electronic poly_total
count 1056.000000 1056.000000 1056.000000 1056.000000 1056.000000
mean 166.420376 2.119432 2.434886 7.248049 14.777898
std 97.425084 1.604924 1.148849 13.054947 19.435303
min 13.980548 0.110000 1.280000 1.630000 2.080000
25% 96.262337 0.890000 1.770000 3.130000 7.557500
50% 145.944691 1.730000 2.190000 4.790000 10.540000
75% 212.106405 2.885000 2.730000 7.440000 15.482500
max 597.341134 8.320000 16.030000 256.840000 277.780000
pandas 对象还使得对数据进行简单计算变得容易。可以把这想象成在Excel电子表格中使用公式。可以使用所有基本的Python数学运算符(如+、-、/和*)。
例如,电介质数据集(dielectric dataset)包含对介电常数的电子贡献(在poly_electronic列中)和总(静态)介电常数(在poly_total 列中)。离子对数据集的贡献由下式给出:
下面,我们计算离子对介电常数的贡献,并将其存储在一个名为poly_的新列中。这就像将数据分配给新列一样简单,即使该列尚不存在。
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
df['poly_ionic'] = df['poly_total']-df['poly_electronic']
print(df['poly_ionic'])
结果:
0 2.79
1 3.57
2 5.67
3 10.95
4 4.77
...
1051 4.09
1052 3.09
1053 19.99
1054 16.03
1055 3.08
Name: poly_ionic, Length: 1056, dtype: float64
在这一节中,我们将学习如何在pymatgen中从材料对象生成机器学习描述符。首先,我们将使用matminer的“特征器”类生成一些描述符。接下来,我们将使用前一节中关于dataframes 的一些知识来检查我们的描述符,并准备它们作为机器学习模型的输入。
特征化器将材质图元转换为机器可学习的特征。特征化器的一般思想是接受一个材质图元(例如pymatgen Composition)并输出一个向量。例如:\begin{align}f(\mathrm{Fe}_2\mathrm{O}_3) \rightarrow [1.5, 7.8, 9.1, 0.09] \end{align}
Matminer包含以下pymatgen对象的特征:*成分*晶体结构*晶体位置*能带结构*态密度
根据特征,返回的特征可能是:*数值、分类或混合向量*矩阵*其他pymatgen对象(用于进一步处理)
由于我们大部分时间都在处理 pandas 的 dataframes,所以所有的特征器都在pandas dataFrame上工作。我们将在本课稍后提供这方面的示例。
matminer Matminer中的特征器包含60多个特征器,其中大多数是通过发表在同行评审论文中的方法实现的。你可以在[matminer网站](https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html)上找到特征器的完整列表。所有的特征器都在它们的核心方法中内置了并行化和方便的容错功能。
在这一课中,我们将复习所有特征化程序中的主要方法。在本单元结束时,你将能够使用一个通用的软件界面为广泛的材料信息学问题生成描述符。
任何matminer的核心方法都是“特征化”。该方法接受一个材料对象,并返回一个机器学习向量或矩阵。让我们看一个pymatgen结构的例子:
from pymatgen import Composition
fe2o3 = Composition("Fe2O3")
print("fe2o3:",fe2o3)
#作为一个简单的例子,我们将使用element分数特征化器获得元素分数。
from matminer.featurizers.composition import ElementFraction
ef = ElementFraction()
#现在我们可以把我们的结果具体化了。
element_fractions = ef.featurize(fe2o3)
print(element_fractions)
结果:
fe2o3: Fe2 O3
[0, 0, 0, 0, 0, 0, 0, 0.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
我们已经成功地生成了用于学习的特性,但是它们意味着什么呢?检查的一种方法是阅读任何Features文档中的特征部分...但是更简单的方法是使用 feature_labels() 方法。
from matminer.featurizers.composition import ElementFraction
ef = ElementFraction()
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)
结果:
['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr']
我们现在按照生成特征的顺序来查看标签。
from pymatgen import Composition
fe2o3 = Composition("Fe2O3")
#作为一个简单的例子,我们将使用element分数特征化器获得元素分数。
from matminer.featurizers.composition import ElementFraction
ef = ElementFraction()
#现在我们可以把我们的结果具体化了。
element_fractions = ef.featurize(fe2o3)
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])
结果:
O 0.6
Fe 0.4
我们刚刚从一个单独的样本中生成了一些描述符和它们的标签,但是大多数时候我们的数据都在pandas的dataframs中。幸运的是,matminer featurizers实现了一个featurize_dataframe()方法,该方法与dataframes进行交互。
让我们从matminer中获取一个新的数据集,并在其上使用我们的 ElementFraxtion 特征器。
首先,我们像上一节一样下载数据集。在这个例子中,我们将下载一个超硬材料的数据集。
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("brgoch_superhard_training")
print(df.head())
结果:
Fetching brgoch_superhard_training.json.gz from https://ndownloader.figshare.com/files/13858931 to D:\anaconda3\envs\pythonProject1\lib\site-packages\matminer\datasets\brgoch_superhard_training.json.gz
formula ... suspect_value
0 AlPt3 ... False
1 Mn2Nb ... False
2 HfO2 ... False
3 Cu3Pt ... False
4 Mg3Pt ... False
接下来,我们可以使用featurize_dataframe()方法 (由所有featurizer实现) 一次将ElementFraction 应用于所有数据。唯一需要的参数是作为输入的 dataframe 和输入列名(在本例中是 composition )。默认情况下,featurize _ dataframe()使用多处理并行化。
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("brgoch_superhard_training")
print(df.head())
print("---" * 20)
from matminer.featurizers.composition import ElementFraction
ef = ElementFraction()
if __name__ == '__main__':
df = ef.featurize_dataframe(df, "composition")
print(df.head())
结果:
formula bulk_modulus shear_modulus composition material_id structure brgoch_feats suspect_value
0 AlPt3 225.230461 91.197748 (Al, Pt) mp-188 [[0. 0. 0.] Al, [0. 1.96140395 1.96140... {'atomic_number_feat_1': 123.5, 'atomic_number... False
1 Mn2Nb 232.696340 74.590157 (Mn, Nb) mp-12659 [[-2.23765223e-08 1.42974191e+00 5.92614104e... {'atomic_number_feat_1': 45.5, 'atomic_number_... False
2 HfO2 204.573433 98.564374 (Hf, O) mp-352 [[2.24450185 3.85793022 4.83390736] O, [2.7788... {'atomic_number_feat_1': 44.0, 'atomic_number_... False
3 Cu3Pt 159.312640 51.778816 (Cu, Pt) mp-12086 [[0. 1.86144248 1.86144248] Cu, [1.861... {'atomic_number_feat_1': 82.5, 'atomic_number_... False
4 Mg3Pt 69.637565 27.588765 (Mg, Pt) mp-18707 [[0. 0. 2.73626461] Mg, [0. ... {'atomic_number_feat_1': 57.0, 'atomic_number_... False
------------------------------------------------------------
formula bulk_modulus shear_modulus composition material_id structure brgoch_feats suspect_value H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr
0 AlPt3 225.230461 91.197748 (Al, Pt) mp-188 [[0. 0. 0.] Al, [0. 1.96140395 1.96140... {'atomic_number_feat_1': 123.5, 'atomic_number... False 0 0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0 0.0 0.00 0.25 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.0 0.0 0.0 0.75 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 Mn2Nb 232.696340 74.590157 (Mn, Nb) mp-12659 [[-2.23765223e-08 1.42974191e+00 5.92614104e... {'atomic_number_feat_1': 45.5, 'atomic_number_... False 0 0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.666667 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.333333 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 HfO2 204.573433 98.564374 (Hf, O) mp-352 [[2.24450185 3.85793022 4.83390736] O, [2.7788... {'atomic_number_feat_1': 44.0, 'atomic_number_... False 0 0 0.0 0.0 0.0 0.0 0.0 0.666667 0.0 0 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.333333 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 Cu3Pt 159.312640 51.778816 (Cu, Pt) mp-12086 [[0. 1.86144248 1.86144248] Cu, [1.861... {'atomic_number_feat_1': 82.5, 'atomic_number_... False 0 0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0 0.0 0.00 0.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.75 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.0 0.0 0.0 0.25 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 Mg3Pt 69.637565 27.588765 (Mg, Pt) mp-18707 [[0. 0. 2.73626461] Mg, [0. ... {'atomic_number_feat_1': 57.0, 'atomic_number_... False 0 0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0 0.0 0.75 0.00 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.0 0.0 0.0 0.0 0.25 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
我们可以对其他类型的特征器使用相同的语法。现在让我们给一个结构分配描述符。我们使用与组合特征器相同的语法来完成这个任务。首先,让我们加载一个包含结构的数据集。
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("phonon_dielectric_mp")
print(df.head())
结果:
Fetching phonon_dielectric_mp.json.gz from https://ndownloader.figshare.com/files/13297571 to D:\anaconda3\envs\pythonProject1\lib\site-packages\matminer\datasets\phonon_dielectric_mp.json.gz
mpid ... formula
0 mp-1000 ... BaTe
1 mp-1002124 ... HfC
2 mp-1002164 ... GeC
3 mp-10044 ... BAs
4 mp-1008223 ... CaSe
让我们使用 DensityFeatures 来计算这些结构的一些基本密度特征。
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("phonon_dielectric_mp")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
print(densityf.feature_labels())
结果:
['density', 'vpa', 'packing fraction']
这些是我们将获得的特性。现在,我们使用 featurize_dataframe() 为dataframe中的所有样本生成这些特征。因为我们使用结构作为特征器的输入,所以我们选择“structure”列。
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("phonon_dielectric_mp")
print(df.head())
print("---"*20)
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
print(df.head())
结果:
mpid eps_electronic eps_total last phdos peak structure formula
0 mp-1000 6.311555 12.773454 98.585771 [[2.8943817 2.04663693 5.01321616] Te, [0. 0.... BaTe
1 mp-1002124 24.137743 32.965593 677.585725 [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78... HfC
2 mp-1002164 8.111021 11.169464 761.585719 [[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45... GeC
3 mp-10044 10.032168 10.128936 701.585723 [[0.98372595 0.69559929 1.70386332] B, [0. 0. ... BAs
4 mp-1008223 3.979201 6.394043 204.585763 [[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se] CaSe
------------------------------------------------------------
mpid eps_electronic eps_total last phdos peak structure formula density vpa packing fraction
0 mp-1000 6.311555 12.773454 98.585771 [[2.8943817 2.04663693 5.01321616] Te, [0. 0.... BaTe 4.937886 44.545547 0.596286
1 mp-1002124 24.137743 32.965593 677.585725 [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78... HfC 9.868234 16.027886 0.531426
2 mp-1002164 8.111021 11.169464 761.585719 [[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45... GeC 5.760895 12.199996 0.394180
3 mp-10044 10.032168 10.128936 701.585723 [[0.98372595 0.69559929 1.70386332] B, [0. 0. ... BAs 5.087634 13.991016 0.319600
4 mp-1008223 3.979201 6.394043 204.585763 [[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se] CaSe 2.750191 35.937000 0.428523
除了 Bandstructure/DOS/Structure/Composition 特征器之外,matminer还提供了一个特征器接口,用于以容错方式在pymatgen对象之间进行转换(例如,将氧化状态辅助到成分)。这些特征器可以在 matminer.featurizers.conversion 中找到,并使用相同的featurize/featurize_dataframe等。语法和其他特征一样。
我们之前加载的数据集只包含一个带有字符串对象的 formula 列。要将这些数据转换成 composition
列 包含pymatgen Composition对象,我们可以在 formula
列上使用 StrToComposition 转换特性器。
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("phonon_dielectric_mp")
print(df.head())
print("---"*20)
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
if __name__ == '__main__':
df = stc.featurize_dataframe(df, "formula")
print(df.head())
结果:
mpid eps_electronic eps_total last phdos peak structure formula
0 mp-1000 6.311555 12.773454 98.585771 [[2.8943817 2.04663693 5.01321616] Te, [0. 0.... BaTe
1 mp-1002124 24.137743 32.965593 677.585725 [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78... HfC
2 mp-1002164 8.111021 11.169464 761.585719 [[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45... GeC
3 mp-10044 10.032168 10.128936 701.585723 [[0.98372595 0.69559929 1.70386332] B, [0. 0. ... BAs
4 mp-1008223 3.979201 6.394043 204.585763 [[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se] CaSe
------------------------------------------------------------
mpid eps_electronic eps_total last phdos peak structure formula composition
0 mp-1000 6.311555 12.773454 98.585771 [[2.8943817 2.04663693 5.01321616] Te, [0. 0.... BaTe (Ba, Te)
1 mp-1002124 24.137743 32.965593 677.585725 [[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78... HfC (Hf, C)
2 mp-1002164 8.111021 11.169464 761.585719 [[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45... GeC (Ge, C)
3 mp-10044 10.032168 10.128936 701.585723 [[0.98372595 0.69559929 1.70386332] B, [0. 0. ... BAs (B, As)
4 mp-1008223 3.979201 6.394043 204.585763 [[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se] CaSe (Ca, Se)
在我们开始练习之前,Featurizers有一些强大的功能值得一提(这里没有提到更多)。
通常,数据是混乱的,某些特性化者会遇到错误。在 featurize _ dataframe() 中设置ignore_errors=True 以跳过错误;如果您希望在附加列中看到返回的错误,也可以将return_errors 设置为True。
许多特征器是使用在同行评审研究中发现的方法实现的。请使用 citations() 方法引用这些原始作品,该方法在Python列表中返回BibTex格式的引用。
在第1部分和第2部分中,我们演示了如何下载数据集和添加机器可学习的特性。在第3部分中,我们展示了如何在数据集上训练机器学习模型并分析结果。
本部分广泛使用scikit-learn包,这是一个用于机器学习的开源python包。Matminer旨在使scikit-learn的机器学习尽可能简单。还存在其他机器学习包,例如实现神经网络体系结构的 TensorFlow。这些软件包也可以与matminer一起使用,但不在本研讨会的讨论范围内。
首先,让我们加载一个可以用于机器学习的数据集。事先,我们已经为练习1和练习2中使用的elastic_tensor_2015数据集添加了一些 composition 和 structure 特征。
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
print(df.head())
结果:
material_id formula nsites space_group volume structure elastic_anisotropy G_Reuss G_VRH G_Voigt K_Reuss K_VRH K_Voigt poisson_ratio compliance_tensor elastic_tensor elastic_tensor_original cif kpoint_density poscar density vpa packing fraction composition
0 mp-10003 Nb4CoSi 12 124 194.419802 [[0.94814328 2.07280467 2.5112 ] Nb, [5.273... 0.030688 96.844535 97.141604 97.438674 194.267623 194.268884 194.270146 0.285701 [[0.004385293093993, -0.0016070693558990002, -... [[311.33514638650246, 144.45092552856926, 126.... [[311.33514638650246, 144.45092552856926, 126.... #\#CIF1.1\n###################################... 7000 Nb8 Co2 Si2\n1.0\n6.221780 0.000000 0.000000\n... 7.834556 16.201654 0.688834 (Nb, Co, Si)
1 mp-10010 Al(CoSi)2 5 164 61.987320 [[0. 0. 0.] Al, [1.96639263 1.13529553 0.75278... 0.266910 93.939650 96.252006 98.564362 173.647763 175.449907 177.252050 0.268105 [[0.0037715428949660003, -0.000844229828709, -... [[306.93357350984974, 88.02634955100905, 105.6... [[306.93357350984974, 88.02634955100905, 105.6... #\#CIF1.1\n###################################... 7000 Al1 Co2 Si2\n1.0\n3.932782 0.000000 0.000000\n... 5.384968 12.397466 0.644386 (Al, Co, Si)
2 mp-10015 SiOs 2 221 25.952539 [[1.480346 1.480346 1.480346] Si, [0. 0. 0.] Os] 0.756489 120.962289 130.112955 139.263621 295.077545 295.077545 295.077545 0.307780 [[0.0019959391925840004, -0.000433146670736000... [[569.5291276937579, 157.8517489654999, 157.85... [[569.5291276937579, 157.8517489654999, 157.85... #\#CIF1.1\n###################################... 7000 Si1 Os1\n1.0\n2.960692 0.000000 0.000000\n0.00... 13.968635 12.976265 0.569426 (Si, Os)
3 mp-10021 Ga 4 63 76.721433 [[0. 1.09045794 0.84078375] Ga, [0. ... 2.376805 12.205989 15.101901 17.997812 49.025963 49.130670 49.235377 0.360593 [[0.021647143908635, -0.005207263618160001, -0... [[69.28798774976904, 34.7875015216915, 37.3877... [[70.13259066665267, 40.60474945058445, 37.387... #\#CIF1.1\n###################################... 7000 Ga4\n1.0\n2.803229 0.000000 0.000000\n0.000000... 6.036267 19.180359 0.479802 (Ga)
4 mp-10025 SiRu2 12 62 160.300999 [[1.0094265 4.24771709 2.9955487 ] Si, [3.028... 0.196930 100.110773 101.947798 103.784823 255.055257 256.768081 258.480904 0.324682 [[0.00410214297725, -0.001272204332729, -0.001... [[349.3767766177825, 186.67131003104407, 176.4... [[407.4791016459293, 176.4759188081947, 213.83... #\#CIF1.1\n###################################... 7000 Si4 Ru8\n1.0\n4.037706 0.000000 0.000000\n0.00... 9.539514 13.358418 0.598395 (Si, Ru)
我们首先需要将数据集分成 “target” 属性和用于学习的 “features” 。在这个模型中,我们将使用体积模量(K_VRH)作为 target 属性。我们使用dataframe 的 values
属性给 target 属性一个numpy数组,而不是pandas Series
对象。
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
y = df['K_VRH'].values
print(y)
结果:
[194.26888436 175.44990675 295.07754499 ... 89.41816126 99.3845653
35.93865993]
机器学习算法只能使用数字特征进行训练。因此,我们需要从数据集中移除任何非数字列。此外,我们希望从特征集中移除 K_VRH 列,因为模型不应该预先知道 target 属性。
上面加载的数据集包括以前用于生成机器可学习特征的structure
, formula
和 composition
。让我们使用 pandas drop() 函数删除它们,在章节1中讨论过。请记住,axis=1表示我们删除的是列而不是行。
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
y = df['K_VRH'].values
X = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)
print("There are {} possible descriptors:".format(X.columns))
print(X.columns)
结果:
There are Index(['nsites', 'space_group', 'volume', 'elastic_anisotropy', 'G_Reuss',
'G_VRH', 'G_Voigt', 'K_Reuss', 'K_Voigt', 'poisson_ratio',
'kpoint_density', 'density', 'vpa', 'packing fraction'],
dtype='object') possible descriptors:
Index(['nsites', 'space_group', 'volume', 'elastic_anisotropy', 'G_Reuss',
'G_VRH', 'G_Voigt', 'K_Reuss', 'K_Voigt', 'poisson_ratio',
'kpoint_density', 'density', 'vpa', 'packing fraction'],
dtype='object')
scikit-learn 库可以轻松地使用我们生成的功能来训练机器学习模型。它实现了各种不同的回归模型,并包含用于交叉验证的工具。
为了节省时间,在这个例子中,我们将只试验一个模型,但最好试验多个模型,看看哪一个对你的机器学习问题表现最好。一个好的“开始”模型是随机森林模型。让我们创建一个随机森林模型
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
请注意,我们创建的模型的估计量(n_estimators)设置为100。n_estimators是机器学习超参数的一个例子。大多数模型包含许多可调超参数。为了获得良好的性能,有必要为每个单独的机器学习问题微调这些参数。目前没有简单的方法提前知道什么样的超参数是最优的。通常,使用试错法。
我们现在可以训练我们的模型使用输入特征(X)来预测 target 属性(y)。这是使用 fit() 函数实现的。
rf.fit(X, y)
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
y = df['K_VRH'].values
X = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)
rf.fit(X, y)
接下来,我们需要评估模型的性能。为此,我们首先要求模型预测原始dataframe中每个条目的体积模量。
y_pred = rf.predict(X)
接下来,我们可以通过查看预测的 均方根误差 来检查模型的准确性。Scikit-learn提供了一个 mean_squared_error() 函数来计算均方差。然后,我们取其平方根,以获得最终的性能指标。
import numpy as np
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
import numpy as np
from sklearn.metrics import mean_squared_error
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
y = df['K_VRH'].values
X = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)
rf.fit(X, y)
y_pred = rf.predict(X)
mse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))
结果:
training RMSE = 0.801 GPa
一个0.801 GPa的RMSE(均方根误差)看起来很合理!然而,由于模型是在完全相同的数据上训练和评估的,这不是对模型在未知材料(机器学习研究的主要目的)上的表现的真实估计。
为了获得更准确的预测性能估计,并验证我们没有过度拟合,我们需要检查交叉验证分数,而不是拟合分数。
在交叉验证中,数据被随机划分为 n 份(splits)(在本例中为10个),每个分割包含大致相同数量的样本。对模型进行 n-1 份分割 训练(训练集),通过比较最终分割(测试集)的实际值和预测值来评估模型性能。总的来说,这个过程被重复多次,这样每个分割在某个点被用作测试集。交叉验证分数是所有测试集的平均分数。
有许多方法可以将数据分割成多个部分。在这个例子中,我们使用 KFold 方法,并选择拆分的数量为10。即90 %的数据将用作训练集,10 %用作测试集。
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=1)
结果:
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\model_selection\_split.py:293: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
warnings.warn(
请注意,我们将设置 random_state=1,以确保每个参与者对他们的模型得到相同的答案。
最后,可以使用 Scikit-Learn cross_val_score() 函数自动获得交叉验证分数。这个函数需要一个机器学习模型、输入特征和目标属性作为参数。注意,我们将kfold对象作为 cv 参数传递,以使 cross_val_score() 使用正确的测试/训练分割。
对于每次拆分,在评估性能之前,将从头开始训练模型。由于我们必须训练和预测10次,交叉验证通常需要一些时间来执行。在我们的例子中,模型相当小,所以这个过程只需要大约一分钟。最终的交叉验证分数是所有分割的平均值。
from sklearn.model_selection import cross_val_score
scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
y = df['K_VRH'].values
X = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)
rf.fit(X, y)
y_pred = rf.predict(X)
mse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))
kfold = KFold(n_splits=10, random_state=1)
scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))
结果:
training RMSE = 0.801 GPa
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\model_selection\_split.py:293: FutureWarning: Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.
warnings.warn(
Mean RMSE: 1.731
请注意,我们的RMSE有点变化,因为现在它反映了模型的真正预测能力。不过一个~0.9GPa的均方根误差还是不错的!
对于所有测试/训练分割的测试集中的每个样本,我们可以通过对照实际值绘制我们的预测来可视化我们的模型的预测性能。
首先,我们使用 cross_val_predict 方法获得每个分割的测试集的预测值。这类似于cross_val_score 方法,只是它返回的是实际预测值,而不是模型得分。
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(rf, X, y, cv=kfold)
为了绘图,我们使用matminer的 PlotlyFig
模块,它可以帮助您快速生成出版就绪的图表。 PlotlyFig
可以生成许多不同类型的图。详细解释它的使用超出了本教程的范围,但是可用图的示例在 FigRecipes section of the matminer_examples repository.
from matminer.figrecipes.plot import PlotlyFig
pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
y_title='Predicted bulk modulus (GPa)',
mode='notebook')
pf.xy(xy_pairs=[(y, y_pred), ([0, 400], [0, 400])],
labels=df['formula'],
modes=['markers', 'lines'],
lines=[{}, {'color': 'black', 'dash': 'dash'}],
showlegends=False)
#此代码需在Jupyter中运行
#*Jupyter最简便方式为在anaconda中安装JupyterLab和Jupyter Notebook
#*参考matminer.figrecipes手册,需在jupyter_notebook_config.py中设置NotebookApp.iopub_data_rate_limit=1.0e10,否则无法作图,可在当前python环境的终端中输入jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from matminer.figrecipes.plot import PlotlyFig
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
y = df['K_VRH'].values
X = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)
# rf.fit(X, y)
# y_pred = rf.predict(X)
# mse = mean_squared_error(y, y_pred)
# print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))
kfold = KFold(n_splits=10, random_state=1)
# scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)
# rmse_scores = [np.sqrt(abs(s)) for s in scores]
# print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))
y_pred = cross_val_predict(rf, X, y, cv=kfold)
pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
y_title='Predicted bulk modulus (GPa)',
mode='notebook')
pf.xy(xy_pairs=[(y, y_pred), ([0, 400], [0, 400])],
labels=df['formula'],
modes=['markers', 'lines'],
lines=[{}, {'color': 'black', 'dash': 'dash'}],
showlegends=False)
还不错!但是,肯定也有一些离群点(可以用鼠标悬停在点上看看是什么)。
机器学习的一个重要方面是能够理解为什么模型会做出某些预测。随机森林模型特别容易解释,因为它们具有feature_importances
属性,该属性包含每个特征在决定最终预测中的重要性。让我们看看我们模型的特征重要性。
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from pymatgen
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
y = df['K_VRH'].values
X = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)
rf.fit(X, y)
print(rf.feature_importances_)
结果:
[1.33190021e-05 1.59243590e-05 3.00289964e-05 1.31361719e-04
4.30431755e-04 4.14454063e-04 2.24079389e-04 8.12596104e-01
1.85751384e-01 1.10635270e-04 1.32575216e-06 3.02149617e-05
4.80359906e-05 2.02700688e-04]
为了理解这一点,我们需要知道每个数字对应于哪个特征。我们可以使用 PlotlyFig
来绘制5个最重要特征的重要性。
importances = rf.feature_importances_
included = X.columns.values
indices = np.argsort(importances)[::-1]
pf = PlotlyFig(y_title='Importance (%)',
title='Feature by importances',
mode='notebook')
pf.bar(x=included[indices][0:5], y=importances[indices][0:5])
import pandas as pd
#显示Dateframe所有列(参数设置为None代表显示所有行,也可以自行设置数字)
from matminer import PlotlyFig
pd.set_option('display.max_columns',None)
#禁止Dateframe自动换行(设置为Flase不自动换行,True反之)
pd.set_option('expand_frame_repr', False)
from matminer.datasets.dataset_retrieval import load_dataset
df = load_dataset("elastic_tensor_2015")
from matminer.featurizers.structure import DensityFeatures
densityf = DensityFeatures()
from matminer.featurizers.conversions import StrToComposition
stc = StrToComposition()
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=1)
import numpy as np
import pymatgen
if __name__ == '__main__':
df = densityf.featurize_dataframe(df, "structure")
df = stc.featurize_dataframe(df, "formula")
y = df['K_VRH'].values
X = df.drop(['material_id',"structure", "formula", "composition", "K_VRH",'elastic_tensor_original','poscar','compliance_tensor','elastic_tensor','cif'], axis=1)
rf.fit(X, y)
importances = rf.feature_importances_
included = X.columns.values
indices = np.argsort(importances)[::-1]
pf = PlotlyFig(y_title='Importance (%)',
title='Feature by importances',
mode='notebook')
pf.bar(x=included[indices][0:5], y=importances[indices][0:5])
Automatminer 是一个包,用于使用matminer的特征器、特征约简技术和自动机器学习(AutoML)自动创建ML管道。Automatminer以端到端工作,从原始数据到预测,不需要任何人工输入。
放入一个数据集,得到一个预测材料属性的机器。* Automatminer在材料信息学的多个领域与最先进的手动调整机器学习模型竞争。Automatminer还包括运行MatBench的实用程序,Matbench是一种材料科学ML基准。*从[官方文档]中了解更多关于Automatminer和MatBench的信息。
automatminer是如何工作的?Automatminer使用matminer描述符库中的数百种描述符技术来自动修饰数据集,挑选最有用的特征进行学习,并运行一个单独的AutoML管道。一旦管道被安装好,它可以被总结成一个文本文件,保存到磁盘上,或者用于对新材料进行预测。Automatminer概述材料图元(如晶体结构)在一端,而属性预测则在另一端。MatPipe处理中间操作,如分配描述符、清理有问题的数据、数据转换、插补和机器学习。
MatPipe是Automatminer '`MatPipe`'的主要对象,` MatPipe '是Automatminer的中心对象。它有一个sklearn BaseEstimator语法用于“拟合”和“预测”操作。简单地“适应”你的训练数据,然后“预测”你的测试数据。MatPipe使用[pandas]。作为输入和输出的dataframes 。放入(材料的)dataframes ,取出(属性预测的)dataframes 。
概述在本节中,我们将介绍使用自动挖掘器对数据进行训练和预测的基本步骤。我们还将使用自动挖掘器的应用编程接口来查看我们的自动管道的内部。*首先,我们将从材料项目中加载大约4600个介电常数的数据集。*接下来,我们将为数据拟合一个Automatminer‘MatPipe’(管道)*然后,我们将从结构中预测介电常数,并看看我们的预测是如何进行的(注意,这不是一个容易的问题!)*我们将使用“MatPipe”的内省方法来检查我们的管道。*最后,我们看看如何保存和加载管道,以实现可再现的预测。*注意:为了简洁起见,我们将在本笔记本中使用单个列车测试分割。要运行完整的Automatminer基准测试,请参见“MatPipe.benchmark”文档。
为机器学习准备数据集让我们加载一个数据集来玩。在这个例子中,我们将使用matminer来加载 MatBench v0.1数据集之一。
from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
if __name__ == '__main__':
df = load_dataset("matbench_dielectric")
print(df.head())
结果:
structure n
0 [[4.29304147 2.4785886 1.07248561] S, [4.2930... 1.752064
1 [[3.95051434 4.51121437 0.28035002] K, [4.3099... 1.652859
2 [[-1.78688104 4.79604117 1.53044621] Rb, [-1... 1.867858
3 [[4.51438064 4.51438064 0. ] Mn, [0.133... 2.676887
4 [[-4.36731958 6.8886097 0.50929706] Li, [-2... 1.793232
通过检查数据集,我们可以看到只有“structure”和“n”(介电常数)列。
接下来,我们可以生成一个 train-test 分割。用于评估automatminer。
model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)
让我们删除测试中 dataframe 的 target 属性,这样我们就可以确定我们没有给自动挖掘器任何测试信息。
我们的 target 变量是“n”。
target = "n"
prediction_df = test_df.drop(columns=[target])
prediction_df.head()
from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
df = load_dataset("matbench_dielectric")
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)
target = "n"
prediction_df = test_df.drop(columns=[target])
print(prediction_df.head())
结果:
structure
1802 [[3.71205866 2.14315394 1.14375057] Si, [-3.71...
1881 [[0. 0. 0.] Cd, [1.35314892 0.95682078 2.34372...
1288 [[-0.50714072 4.9893142 6.08288682] K, [-1....
4490 [[3.90704797 2.76270011 6.76720559] Si, [0.558...
32 [[1.91506173 1.23473956 4.58373805] P, [ 5.553...
现在我们拥有了启动我们的 AutoML管道所需的一切。为了简单起见,我们将使用 MatPipe
预设。MatPipe
是高度可定制的,有数百个配置选项,但大多数用例将通过使用预设配置之一来满足。我们使用 from_preset 方法。
在这个例子中,出于时间的考虑,我们将使用“debug”预设,这将花费大约1.5分钟进行机器学习。如果你有更多的时间,“express”预设是一个很好的选择。
from automatminer import MatPipe
pipe = MatPipe.from_preset("debug")
from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipe
if __name__ == '__main__':
df = load_dataset("matbench_dielectric")
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)
target = "n"
prediction_df = test_df.drop(columns=[target])
pipe = MatPipe.from_preset("debug")
结果:
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.metrics.scorer module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
warnings.warn(message, FutureWarning)
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.feature_selection.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.feature_selection. Anything that cannot be imported from sklearn.feature_selection is now part of the private API.
warnings.warn(message, FutureWarning)
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.neighbors.unsupervised module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.
warnings.warn(message, FutureWarning)
D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\externals\joblib\__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
warnings.warn(msg, category=FutureWarning)
要将 Automatminer MatPipe与数据匹配,请输入您的训练数据和所需目标。
pipe.fit(train_df, target)
from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipe
if __name__ == '__main__':
df = load_dataset("matbench_dielectric")
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)
target = "n"
prediction_df = test_df.drop(columns=[target])
pipe = MatPipe.from_preset("debug")
pipe.fit(train_df, target)
结果:
2021-03-15 20:45:47 INFO Problem type is: regression
2021-03-15 20:45:47 INFO Fitting MatPipe pipeline to data.
2021-03-15 20:45:47 INFO AutoFeaturizer: Starting fitting.
2021-03-15 20:45:47 INFO AutoFeaturizer: Adding compositions from structures.
2021-03-15 20:45:47 INFO AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
StructureToOxidStructure: 0%| | 0/3811 [00:00, ?it/s]D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.metrics.scorer module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
StructureToComposition: 0%| | 0/3811 [00:00, ?it/s]D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.metrics.scorer module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
zeros[:len(eigs)] = eigs
D:\anaconda3\envs\pythonProject1\lib\site-packages\matminer\featurizers\structure.py:743: ComplexWarning: Casting complex values to real discards the imaginary part
zeros[:len(eigs)] = eigs
SineCoulombMatrix: 100%|██████████| 3811/3811 [00:17<00:00, 219.80it/s]
2021-03-15 20:48:44 INFO AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2021-03-15 20:48:44 INFO AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2021-03-15 20:48:44 INFO AutoFeaturizer: Finished transforming.
2021-03-15 20:48:44 INFO DataCleaner: Starting fitting.
2021-03-15 20:48:44 INFO DataCleaner: Cleaning with respect to samples with sample na_method 'drop'
2021-03-15 20:48:44 INFO DataCleaner: Replacing infinite values with nan for easier screening.
2021-03-15 20:48:44 INFO DataCleaner: Before handling na: 3811 samples, 421 features
2021-03-15 20:48:44 INFO DataCleaner: 0 samples did not have target values. They were dropped.
2021-03-15 20:48:44 INFO DataCleaner: Handling feature na by max na threshold of 0.01 with method 'drop'.
2021-03-15 20:48:44 INFO DataCleaner: After handling na: 3811 samples, 421 features
2021-03-15 20:48:44 INFO DataCleaner: Finished fitting.
2021-03-15 20:48:44 INFO FeatureReducer: Starting fitting.
2021-03-15 20:48:45 INFO FeatureReducer: 285 features removed due to cross correlation more than 0.95
2021-03-15 20:52:46 INFO TreeFeatureReducer: Finished tree-based feature reduction of 135 initial features to 13
2021-03-15 20:52:46 INFO FeatureReducer: Finished fitting.
2021-03-15 20:52:46 INFO FeatureReducer: Starting transforming.
2021-03-15 20:52:46 INFO FeatureReducer: Finished transforming.
2021-03-15 20:52:46 INFO TPOTAdaptor: Starting fitting.
27 operators have been imported by TPOT.
Optimization Progress: 0%| | 0/10 [00:00, ?pipeline/s]D:\anaconda3\envs\pythonProject1\lib\site-packages\sklearn\utils\deprecation.py:144: FutureWarning: The sklearn.metrics.scorer module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
ExtraTreesRegressor(StandardScaler(SelectPercentile(input_matrix, SelectPercentile__percentile=99)), ExtraTreesRegressor__bootstrap=True, ExtraTreesRegressor__max_features=0.8500000000000002, ExtraTreesRegressor__min_samples_leaf=1, ExtraTreesRegressor__min_samples_split=5, ExtraTreesRegressor__n_estimators=500)
Optimization Progress: 100%|██████████| 30/30 [00:57<00:00, 1.41s/pipeline]Generation 2 - Current Pareto front scores:
-3 -0.4658520716454634 ExtraTreesRegressor(StandardScaler(SelectPercentile(input_matrix, SelectPercentile__percentile=99)), ExtraTreesRegressor__bootstrap=True, ExtraTreesRegressor__max_features=0.8500000000000002, ExtraTreesRegressor__min_samples_leaf=1, ExtraTreesRegressor__min_samples_split=5, ExtraTreesRegressor__n_estimators=500)
Optimization Progress: 100%|██████████| 30/30 [00:58<00:00, 1.41s/pipeline]_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler..
Optimization Progress: 100%|██████████| 30/30 [00:58<00:00, 1.41s/pipeline]_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler..
1.01 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.
WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.
TPOT closed prematurely. Will use the current best pipeline.
2021-03-15 20:53:51 INFO TPOTAdaptor: Finished fitting.
2021-03-15 20:53:51 INFO MatPipe successfully fit.
我们的MatPipe现在适合了。让我们用 MatPipe.predict 预测我们的测试数据。这应该只需要几分钟
prediction_df = pipe.predict(prediction_df)
from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipe
if __name__ == '__main__':
df = load_dataset("matbench_dielectric")
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)
target = "n"
prediction_df = test_df.drop(columns=[target])
pipe = MatPipe.from_preset("debug")
pipe.fit(train_df, target)
prediction_df = pipe.predict(prediction_df)
结果:
2020-07-27 14:36:25 INFO Beginning MatPipe prediction using fitted pipeline.
2020-07-27 14:36:25 INFO AutoFeaturizer: Starting transforming.
2020-07-27 14:36:25 INFO AutoFeaturizer: Adding compositions from structures.
2020-07-27 14:36:25 INFO AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
2020-07-27 14:37:09 INFO AutoFeaturizer: Guessing oxidation states of compositions, as they were not present in input.
2020-07-27 14:37:15 INFO AutoFeaturizer: Featurizing with ElementProperty.
2020-07-27 14:37:19 INFO AutoFeaturizer: Guessing oxidation states of structures if they were not present in input.
2020-07-27 14:37:22 INFO AutoFeaturizer: Featurizing with SineCoulombMatrix.
2020-07-27 14:37:28 INFO AutoFeaturizer: Featurizer type bandstructure not in the dataframe. Skipping...
2020-07-27 14:37:28 INFO AutoFeaturizer: Featurizer type dos not in the dataframe. Skipping...
2020-07-27 14:37:28 INFO AutoFeaturizer: Finished transforming.
2020-07-27 14:37:28 INFO DataCleaner: Starting transforming.
2020-07-27 14:37:28 INFO DataCleaner: Cleaning with respect to samples with sample na_method 'fill'
2020-07-27 14:37:28 INFO DataCleaner: Replacing infinite values with nan for easier screening.
2020-07-27 14:37:28 INFO DataCleaner: Before handling na: 953 samples, 420 features
2020-07-27 14:37:28 INFO DataCleaner: After handling na: 953 samples, 420 features
2020-07-27 14:37:28 INFO DataCleaner: Target not found in df columns. Ignoring...
2020-07-27 14:37:28 INFO DataCleaner: Finished transforming.
2020-07-27 14:37:28 INFO FeatureReducer: Starting transforming.
2020-07-27 14:37:28 WARNING FeatureReducer: Target not found in columns to transform.
2020-07-27 14:37:28 INFO FeatureReducer: Finished transforming.
2020-07-27 14:37:28 INFO TPOTAdaptor: Starting predicting.
2020-07-27 14:37:28 INFO TPOTAdaptor: Prediction finished successfully.
2020-07-27 14:37:28 INFO TPOTAdaptor: Finished predicting.
2020-07-27 14:37:28 INFO MatPipe prediction completed.
MatPipe将预测放在一个名为“{target} predicted”的列中:
prediction_df.head()
from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipe
if __name__ == '__main__':
df = load_dataset("matbench_dielectric")
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)
target = "n"
prediction_df = test_df.drop(columns=[target])
pipe = MatPipe.from_preset("debug")
pipe.fit(train_df, target)
prediction_df = pipe.predict(prediction_df)
print(prediction_df.head())
结果:
MagpieData range AtomicWeight ... n predicted
1802 102.710600 ... 1.951822
1881 15.189000 ... 3.295348
1288 49.380600 ... 1.656971
4490 0.000000 ... 4.706100
32 32.572238 ... 2.754411
现在让我们用平均误差来给我们的预测打分,并将它们与sklearn的虚拟回归器进行比较。
from matminer.datasets.dataset_retrieval import load_dataset
import pymatgen
from sklearn.model_selection import train_test_split
from automatminer import MatPipe
from sklearn.metrics import mean_absolute_error
from sklearn.dummy import DummyRegressor
if __name__ == '__main__':
df = load_dataset("matbench_dielectric")
train_df, test_df = train_test_split(df, test_size=0.2, shuffle=True, random_state=20191014)
target = "n"
prediction_df = test_df.drop(columns=[target])
pipe = MatPipe.from_preset("debug")
pipe.fit(train_df, target)
prediction_df = pipe.predict(prediction_df)
# fit the dummy
dr = DummyRegressor()
dr.fit(train_df["structure"], train_df[target])
dummy_test = dr.predict(test_df["structure"])
# Score dummy and MatPipe
true = test_df[target]
matpipe_test = prediction_df[target + " predicted"]
mae_matpipe = mean_absolute_error(true, matpipe_test)
mae_dummy = mean_absolute_error(true, dummy_test)
print("Dummy MAE: {}".format(mae_dummy))
print("MatPipe MAE: {}".format(mae_matpipe))
结果:
Dummy MAE: 0.7772666142371938
MatPipe MAE: 0.5030822760911582
使用 MatPipe.inspect (所有适当属性名称的长而全面的版本)或 MatPipe.summarize (执行摘要)中的dict/text摘要检查MatPipe内部。
import pprint
# Get a summary and save a copy to json
summary = pipe.summarize(filename="MatPipe_predict_experimental_gap_from_composition_summary.json")
pprint.pprint(summary)
结果:
{'data_cleaning': {'drop_na_targets': 'True',
'encoder': 'one-hot',
'feature_na_method': 'drop',
'na_method_fit': 'drop',
'na_method_transform': 'fill'},
'feature_reduction': {'reducer_params': "{'tree': {'importance_percentile': "
"0.9, 'mode': 'regression', "
"'random_state': 0}}",
'reducers': "('corr', 'tree')"},
'features': ['MagpieData range AtomicWeight',
'MagpieData avg_dev AtomicWeight',
'MagpieData mean MeltingT',
'MagpieData maximum Electronegativity',
'MagpieData mean Electronegativity',
'MagpieData avg_dev Electronegativity',
'MagpieData avg_dev NUnfilled',
'MagpieData mean GSvolume_pa',
'sine coulomb matrix eig 0',
'sine coulomb matrix eig 6',
'sine coulomb matrix eig 7'],
'featurizers': {'bandstructure': [BandFeaturizer()],
'composition': [ElementProperty(data_source=,
features=['Number', 'MendeleevNumber', 'AtomicWeight',
'MeltingT', 'Column', 'Row', 'CovalentRadius',
'Electronegativity', 'NsValence', 'NpValence',
'NdValence', 'NfValence', 'NValence', 'NsUnfilled',
'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled',
'GSvolume_pa', 'GSbandgap', 'GSmagmom',
'SpaceGroupNumber'],
stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev',
'mode'])],
'dos': [DOSFeaturizer()],
'structure': [SineCoulombMatrix()]},
'ml_model': 'Pipeline(memory=Memory(location=/var/folders/x6/mzkjfgpx3m9cr_6mcy9759qw0000gn/T/tmps0ji7j_y/joblib),\n'
" steps=[('selectpercentile',\n"
' SelectPercentile(percentile=23,\n'
' score_func=)),\n'
" ('robustscaler', RobustScaler()),\n"
" ('randomforestregressor',\n"
' RandomForestRegressor(bootstrap=False, '
'max_features=0.05,\n'
' min_samples_leaf=7, '
'min_samples_split=5,\n'
' n_estimators=20))])'}
# Explain the MatPipe's internals more comprehensively
details = pipe.inspect(filename="MatPipe_predict_experimental_gap_from_composition_details.json")
print(details)
结果:
{'autofeaturizer': {'autofeaturizer': {'cache_src': None, 'preset': 'debug', 'featurizers': {'composition': [ElementProperty(data_source=,
features=['Number', 'MendeleevNumber', 'AtomicWeight',
'MeltingT', 'Column', 'Row', 'CovalentRadius',
'Electronegativity', 'NsValence', 'NpValence',
'NdValence', 'NfValence', 'NValence', 'NsUnfilled',
'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled',
'GSvolume_pa', 'GSbandgap', 'GSmagmom',
'SpaceGroupNumber'],
stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev',
'mode'])], 'structure': [SineCoulombMatrix()], 'bandstructure': [BandFeaturizer()], 'dos': [DOSFeaturizer()]}, 'exclude': [], 'functionalize': False, 'ignore_cols': [], 'fitted_input_df': {'obj': , 'columns': 2, 'samples': 3811}, 'converted_input_df': {'obj': , 'columns': 3, 'samples': 3811}, 'ignore_errors': True, 'drop_inputs': True, 'multiindex': False, 'do_precheck': True, 'n_jobs': 2, 'guess_oxistates': True, 'features': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData mode Number', 'MagpieData minimum MendeleevNumber', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData range AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData minimum MeltingT', 'MagpieData maximum MeltingT', 'MagpieData range MeltingT', 'MagpieData mean MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData minimum Column', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData maximum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData range NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData minimum NsUnfilled', 'MagpieData maximum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData avg_dev NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData mean GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData range GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData avg_dev GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 151', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'auto_featurizer': True, 'removed_featurizers': [], 'composition_col': 'composition', 'structure_col': 'structure', 'bandstruct_col': 'bandstructure', 'dos_col': 'dos', 'is_fit': True, 'fittable_fcls': {'BagofBonds', 'PartialRadialDistributionFunction', 'BondFractions'}, 'needs_fit': False, 'min_precheck_frac': 0.9}}, 'cleaner': {'cleaner': {'max_na_frac': 0.01, 'feature_na_method': 'drop', 'encoder': 'one-hot', 'encode_categories': True, 'drop_na_targets': True, 'na_method_fit': 'drop', 'na_method_transform': 'fill', 'dropped_features': [], 'object_cols': [], 'number_cols': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData mode Number', 'MagpieData minimum MendeleevNumber', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData range AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData minimum MeltingT', 'MagpieData maximum MeltingT', 'MagpieData range MeltingT', 'MagpieData mean MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData minimum Column', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData maximum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData range NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData minimum NsUnfilled', 'MagpieData maximum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData avg_dev NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData mean GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData range GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData avg_dev GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 151', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'fitted_df': {'obj': , 'columns': 421, 'samples': 3811}, 'fitted_target': 'n', 'dropped_samples': {'obj': , 'columns': 421, 'samples': 0}, 'max_problem_col_warning_threshold': 0.3, 'warnings': [], 'is_fit': True}}, 'reducer': {'reducer': {'reducers': ('corr', 'tree'), 'corr_threshold': 0.95, 'n_pca_features': 'auto', 'tree_importance_percentile': 0.9, 'n_rebate_features': 0.3, '_keep_features': [], '_remove_features': [], 'removed_features': {'corr': ['MagpieData range Number', 'MagpieData mean Number', 'MagpieData avg_dev Number', 'MagpieData minimum MendeleevNumber', 'MagpieData minimum AtomicWeight', 'MagpieData maximum AtomicWeight', 'MagpieData mean AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData maximum MeltingT', 'MagpieData minimum Column', 'MagpieData range NsValence', 'MagpieData mean NsValence', 'MagpieData avg_dev NsValence', 'MagpieData range NfValence', 'MagpieData minimum NsUnfilled', 'MagpieData range NsUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData range NdUnfilled', 'MagpieData avg_dev NdUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData maximum GSvolume_pa', 'MagpieData range GSbandgap', 'MagpieData avg_dev GSbandgap', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData avg_dev GSmagmom', 'sine coulomb matrix eig 11', 'sine coulomb matrix eig 15', 'sine coulomb matrix eig 17', 'sine coulomb matrix eig 21', 'sine coulomb matrix eig 22', 'sine coulomb matrix eig 23', 'sine coulomb matrix eig 25', 'sine coulomb matrix eig 26', 'sine coulomb matrix eig 27', 'sine coulomb matrix eig 28', 'sine coulomb matrix eig 29', 'sine coulomb matrix eig 30', 'sine coulomb matrix eig 32', 'sine coulomb matrix eig 33', 'sine coulomb matrix eig 35', 'sine coulomb matrix eig 37', 'sine coulomb matrix eig 38', 'sine coulomb matrix eig 39', 'sine coulomb matrix eig 40', 'sine coulomb matrix eig 43', 'sine coulomb matrix eig 44', 'sine coulomb matrix eig 45', 'sine coulomb matrix eig 47', 'sine coulomb matrix eig 48', 'sine coulomb matrix eig 49', 'sine coulomb matrix eig 50', 'sine coulomb matrix eig 51', 'sine coulomb matrix eig 52', 'sine coulomb matrix eig 53', 'sine coulomb matrix eig 54', 'sine coulomb matrix eig 55', 'sine coulomb matrix eig 56', 'sine coulomb matrix eig 57', 'sine coulomb matrix eig 58', 'sine coulomb matrix eig 59', 'sine coulomb matrix eig 60', 'sine coulomb matrix eig 61', 'sine coulomb matrix eig 62', 'sine coulomb matrix eig 63', 'sine coulomb matrix eig 64', 'sine coulomb matrix eig 65', 'sine coulomb matrix eig 66', 'sine coulomb matrix eig 67', 'sine coulomb matrix eig 68', 'sine coulomb matrix eig 70', 'sine coulomb matrix eig 71', 'sine coulomb matrix eig 72', 'sine coulomb matrix eig 73', 'sine coulomb matrix eig 74', 'sine coulomb matrix eig 76', 'sine coulomb matrix eig 77', 'sine coulomb matrix eig 78', 'sine coulomb matrix eig 79', 'sine coulomb matrix eig 80', 'sine coulomb matrix eig 83', 'sine coulomb matrix eig 84', 'sine coulomb matrix eig 86', 'sine coulomb matrix eig 87', 'sine coulomb matrix eig 88', 'sine coulomb matrix eig 89', 'sine coulomb matrix eig 90', 'sine coulomb matrix eig 91', 'sine coulomb matrix eig 92', 'sine coulomb matrix eig 93', 'sine coulomb matrix eig 94', 'sine coulomb matrix eig 95', 'sine coulomb matrix eig 96', 'sine coulomb matrix eig 97', 'sine coulomb matrix eig 98', 'sine coulomb matrix eig 100', 'sine coulomb matrix eig 101', 'sine coulomb matrix eig 102', 'sine coulomb matrix eig 103', 'sine coulomb matrix eig 104', 'sine coulomb matrix eig 105', 'sine coulomb matrix eig 106', 'sine coulomb matrix eig 107', 'sine coulomb matrix eig 108', 'sine coulomb matrix eig 109', 'sine coulomb matrix eig 110', 'sine coulomb matrix eig 111', 'sine coulomb matrix eig 112', 'sine coulomb matrix eig 113', 'sine coulomb matrix eig 114', 'sine coulomb matrix eig 115', 'sine coulomb matrix eig 116', 'sine coulomb matrix eig 117', 'sine coulomb matrix eig 118', 'sine coulomb matrix eig 119', 'sine coulomb matrix eig 120', 'sine coulomb matrix eig 121', 'sine coulomb matrix eig 122', 'sine coulomb matrix eig 123', 'sine coulomb matrix eig 124', 'sine coulomb matrix eig 125', 'sine coulomb matrix eig 126', 'sine coulomb matrix eig 127', 'sine coulomb matrix eig 128', 'sine coulomb matrix eig 129', 'sine coulomb matrix eig 130', 'sine coulomb matrix eig 131', 'sine coulomb matrix eig 132', 'sine coulomb matrix eig 133', 'sine coulomb matrix eig 134', 'sine coulomb matrix eig 135', 'sine coulomb matrix eig 136', 'sine coulomb matrix eig 137', 'sine coulomb matrix eig 138', 'sine coulomb matrix eig 139', 'sine coulomb matrix eig 140', 'sine coulomb matrix eig 141', 'sine coulomb matrix eig 142', 'sine coulomb matrix eig 143', 'sine coulomb matrix eig 144', 'sine coulomb matrix eig 145', 'sine coulomb matrix eig 146', 'sine coulomb matrix eig 147', 'sine coulomb matrix eig 148', 'sine coulomb matrix eig 149', 'sine coulomb matrix eig 150', 'sine coulomb matrix eig 152', 'sine coulomb matrix eig 153', 'sine coulomb matrix eig 154', 'sine coulomb matrix eig 155', 'sine coulomb matrix eig 156', 'sine coulomb matrix eig 157', 'sine coulomb matrix eig 158', 'sine coulomb matrix eig 159', 'sine coulomb matrix eig 160', 'sine coulomb matrix eig 161', 'sine coulomb matrix eig 162', 'sine coulomb matrix eig 163', 'sine coulomb matrix eig 164', 'sine coulomb matrix eig 165', 'sine coulomb matrix eig 166', 'sine coulomb matrix eig 167', 'sine coulomb matrix eig 168', 'sine coulomb matrix eig 169', 'sine coulomb matrix eig 170', 'sine coulomb matrix eig 171', 'sine coulomb matrix eig 172', 'sine coulomb matrix eig 173', 'sine coulomb matrix eig 174', 'sine coulomb matrix eig 175', 'sine coulomb matrix eig 176', 'sine coulomb matrix eig 177', 'sine coulomb matrix eig 178', 'sine coulomb matrix eig 179', 'sine coulomb matrix eig 180', 'sine coulomb matrix eig 181', 'sine coulomb matrix eig 182', 'sine coulomb matrix eig 183', 'sine coulomb matrix eig 184', 'sine coulomb matrix eig 185', 'sine coulomb matrix eig 186', 'sine coulomb matrix eig 187', 'sine coulomb matrix eig 188', 'sine coulomb matrix eig 189', 'sine coulomb matrix eig 190', 'sine coulomb matrix eig 191', 'sine coulomb matrix eig 192', 'sine coulomb matrix eig 193', 'sine coulomb matrix eig 194', 'sine coulomb matrix eig 195', 'sine coulomb matrix eig 196', 'sine coulomb matrix eig 197', 'sine coulomb matrix eig 198', 'sine coulomb matrix eig 199', 'sine coulomb matrix eig 200', 'sine coulomb matrix eig 201', 'sine coulomb matrix eig 202', 'sine coulomb matrix eig 203', 'sine coulomb matrix eig 204', 'sine coulomb matrix eig 205', 'sine coulomb matrix eig 206', 'sine coulomb matrix eig 207', 'sine coulomb matrix eig 208', 'sine coulomb matrix eig 209', 'sine coulomb matrix eig 210', 'sine coulomb matrix eig 211', 'sine coulomb matrix eig 212', 'sine coulomb matrix eig 213', 'sine coulomb matrix eig 214', 'sine coulomb matrix eig 215', 'sine coulomb matrix eig 216', 'sine coulomb matrix eig 217', 'sine coulomb matrix eig 218', 'sine coulomb matrix eig 219', 'sine coulomb matrix eig 220', 'sine coulomb matrix eig 221', 'sine coulomb matrix eig 222', 'sine coulomb matrix eig 223', 'sine coulomb matrix eig 224', 'sine coulomb matrix eig 225', 'sine coulomb matrix eig 226', 'sine coulomb matrix eig 227', 'sine coulomb matrix eig 228', 'sine coulomb matrix eig 229', 'sine coulomb matrix eig 230', 'sine coulomb matrix eig 231', 'sine coulomb matrix eig 232', 'sine coulomb matrix eig 233', 'sine coulomb matrix eig 234', 'sine coulomb matrix eig 235', 'sine coulomb matrix eig 236', 'sine coulomb matrix eig 237', 'sine coulomb matrix eig 238', 'sine coulomb matrix eig 239', 'sine coulomb matrix eig 240', 'sine coulomb matrix eig 241', 'sine coulomb matrix eig 242', 'sine coulomb matrix eig 243', 'sine coulomb matrix eig 244', 'sine coulomb matrix eig 245', 'sine coulomb matrix eig 246', 'sine coulomb matrix eig 247', 'sine coulomb matrix eig 248', 'sine coulomb matrix eig 249', 'sine coulomb matrix eig 250', 'sine coulomb matrix eig 251', 'sine coulomb matrix eig 252', 'sine coulomb matrix eig 253', 'sine coulomb matrix eig 254', 'sine coulomb matrix eig 255', 'sine coulomb matrix eig 256', 'sine coulomb matrix eig 257', 'sine coulomb matrix eig 258', 'sine coulomb matrix eig 259', 'sine coulomb matrix eig 260', 'sine coulomb matrix eig 261', 'sine coulomb matrix eig 262', 'sine coulomb matrix eig 263', 'sine coulomb matrix eig 264', 'sine coulomb matrix eig 265', 'sine coulomb matrix eig 266', 'sine coulomb matrix eig 267', 'sine coulomb matrix eig 268', 'sine coulomb matrix eig 269', 'sine coulomb matrix eig 270', 'sine coulomb matrix eig 271', 'sine coulomb matrix eig 272', 'sine coulomb matrix eig 273', 'sine coulomb matrix eig 274', 'sine coulomb matrix eig 275', 'sine coulomb matrix eig 276', 'sine coulomb matrix eig 277', 'sine coulomb matrix eig 278', 'sine coulomb matrix eig 279', 'sine coulomb matrix eig 280', 'sine coulomb matrix eig 281', 'sine coulomb matrix eig 282', 'sine coulomb matrix eig 283', 'sine coulomb matrix eig 284', 'sine coulomb matrix eig 285', 'sine coulomb matrix eig 286', 'sine coulomb matrix eig 287'], 'tree': ['MagpieData minimum Number', 'MagpieData maximum Number', 'MagpieData mode Number', 'MagpieData maximum MendeleevNumber', 'MagpieData range MendeleevNumber', 'MagpieData mean MendeleevNumber', 'MagpieData avg_dev MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData minimum MeltingT', 'MagpieData range MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData maximum Column', 'MagpieData range Column', 'MagpieData mean Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData minimum Row', 'MagpieData maximum Row', 'MagpieData range Row', 'MagpieData mean Row', 'MagpieData avg_dev Row', 'MagpieData mode Row', 'MagpieData minimum CovalentRadius', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData mean CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData mode CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData minimum NsValence', 'MagpieData maximum NsValence', 'MagpieData mode NsValence', 'MagpieData minimum NpValence', 'MagpieData maximum NpValence', 'MagpieData range NpValence', 'MagpieData mean NpValence', 'MagpieData avg_dev NpValence', 'MagpieData mode NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData avg_dev NdValence', 'MagpieData mode NdValence', 'MagpieData minimum NfValence', 'MagpieData maximum NfValence', 'MagpieData mean NfValence', 'MagpieData avg_dev NfValence', 'MagpieData mode NfValence', 'MagpieData minimum NValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData mean NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData maximum NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData mode NsUnfilled', 'MagpieData minimum NpUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData range NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData mode NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData minimum NUnfilled', 'MagpieData maximum NUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData mode NUnfilled', 'MagpieData minimum GSvolume_pa', 'MagpieData range GSvolume_pa', 'MagpieData avg_dev GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData minimum GSbandgap', 'MagpieData maximum GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData mode GSmagmom', 'MagpieData minimum SpaceGroupNumber', 'MagpieData maximum SpaceGroupNumber', 'MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber', 'sine coulomb matrix eig 1', 'sine coulomb matrix eig 2', 'sine coulomb matrix eig 3', 'sine coulomb matrix eig 4', 'sine coulomb matrix eig 5', 'sine coulomb matrix eig 8', 'sine coulomb matrix eig 9', 'sine coulomb matrix eig 10', 'sine coulomb matrix eig 12', 'sine coulomb matrix eig 13', 'sine coulomb matrix eig 14', 'sine coulomb matrix eig 16', 'sine coulomb matrix eig 18', 'sine coulomb matrix eig 19', 'sine coulomb matrix eig 20', 'sine coulomb matrix eig 24', 'sine coulomb matrix eig 31', 'sine coulomb matrix eig 34', 'sine coulomb matrix eig 36', 'sine coulomb matrix eig 41', 'sine coulomb matrix eig 42', 'sine coulomb matrix eig 46', 'sine coulomb matrix eig 69', 'sine coulomb matrix eig 75', 'sine coulomb matrix eig 81', 'sine coulomb matrix eig 82', 'sine coulomb matrix eig 85', 'sine coulomb matrix eig 99', 'sine coulomb matrix eig 151']}, 'retained_features': ['sine coulomb matrix eig 6', 'MagpieData range AtomicWeight', 'MagpieData avg_dev NUnfilled', 'MagpieData mean GSvolume_pa', 'sine coulomb matrix eig 0', 'MagpieData mean MeltingT', 'MagpieData mean Electronegativity', 'MagpieData avg_dev AtomicWeight', 'MagpieData maximum Electronegativity', 'sine coulomb matrix eig 7', 'MagpieData avg_dev Electronegativity'], 'reducer_params': {'tree': {'importance_percentile': 0.9, 'mode': 'regression', 'random_state': 0}}, '_pca': None, '_pca_feats': None, 'is_fit': True}}, 'learner': {'learner': {'mode': 'regression', 'tpot_kwargs': {'max_time_mins': 1, 'max_eval_time_mins': 1, 'population_size': 10, 'n_jobs': 2, 'cv': 5, 'verbosity': 3, 'memory': 'auto', 'template': 'Selector-Transformer-Regressor', 'config_dict': {'sklearn.linear_model.ElasticNetCV': {'l1_ratio': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]), 'tol': [1e-05, 0.0001, 0.001, 0.01, 0.1]}, 'sklearn.ensemble.ExtraTreesRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'bootstrap': [True, False]}, 'sklearn.ensemble.GradientBoostingRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'loss': ['ls', 'lad', 'huber', 'quantile'], 'learning_rate': [0.01, 0.1, 0.5, 1.0], 'max_depth': range(1, 11, 2), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'subsample': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]), 'max_features': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]), 'alpha': [0.75, 0.8, 0.85, 0.9, 0.95, 0.99]}, 'sklearn.tree.DecisionTreeRegressor': {'max_depth': range(1, 11, 2), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3)}, 'sklearn.neighbors.KNeighborsRegressor': {'n_neighbors': range(1, 101), 'weights': ['uniform', 'distance'], 'p': [1, 2]}, 'sklearn.linear_model.LassoLarsCV': {'normalize': [True, False]}, 'sklearn.svm.LinearSVR': {'loss': ['epsilon_insensitive', 'squared_epsilon_insensitive'], 'dual': [True, False], 'tol': [1e-05, 0.0001, 0.001, 0.01, 0.1], 'C': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 15.0, 20.0, 25.0], 'epsilon': [0.0001, 0.001, 0.01, 0.1, 1.0]}, 'sklearn.ensemble.RandomForestRegressor': {'n_estimators': [20, 100, 200, 500, 1000], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]), 'min_samples_split': range(2, 21, 3), 'min_samples_leaf': range(1, 21, 3), 'bootstrap': [True, False]}, 'sklearn.linear_model.RidgeCV': {}, 'sklearn.preprocessing.Binarizer': {'threshold': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])}, 'sklearn.decomposition.FastICA': {'tol': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])}, 'sklearn.cluster.FeatureAgglomeration': {'linkage': ['ward', 'complete', 'average'], 'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']}, 'sklearn.preprocessing.MaxAbsScaler': {}, 'sklearn.preprocessing.MinMaxScaler': {}, 'sklearn.preprocessing.Normalizer': {'norm': ['l1', 'l2', 'max']}, 'sklearn.kernel_approximation.Nystroem': {'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'], 'gamma': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]), 'n_components': range(1, 11)}, 'sklearn.decomposition.PCA': {'svd_solver': ['randomized'], 'iterated_power': range(1, 11)}, 'sklearn.preprocessing.PolynomialFeatures': {'degree': [2], 'include_bias': [False], 'interaction_only': [False]}, 'sklearn.kernel_approximation.RBFSampler': {'gamma': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])}, 'sklearn.preprocessing.RobustScaler': {}, 'sklearn.preprocessing.StandardScaler': {}, 'tpot.builtins.ZeroCount': {}, 'tpot.builtins.OneHotEncoder': {'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25], 'sparse': [False], 'threshold': [10]}, 'sklearn.feature_selection.SelectFwe': {'alpha': array([0. , 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008,
0.009, 0.01 , 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017,
0.018, 0.019, 0.02 , 0.021, 0.022, 0.023, 0.024, 0.025, 0.026,
0.027, 0.028, 0.029, 0.03 , 0.031, 0.032, 0.033, 0.034, 0.035,
0.036, 0.037, 0.038, 0.039, 0.04 , 0.041, 0.042, 0.043, 0.044,
0.045, 0.046, 0.047, 0.048, 0.049]), 'score_func': {'sklearn.feature_selection.f_regression': None}}, 'sklearn.feature_selection.SelectPercentile': {'percentile': range(1, 100), 'score_func': {'sklearn.feature_selection.f_regression': None}}, 'sklearn.feature_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}, 'sklearn.feature_selection.SelectFromModel': {'threshold': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]), 'estimator': {'sklearn.ensemble.ExtraTreesRegressor': {'n_estimators': [100], 'max_features': array([0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95])}}}}, 'scoring': 'neg_mean_absolute_error'}, 'models': None, 'random_state': None, 'greater_score_is_better': None, '_fitted_target': 'n', '_backend': TPOTRegressor(config_dict={'sklearn.cluster.FeatureAgglomeration': {'affinity': ['euclidean',
'l1',
'l2',
'manhattan',
'cosine'],
'linkage': ['ward',
'complete',
'average']},
'sklearn.decomposition.FastICA': {'tol': array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])},
'sklearn.decomposition.PCA': {'iterated_power'...
'tpot.builtins.OneHotEncoder': {'minimum_fraction': [0.05,
0.1,
0.15,
0.2,
0.25],
'sparse': [False],
'threshold': [10]},
'tpot.builtins.ZeroCount': {}},
log_file=,
max_eval_time_mins=1, max_time_mins=1, memory='auto', n_jobs=2,
population_size=10, scoring='neg_mean_absolute_error',
template='Selector-Transformer-Regressor', verbosity=3), '_features': ['MagpieData range AtomicWeight', 'MagpieData avg_dev AtomicWeight', 'MagpieData mean MeltingT', 'MagpieData maximum Electronegativity', 'MagpieData mean Electronegativity', 'MagpieData avg_dev Electronegativity', 'MagpieData avg_dev NUnfilled', 'MagpieData mean GSvolume_pa', 'sine coulomb matrix eig 0', 'sine coulomb matrix eig 6', 'sine coulomb matrix eig 7'], 'from_serialized': False, '_best_models': None, 'is_fit': True}}, 'pre_fit_df': {'obj': , 'columns': 2, 'samples': 3811}, 'post_fit_df': {'obj': , 'columns': 12, 'samples': 3811}, 'ml_type': 'regression', 'target': 'n', 'version': '1.0.3.20191111', 'is_fit': True}
您可以直接访问MatPipe的内部对象,而不是通过文本摘要;你只需要知道要访问哪些属性。有关更多信息,请参见在线应用编程接口文档或源代码。
# Access some attributes of MatPipe directly, instead of via a text digest
print(pipe.learner.best_pipeline)
结果:
Pipeline(memory=Memory(location=/var/folders/x6/mzkjfgpx3m9cr_6mcy9759qw0000gn/T/tmps0ji7j_y/joblib),
steps=[('selectpercentile',
SelectPercentile(percentile=23,
score_func=)),
('robustscaler', RobustScaler()),
('randomforestregressor',
RandomForestRegressor(bootstrap=False, max_features=0.05,
min_samples_leaf=7, min_samples_split=5,
n_estimators=20))])
print(pipe.autofeaturizer.featurizers["composition"])
结果:
[ElementProperty(data_source=,
features=['Number', 'MendeleevNumber', 'AtomicWeight',
'MeltingT', 'Column', 'Row', 'CovalentRadius',
'Electronegativity', 'NsValence', 'NpValence',
'NdValence', 'NfValence', 'NValence', 'NsUnfilled',
'NpUnfilled', 'NdUnfilled', 'NfUnfilled', 'NUnfilled',
'GSvolume_pa', 'GSbandgap', 'GSmagmom',
'SpaceGroupNumber'],
stats=['minimum', 'maximum', 'range', 'mean', 'avg_dev',
'mode'])]
print(pipe.autofeaturizer.featurizers["structure"])
结果:
[SineCoulombMatrix()]
能够复制你的结果是材料信息学的一个重要方面。MatPipe 提供了方便保存和加载整个管道的方法,供他人使用。
用MatPipe.save 保存一个 MatPipe 供以后使用。装上 MatPipe.load。
filename = "MatPipe_predict_experimental_gap_from_composition.p"
pipe.save(filename)
pipe_loaded = MatPipe.load(filename)
结果:
2020-07-27 14:37:33 INFO Loaded MatPipe from file MatPipe_predict_experimental_gap_from_composition.p.
2020-07-27 14:37:33 WARNING Only use this model to make predictions (do not retrain!). Backend was serialzed as only the top model, not the full automl backend.