本文介绍如何将各类模型转换成 PMML 文件,模型在线预测服务需要支持如下几类模型:
以 LogisticRegressionModel 为例,分析是否有原生方法生成 PMML。当前版本(spark-mllib_2.11:2.3.2)下存在两个版本:
extends org.apache.spark.mllib.regression.GeneralizedLinearModel with org.apache.spark.mllib.classification.ClassificationModel with scala.Serializable with org.apache.spark.mllib.util.Saveable with org.apache.spark.mllib.pmml.PMMLExportable
PMMLExportable 中有相应的 toPMML 方法,如下图所示:
extends org.apache.spark.ml.classification.ProbabilisticClassificationModel[org.apache.spark.ml.linalg.Vector, org.apache.spark.ml.classification.LogisticRegressionModel] with org.apache.spark.ml.classification.LogisticRegressionParams with org.apache.spark.ml.util.MLWritable
后续分析如何将 spark.ml 下的模型生成 PMML文件。
val pipeline = new Pipeline().setStages(Array(vectorAssembler, lr))
val pipelineModel = pipeline.fit(featuresAndLabelDF)
val pmml = new PMMLBuilder(featuresAndLabelDF.schema, pipelineModel).build()
val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val path = new Path(outPutPMMLPath)
if (fs.exists(path)) {
fs.delete(path, true)
}
val out = fs.create(path)
MetroJAXBUtil.marshal(pmml, out)
User class threw exception: java.lang.IllegalArgumentException: Expected string, integral, double or boolean data type, got vector data type
# 1.定义function,将Vector类型转换为Array类型:
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val vecToArray = udf( (x : linalg.Vector) => x.toArray)
val dfArr = df.withColumn("featuresArr", vecToArray(col(dp.trainFeatures)))
# 2.根据vector大小生成虚拟字段,并生成DataFrame
val elements = Array("col0","col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9",
"col10", "col11", "col12", "col13", "col14", "col15", "col16")
val sqlExpr = elements.zipWithIndex.map{
case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
val sqlExprWithLabel = sqlExpr.+:(col(labelCol))
val featuresAndLabelDF = dfArr.select(sqlExprWithLabel: _*)
- Header
- DataDictionary
- DataField
- RegressionModel
- MiningSchema
- Output
- RegressionTable
import os
os.environ["PATH"] = '/opt/soft/jdk/jdk1.8.0_66/bin' + os.pathsep + os.environ["PATH"]
print(os.environ["PATH"])
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
lr = LogisticRegression(penalty='l2', tol=0.000001, C=0.01, max_iter=10, random_state=9999)
pipeline = PMMLPipeline([("classifier", lr)]);
pipeline.fit(train_X, train_Y)
sklearn2pmml(pipeline, "demo.pmml", with_repr = True)
重点关注1:PATH环境变量中如果有1.7版本的JDK,请将JDK1.8的设置放在前边
重点关注2:需要将模型(lr)封装到 PMMLPipeline 中
Demo1:将模型封装到 PMMLPipeline 中然后保存
lr2 = LogisticRegression(penalty='l2', tol=0.000001, C=0.01, max_iter=10, random_state=9999)
pipeline = PMMLPipeline([("classifier", lr2)])
pipeline.fit(train_X, train_Y)
from sklearn.externals import joblib
joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)
Demo2:将原生模型使用 joblib 进行保存
lr3 = LogisticRegression(penalty='l2', tol=0.000001, C=0.01, max_iter=10, random_state=9999)
lr3.fit(train_X, train_Y)
local_model_path = "lr3.pkl"
joblib.dump(lr3, local_model_path)
from sklearn.datasets import load_boston
from lightgbm import LGBMRegressor
boston = load_boston()
lgbm = LGBMRegressor(objective = "regression")
lgbm.fit(boston.data, boston.target, feature_name = boston.feature_names.tolist())
lgbm.booster_.save_model("lightgbm.txt")
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target
X, y = data.iloc[:,:-1],data.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
import os
import xgboost as xgb
os.environ["PATH"] = '/opt/soft/jdk/jdk1.8.0_66/bin' + os.pathsep + os.environ["PATH"]
print(os.environ["PATH"])
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
# 使用xgboost模型
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 10, n_estimators = 10)
pipeline = PMMLPipeline([("classifier", xg_reg)]);
pipeline.fit(X_train,y_train)
sklearn2pmml(pipeline, "xgboost_sklearn2pmml.pmml", with_repr = True)
def ceate_feature_map(file_name,features):
outfile = open(file_name, 'w')
for i, feat in enumerate(features):
outfile.write('{0}\t{1}\tq\n'.format(i, feat))
#feature type, use i for indicator and q for quantity
outfile.close()
ceate_feature_map("xgboost.fmap", boston.feature_names)
Demo1:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
boston = load_boston()
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target
X, y = data.iloc[:,:-1],data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, reg_alpha = 10, n_estimators = 10)
xg_reg.fit(X_train, y_train)
xg_reg.save_model("xgboost_1.model")
Demo2:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
boston = load_boston()
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
data['PRICE'] = boston.target
X, y = data.iloc[:,:-1],data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
data_dmatrix = xgb.DMatrix(data=X_train, label=y_train)
params = {
"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
'max_depth': 5, 'alpha': 10}
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
xg_reg.save_model("xgboost_2.model")
注意:demo1 中使用的参数为 reg_alpha,demo2 中使用的是 alpha