Spark RandomForest入模变量中null值异常处理

RandomForest建模中,入参中包含了数值型、字符串类型的值。入模的时候,统一使用df.na.fill(0.0)会导致NullPointerException或者Cannot have an empty string for name。
如果不需要string入参,直接去掉。

val (trainingData, testData) = splitData(featsDFWithLabel, trainingSampleRatio)
val formula = new RFormula()
      .setFormula("label ~ . - user_id  - relation_type - industry - is_phonenum  - relation_type_definite - type - - category - mark - 
                   name - job_first_level - job_second_level - result - 
                  duration - phone_label - dt")
      .setFeaturesCol("features")
      .setLabelCol("label")
val pipelineModel: PipelineModel = getRFModel(formula, trainingData)

如果需要string类型的变量,则需要分开处理。

val trainingDFNew = trainingDF.na.fill(Map("industry" -> "empty", "category" -> "empty", "phone_label" -> "empty")).na.fill(0.0)
      .na.replace("industry", Map("" -> "empty"))
      .na.replace("category", Map("" -> "empty"))
      .na.replace("phone_label", Map("" -> "empty"))

Error:

ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.
java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.

如果string类型的值存在空值,也需要处理,否则在onehot编码时会报错。

你可能感兴趣的:(Spark RandomForest入模变量中null值异常处理)