最近搞了2个周的xgboost4j-spark,整个人都不好了!太难了!
下面说说自己遇到的主要问题吧,希望对刚开始使用xgboost4j-spark的朋友有一定的帮助。有需要交流的可以留言~
主要问题:
1.先去服务器上看看要使用的spark集群是啥版本的,可能spark2.1 和 spark2.3 都支持,那样最好
2.了解清楚,线上部署或者离线预测的时候用的啥版本,像我这儿只能用spark2.1,不同集群还不一样
3.spark版本和xgboost4j-spark版本对应关系(很重要,不然各种奇怪的错误)
spark版本 | xgboost4j-spark版本 |
2.1 | 0.72 |
2.3+ | 0.72 ~ 0.90 |
如果spark是2.1版本,xgboost4j一定要用0.72版本的,不要往高版本弄了,大概率跑不通的。如果spark是2.3+版本,根据自己的需要在0.72~0.90直接选个吧,1.0.0的版本还没用,看官方好像要spark2.4+的。至少,spark2.1配xgboost4j-spark 0.72 和 spark2.3配xgboost4j-spark0.90 我是在本地调试和集群上都跑通了的。(顺便补充下,0.72的early stopping 有bug)
4. 项目中pom.xml文件依赖配置
|
|
这里以spark2.3为例,具体看你自己的实际情况。配置好之后,右键pom.xml --> Maven --> reimport
5. xgboost4j-spark实际运行情况
1) 参数 ("num_class" -> 2) 这个num_class参数是针对多分类问题的,如果你是二分类问题,不用管它,只用设置 ("objective" -> "binary:logistic") 不然会报一个奇怪的错误. preds的size刚好是label size的两倍. 具体的可以看看下面这个链接https://www.gitmemory.com/issue/dmlc/xgboost/4552/503344233
ml.dmlc.xgboost4j.java.XGBoostError: [21:37:48] /xgboost/src/objective/regression_obj.cu:65: Check failed: preds.Size() == info.labels_.Size() (800 vs. 400) : labels are not correctly providedpreds.size=800, label.size=400
2) 集群上提交任务的时候,可能无缘无故的failed,可能与spark作业有关,设置下面参数试试
--conf spark.speculation=false |
6. xgboost4j-spark 本地运行demo
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassificationModel, XGBoostClassifier}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.linalg.{DenseVector, SparseVector, Vector, Vectors}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
object XgbDemo {
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder.appName("xgb").master("local[*]").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val data: DataFrame = spark.read.option("inferSchema", true).option("header", false).csv("./data/Chapter6Data/admission1.csv").toDF("label", "x1", "x2", "x3")
val vectorAssembler: VectorAssembler = new VectorAssembler().setInputCols(Array("x1", "x2", "x3")).setOutputCol("features")
val train: DataFrame = vectorAssembler.transform(data)
train.printSchema()
train.show(2)
println(train.count())
val paramsMap = Map(
("eta" -> 0.1f), ("max_depth" -> 3), ("objective" -> "binary:logistic"), ("num_round" -> 100), ("num_works" -> 1), ("num_class" -> 2)
)
val xgb = new XGBoostClassifier(xgboostParams = paramsMap)
xgb.setFeaturesCol("features")
xgb.setLabelCol("label")
val clf: XGBoostClassificationModel = xgb.fit(train)
val trainPrediction: DataFrame = clf.transform(train)
val scoreTrain: RDD[(Double, Double)] = trainPrediction.select($"prediction", $"label").rdd.map { row: Row =>
val pred: Double = row.getDouble(0)
val label: Double = row.getInt(1).toDouble
(pred, label)
}
val trainMetric = new BinaryClassificationMetrics(scoreTrain)
val trainAuc: Double = trainMetric.areaUnderROC()
println("@@@ xgb score info : \n" + "train AUC :" + trainAuc)
}
}
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,646 INFO [89] train-error:0.192500
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,646 INFO [90] train-error:0.192500
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,647 INFO [91] train-error:0.190000
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,647 INFO [92] train-error:0.190000
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,647 INFO [93] train-error:0.192500
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,647 INFO [94] train-error:0.192500
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,647 INFO [95] train-error:0.192500
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,648 INFO [96] train-error:0.190000
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,648 INFO [97] train-error:0.185000
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,648 INFO [98] train-error:0.185000
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,649 INFO [99] train-error:0.185000
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,649 DEBUG Recieve shutdown signal from 0
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,649 INFO @tracker All nodes finishes job
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: 2020-06-26 21:45:19,649 INFO @tracker 0.0302619934082 secs between node start and job finish
20/06/26 21:45:19 INFO RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 0
20/06/26 21:45:20 INFO RabitTracker: Tracker Process ends with exit code 0
20/06/26 21:45:20 INFO XGBoostSpark: Rabit returns with exit code 0
@@@ xgb score info :
train AUC :0.7255054656629459
7. ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
环境:spark2.3 xgboost4j-spark 0.82 scala 2.11.8
集群yarn-cluster上跑xgboost4j-spark报错:
(1) 2020-08-12,21:50:15,724 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: TaskKilled(another attempt succeeded), stopping SparkContext
(2) 2020-08-12,21:50:16,765 ERROR ml.dmlc.xgboost4j.java.RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job 8 cancelled because SparkContext was shut down
(3) 2020-08-12,21:50:21,777 INFO XGBoostSpark: Rabit returns with exit code 143
2020-08-12,21:50:21,779 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: User class threw exception: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
查看了dmlc 主页上的问题,发现很多人都遇到这个问题了,而且还不知道具体问题出现在哪里。下面是我这次搞了3天解决这个问题的全过程........ 心酸
1)第一反应检查代码。无问题
2)查看web UI 发现每次都是 foreachPartition at XGBoost.scala:397 这个job的时候报错,点开里面是stages 也是这个时候出问题,而且有些tasks是成功的 到某个task就报错
3)增加executor数量增加内存 , 增加资源。还是相同的报错
4)返回来,看看报错。第一个报错, TaskFailedListener,说有task 失败了? 看了下spark web ui 发现没有task失败呀。 返回去,看xgb的代码,TaskFailedListener 中“A tracker that ensures enough number of executor cores are alive.Throws an exception when the number of alive cores is less than nWorkers.” 意思就是说有一个executor 在训练的过程中,炸了。然后,会报错。 关键是,看web ui executor,没有失败的executor且没有失败的task呀。看到这里,肯定是哪里有问题,导致task执行失败,executor执行失败,但是这个报错没有被driver 及 executor的log给记录。
5)阴差阳错的把executor设置小了36个, 跑了下;还是同样的报错,但是这次居然出现了个executor faild,赶紧点进去查看log,不看不知道一看吓一跳,原来是因为进行VectorAssembler的时候,有NA导致报错。哎,代码里面已经对数据缺失进行na.fill(-1)居然还有NA。
6)修改了代码,填一波缺失值,然后就跑起来了!!!!!!
出现ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed 百分之99可能都是因为代码里面有bug导致的,也有可能是spark 资源参数的关系,只要代码没问题,spark资源参数多试试,应该没啥问题