使用pyspark做机器学习,实例化模型对象时,需要指定输入featuresCol的名称。其中,featuresCol是由数据的X构成的“单列”,aka 'vector'。
否则会报错:
Traceback (most recent call last):
File "", line 1, in
File "/data/spark/spark-2.4.4/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/data/spark/spark-2.4.4/python/pyspark/ml/wrapper.py", line 295, in _fit
java_model = self._fit_java(dataset)
File "/data/spark/spark-2.4.4/python/pyspark/ml/wrapper.py", line 292, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/data/spark/spark-2.4.4/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/data/spark/spark-2.4.4/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist.
Stack Overflow上,desertnaut :
Spark dataframes are not used like that in Spark ML; all your features need to be vectors in a single column, usually named features
. Here is how you can do it using the 5 rows you have provided as an example:
spark.version
# u'2.2.0'
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
# your sample data:
temp_df = spark.createDataFrame([Row(V4366=0.0, V4460=0.232, V4916=-0.017, V1495=-0.104, V1639=0.005, V1967=-0.008, V3049=0.177, V3746=-0.675, V3869=-3.451, V524=0.004, V5409=0), Row(V4366=0.0, V4460=0.111, V4916=-0.003, V1495=-0.137, V1639=0.001, V1967=-0.01, V3049=0.01, V3746=-0.867, V3869=-2.759, V524=0.0, V5409=0), Row(V4366=0.0, V4460=-0.391, V4916=-0.003, V1495=-0.155, V1639=-0.006, V1967=-0.019, V3049=-0.706, V3746=0.166, V3869=0.189, V524=0.001, V5409=0), Row(V4366=0.0, V4460=0.098, V4916=-0.012, V1495=-0.108, V1639=0.005, V1967=-0.002, V3049=0.033, V3746=-0.787, V3869=-0.926, V524=0.002, V5409=0), Row(V4366=0.0, V4460=0.026, V4916=-0.004, V1495=-0.139, V1639=0.003, V1967=-0.006, V3049=-0.045, V3746=-0.208, V3869=-0.782, V524=0.001, V5409=0)])
trainingData=temp_df.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"])
trainingData.show()
# +--------------------+-----+
# | features|label|
# +--------------------+-----+
# |[-0.104,0.005,-0....| 0|
# |[-0.137,0.001,-0....| 0|
# |[-0.155,-0.006,-0...| 0|
# |[-0.108,0.005,-0....| 0|
# |[-0.139,0.003,-0....| 0|
# +--------------------+-----+
也就是说,需要把全部输入特征转化为一个‘vector’,使用的方法可以是
from pyspark.ml.linalg import Vectors
trainingData=temp_df.rdd.map(lambda x:(Vectors.dense(x[0:-1]), x[-1])).toDF(["features", "label"])
Stack Overflow 上, 100+award回答
Personally I would go with Python UDF and wouldn't bother with anything else:
Vectors
are not native SQL types so there will be performance overhead one way or another. In particular this process requires two steps where data is first converted from external type to row, and then from row to internal representation using generic RowEncoder
.Pipeline
will be much more expensive than a simple conversion. Moreover it requires a process which opposite to the one described above