from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
# (id, text, label)
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])
training.show()# tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
model = pipeline.fit(training)
报错大致如下几段:【建议自己筛选关键部分即可】
Exception happened during processing of request from ('127.0.0.1', 48756)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/root/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:44278)
py4j.protocol.Py4JError: An error occurred while calling o46.fit
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
py4j.protocol.Py4JError: An error occurred while calling o46.fit
During handling of the above exception, another exception occurred:
Py4JError: An error occurred while calling o46.fit
分析:
运行之后,这一段会导致报错:
model = pipeline.fit(training)
是在阿里云ECS上运行导致的报错,Centos系统,学生机。然后我百度 了网上很多种产生问题的原因和解决方法,我发现都是不行的。然后就换了一台本地的虚拟机跑,然后运行成功了,本地虚拟机和阿里云买的那台学生机环境一模一样,只是配置不一样而已。所以,应该是配置太低的问题。
错误原因:难道是Python3.x中没有long类型,只有int类型。Python2.x中既有long 类型又有int 类型。
将long改为int。