查找出所有包含“spark”的句子,即将包含spark的句子的标签设为1,没有spark的句子标签设备0
下面是完整代码,之后分步骤对代码进行解析
Spark2.0以上的pyspark在启动时会自动创建一个名为spark
的SparkSession对象
当需要手工创建时,SparkSession可以由其伴生对象的builder()
方法创建出来
// python代码
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
spark.stop()
%%python
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)],
["id", "test", "label"])
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages = [tokenizer, hashingTF, lr])
model = pipeline.fit(training)
# tokerizer = Tokenizer(inputCol="text", outputCol="words")
# hashingTF = HashingTF(inputCol=tokerizer.getOutputCol(), outputCol="features")
# lr = LogisticRegression(maxIter=10, regPara=0.001)
# print(tokerizer)hashingTF
# pipeline = Pipeline(stages=[tokerizer, hashingTF, lr])
# model=pipeline.fit(training)
# print(type(tokerizer))
emmm,这里好像报错了,很久前的笔记了,忘了后面有没有解决了:
Traceback (most recent call last):
File "python cell", line 21, in
File "D:\Program Files\Anaconda3\lib\site-packages\pyspark\ml\base.py", line 132, in fit
return self._fit(dataset)
File "D:\Program Files\Anaconda3\lib\site-packages\pyspark\ml\pipeline.py", line 107, in _fit
dataset = stage.transform(dataset)
File "D:\Program Files\Anaconda3\lib\site-packages\pyspark\ml\base.py", line 173, in transform
return self._transform(dataset)
File "D:\Program Files\Anaconda3\lib\site-packages\pyspark\ml\wrapper.py", line 312, in _transform
return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx)
File "D:\Program Files\Anaconda3\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "D:\Program Files\Anaconda3\lib\site-packages\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: 'Field "text" does not exist.\nAvailable fields: id, test, label'
PipelineStage包括转换器和评估器,具体的就是,包含tokenizer, hashingTF和lf
tokerizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer,getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokerizer, bashingTF, lr])
这里也报错了:
:2: error: illegal start of simple expression
pipeline = Pipeline(stages=[tokerizer, bashingTF, lr])
因为我的jupyter notebook在升级了Chrome版本之后,突然不能用了,但是之前的笔记保存在本地,于是用vscode安装了插件打开之后,把所有的内容转存到CSDN了,有一些代码当时测试的时候有报错,也忘记了当时有没有解决这些问题。
我的只是做个笔记,如果有需要的同学可以一起讨论一下。。。