用于将Spark作业部署为分布式Web服务的引擎。新东西,我们一起来尝鲜~~
分布式: 充分利用Spark的Node、JVM和线程级并行性。
快速: 没有单节点瓶颈,没有Python往返开销。请求可以通过网络交换机直接路由到工作JVM。在几秒钟内启动Web服务。
低延迟: 使用连续服务时,可以实现低至1毫秒的延迟。
可随处部署: 适用于运行Spark的任何地方,例如Databricks,HDInsight,AZTK,DSVM,本地机或自己的群集。可用于Spark、PySpark和SparklyR。
轻量级: 不依赖于额外(昂贵的)Kafka或Kubernetes集群。
通用性: 使用与批处理和结构化流相同的API。
灵活性: 在单个Spark群集上启动和管理多个服务。同步和异步服务管理和可扩展性。部署任何可表达为结构化流式查询的spark作业。将 sources/sinks 与其他 Spark sources/sinks 一起使用,以实现更复杂的部署。
我们将使用Spark Serving将其部署为实时的Web服务。
首先,我们导入所需的包:
import os
import sys
import requests
import numpy as np
import pandas as pd
import mmlspark
import pyspark
from mmlspark import TrainClassifier
from pyspark.ml.classification import LogisticRegression
spark = pyspark.sql.SparkSession.builder.appName("byzMLSparkApp") \
.getOrCreate()
#.config("spark.jars.packages", "Azure:mmlspark:0.14")
# \\最好不要这样用,可能会跟JARS现存的包引起冲突,或者spray JsonReader等依赖引导不进导致Py4J报错
下载数据,数据每个字段都有空格,先要去除头尾空格才能使用
dataFilePath = "/home/raini/dataset/AdultCensusIncome.csv"
# 去除头尾空格
## 以前我们都需要给字段定义类型,MMLSpark能自动推断类型
data = pd.read_csv(dataFilePath, dtype={"hours-per-week": np.float64})
for col in data.columns:
df = data[col].astype(str).str.replace(" ", "")#.strip("")
data[col] = df
df = spark.createDataFrame(data)
df.show(10)
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age| workclass|fnlwgt|education|Education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40.0| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors| 13| Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13.0| United-States| <=50K|
| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40.0| United-States| <=50K|
| 53| Private|234721| 11th| 7| Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40.0| United-States| <=50K|
| 28| Private|338409|Bachelors| 13| Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40.0| Cuba| <=50K|
| 37| Private|284582| Masters| 14| Married-civ-spouse| Exec-managerial| Wife|White|Female| 0| 0| 40.0| United-States| <=50K|
| 49| Private|160187| 9th| 5|Married-spouse-ab...| Other-service|Not-in-family|Black|Female| 0| 0| 16.0| Jamaica| <=50K|
| 52|Self-emp-not-inc|209642| HS-grad| 9| Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 45.0| United-States| >50K|
| 31| Private| 45781| Masters| 14| Never-married| Prof-specialty|Not-in-family|White|Female| 14084| 0| 50.0| United-States| >50K|
| 42| Private|159449|Bachelors| 13| Married-civ-spouse| Exec-managerial| Husband|White| Male| 5178| 0| 40.0| United-States| >50K|
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
取我们需要的字段进行建模,y为[sex]。
df = df.select(["age", "fnlwgt", "hours-per-week", "Education-num", "sex"])
train, test = df.randomSplit([0.75, 0.25], seed=123)
df.limit(10).show()
+---+------+--------------+-------------+------+
|age|fnlwgt|hours-per-week|Education-num| sex|
+---+------+--------------+-------------+------+
| 39| 77516| 40.0| 13| Male|
| 50| 83311| 13.0| 13| Male|
| 38|215646| 40.0| 9| Male|
| 53|234721| 40.0| 7| Male|
| 28|338409| 40.0| 13|Female|
| 37|284582| 40.0| 14|Female|
| 49|160187| 16.0| 5|Female|
| 52|209642| 45.0| 9| Male|
| 31| 45781| 50.0| 14|Female|
| 42|159449| 40.0| 13| Male|
+---+------+--------------+-------------+------+
TrainClassifier可用于初始化和拟合模型,它包含了SparkML分类器,参数numFeatures控制散列特征的数量。可以使用帮助(mmlspark.TrainClassifier)来查看不同的参数。
注意:它隐式地将数据转换为算法所期望的格式。更具体的算法如:tokenizes, hashes strings, one-hot encodes categorical variables, assembles the features into a vector etc.
model = TrainClassifier(model=LogisticRegression(), labelCol="sex", numFeatures=256).fit(train)
print(train.schema) # 训练成功,可见内部已将String转成Float
StructType(List(StructField(age,StringType,true),StructField(fnlwgt,StringType,true),StructField(hours-per-week,StringType,true),StructField(Education-num,StringType,true),StructField(sex,StringType,true)))
from mmlspark import ComputeModelStatistics, TrainedClassifierModel
prediction = model.transform(test)
prediction.printSchema()
root
|-- age: long (nullable = true)
|-- fnlwgt: long (nullable = true)
|-- hours-per-week: double (nullable = true)
|-- Education-num: long (nullable = true)
|-- sex: string (nullable = true)
|-- scores: vector (nullable = true)
|-- scored_probabilities: vector (nullable = true)
|-- scored_labels: double (nullable = false)
prediction.show()
+---+------+--------------+-------------+------+--------------------+--------------------+-------------+
|age|fnlwgt|hours-per-week|Education-num| sex| scores|scored_probabilities|scored_labels|
+---+------+--------------+-------------+------+--------------------+--------------------+-------------+
| 50| 83311| 13.0| 13| Male|[-0.3112541859874...|[0.42280863585226...| 1.0|
| 28|338409| 40.0| 13|Female|[2.03573333830975...|[0.88449809518427...| 0.0|
| 49|160187| 16.0| 5|Female|[-0.6237799299103...|[0.34892225427013...| 1.0|
| 37|280464| 80.0| 10| Male|[-6.0676335524913...|[0.00231129449917...| 1.0|
+---+------+--------------+-------------+------+--------------------+--------------------+-------------+
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).show()
+---------------+--------------------+------------------+------------------+------------------+------------------+
|evaluation_type| confusion_matrix| accuracy| precision| recall| AUC|
+---------------+--------------------+------------------+------------------+------------------+------------------+
| Classification|517.0 2218.0
4...|0.6752094627895515|0.6911293691686394|0.9223192715108716|0.6504545242978628|
+---------------+--------------------+------------------+------------------+------------------+------------------+
首先,我们将定义WebService的输入/输出。 有关更多信息,请访问 documentation for Spark Serving
Spark Serving添加了特殊的 Streaming Sourece 接收器,可将任何结构化流作业转换为Web服务。Spark Serving附带两个部署选项,根据使用的负载平衡方法而有所不同:
用于创建各种不同的服务DataFrame,并在最后的 df.wrtieSteam后使用等效的Web请求语句。
更多部署操作可以查看:Serving文档
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import *
import uuid
from mmlspark import request_to_string, string_to_response
## 定义输入数据流向的IP、端口、名称,以及输入数据格式
serving_inputs = spark.readStream.server() \
.address("localhost", 8898, "my_api") \
.load()\
.parseRequest(test.schema)
## 自定义数据类型:
#.parseRequest(StructType().add("S1", StringType()).add("S2", IntegerType()))
## 增加列
#replies = df.withColumn("fooLength", length(col("S1"))).makeReply("S1Length")
## 定义对输入数据的操作,以及输出的字段名称
serving_outputs = model.transform(serving_inputs) \
.makeReply("scored_labels")
## 启动服务,将提交的请求响应到“my_api”
server = serving_outputs.writeStream \
.server() \
.replyTo("my_api") \
.queryName("my_query") \
.option("checkpointLocation", "checkpoints-{}".format(uuid.uuid1())) \
.start()
data = u'{"age": 31,"fnlwgt":45781,"hours-per-week":50.0,"Education-num":14}'
r = requests.post(data=data, url="http://localhost:8898/my_api")
print("Response {}".format(r.text))
返回结果:Response {“scored_labels”:1.0}
data = u'{"age":42,"fnlwgt":159449,"hours-per-week":40.0,"Education-num":13}'
r = requests.post(data=data, url="http://localhost:8898/my_api")
print("Response {}".format(r.text))
返回结果:Response {“scored_labels”:1.0}
import time
time.sleep(20) # wait for server to finish setting up (just to be safe)
server.stop()