[MMLSpark]使用Spark Serving将模型部署为实时的Web服务

用于将Spark作业部署为分布式Web服务的引擎。新东西,我们一起来尝鲜~~

特点

分布式: 充分利用Spark的Node、JVM和线程级并行性。
快速: 没有单节点瓶颈,没有Python往返开销。请求可以通过网络交换机直接路由到工作JVM。在几秒钟内启动Web服务。
低延迟: 使用连续服务时,可以实现低至1毫秒的延迟。
可随处部署: 适用于运行Spark的任何地方,例如Databricks,HDInsight,AZTK,DSVM,本地机或自己的群集。可用于Spark、PySpark和SparklyR。
轻量级: 不依赖于额外(昂贵的)Kafka或Kubernetes集群。
通用性: 使用与批处理和结构化流相同的API。
灵活性: 在单个Spark群集上启动和管理多个服务。同步和异步服务管理和可扩展性。部署任何可表达为结构化流式查询的spark作业。将 sources/sinks 与其他 Spark sources/sinks 一起使用,以实现更复杂的部署。

使用人口普查数据集(点击可下载)预测收入

我们将使用Spark Serving将其部署为实时的Web服务。

导入依赖

首先,我们导入所需的包:

import os
import sys
import requests
import numpy as np
import pandas as pd
import mmlspark
import pyspark
from mmlspark import TrainClassifier
from pyspark.ml.classification import LogisticRegression

spark = pyspark.sql.SparkSession.builder.appName("byzMLSparkApp") \
            .getOrCreate()
            #.config("spark.jars.packages", "Azure:mmlspark:0.14")
            # \\最好不要这样用,可能会跟JARS现存的包引起冲突,或者spray JsonReader等依赖引导不进导致Py4J报错

数据清洗

下载数据,数据每个字段都有空格,先要去除头尾空格才能使用

dataFilePath = "/home/raini/dataset/AdultCensusIncome.csv"

# 去除头尾空格
## 以前我们都需要给字段定义类型,MMLSpark能自动推断类型
data = pd.read_csv(dataFilePath, dtype={"hours-per-week": np.float64}) 
for col in data.columns:
    df = data[col].astype(str).str.replace(" ", "")#.strip("")
    data[col] = df

df = spark.createDataFrame(data)
df.show(10)

+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|Education-num|      marital-status|       occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|       Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|          40.0| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|  Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|          13.0| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|            Divorced|Handlers-cleaners|Not-in-family|White|  Male|           0|           0|          40.0| United-States| <=50K|
| 53|         Private|234721|     11th|            7|  Married-civ-spouse|Handlers-cleaners|      Husband|Black|  Male|           0|           0|          40.0| United-States| <=50K|
| 28|         Private|338409|Bachelors|           13|  Married-civ-spouse|   Prof-specialty|         Wife|Black|Female|           0|           0|          40.0|          Cuba| <=50K|
| 37|         Private|284582|  Masters|           14|  Married-civ-spouse|  Exec-managerial|         Wife|White|Female|           0|           0|          40.0| United-States| <=50K|
| 49|         Private|160187|      9th|            5|Married-spouse-ab...|    Other-service|Not-in-family|Black|Female|           0|           0|          16.0|       Jamaica| <=50K|
| 52|Self-emp-not-inc|209642|  HS-grad|            9|  Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|          45.0| United-States|  >50K|
| 31|         Private| 45781|  Masters|           14|       Never-married|   Prof-specialty|Not-in-family|White|Female|       14084|           0|          50.0| United-States|  >50K|
| 42|         Private|159449|Bachelors|           13|  Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|        5178|           0|          40.0| United-States|  >50K|
+---+----------------+------+---------+-------------+--------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+

数据建模

取我们需要的字段进行建模,y为[sex]。

df = df.select(["age", "fnlwgt", "hours-per-week", "Education-num", "sex"])
train, test = df.randomSplit([0.75, 0.25], seed=123)
df.limit(10).show()

+---+------+--------------+-------------+------+
|age|fnlwgt|hours-per-week|Education-num|   sex|
+---+------+--------------+-------------+------+
| 39| 77516|          40.0|           13|  Male|
| 50| 83311|          13.0|           13|  Male|
| 38|215646|          40.0|            9|  Male|
| 53|234721|          40.0|            7|  Male|
| 28|338409|          40.0|           13|Female|
| 37|284582|          40.0|           14|Female|
| 49|160187|          16.0|            5|Female|
| 52|209642|          45.0|            9|  Male|
| 31| 45781|          50.0|           14|Female|
| 42|159449|          40.0|           13|  Male|
+---+------+--------------+-------------+------+

模型训练

TrainClassifier可用于初始化和拟合模型,它包含了SparkML分类器,参数numFeatures控制散列特征的数量。可以使用帮助(mmlspark.TrainClassifier)来查看不同的参数。

注意:它隐式地将数据转换为算法所期望的格式。更具体的算法如:tokenizes, hashes strings, one-hot encodes categorical variables, assembles the features into a vector etc.

model = TrainClassifier(model=LogisticRegression(), labelCol="sex", numFeatures=256).fit(train)

print(train.schema) # 训练成功,可见内部已将String转成Float
StructType(List(StructField(age,StringType,true),StructField(fnlwgt,StringType,true),StructField(hours-per-week,StringType,true),StructField(Education-num,StringType,true),StructField(sex,StringType,true)))

模型预测

from mmlspark import ComputeModelStatistics, TrainedClassifierModel
prediction = model.transform(test)

prediction.printSchema()
root
 |-- age: long (nullable = true)
 |-- fnlwgt: long (nullable = true)
 |-- hours-per-week: double (nullable = true)
 |-- Education-num: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- scores: vector (nullable = true)
 |-- scored_probabilities: vector (nullable = true)
 |-- scored_labels: double (nullable = false)

prediction.show()
 +---+------+--------------+-------------+------+--------------------+--------------------+-------------+
 |age|fnlwgt|hours-per-week|Education-num|   sex|              scores|scored_probabilities|scored_labels|
 +---+------+--------------+-------------+------+--------------------+--------------------+-------------+
 | 50| 83311|          13.0|           13|  Male|[-0.3112541859874...|[0.42280863585226...|          1.0|
 | 28|338409|          40.0|           13|Female|[2.03573333830975...|[0.88449809518427...|          0.0|
 | 49|160187|          16.0|            5|Female|[-0.6237799299103...|[0.34892225427013...|          1.0|
 | 37|280464|          80.0|           10|  Male|[-6.0676335524913...|[0.00231129449917...|          1.0|
 +---+------+--------------+-------------+------+--------------------+--------------------+-------------+

计算统计量

metrics = ComputeModelStatistics().transform(prediction)

metrics.limit(10).show()
+---------------+--------------------+------------------+------------------+------------------+------------------+
|evaluation_type|    confusion_matrix|          accuracy|         precision|            recall|               AUC|
+---------------+--------------------+------------------+------------------+------------------+------------------+
| Classification|517.0  2218.0  
4...|0.6752094627895515|0.6911293691686394|0.9223192715108716|0.6504545242978628|
+---------------+--------------------+------------------+------------------+------------------+------------------+

定义Web服务

首先,我们将定义WebService的输入/输出。 有关更多信息,请访问 documentation for Spark Serving

创建流

Spark Serving添加了特殊的 Streaming Sourece 接收器,可将任何结构化流作业转换为Web服务。Spark Serving附带两个部署选项,根据使用的负载平衡方法而有所不同:

  1. spark.readStream.server():对于头节点负载均衡服务,使用HTTPSource和 HTTPSink类部署头节点负载平衡。此模式允许更复杂的窗口化,重新分区和SQL操作。此选项也适用于快速设置和测试,因为它不需要任何额外的负载平衡或网络交换机。
  2. spark.readStream.distributedServer() :对于自定义负载均衡服务,使用DistributedHTTPSource和DistributedHTTPSink类为自定义负载均衡器配置Spark服务 。
  3. spark.readStream.continuousServer():对于自定义负载均衡,构建亚毫秒延迟连续服务器。需要特别的设置:df.writeStream.continuousServer().trigger(continuous=“1 second”)…

用于创建各种不同的服务DataFrame,并在最后的 df.wrtieSteam后使用等效的Web请求语句。
更多部署操作可以查看:Serving文档

from pyspark.sql.functions import col, from_json
from pyspark.sql.types import *
import uuid
from mmlspark import request_to_string, string_to_response

## 定义输入数据流向的IP、端口、名称,以及输入数据格式
serving_inputs = spark.readStream.server() \
    .address("localhost", 8898, "my_api") \
    .load()\
    .parseRequest(test.schema)

    ## 自定义数据类型:
    #.parseRequest(StructType().add("S1", StringType()).add("S2", IntegerType()))
## 增加列
#replies = df.withColumn("fooLength", length(col("S1"))).makeReply("S1Length") 

## 定义对输入数据的操作,以及输出的字段名称
serving_outputs = model.transform(serving_inputs) \
  .makeReply("scored_labels")

## 启动服务,将提交的请求响应到“my_api”
server = serving_outputs.writeStream \
    .server() \
    .replyTo("my_api") \
    .queryName("my_query") \
    .option("checkpointLocation", "checkpoints-{}".format(uuid.uuid1())) \
    .start()

模拟提交服务请求,计算结果

data = u'{"age": 31,"fnlwgt":45781,"hours-per-week":50.0,"Education-num":14}'
r = requests.post(data=data, url="http://localhost:8898/my_api")
print("Response {}".format(r.text))

返回结果:Response {“scored_labels”:1.0}

data = u'{"age":42,"fnlwgt":159449,"hours-per-week":40.0,"Education-num":13}'
r = requests.post(data=data, url="http://localhost:8898/my_api")
print("Response {}".format(r.text))

返回结果:Response {“scored_labels”:1.0}

停止服务

import time
time.sleep(20) # wait for server to finish setting up (just to be safe)
server.stop()

你可能感兴趣的:(spark,MMLSpark,SparkServing)