Alink 是阿里巴巴计算平台事业部PAI团队从 2017 年开始基于实时计算引擎 Flink 研发的新一代机器学习算法平台,提供丰富的算法组件库和便捷的操作框架,开发者可以一键搭建覆盖数据处理、特征工程、模型训练、模型预测的算法模型开发全流程。
借助Flink在批流一体化方面的优势,Alink能够为批流任务提供一致性的操作。在实践过程中,Flink原有的机器学习库FlinkML的局限性显露出来(仅支持10余种算法,支持的数据结构也不够通用),但我们看重Flink底层引擎的优秀性能,于是基于Flink重新设计研发了机器学习算法库,于2018年在阿里集团内部上线,随后不断改进完善,在阿里内部错综复杂的业务场景中锻炼成长。
FlinkML 是 Flink 社区现存的一套机器学习算法库,这一套算法库已经存在很久而且更新比较缓慢。Alink 是基于新一代的 Flink,完全重新写了一套,跟 FlinkML 没有代码上的关系。Alink 由阿里巴巴计算平台事业部PAI团队研发,开发出来以后在阿里巴巴内部也用了,然后现在正式开源出来。
from pyalink.alink import *
resetEnv()
useLocalEnv(1, config=None)
Use one of the following commands to start using PyAlink:
JVM listening on 127.0.0.1:50568
MLEnv(benv=JavaObject id=o2, btenv=JavaObject id=o5, senv=JavaObject id=o3, stenv=JavaObject id=o6)
## read data
URL = "https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/review_rating_train.csv"
SCHEMA_STR = "review_id bigint, rating5 bigint, rating3 bigint, review_context string"
LABEL_COL = "rating5"
TEXT_COL = "review_context"
VECTOR_COL = "vec"
PRED_COL = "pred"
PRED_DETAIL_COL = "predDetail"
source = CsvSourceBatchOp() \
.setFilePath(URL)\
.setSchemaStr(SCHEMA_STR)\
.setFieldDelimiter("_alink_")\
.setQuoteChar(None)
## Split data for train and test
trainData = SplitBatchOp().setFraction(0.9).linkFrom(source)
testData = trainData.getSideOutput(0)
pipeline = (
Pipeline()
.add(
Segment()
.setSelectedCol(TEXT_COL)
)
.add(
StopWordsRemover()
.setSelectedCol(TEXT_COL)
).add(
DocHashCountVectorizer()
.setFeatureType("WORD_COUNT")
.setSelectedCol(TEXT_COL)
.setOutputCol(VECTOR_COL)
)
)
## naiveBayes model
naiveBayes = (
NaiveBayesTextClassifier()
.setVectorCol(VECTOR_COL)
.setLabelCol(LABEL_COL)
.setPredictionCol(PRED_COL)
.setPredictionDetailCol(PRED_DETAIL_COL)
)
%timeit model = pipeline.add(naiveBayes).fit(trainData)
473 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
## evaluation
predict = model.transform(testData)
metrics = (
EvalMultiClassBatchOp()
.setLabelCol(LABEL_COL)
.setPredictionDetailCol(PRED_DETAIL_COL)
.linkFrom(predict)
.collectMetrics()
)
print("ConfusionMatrix:", metrics.getConfusionMatrix())
print("LabelArray:", metrics.getLabelArray())
print("LogLoss:", metrics.getLogLoss())
print("Accuracy:", metrics.getAccuracy())
print("Kappa:", metrics.getKappa())
print("MacroF1:", metrics.getMacroF1())
print("Label 1 Accuracy:", metrics.getAccuracy("1"))
print("Label 1 Kappa:", metrics.getKappa("1"))
print("Label 1 Precision:", metrics.getPrecision("1"))
ConfusionMatrix: [[4987, 327, 229, 204, 292], [28, 1223, 164, 147, 108], [1, 1, 269, 10, 11], [0, 0, 0, 10, 0], [0, 2, 1, 2, 83]]
LabelArray: [‘5’, ‘4’, ‘3’, ‘2’, ‘1’]
LogLoss: 2.330945631084851
Accuracy: 0.8114582047166317
Kappa: 0.6190950197563011
MacroF1: 0.5123859853163818
Label 1 Accuracy: 0.9486356340288925
Label 1 Kappa: 0.27179135595030096
Label 1 Precision: 0.9431818181818182
# set env
from pyalink.alink import *
import sys, os
resetEnv()
useLocalEnv(2)
Use one of the following command to start using pyalink:
使用以下一条命令来开始使用 pyalink:
JVM listening on 127.0.0.1:51134
JavaObject id=o6
# schema of train data
schemaStr = "id string, click string, dt string, C1 string, banner_pos int, site_id string, \
site_domain string, site_category string, app_id string, app_domain string, \
app_category string, device_id string, device_ip string, device_model string, \
device_type string, device_conn_type string, C14 int, C15 int, C16 int, C17 int, \
C18 int, C19 int, C20 int, C21 int"
# prepare batch train data
batchTrainDataFn = "http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-small.csv"
trainBatchData = CsvSourceBatchOp().setFilePath(batchTrainDataFn) \
.setSchemaStr(schemaStr) \
.setIgnoreFirstLine(True);
# feature fit
labelColName = "click"
vecColName = "vec"
numHashFeatures = 30000
selectedColNames =["C1","banner_pos","site_category","app_domain",
"app_category","device_type","device_conn_type",
"C14","C15","C16","C17","C18","C19","C20","C21",
"site_id","site_domain","device_id","device_model"]
categoryColNames = ["C1","banner_pos","site_category","app_domain",
"app_category","device_type","device_conn_type",
"site_id","site_domain","device_id","device_model"]
numericalColNames = ["C14","C15","C16","C17","C18","C19","C20","C21"]
# prepare stream train data
wholeDataFile = "http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-ctr-train-8M.csv"
data = CsvSourceStreamOp() \
.setFilePath(wholeDataFile) \
.setSchemaStr(schemaStr) \
.setIgnoreFirstLine(True);
# split stream to train and eval data
spliter = SplitStreamOp().setFraction(0.5).linkFrom(data)
train_stream_data = spliter
test_stream_data = spliter.getSideOutput(0)
*步骤一、特征工程*
*步骤二、批式模型训练*
*步骤三、在线模型训练(FTRL)*
*步骤四、在线预测*
*步骤五、在线评估
*# setup feature enginerring pipeline
feature_pipeline = Pipeline() \
.add(StandardScaler() \
.setSelectedCols(numericalColNames)) \
.add(FeatureHasher() \
.setSelectedCols(selectedColNames) \
.setCategoricalCols(categoryColNames) \
.setOutputCol(vecColName) \
.setNumFeatures(numHashFeatures))
# fit and save feature pipeline model
FEATURE_PIPELINE_MODEL_FILE = os.path.join(os.getcwd(), "feature_pipe_model.csv")
feature_pipeline.fit(trainBatchData).save(FEATURE_PIPELINE_MODEL_FILE);
BatchOperator.execute();
# load pipeline model
feature_pipelineModel = PipelineModel.load(FEATURE_PIPELINE_MODEL_FILE);
# train initial batch model
lr = LogisticRegressionTrainBatchOp()
%timeit initModel = lr.setVectorCol(vecColName) \
.setLabelCol(labelColName) \
.setWithIntercept(True) \
.setMaxIter(10) \
.linkFrom(feature_pipelineModel.transform(trainBatchData))
59.6 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# ftrl train
model = FtrlTrainStreamOp(initModel) \
.setVectorCol(vecColName) \
.setLabelCol(labelColName) \
.setWithIntercept(True) \
.setAlpha(0.1) \
.setBeta(0.1) \
.setL1(0.01) \
.setL2(0.01) \
.setTimeInterval(10) \
.setVectorSize(numHashFeatures) \
.linkFrom(feature_pipelineModel.transform(train_stream_data))
19.1 s ± 1.9 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# ftrl predict
predResult = FtrlPredictStreamOp(initModel) \
.setVectorCol(vecColName) \
.setPredictionCol("pred") \
.setReservedCols([labelColName]) \
.setPredictionDetailCol("details") \
.linkFrom(model, feature_pipelineModel.transform(test_stream_data))
predResult.print(key="predResult", refreshInterval = 30, maxLimit=20)
'DataStream predResult: (Updated on 2019-12-05 15:03:33)'
click | pred | details | |
---|---|---|---|
0 | 0 | 0 | {“0”:“0.9046159047711626”,“1”:"0.0953840952288… |
1 | 1 | 0 | {“0”:“0.7301554114492774”,“1”:"0.2698445885507… |
2 | 0 | 0 | {“0”:“0.9354702479573089”,“1”:"0.0645297520426… |
3 | 1 | 0 | {“0”:“0.7472443769874088”,“1”:"0.2527556230125… |
4 | 0 | 0 | {“0”:“0.7313933609276811”,“1”:"0.2686066390723… |
5 | 0 | 0 | {“0”:“0.7579078017993002”,“1”:"0.2420921982006… |
6 | 0 | 0 | {“0”:“0.9658883764493819”,“1”:"0.0341116235506… |
7 | 0 | 0 | {“0”:“0.8916428187684737”,“1”:"0.1083571812315… |
8 | 0 | 0 | {“0”:“0.964470362868512”,“1”:"0.03552963713148… |
9 | 0 | 0 | {“0”:“0.7879843998010425”,“1”:"0.2120156001989… |
10 | 0 | 0 | {“0”:“0.7701207324521978”,“1”:"0.2298792675478… |
11 | 0 | 0 | {“0”:“0.8816330561252186”,“1”:"0.1183669438747… |
12 | 0 | 0 | {“0”:“0.8671197714269967”,“1”:"0.1328802285730… |
13 | 0 | 0 | {“0”:“0.9355228418514457”,“1”:"0.0644771581485… |
14 | 0 | 0 | {“0”:“0.9098863130943347”,“1”:"0.0901136869056… |
15 | 0 | 0 | {“0”:“0.7917622336863489”,“1”:"0.2082377663136… |
16 | 0 | 0 | {“0”:“0.8377318499121809”,“1”:"0.1622681500878… |
17 | 0 | 0 | {“0”:“0.9647915025127575”,“1”:"0.0352084974872… |
18 | 0 | 0 | {“0”:“0.7313985049080408”,“1”:"0.2686014950919… |
19 | 1 | 0 | {“0”:“0.8541619467983884”,“1”:"0.1458380532016… |
# ftrl eval
EvalBinaryClassStreamOp() \
.setLabelCol(labelColName) \
.setPredictionCol("pred") \
.setPredictionDetailCol("details") \
.setTimeInterval(10) \
.linkFrom(predResult) \
.link(JsonValueStreamOp() \
.setSelectedCol("Data") \
.setReservedCols(["Statistics"]) \
.setOutputCols(["Accuracy", "AUC", "ConfusionMatrix"]) \
.setJsonPath(["$.Accuracy", "$.AUC", "$.ConfusionMatrix"])) \
.print(key="evaluation", refreshInterval = 30, maxLimit=20)
StreamOperator.execute();
'DataStream evaluation: (Updated on 2019-12-05 15:03:31)'
Statistics | Accuracy | AUC | ConfusionMatrix | |
---|---|---|---|---|
0 | all | 0.8286096670786908 | 0.7182165258211499 | [[5535,5007],[112297,561587]] |
1 | window | 0.8464953470502861 | 0.7283501551891348 | [[485,456],[8534,49090]] |
2 | all | 0.830019475336848 | 0.7191075542108774 | [[6020,5463],[120831,610677]] |
3 | window | 0.8455799884444143 | 0.7227709897015594 | [[512,416],[8671,49247]] |
4 | all | 0.8311614455307001 | 0.719465721678977 | [[6532,5879],[129502,659924]] |
5 | window | 0.8444954128440367 | 0.7259189182276968 | [[545,455],[8698,49162]] |
6 | all | 0.8320733080282608 | 0.7199603254520217 | [[7077,6334],[138200,709086]] |
from pyalink.alink import *
resetEnv()
useLocalEnv(1, config=None)
Use one of the following command to start using pyalink:
使用以下一条命令来开始使用 pyalink:
- useLocalEnv(parallelism, flinkHome=None, config=None)
- useRemoteEnv(host, port, parallelism, flinkHome=None, localIp="localhost", config=None)
Call resetEnv() to reset environment and switch to another.
使用 resetEnv() 来重置运行环境,并切换到另一个。
JVM listening on 127.0.0.1:57785
JavaObject id=o6
## prepare data
import numpy as np
import pandas as pd
data = np.array([
[0, 0.0, 0.0, 0.0],
[1, 0.1, 0.1, 0.1],
[2, 0.2, 0.2, 0.2],
[3, 9, 9, 9],
[4, 9.1, 9.1, 9.1],
[5, 9.2, 9.2, 9.2]
])
df = pd.DataFrame({"id": data[:, 0], "f0": data[:, 1], "f1": data[:, 2], "f2": data[:, 3]})
inOp = BatchOperator.fromDataframe(df, schemaStr='id double, f0 double, f1 double, f2 double')
FEATURE_COLS = ["f0", "f1", "f2"]
VECTOR_COL = "vec"
PRED_COL = "pred"
vectorAssembler = (
VectorAssembler()
.setSelectedCols(FEATURE_COLS)
.setOutputCol(VECTOR_COL)
)
kMeans = (
KMeans()
.setVectorCol(VECTOR_COL)
.setK(2)
.setPredictionCol(PRED_COL)
)
pipeline = Pipeline().add(vectorAssembler).add(kMeans)
%timeit pipeline.fit(inOp).transform(inOp).firstN(9).collectToDataframe()
id | f0 | f1 | f2 | vec | pred | |
---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 0.0 0.0 | 1 |
1 | 1.0 | 0.1 | 0.1 | 0.1 | 0.1 0.1 0.1 | 1 |
2 | 2.0 | 0.2 | 0.2 | 0.2 | 0.2 0.2 0.2 | 1 |
3 | 3.0 | 9.0 | 9.0 | 9.0 | 9.0 9.0 9.0 | 0 |
4 | 4.0 | 9.1 | 9.1 | 9.1 | 9.1 9.1 9.1 | 0 |
5 | 5.0 | 9.2 | 9.2 | 9.2 | 9.2 9.2 9.2 | 0 |
301 ms ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
from pyalink.alink import *
resetEnv()
useLocalEnv(1, config=None)
Use one of the following command to start using pyalink:
使用以下一条命令来开始使用 pyalink:
- useLocalEnv(parallelism, flinkHome=None, config=None)
- useRemoteEnv(host, port, parallelism, flinkHome=None, localIp="localhost", config=None)
Call resetEnv() to reset environment and switch to another.
使用 resetEnv() 来重置运行环境,并切换到另一个。
JVM listening on 127.0.0.1:57514
JavaObject id=o6
手写数字识别
URL = "https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/mnist_dense.csv"
SCHEMA_STR = "label bigint,bitmap string"
mnist_data = CsvSourceBatchOp() \
.setFilePath(URL) \
.setSchemaStr(SCHEMA_STR)\
.setFieldDelimiter(";")
spliter = SplitBatchOp().setFraction(0.8)
train = spliter.linkFrom(mnist_data)
test = spliter.getSideOutput(0)
softmax = Softmax().setVectorCol("bitmap").setLabelCol("label") \
.setPredictionCol("pred").setPredictionDetailCol("detail") \
.setEpsilon(0.0001).setMaxIter(200)
%timeit model = softmax.fit(train)
res = model.transform(test)
evaluation = EvalMultiClassBatchOp().setLabelCol("label").setPredictionCol("pred")
metrics = evaluation.linkFrom(res).collectMetrics()
20.7 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
print("ConfusionMatrix:", metrics.getConfusionMatrix())
print("LabelArray:", metrics.getLabelArray())
print("LogLoss:", metrics.getLogLoss())
print("TotalSamples:", metrics.getTotalSamples())
print("ActualLabelProportion:", metrics.getActualLabelProportion())
print("ActualLabelFrequency:", metrics.getActualLabelFrequency())
print("Accuracy:", metrics.getAccuracy())
print("Kappa:", metrics.getKappa())
ConfusionMatrix: [[170, 3, 5, 0, 1, 7, 2, 2, 1, 0], [2, 154, 2, 1, 14, 3, 6, 9, 0, 2], [9, 3, 174, 0, 3, 3, 3, 3, 0, 0], [0, 0, 1, 162, 5, 4, 2, 6, 0, 7], [5, 9, 2, 5, 160, 1, 8, 1, 0, 0], [11, 4, 2, 0, 4, 187, 1, 2, 1, 1], [2, 5, 2, 2, 6, 1, 170, 4, 1, 0], [0, 2, 8, 4, 2, 4, 8, 180, 6, 1], [1, 3, 3, 1, 3, 1, 3, 3, 209, 0], [2, 2, 2, 0, 3, 1, 1, 2, 0, 179]]
LabelArray: [‘9’, ‘8’, ‘7’, ‘6’, ‘5’, ‘4’, ‘3’, ‘2’, ‘1’, ‘0’]
LogLoss: None
TotalSamples: 2000
ActualLabelProportion: [0.101, 0.0925, 0.1005, 0.0875, 0.1005, 0.106, 0.102, 0.106, 0.109, 0.095]
ActualLabelFrequency: [202, 185, 201, 175, 201, 212, 204, 212, 218, 190]
Accuracy: 0.8725
Kappa: 0.858283141946106
# set env
from pyalink.alink import *
resetEnv()
useLocalEnv(1, config=None)
Use one of the following commands to start using PyAlink:
- useLocalEnv(parallelism, flinkHome=None, config=None)
- useRemoteEnv(host, port, parallelism, flinkHome=None, localIp="localhost", config=None)
Call resetEnv() to reset environment and switch to another.
JVM listening on 127.0.0.1:50568
MLEnv(benv=JavaObject id=o2, btenv=JavaObject id=o5, senv=JavaObject id=o3, stenv=JavaObject id=o6)
## read data
URL = "https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/review_rating_train.csv"
SCHEMA_STR = "review_id bigint, rating5 bigint, rating3 bigint, review_context string"
LABEL_COL = "rating5"
TEXT_COL = "review_context"
VECTOR_COL = "vec"
PRED_COL = "pred"
PRED_DETAIL_COL = "predDetail"
source = CsvSourceBatchOp() \
.setFilePath(URL)\
.setSchemaStr(SCHEMA_STR)\
.setFieldDelimiter("_alink_")\
.setQuoteChar(None)
## Split data for train and test
trainData = SplitBatchOp().setFraction(0.9).linkFrom(source)
testData = trainData.getSideOutput(0)
pipeline = (
Pipeline()
.add(
Segment()
.setSelectedCol(TEXT_COL)
)
.add(
StopWordsRemover()
.setSelectedCol(TEXT_COL)
).add(
DocHashCountVectorizer()
.setFeatureType("WORD_COUNT")
.setSelectedCol(TEXT_COL)
.setOutputCol(VECTOR_COL)
)
)
## naiveBayes model
naiveBayes = (
NaiveBayesTextClassifier()
.setVectorCol(VECTOR_COL)
.setLabelCol(LABEL_COL)
.setPredictionCol(PRED_COL)
.setPredictionDetailCol(PRED_DETAIL_COL)
)
%timeit model = pipeline.add(naiveBayes).fit(trainData)
3.39 s ± 152 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
## evaluation
predict = model.transform(testData)
metrics = (
EvalMultiClassBatchOp()
.setLabelCol(LABEL_COL)
.setPredictionDetailCol(PRED_DETAIL_COL)
.linkFrom(predict)
.collectMetrics()
)
print("ConfusionMatrix:", metrics.getConfusionMatrix())
print("LabelArray:", metrics.getLabelArray())
print("LogLoss:", metrics.getLogLoss())
print("Accuracy:", metrics.getAccuracy())
print("Kappa:", metrics.getKappa())
print("MacroF1:", metrics.getMacroF1())
print("Label 1 Accuracy:", metrics.getAccuracy("1"))
print("Label 1 Kappa:", metrics.getKappa("1"))
print("Label 1 Precision:", metrics.getPrecision("1"))
ConfusionMatrix: [[4987, 327, 229, 204, 292], [28, 1223, 164, 147, 108], [1, 1, 269, 10, 11], [0, 0, 0, 10, 0], [0, 2, 1, 2, 83]]
LabelArray: [‘5’, ‘4’, ‘3’, ‘2’, ‘1’]
LogLoss: 2.330945631084851
Accuracy: 0.8114582047166317
Kappa: 0.6190950197563011
MacroF1: 0.5123859853163818
Label 1 Accuracy: 0.9486356340288925
Label 1 Kappa: 0.27179135595030096
Label 1 Precision: 0.9431818181818182
本文档介绍了Alink的由来以及与flink的关系,以及alink使用的五个基础案例。案例中的代码在jupyter lab中进行运行可以直接得到训练结果以及预测结果。Alink采用的python语言,但是其机器学习过程又与sparkml类似,可以简单理解为,Alink是对标sparkML的机器学习框架,并且支持在线数据流式处理能力。后续会持续进行该方向的研究。