sparkMlib_doc

模型输入输出对应关系

输入表（hive）——模型参数——输出模型（hdfs）
- DecisionTree
- GBTC
- LogisticRegression
- NaiveBayes
- RandomForest
- BisectingKMeans
- IDFTrain
- ALS
- DecisionTreeRegression
- LinearRegression
- RandomForestRegression

示例：
'{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10}'
'{"OutputModelTarget":"hdfs:///user/model/GBTC","Overwrite":true}'

输入表（hive）——输入模型（hdfs）——输出表（hive）
- DecisionTreePre
- GBTCPre
- LogisticRegressionPre
- NaiveBayesPre
- RandomForestPre
- BisectingKMeansPre
- TFIDFVectorize
- ALSPre
- DecisionTreeRegressionPre
- LinearRegressionPre
- RandomForestRegressionPre

示例：
'{"InputHiveTable":"DecisionTree"}'
'{"InputModelSource":"hdfs:///user/model/DecisionTreeModel"}'
'{"OutputHiveTable":"DecisionTreeTestPre"}'

输入表（hive）——模型参数——输出表（hive）
- Binarizer
- Bucketizer
- ChiSqSelector
- CountVectorzer
- DCT
- TFVectorize
- IndexToString
- MinMaxScaler
- NGram
- Normalizer
- OneHotEncoder
- PCA
- RegexTokenizer
- StandardScaler
- StringIndexer
- VectorAssembler
- VectorIndexer

示例
'{"InputHiveTable":"Binarizer"}'
'{"InputCol":"feature","OutputCol":"binarized_feature","Threshold":0.5}
'{"OutputHiveTable":"Binarizer"}'

模型名称——输入表（hive）——输入模型（hdfs）——
模型参数——输出模型（hdfs）
- BinaryClassificationMetrics
- RegressionMetrics

示例
'DecisionTreeRegression'
'{"InputHiveTable":"DecisionTree"}'
'{"InputModelSource":"hdfs:///user/model/DecisionTreeModel"}'
'{"LabelCol":"label","PredictionCol":"prediction"}'
'{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

输入表（hive）——模型参数——输出结果（json格式保存到hdfs）
- StatisticalSummary
- Correlations

'{"InputHiveTable":"DecisionTree"}'
'{"InputCols":["f1","f2","f3","f4"],"ColLabels":["f1","f2","f3","f4"]}'
'{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

输入表（hive）——模型参数
- RandomSplit
  备注：训练集和测试集合默认保存到hive中

示例
'{"InputHiveTable":"DecisionTreeRegression"}'
'{"RandomRate":0.5}'
'{"OutputTrainingSetTable":"trainSet"}'
'{"OutputTestSetTable":"testSet"}'

特征

Feature特征提取

TFVectorize
1.1 参数

参数名称	参数类型	参数描述	默认值	是否必选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
NumFeatures	int	Number of features. Should be greater than 0.	20	false
Binary	bool	Binary toggle to control term frequency counts.	false	false

1.2 测试用例
feature.TFVectorize
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.TFVectorize --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"wfwfwfwf"}' '{"InputCol":"segmented","OutputCol":"tf","NumFeatures":100,"Binary":true}' '{"OutputHiveTable":"test2"}'

IDFTrainSpell
2.1 参数

参数名称	参数类型	参数描述	默认值	是否必选
InputCol	string	Param for input column name.	null	true
MinDocFreq	int	The minimum number of documents in which a term should appear.	0	false

2.2 测试用例
feature.IDFTrain
2.3 提交参数示例
spark-submit --class grimoire.ml.feature.conjure.IDFTrain --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"test2"}' '{"InputCol":"tf", "OutputCol":"tfidf"}' '{"OutputModelTarget":"hdfs:///user/model/idf2","Overwrite":true}'

TFIDFVectorize
3.1 参数

参数名称	参数类型	参数描述	默认值	是否必选
OutputCol	string	Param for output column name.	output	true

3.2 测试用例
feature.TFIDFVectorize
3.3 提交参数示例
spark-submit --class grimoire.ml.feature.conjure.TFIDFVectorize --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"test2"}' '{"InputModelSource":"hdfs:///user/model/idf2"}' '{"OutputCol":"tfidf"}' '{"OutputHiveTable":"TFIDFVectorize"}'

CountVectorizer
4.1 参数

参数名称	参数类型	参数描述	默认值	是否必选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
VocabSize	int	Max size of the vocabulary.	262144	false
MinDF	double	Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer greater than or equal to 1, this specifies the number of documents the term must appear in; if this is a double in [0,1), then this specifies the fraction of documents.	1	false
Binary	boolean	Binary toggle to control term frequency counts. If true, all non-zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.	false	false

4.2 测试用例
feature.CountVectorizer
4.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.CountVectorizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"countvectorizer"}' '{"InputCol":"words","OutputCol":"features","VocabSize":3,"MinDF":2}' '{"OutputHiveTable":"CountVectorzerConjure"}'

特征转换

RegexTokenizer
1.1 参数

参数名称	参数类型	参数描述	默认值	是否必选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
Pattern	String	Regex pattern used to match delimiters if gaps is true or tokens if gaps is false.	\s+	false
MinTokenLength	int	Minimum token length, greater than or equal to 0, to avoid returning empty strings	1	false
ToLowercase	boolean	Indicates whether to convert all characters to lowercase before tokenizing.	true	false

1.2 测试用例
feature.RegexTokenizer
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.RegexTokenizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"RegexTokenizer"}' '{"InputCol":"sentence","OutputCol":"words","Pattern":"\\W"}' '{"OutputHiveTable":"RegexTokenizerConjure"}'

NGram
2.1 参数

参数名称	参数类型	参数描述	默认值	是否必选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
N	int	Minimum n-gram length, greater than or equal to 1.	2	true

2.2 示例
feature.NGram
2.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.NGram --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"NGram"}' '{"InputCol":"words","OutputCol":"ngrams","N":2}' '{"OutputHiveTable":"NGramConjure"}'

Binarizer
3.1 参数

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
Threshold	Double	Param for threshold used to binarize continuous features.	0.5	false

3.2 示例
feature.Binarizer
3.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Binarizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Binarizer"}' '{"InputCol":"feature","OutputCol":"binarized_feature","Threshold":0.5}' '{"OutputHiveTable":"BinarizerConjure"}'

PCA
4.1 参数

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
K	Int	The number of clusters to infer. Must be > 1.	4	true

4.2 测试用例
feature.PCA
4.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.PCA --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"PCA"}' '{"InputCol":"features","OutputCol":"pcaFeatures","K":3}' '{"OutputHiveTable":"PCAConjure"}'

DCT
5.1 参数

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
Inverse	Boolean	Indicates whether to perform the inverse DCT (true) or forward DCT .	null	true

5.2 测试用例
feature.DCT
5.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.DCT --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DCT"}' '{"InputCol":"features","OutputCol":"featuresDCT","Inverse":false}' '{"OutputHiveTable":"DCTConjure"}'

StringIndexer
6.1 参数

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
HandleInvalid	string	Param for how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error). More options may be added later.	error	false

6.2 测试用例
feature.StringIndexer
6.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.StringIndexer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"StringIndexer"}' '{"InputCol":"category","OutputCol":"categoryIndex"}' '{"OutputHiveTable":"StringIndexerConjure"}'

IndexToString

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true

7.2 测试用例
feature.IndexToString
7.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.IndexToString --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"IndexToString"}' '{"InputCol":"categoryIndex","OutputCol":"originalCategory"}' '{"OutputHiveTable":"IndexToStringConjure"}'

8 OneHotEncoder
8.1 参数

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
DropLast	Boolean	Whether to drop the last category in the encoded vector	true	false

8.2 测试用例
feature.OneHotEncoder
8.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.OneHotEncoder --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"OneHotEncoder"}' '{"InputCol":"categoryIndex","OutputCol":"categoryVec","DropLast":false}' '{"OutputHiveTable":"OneHotEncoderConjure"}'

VectorAssembler
9.1 参数

参数名称	参数类型	参数描述	默认值	是否可选
InputCols	seq[string]	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true

9.2 测试用例
feature.VectorAssembler
9.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.VectorAssembler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"VectorAssembler"}' '{"InputCols":["hour", "mobile", "userFeatures"],"OutputCol":"features"}' '{"OutputHiveTable":"VectorAssemblerConjure"}'

Normalizer
10.1 参数

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
P	doube	Normalization in Lp space.	null	true

10.2 测试用例
feature.Normalizer
10.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Normalizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Normalizer"}' '{"InputCol":"features","OutputCol":"normFeatures","P":1}' '{"OutputHiveTable":"NormalizerConjure"}'

StandardScaler
11.1参数

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
WithStd	bool	Whether to scale the data to unit standard deviation.	false	false
WithMean	bool	Whether to center the data with mean before scaling.	true	false

11.2 测试用例
feature.StandardScaler
11.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.StandardScaler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"StandardScaler"}' '{"InputCol":"features","OutputCol":"scaledFeatures","WithStd":true}' '{"OutputHiveTable":"StandardScalerConjure"}'

MinMaxScaler
12.1 参数配置

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true

12.2 测试用例
feature.MinMaxScaler
12.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.MinMaxScaler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"MinMaxScaler"}' '{"InputCol":"features","OutputCol":"scaledFeatures"}' '{"OutputHiveTable":"MinMaxScalerConjure"}'

13 Bucketizer
13.1 参数配置

参数名称	参数类型	参数描述	默认值	是否可选
InputCol	string	Param for input column name.	null	true
OutputCol	string	Param for output column name.	output	true
Splits	seq[double]	Parameter for mapping continuous features into buckets.	output	true

13.2 测试用例
feature.Bucketizer
13.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Bucketizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Bucketizer"}' '{"InputCol":"features","OutputCol":"bucketedFeatures","Splits":[-1000, -0.5, 0.0, 0.5, 1000]}' '{"OutputHiveTable":"BucketizerConjure"}'

RandomSplit
14.1 参数配置

参数名称	参数类型	参数描述	默认值	是否可选
RandomRate	double	trainset random rate.	null	true
Trainset	string	split dataset to trainset.	trainset	true
TestSet	string	split dataset to testset .	testset	true

13.2 测试用例
feature.RandomSplit
13.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.RandomSplit --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"RandomRate":0.5}'
'{"OutputTrainingSetTable":"trainSet"}'
'{"OutputTestSetTable":"testSet"}'

Feature特征选择

ChiSqSelector

参数名称	参数类型	参数描述	默认值	是否必选
FeaturesCol	string	Column of features.	features	true
OutputCol	string	Param for output column name.	output	true
NumTopFeatures	Int	Number of features that selector will select, ordered by ascending p-value.	50	true
LabelCol	String	Column name of label.	label	true

1.2 测试用例
feature.ChiSqSelector
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.ChiSqSelector --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ChiSqSelector"}' '{"NumTopFeatures":1,"FeaturesCol":"features","LabelCol":"clicked","OutputCol":"chisq"}' '{"OutputHiveTable":"ChiSqSelectorConjure"}'

分类

LogisticRegression

参数名称	参数类型	参数描述	默认值	是否可选
MaxIter	Int	Max iteration of train.	100	false
RegParam	double	Regularize parameter.	0.0	false
ElasticNetParam	double	The param of α.	0.0	true
Family	String	binomial logistic regression or multinomial logistic regression .	auto	false
LabelCol	String	Column name of label.	label	true
FeaturesCol	String	Column of features.	features	true
FitIntercept	boolean	Param for whether to fit an intercept term	true	false
Standardization	boolean	Param for whether to standardize the training features before fitting the model.	true	false
Threshold	double	Limit of calculation	0.5	false
Tol	double	Param for the convergence tolerance for iterative algorithms (>= 0)	1.0E-6	false
ProbabilityCol	String	Param for Column name for predicted class conditional probabilities.	probability	false
RawPredictionCol	String	aram for prediction column name.	RawPredictionCol	false
predictionCol	String	Param for prediction column name.	prediction	true

1.2 测试用例
classify.LogisticRegression
1.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.LogisticRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"logistregression_train"}' '{"MaxIter":10,"RegParam":0.3,"ElasticNetParam":0.8,"LabelCol":"label","FeaturesCol":"features","PredictionCol":"predic"}' '{"OutputModelTarget":"hdfs:///user/model/LogisticRegression","Overwrite":true}'

LogisticRegressionPre
2.1 测试用例
classify.LogisticRegressionPre
2.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.LogisticRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"logistregression_train"}' '{"InputModelSource":"hdfs:///user/model/LogisticRegression"}' '{"OutputHiveTable":"LogisticRegressionTestPre"}'
DecisionTree
3.1 参数

参数名称	参数类型	参数描述	默认值	是否必选
LabelCol	String	Column name of label.	label	true
FeaturesCol	String	Column of features.	features	true
RawPredictionCol	String	Param for prediction column name.	rawPrediction	false
Seed	long	Random seed.	159147643	false
CheckpointInterval	int	Param for set checkpoint interval (>= 1) or disable checkpoint (-1).	10	false
Impurity	String	Criterion used for information gain calculation (case-insensitive).	gini	false
MaxBins	int	Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node.	32	false
MaxDepth	int	Maximum depth of the tree (>= 0).	true	false
Threshold	double	Limit of calculation	0.5	false
MinInfoGain	double	Minimum information gain for a split to be considered at a tree node.	0.0	false
MinInstancesPerNode	int	Minimum number of instances each child must have after split.	1	false
PredictionCol	String	Param for prediction column name.	prediction	true
ProbabilityCol	String	Param for Column name for predicted class conditional probabilities.	probabilities	false

3.2 测试用例
classify.DecisionTree
3.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.DecisionTree --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/DecisionTree","Overwrite":true}'

DecisionTreePre
4.1 测试用例
classify.DecisionTreePre
4.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.DecisionTreePre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/DecisionTree"}' '{"OutputHiveTable":"DecisionTreeTestPre"}'
GBTC
5.1. 模型参数

参数名称	参数类型	参数描述	默认值	是否必选
LabelCol	String	Column name of label.	label	True
FeaturesCol	String	Column of features.	features	True
MaxIter	int	Max iteration of train (default: 20)	20	False
Impurity	String	Criterion used for information gain calculation (case-insensitive). (default: gini)	gini	False
MaxBins	int	Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32)	32	False
MaxDepth	int	Maximum depth of the tree (>= 0). (default: 5)	5	False
MinInfoGain	double	Minimum information gain for a split to be considered at a tree node. (default: 0.0)	0.0	False
CheckpointInterval	int	Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10)	10	False
MinInstancesPerNode	int	Minimum number of instances each child must have after split. (default: 1)	1	False
PredictionCol	String	Param for prediction column name.	prediction	True
Seed	long	Random seed. (default: -1287390502)	-1287390502	False
RawPredictionCol	String	Param for prediction column name. (default: rawPrediction)	rawPrediction	False
subsamplingRate	double	Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0)	1.0	False
LossType	String	Loss function which GBT tries to minimize. (default: squared)	squared	False
StepSize	double	Set the initial step size of SGD for the first step. Default 1.0. In subsequent steps, the step size will decrease with stepSize/sqrt(t) (default: 0.1)	0.1	False

5.2 测试用例
classify.GBTC
5.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.GBTC --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10,"LossType":"logistic"}' '{"OutputModelTarget":"hdfs:///user/model/GBTC","Overwrite":true}'

GBTCPre
6.1 测试用例
classify.GBTCPre
6.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.GBTCPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/GBTC"}' '{"OutputHiveTable":"GBTCTestPre"}'
NaiveBayes
7.1 模型参数

参数名称	参数类型	参数描述	默认值	是否必选
LabelCol	String	Column name of label.	label	True
FeaturesCol	String	Column of features.	features	True
WeightCol	String	Param for weight column name. (default: null)	None	False
ModelType	String	The model type which is a string (case-sensitive). (default: multinomial)	multinomial	False
smothing	double	The smoothing parameter. (default: 1.0)	1.0	False
PredictionCol	String	Param for prediction column name.	prediction	True
ProbabilityCol	String	Param for Column name for predicted class conditional probabilities. (default: probabilities)	probabilities	False
RawPredictionCol	String	Param for prediction column name. (default: rawPrediction)	rawPrediction	False
Thresholds	Seq	Param for Thresholds in multi-class classification to adjust the probability of predicting each class. (default: null)	None	False

7.2 测试用例
classify.NaiveBayes
7.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.NaiveBayes --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/NaiveBayes","Overwrite":true}'

NaiveBayesPre
8.1 测试用例
classify.NaiveBayesPre
8.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.NaiveBayesPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/NaiveBayes"}' '{"OutputHiveTable":"NaiveBayesTestPre"}'
RandomForest
9.1 模型参数

参数名称	参数类型	参数描述	默认值	是否必选
LabelCol	String	Column name of label.	label	True
FeaturesCol	String	Column of features.	features	True
NumTrees	int	Number of trees to train.	20	True
Impurity	String	Criterion used for information gain calculation (case-insensitive). (default: gini)	gini	False
MaxBins	int	Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32)	32	False
MaxDepth	int	Maximum depth of the tree (>= 0). (default: 5)	5	False
MinInfoGain	double	Minimum information gain for a split to be considered at a tree node. (default: 0.0)	0.0	False
CheckpointInterval	int	Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10)	10	False
MinInstancesPerNode	int	Minimum number of instances each child must have after split. (default: 1)	1	False
PredictionCol	String	Param for prediction column name.	prediction	True
FeatureSubsetStrategy	String	The number of features to consider for splits at each tree node. (default: auto)	auto	False
ProbabilityCol	String	Param for Column name for predicted class conditional probabilities. (default: probabilities)	probabilities	False
Thresholds	Seq	Param for Thresholds in multi-class classification to adjust the probability of predicting each class. (default: null)	None	False
Seed	long	Random seed. (default: 159147643)	159147643	False
RawPredictionCol	String	Param for prediction column name. (default: rawPrediction)	rawPrediction	False
subsamplingRate	double	Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0)	1.0	False

10.2 测试用例
classify.RandomForest
10.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.RandomForest --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","NumTrees":10}' '{"OutputModelTarget":"hdfs:///user/model/RandomForest","Overwrite":true}'

RandomForestPre
8.1 测试用例
classify.RandomForestPre
8.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.RandomForestPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/RandomForest"}' '{"OutputHiveTable":"RandomForestTestPre"}'

回归

DecisionTreeRegression
1.1 模型参数

参数名称	参数类型	参数描述	默认值	是否必选
LabelCol	String	Column name of label.	label	True
FeaturesCol	String	Column of features.	features	True
PredictionCol	String	Param for prediction column name.	prediction	True
varianceCol	String	Param for Column name for the biased sample variance of prediction. (default: null)	None	False
CacheNodeIds	boolean	(default: false)	false	False
CheckpointInterval	int	Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10)	10	False
Impurity	String	Criterion used for information gain calculation (case-insensitive). (default: variance)	variance	False
MaxBins	int	Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32)	32	False
MaxDepth	int	Maximum depth of the tree (>= 0). (default: 5)	5	False
maxMemoryInMB	int	Max number of memory intput. (default: 256)	256	False
MinInfoGain	double	Minimum information gain for a split to be considered at a tree node. (default: 0.0)	0.0	False
MinInstancesPerNode	int	Minimum number of instances each child must have after split. (default: 1)	1	False
Seed	long	Random seed. (default: 926680331)	926680331	False

1.2 测试用例
regression.DecisionTreeRegression
1.3 提交参数
spark-submit --class grimoire.ml.regression.conjure.DecisionTreeRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/DecisionTreeRegression","Overwrite":true}'

DecisionTreeRegressionPre
2.1 测试用例
regression.RandomForestPre
2.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.DecisionTreeRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/DecisionTreeRegression"}' '{"OutputHiveTable":"DecisionTreeRegressionPre"}'

3 LinearRegression

参数名称	参数类型	参数描述	默认值	是否必选
MaxIter	int	Max iteration of train (default: 100)	100	False
RegParam	double	Regularize parameter (default: 0.0)	0.0	False
ElasticNetParam	double	The param of α. (default: 0.0)	0.0	False
FitIntercept	boolean	Param for whether to fit an intercept term. (default: true)	true	False
LabelCol	String	Column name of label.	label	True
FeaturesCol	String	Column of features.	features	True
WeightCol	String	Param for weight column name. (default: null)	None	False
AggregationDepth	int	the depth of aggregation (default: 2)	2	False
Standardization	boolean	Param for whether to standardize the training features before fitting the model. (default: true)	true	False
Solver	String	Param for the solver algorithm for optimization. (default: auto)	auto	False
Tol	double	Param for the convergence tolerance for iterative algorithms (>= 0). (default: 1.0E-6)	1.0E-6	False
PredictionCol	String	Param for prediction column name.	prediction	True

3.1 测试用例
regression.LinearRegression
3.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.LinearRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10,"RegParam":0.3,"ElasticNetParam":0.8}' '{"OutputModelTarget":"hdfs:///user/model/LinearRegression","Overwrite":true}'

LinearRegressionPre
4.1 测试用例
regression.LinearRegressionPre
4.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.LinearRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/LinearRegression"}' '{"OutputHiveTable":"LinearRegressionPre"}'

5 RandomForestRegression

参数名称	参数类型	参数描述	默认值	是否必选
LabelCol	String	Column name of label.	label	True
FeaturesCol	String	Column of features.	features	True
PredictionCol	String	Param for prediction column name.	prediction	True
CacheNodeIds	boolean	(default: false)	false	False
CheckpointInterval	int	Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10)	10	False
Seed	long	Random seed. (default: 235498149)	235498149	False
Impurity	String	Criterion used for information gain calculation (case-insensitive). (default: variance)	variance	False
MaxBins	int	Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32)	32	False
MaxDepth	int	Maximum depth of the tree (>= 0). (default: 5)	5	False
maxMemoryInMB	int	Max number of memory intput. (default: 256)	256	False
MinInfoGain	double	Minimum information gain for a split to be considered at a tree node. (default: 0.0)	0.0	False
MinInstancesPerNode	int	Minimum number of instances each child must have after split. (default: 1)	1	False
NumTrees	int	Number of trees to train.	10	True
subsamplingRate	double	Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0)	1.0	False
FeatureSubsetStrategy	String	The number of features to consider for splits at each tree node. (default: auto)	auto	False

5.1 测试用例
regression.RandomForestRegression
5.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.RandomForestRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","NumTrees":20}' '{"OutputModelTarget":"hdfs:///user/model/RandomForestRegression","Overwrite":true}'

RandomForestRegressionPre
6.1 测试用例
regression.RandomForestRegressionPre
6.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.RandomForestRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/RandomForestRegression"}' '{"OutputHiveTable":"RandomForestRegressionPre"}'

聚类

BisectingKMeans

参数名称	参数类型	参数描述	默认值	是否必选
K	int	The number of clusters to infer. Must be > 1.	4	True
Seed	long	Random seed. (default: 566573821)	566573821	False
PredictionCol	String	Param for prediction column name.	prediction	True
FeaturesCol	String	Column of features.	features	True
maxInter	int	Param for maximum number of iterations (>= 0). (default: 20)	20	False
minDivisibleClusterSize	int	Minnum size of divisibleCluster (default: 1)	1	False

1.1 测试用例
clustering.BisectingKMeans
1.2 提交参数
spark-submit --class grimoire.ml.clustering.conjure.BisectingKMeans --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"BisectingKMeans"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","K":2}' '{"OutputModelTarget":"hdfs:///user/model/BisectingKMeans","Overwrite":true}'

2.1 测试用例
clustering.BisectingKMeansPre
2.2 提交参数
spark-submit --class grimoire.ml.clustering.conjure.BisectingKMeansPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"BisectingKMeans"}' '{"InputModelSource":"hdfs:///user/model/BisectingKMeans"}' '{"OutputHiveTable":"BisectingKMeansPre"}'

协同过滤

参数名称	参数类型	参数描述	默认值	是否必选
MaxIter	int	Max iteration of train (default: 10)	10	False
RegParam	double	Regularize parameter (default: 0.1)	0.1	False
UserCol	String	Param for the column name for user ids.	user	True
ItemCol	String	Param for the column name for item ids.	item	True
RatingCol	String	Param for the column name for ratings.	rating	True
PredictionCol	String	Param for prediction column name.	prediction	True
Alpha	double	Param for the alpha parameter in the implicit preference formulation (nonnegative). (default: 1.0)	1.0	False
CheckpointInterval	int	Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10)	10	False
FinalStorageLevel	String	(default: MEMORY_AND_DISK)	MEMORY_AND_DISK	False
Nonnegative	boolean	Param for whether to apply nonnegativity constraints. (default: false)	false	False
NumUserBlocks	int	Param for number of blocks. (default: 10)	10	False
NumItemBlocks	int	Param for number of item blocks (positive). (default: 10)	10	False
Rank	int	Param for rank of the matrix factorization (positive). (default: 10)	10	False
Seed	long	Random seed. (default: 1994790107)	1994790107	False
ImplicitPrefs	boolean	Param to decide whether to use implicit preference. (default: false)	false	False

1.1 测试用例
filtering.ALS
1.2 提交参数
spark-submit --class grimoire.ml.filtering.conjure.ALS --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ALS"}' '{"MaxIter":5,"RegParam":0.01,"UserCol":"user","ItemCol":"product","RatingCol":"rating","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/ALS","Overwrite":true}'

2.1 测试用例
filtering.ALSPre
2.2 提交参数
spark-submit --class grimoire.ml.filtering.conjure.ALSPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ALS"}' '{"InputModelSource":"hdfs:///user/model/ALS"}' '{"OutputHiveTable":"ALSPre"}'

统计

Correlations

参数名称	参数类型	参数描述	默认值	是否必选
InputCols	Seq	names of input columns (default: null)	None	False
CorrelationMethod	String	Correlation method(default: pearson; alternative: spearman). (default: pearson)	pearson	False

1.1 测试用例
statistics.ALS
1.2 提交参数
spark-submit --class grimoire.ml.statistics.conjure.Correlations --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"iris"}' '{"InputCols":["f1","f2","f3","f4"],"CorrelationMethod":"pearson"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss.txt"}'

StatisticalSummary

参数名称	参数类型	参数描述	默认值	是否必选
InputCols	Seq	Param for input column names.	None	True
rowLabels	Seq	the labels of rows. (default: List(count, max, min, mean, normL1, normL2, numNonzeros, variance))	List(count, max, min, mean, normL1, normL2, numNonzeros, variance)	False
ColLabels	Seq	the labels of columns.	None	True
transposed	boolean	whether or not is transposed (default: true)	true	False
numCols	int	the number of rows (default: null)	None	False
numRows	int	the number of columns (default: null)	None	False

2.1 测试用例
statistics.StatisticalSummary
2.2 提交参数
spark-submit --class grimoire.ml.statistics.conjure.StatisticalSummary --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"iris"}' '{"InputCols":["f1","f2","f3","f4"],"ColLabels":["f1","f2","f3","f4"]}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

模型评估

BinaryClassificationMetrics

参数名称	参数类型	参数描述	默认值	是否必选
LabelCol	String	Column name of label.	label	True
PredictionCol	String	Param for prediction column name.	prediction	True

1.1 测试用例
evaluate.BinaryClassificationMetrics
1.2 提交参数
spark-submit --class grimoire.ml.evaluate.conjure.BinaryClassificationMetrics --master local[4] 'grimoire-assembly-0.1.0.jar' 'RandomForest' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/RandomForest"}' '{"LabelCol":"label","PredictionCol":"prediction"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

RegressionMetrics

参数名称	参数类型	参数描述	默认值	是否必选
LabelCol	String	Column name of label.	label	True
PredictionCol	String	Param for prediction column name.	prediction	True

1.1 测试用例
evaluate.RegressionMetrics
1.2 提交参数
spark-submit --class grimoire.ml.evaluate.conjure.RegressionMetrics --master local[4] 'grimoire-assembly-0.1.0.jar' 'LinearRegression' '{"InputHiveTable":"REGRESSION"}' '{"InputModelSource":"hdfs:///user/model/REGRESSIONMOD"}' '{"LabelCol":"label","PredictionCol":"prediction"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

sparkMlib_doc_1.0