模型输入输出对应关系
- 输入表(hive)——模型参数——输出模型(hdfs)
- DecisionTree
- GBTC
- LogisticRegression
- NaiveBayes
- RandomForest
- BisectingKMeans
- IDFTrain
- ALS
- DecisionTreeRegression
- LinearRegression
- RandomForestRegression
示例:
'{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10}'
'{"OutputModelTarget":"hdfs:///user/model/GBTC","Overwrite":true}'
- 输入表(hive)——输入模型(hdfs)——输出表(hive)
- DecisionTreePre
- GBTCPre
- LogisticRegressionPre
- NaiveBayesPre
- RandomForestPre
- BisectingKMeansPre
- TFIDFVectorize
- ALSPre
- DecisionTreeRegressionPre
- LinearRegressionPre
- RandomForestRegressionPre
示例:
'{"InputHiveTable":"DecisionTree"}'
'{"InputModelSource":"hdfs:///user/model/DecisionTreeModel"}'
'{"OutputHiveTable":"DecisionTreeTestPre"}'
- 输入表(hive)——模型参数——输出表(hive)
- Binarizer
- Bucketizer
- ChiSqSelector
- CountVectorzer
- DCT
- TFVectorize
- IndexToString
- MinMaxScaler
- NGram
- Normalizer
- OneHotEncoder
- PCA
- RegexTokenizer
- StandardScaler
- StringIndexer
- VectorAssembler
- VectorIndexer
示例
'{"InputHiveTable":"Binarizer"}'
'{"InputCol":"feature","OutputCol":"binarized_feature","Threshold":0.5}
'{"OutputHiveTable":"Binarizer"}'
- 模型名称——输入表(hive)——输入模型(hdfs)——
模型参数——输出模型(hdfs)- BinaryClassificationMetrics
- RegressionMetrics
示例
'DecisionTreeRegression'
'{"InputHiveTable":"DecisionTree"}'
'{"InputModelSource":"hdfs:///user/model/DecisionTreeModel"}'
'{"LabelCol":"label","PredictionCol":"prediction"}'
'{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'
- 输入表(hive)——模型参数——输出结果(json格式保存到hdfs)
- StatisticalSummary
- Correlations
'{"InputHiveTable":"DecisionTree"}'
'{"InputCols":["f1","f2","f3","f4"],"ColLabels":["f1","f2","f3","f4"]}'
'{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'
- 输入表(hive)——模型参数
- RandomSplit
备注:训练集和测试集合默认保存到hive中
- RandomSplit
示例
'{"InputHiveTable":"DecisionTreeRegression"}'
'{"RandomRate":0.5}'
'{"OutputTrainingSetTable":"trainSet"}'
'{"OutputTestSetTable":"testSet"}'
特征
Feature特征提取
- TFVectorize
1.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
NumFeatures | int | Number of features. Should be greater than 0. | 20 | false |
Binary | bool | Binary toggle to control term frequency counts. | false | false |
1.2 测试用例
feature.TFVectorize
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.TFVectorize --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"wfwfwfwf"}' '{"InputCol":"segmented","OutputCol":"tf","NumFeatures":100,"Binary":true}' '{"OutputHiveTable":"test2"}'
- IDFTrainSpell
2.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
MinDocFreq | int | The minimum number of documents in which a term should appear. | 0 | false |
2.2 测试用例
feature.IDFTrain
2.3 提交参数示例
spark-submit --class grimoire.ml.feature.conjure.IDFTrain --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"test2"}' '{"InputCol":"tf", "OutputCol":"tfidf"}' '{"OutputModelTarget":"hdfs:///user/model/idf2","Overwrite":true}'
- TFIDFVectorize
3.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
OutputCol | string | Param for output column name. | output | true |
3.2 测试用例
feature.TFIDFVectorize
3.3 提交参数示例
spark-submit --class grimoire.ml.feature.conjure.TFIDFVectorize --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"test2"}' '{"InputModelSource":"hdfs:///user/model/idf2"}' '{"OutputCol":"tfidf"}' '{"OutputHiveTable":"TFIDFVectorize"}'
- CountVectorizer
4.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
VocabSize | int | Max size of the vocabulary. | 262144 | false |
MinDF | double | Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer greater than or equal to 1, this specifies the number of documents the term must appear in; if this is a double in [0,1), then this specifies the fraction of documents. | 1 | false |
Binary | boolean | Binary toggle to control term frequency counts. If true, all non-zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. | false | false |
4.2 测试用例
feature.CountVectorizer
4.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.CountVectorizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"countvectorizer"}' '{"InputCol":"words","OutputCol":"features","VocabSize":3,"MinDF":2}' '{"OutputHiveTable":"CountVectorzerConjure"}'
特征转换
- RegexTokenizer
1.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
Pattern | String | Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. | \s+ | false |
MinTokenLength | int | Minimum token length, greater than or equal to 0, to avoid returning empty strings | 1 | false |
ToLowercase | boolean | Indicates whether to convert all characters to lowercase before tokenizing. | true | false |
1.2 测试用例
feature.RegexTokenizer
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.RegexTokenizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"RegexTokenizer"}' '{"InputCol":"sentence","OutputCol":"words","Pattern":"\\W"}' '{"OutputHiveTable":"RegexTokenizerConjure"}'
- NGram
2.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
N | int | Minimum n-gram length, greater than or equal to 1. | 2 | true |
2.2 示例
feature.NGram
2.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.NGram --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"NGram"}' '{"InputCol":"words","OutputCol":"ngrams","N":2}' '{"OutputHiveTable":"NGramConjure"}'
- Binarizer
3.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
Threshold | Double | Param for threshold used to binarize continuous features. | 0.5 | false |
3.2 示例
feature.Binarizer
3.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Binarizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Binarizer"}' '{"InputCol":"feature","OutputCol":"binarized_feature","Threshold":0.5}' '{"OutputHiveTable":"BinarizerConjure"}'
- PCA
4.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
K | Int | The number of clusters to infer. Must be > 1. | 4 | true |
4.2 测试用例
feature.PCA
4.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.PCA --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"PCA"}' '{"InputCol":"features","OutputCol":"pcaFeatures","K":3}' '{"OutputHiveTable":"PCAConjure"}'
- DCT
5.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
Inverse | Boolean | Indicates whether to perform the inverse DCT (true) or forward DCT . | null | true |
5.2 测试用例
feature.DCT
5.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.DCT --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DCT"}' '{"InputCol":"features","OutputCol":"featuresDCT","Inverse":false}' '{"OutputHiveTable":"DCTConjure"}'
- StringIndexer
6.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
HandleInvalid | string | Param for how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error). More options may be added later. | error | false |
6.2 测试用例
feature.StringIndexer
6.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.StringIndexer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"StringIndexer"}' '{"InputCol":"category","OutputCol":"categoryIndex"}' '{"OutputHiveTable":"StringIndexerConjure"}'
- IndexToString
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
7.2 测试用例
feature.IndexToString
7.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.IndexToString --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"IndexToString"}' '{"InputCol":"categoryIndex","OutputCol":"originalCategory"}' '{"OutputHiveTable":"IndexToStringConjure"}'
8 OneHotEncoder
8.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
DropLast | Boolean | Whether to drop the last category in the encoded vector | true | false |
8.2 测试用例
feature.OneHotEncoder
8.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.OneHotEncoder --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"OneHotEncoder"}' '{"InputCol":"categoryIndex","OutputCol":"categoryVec","DropLast":false}' '{"OutputHiveTable":"OneHotEncoderConjure"}'
- VectorAssembler
9.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCols | seq[string] | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
9.2 测试用例
feature.VectorAssembler
9.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.VectorAssembler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"VectorAssembler"}' '{"InputCols":["hour", "mobile", "userFeatures"],"OutputCol":"features"}' '{"OutputHiveTable":"VectorAssemblerConjure"}'
- Normalizer
10.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
P | doube | Normalization in Lp space. | null | true |
10.2 测试用例
feature.Normalizer
10.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Normalizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Normalizer"}' '{"InputCol":"features","OutputCol":"normFeatures","P":1}' '{"OutputHiveTable":"NormalizerConjure"}'
- StandardScaler
11.1参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
WithStd | bool | Whether to scale the data to unit standard deviation. | false | false |
WithMean | bool | Whether to center the data with mean before scaling. | true | false |
11.2 测试用例
feature.StandardScaler
11.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.StandardScaler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"StandardScaler"}' '{"InputCol":"features","OutputCol":"scaledFeatures","WithStd":true}' '{"OutputHiveTable":"StandardScalerConjure"}'
- MinMaxScaler
12.1 参数配置
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
12.2 测试用例
feature.MinMaxScaler
12.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.MinMaxScaler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"MinMaxScaler"}' '{"InputCol":"features","OutputCol":"scaledFeatures"}' '{"OutputHiveTable":"MinMaxScalerConjure"}'
13 Bucketizer
13.1 参数配置
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
InputCol | string | Param for input column name. | null | true |
OutputCol | string | Param for output column name. | output | true |
Splits | seq[double] | Parameter for mapping continuous features into buckets. | output | true |
13.2 测试用例
feature.Bucketizer
13.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Bucketizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Bucketizer"}' '{"InputCol":"features","OutputCol":"bucketedFeatures","Splits":[-1000, -0.5, 0.0, 0.5, 1000]}' '{"OutputHiveTable":"BucketizerConjure"}'
- RandomSplit
14.1 参数配置
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
RandomRate | double | trainset random rate. | null | true |
Trainset | string | split dataset to trainset. | trainset | true |
TestSet | string | split dataset to testset . | testset | true |
13.2 测试用例
feature.RandomSplit
13.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.RandomSplit --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"RandomRate":0.5}'
'{"OutputTrainingSetTable":"trainSet"}'
'{"OutputTestSetTable":"testSet"}'
Feature特征选择
- ChiSqSelector
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
FeaturesCol | string | Column of features. | features | true |
OutputCol | string | Param for output column name. | output | true |
NumTopFeatures | Int | Number of features that selector will select, ordered by ascending p-value. | 50 | true |
LabelCol | String | Column name of label. | label | true |
1.2 测试用例
feature.ChiSqSelector
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.ChiSqSelector --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ChiSqSelector"}' '{"NumTopFeatures":1,"FeaturesCol":"features","LabelCol":"clicked","OutputCol":"chisq"}' '{"OutputHiveTable":"ChiSqSelectorConjure"}'
分类
- LogisticRegression
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否可选 |
---|---|---|---|---|
MaxIter | Int | Max iteration of train. | 100 | false |
RegParam | double | Regularize parameter. | 0.0 | false |
ElasticNetParam | double | The param of α. | 0.0 | true |
Family | String | binomial logistic regression or multinomial logistic regression . | auto | false |
LabelCol | String | Column name of label. | label | true |
FeaturesCol | String | Column of features. | features | true |
FitIntercept | boolean | Param for whether to fit an intercept term | true | false |
Standardization | boolean | Param for whether to standardize the training features before fitting the model. | true | false |
Threshold | double | Limit of calculation | 0.5 | false |
Tol | double | Param for the convergence tolerance for iterative algorithms (>= 0) | 1.0E-6 | false |
ProbabilityCol | String | Param for Column name for predicted class conditional probabilities. | probability | false |
RawPredictionCol | String | aram for prediction column name. | RawPredictionCol | false |
predictionCol | String | Param for prediction column name. | prediction | true |
1.2 测试用例
classify.LogisticRegression
1.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.LogisticRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"logistregression_train"}' '{"MaxIter":10,"RegParam":0.3,"ElasticNetParam":0.8,"LabelCol":"label","FeaturesCol":"features","PredictionCol":"predic"}' '{"OutputModelTarget":"hdfs:///user/model/LogisticRegression","Overwrite":true}'
LogisticRegressionPre
2.1 测试用例
classify.LogisticRegressionPre
2.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.LogisticRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"logistregression_train"}' '{"InputModelSource":"hdfs:///user/model/LogisticRegression"}' '{"OutputHiveTable":"LogisticRegressionTestPre"}'DecisionTree
3.1 参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
LabelCol | String | Column name of label. | label | true |
FeaturesCol | String | Column of features. | features | true |
RawPredictionCol | String | Param for prediction column name. | rawPrediction | false |
Seed | long | Random seed. | 159147643 | false |
CheckpointInterval | int | Param for set checkpoint interval (>= 1) or disable checkpoint (-1). | 10 | false |
Impurity | String | Criterion used for information gain calculation (case-insensitive). | gini | false |
MaxBins | int | Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. | 32 | false |
MaxDepth | int | Maximum depth of the tree (>= 0). | true | false |
Threshold | double | Limit of calculation | 0.5 | false |
MinInfoGain | double | Minimum information gain for a split to be considered at a tree node. | 0.0 | false |
MinInstancesPerNode | int | Minimum number of instances each child must have after split. | 1 | false |
PredictionCol | String | Param for prediction column name. | prediction | true |
ProbabilityCol | String | Param for Column name for predicted class conditional probabilities. | probabilities | false |
3.2 测试用例
classify.DecisionTree
3.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.DecisionTree --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/DecisionTree","Overwrite":true}'
DecisionTreePre
4.1 测试用例
classify.DecisionTreePre
4.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.DecisionTreePre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/DecisionTree"}' '{"OutputHiveTable":"DecisionTreeTestPre"}'GBTC
5.1. 模型参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
LabelCol | String | Column name of label. | label | True |
FeaturesCol | String | Column of features. | features | True |
MaxIter | int | Max iteration of train (default: 20) | 20 | False |
Impurity | String | Criterion used for information gain calculation (case-insensitive). (default: gini) | gini | False |
MaxBins | int | Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32) | 32 | False |
MaxDepth | int | Maximum depth of the tree (>= 0). (default: 5) | 5 | False |
MinInfoGain | double | Minimum information gain for a split to be considered at a tree node. (default: 0.0) | 0.0 | False |
CheckpointInterval | int | Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) | 10 | False |
MinInstancesPerNode | int | Minimum number of instances each child must have after split. (default: 1) | 1 | False |
PredictionCol | String | Param for prediction column name. | prediction | True |
Seed | long | Random seed. (default: -1287390502) | -1287390502 | False |
RawPredictionCol | String | Param for prediction column name. (default: rawPrediction) | rawPrediction | False |
subsamplingRate | double | Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0) | 1.0 | False |
LossType | String | Loss function which GBT tries to minimize. (default: squared) | squared | False |
StepSize | double | Set the initial step size of SGD for the first step. Default 1.0. In subsequent steps, the step size will decrease with stepSize/sqrt(t) (default: 0.1) | 0.1 | False |
5.2 测试用例
classify.GBTC
5.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.GBTC --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10,"LossType":"logistic"}' '{"OutputModelTarget":"hdfs:///user/model/GBTC","Overwrite":true}'
GBTCPre
6.1 测试用例
classify.GBTCPre
6.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.GBTCPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/GBTC"}' '{"OutputHiveTable":"GBTCTestPre"}'NaiveBayes
7.1 模型参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
LabelCol | String | Column name of label. | label | True |
FeaturesCol | String | Column of features. | features | True |
WeightCol | String | Param for weight column name. (default: null) | None | False |
ModelType | String | The model type which is a string (case-sensitive). (default: multinomial) | multinomial | False |
smothing | double | The smoothing parameter. (default: 1.0) | 1.0 | False |
PredictionCol | String | Param for prediction column name. | prediction | True |
ProbabilityCol | String | Param for Column name for predicted class conditional probabilities. (default: probabilities) | probabilities | False |
RawPredictionCol | String | Param for prediction column name. (default: rawPrediction) | rawPrediction | False |
Thresholds | Seq | Param for Thresholds in multi-class classification to adjust the probability of predicting each class. (default: null) | None | False |
7.2 测试用例
classify.NaiveBayes
7.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.NaiveBayes --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/NaiveBayes","Overwrite":true}'
NaiveBayesPre
8.1 测试用例
classify.NaiveBayesPre
8.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.NaiveBayesPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/NaiveBayes"}' '{"OutputHiveTable":"NaiveBayesTestPre"}'RandomForest
9.1 模型参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
LabelCol | String | Column name of label. | label | True |
FeaturesCol | String | Column of features. | features | True |
NumTrees | int | Number of trees to train. | 20 | True |
Impurity | String | Criterion used for information gain calculation (case-insensitive). (default: gini) | gini | False |
MaxBins | int | Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32) | 32 | False |
MaxDepth | int | Maximum depth of the tree (>= 0). (default: 5) | 5 | False |
MinInfoGain | double | Minimum information gain for a split to be considered at a tree node. (default: 0.0) | 0.0 | False |
CheckpointInterval | int | Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) | 10 | False |
MinInstancesPerNode | int | Minimum number of instances each child must have after split. (default: 1) | 1 | False |
PredictionCol | String | Param for prediction column name. | prediction | True |
FeatureSubsetStrategy | String | The number of features to consider for splits at each tree node. (default: auto) | auto | False |
ProbabilityCol | String | Param for Column name for predicted class conditional probabilities. (default: probabilities) | probabilities | False |
Thresholds | Seq | Param for Thresholds in multi-class classification to adjust the probability of predicting each class. (default: null) | None | False |
Seed | long | Random seed. (default: 159147643) | 159147643 | False |
RawPredictionCol | String | Param for prediction column name. (default: rawPrediction) | rawPrediction | False |
subsamplingRate | double | Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0) | 1.0 | False |
10.2 测试用例
classify.RandomForest
10.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.RandomForest --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","NumTrees":10}' '{"OutputModelTarget":"hdfs:///user/model/RandomForest","Overwrite":true}'
- RandomForestPre
8.1 测试用例
classify.RandomForestPre
8.2 提交参数
spark-submit --class grimoire.ml.classify.conjure.RandomForestPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/RandomForest"}' '{"OutputHiveTable":"RandomForestTestPre"}'
回归
- DecisionTreeRegression
1.1 模型参数
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
LabelCol | String | Column name of label. | label | True |
FeaturesCol | String | Column of features. | features | True |
PredictionCol | String | Param for prediction column name. | prediction | True |
varianceCol | String | Param for Column name for the biased sample variance of prediction. (default: null) | None | False |
CacheNodeIds | boolean | (default: false) | false | False |
CheckpointInterval | int | Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) | 10 | False |
Impurity | String | Criterion used for information gain calculation (case-insensitive). (default: variance) | variance | False |
MaxBins | int | Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32) | 32 | False |
MaxDepth | int | Maximum depth of the tree (>= 0). (default: 5) | 5 | False |
maxMemoryInMB | int | Max number of memory intput. (default: 256) | 256 | False |
MinInfoGain | double | Minimum information gain for a split to be considered at a tree node. (default: 0.0) | 0.0 | False |
MinInstancesPerNode | int | Minimum number of instances each child must have after split. (default: 1) | 1 | False |
Seed | long | Random seed. (default: 926680331) | 926680331 | False |
1.2 测试用例
regression.DecisionTreeRegression
1.3 提交参数
spark-submit --class grimoire.ml.regression.conjure.DecisionTreeRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/DecisionTreeRegression","Overwrite":true}'
DecisionTreeRegressionPre
2.1 测试用例
regression.RandomForestPre
2.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.DecisionTreeRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/DecisionTreeRegression"}' '{"OutputHiveTable":"DecisionTreeRegressionPre"}'
3 LinearRegression
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
MaxIter | int | Max iteration of train (default: 100) | 100 | False |
RegParam | double | Regularize parameter (default: 0.0) | 0.0 | False |
ElasticNetParam | double | The param of α. (default: 0.0) | 0.0 | False |
FitIntercept | boolean | Param for whether to fit an intercept term. (default: true) | true | False |
LabelCol | String | Column name of label. | label | True |
FeaturesCol | String | Column of features. | features | True |
WeightCol | String | Param for weight column name. (default: null) | None | False |
AggregationDepth | int | the depth of aggregation (default: 2) | 2 | False |
Standardization | boolean | Param for whether to standardize the training features before fitting the model. (default: true) | true | False |
Solver | String | Param for the solver algorithm for optimization. (default: auto) | auto | False |
Tol | double | Param for the convergence tolerance for iterative algorithms (>= 0). (default: 1.0E-6) | 1.0E-6 | False |
PredictionCol | String | Param for prediction column name. | prediction | True |
3.1 测试用例
regression.LinearRegression
3.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.LinearRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10,"RegParam":0.3,"ElasticNetParam":0.8}' '{"OutputModelTarget":"hdfs:///user/model/LinearRegression","Overwrite":true}'
- LinearRegressionPre
4.1 测试用例
regression.LinearRegressionPre
4.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.LinearRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/LinearRegression"}' '{"OutputHiveTable":"LinearRegressionPre"}'
5 RandomForestRegression
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
LabelCol | String | Column name of label. | label | True |
FeaturesCol | String | Column of features. | features | True |
PredictionCol | String | Param for prediction column name. | prediction | True |
CacheNodeIds | boolean | (default: false) | false | False |
CheckpointInterval | int | Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) | 10 | False |
Seed | long | Random seed. (default: 235498149) | 235498149 | False |
Impurity | String | Criterion used for information gain calculation (case-insensitive). (default: variance) | variance | False |
MaxBins | int | Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32) | 32 | False |
MaxDepth | int | Maximum depth of the tree (>= 0). (default: 5) | 5 | False |
maxMemoryInMB | int | Max number of memory intput. (default: 256) | 256 | False |
MinInfoGain | double | Minimum information gain for a split to be considered at a tree node. (default: 0.0) | 0.0 | False |
MinInstancesPerNode | int | Minimum number of instances each child must have after split. (default: 1) | 1 | False |
NumTrees | int | Number of trees to train. | 10 | True |
subsamplingRate | double | Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0) | 1.0 | False |
FeatureSubsetStrategy | String | The number of features to consider for splits at each tree node. (default: auto) | auto | False |
5.1 测试用例
regression.RandomForestRegression
5.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.RandomForestRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","NumTrees":20}' '{"OutputModelTarget":"hdfs:///user/model/RandomForestRegression","Overwrite":true}'
- RandomForestRegressionPre
6.1 测试用例
regression.RandomForestRegressionPre
6.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.RandomForestRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/RandomForestRegression"}' '{"OutputHiveTable":"RandomForestRegressionPre"}'
聚类
- BisectingKMeans
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
K | int | The number of clusters to infer. Must be > 1. | 4 | True |
Seed | long | Random seed. (default: 566573821) | 566573821 | False |
PredictionCol | String | Param for prediction column name. | prediction | True |
FeaturesCol | String | Column of features. | features | True |
maxInter | int | Param for maximum number of iterations (>= 0). (default: 20) | 20 | False |
minDivisibleClusterSize | int | Minnum size of divisibleCluster (default: 1) | 1 | False |
1.1 测试用例
clustering.BisectingKMeans
1.2 提交参数
spark-submit --class grimoire.ml.clustering.conjure.BisectingKMeans --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"BisectingKMeans"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","K":2}' '{"OutputModelTarget":"hdfs:///user/model/BisectingKMeans","Overwrite":true}'
2.1 测试用例
clustering.BisectingKMeansPre
2.2 提交参数
spark-submit --class grimoire.ml.clustering.conjure.BisectingKMeansPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"BisectingKMeans"}' '{"InputModelSource":"hdfs:///user/model/BisectingKMeans"}' '{"OutputHiveTable":"BisectingKMeansPre"}'
协同过滤
- ALS
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
MaxIter | int | Max iteration of train (default: 10) | 10 | False |
RegParam | double | Regularize parameter (default: 0.1) | 0.1 | False |
UserCol | String | Param for the column name for user ids. | user | True |
ItemCol | String | Param for the column name for item ids. | item | True |
RatingCol | String | Param for the column name for ratings. | rating | True |
PredictionCol | String | Param for prediction column name. | prediction | True |
Alpha | double | Param for the alpha parameter in the implicit preference formulation (nonnegative). (default: 1.0) | 1.0 | False |
CheckpointInterval | int | Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) | 10 | False |
FinalStorageLevel | String | (default: MEMORY_AND_DISK) | MEMORY_AND_DISK | False |
Nonnegative | boolean | Param for whether to apply nonnegativity constraints. (default: false) | false | False |
NumUserBlocks | int | Param for number of blocks. (default: 10) | 10 | False |
NumItemBlocks | int | Param for number of item blocks (positive). (default: 10) | 10 | False |
Rank | int | Param for rank of the matrix factorization (positive). (default: 10) | 10 | False |
Seed | long | Random seed. (default: 1994790107) | 1994790107 | False |
ImplicitPrefs | boolean | Param to decide whether to use implicit preference. (default: false) | false | False |
1.1 测试用例
filtering.ALS
1.2 提交参数
spark-submit --class grimoire.ml.filtering.conjure.ALS --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ALS"}' '{"MaxIter":5,"RegParam":0.01,"UserCol":"user","ItemCol":"product","RatingCol":"rating","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/ALS","Overwrite":true}'
2.1 测试用例
filtering.ALSPre
2.2 提交参数
spark-submit --class grimoire.ml.filtering.conjure.ALSPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ALS"}' '{"InputModelSource":"hdfs:///user/model/ALS"}' '{"OutputHiveTable":"ALSPre"}'
统计
- Correlations
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
InputCols | Seq | names of input columns (default: null) | None | False |
CorrelationMethod | String | Correlation method(default: pearson; alternative: spearman). (default: pearson) | pearson | False |
1.1 测试用例
statistics.ALS
1.2 提交参数
spark-submit --class grimoire.ml.statistics.conjure.Correlations --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"iris"}' '{"InputCols":["f1","f2","f3","f4"],"CorrelationMethod":"pearson"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss.txt"}'
- StatisticalSummary
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
InputCols | Seq | Param for input column names. | None | True |
rowLabels | Seq | the labels of rows. (default: List(count, max, min, mean, normL1, normL2, numNonzeros, variance)) | List(count, max, min, mean, normL1, normL2, numNonzeros, variance) | False |
ColLabels | Seq | the labels of columns. | None | True |
transposed | boolean | whether or not is transposed (default: true) | true | False |
numCols | int | the number of rows (default: null) | None | False |
numRows | int | the number of columns (default: null) | None | False |
2.1 测试用例
statistics.StatisticalSummary
2.2 提交参数
spark-submit --class grimoire.ml.statistics.conjure.StatisticalSummary --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"iris"}' '{"InputCols":["f1","f2","f3","f4"],"ColLabels":["f1","f2","f3","f4"]}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'
模型评估
- BinaryClassificationMetrics
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
LabelCol | String | Column name of label. | label | True |
PredictionCol | String | Param for prediction column name. | prediction | True |
1.1 测试用例
evaluate.BinaryClassificationMetrics
1.2 提交参数
spark-submit --class grimoire.ml.evaluate.conjure.BinaryClassificationMetrics --master local[4] 'grimoire-assembly-0.1.0.jar' 'RandomForest' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/RandomForest"}' '{"LabelCol":"label","PredictionCol":"prediction"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'
- RegressionMetrics
参数名称 | 参数类型 | 参数描述 | 默认值 | 是否必选 |
---|---|---|---|---|
LabelCol | String | Column name of label. | label | True |
PredictionCol | String | Param for prediction column name. | prediction | True |
1.1 测试用例
evaluate.RegressionMetrics
1.2 提交参数
spark-submit --class grimoire.ml.evaluate.conjure.RegressionMetrics --master local[4] 'grimoire-assembly-0.1.0.jar' 'LinearRegression' '{"InputHiveTable":"REGRESSION"}' '{"InputModelSource":"hdfs:///user/model/REGRESSIONMOD"}' '{"LabelCol":"label","PredictionCol":"prediction"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'