sparkMlib_doc_1.0

模型输入输出对应关系

  1. 输入表(hive)——模型参数——输出模型(hdfs)
    • DecisionTree
    • GBTC
    • LogisticRegression
    • NaiveBayes
    • RandomForest
    • BisectingKMeans
    • IDFTrain
    • ALS
    • DecisionTreeRegression
    • LinearRegression
    • RandomForestRegression

示例:
'{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10}'
'{"OutputModelTarget":"hdfs:///user/model/GBTC","Overwrite":true}'

  1. 输入表(hive)——输入模型(hdfs)——输出表(hive)
    • DecisionTreePre
    • GBTCPre
    • LogisticRegressionPre
    • NaiveBayesPre
    • RandomForestPre
    • BisectingKMeansPre
    • TFIDFVectorize
    • ALSPre
    • DecisionTreeRegressionPre
    • LinearRegressionPre
    • RandomForestRegressionPre

示例:
'{"InputHiveTable":"DecisionTree"}'
'{"InputModelSource":"hdfs:///user/model/DecisionTreeModel"}'
'{"OutputHiveTable":"DecisionTreeTestPre"}'

  1. 输入表(hive)——模型参数——输出表(hive)
    • Binarizer
    • Bucketizer
    • ChiSqSelector
    • CountVectorzer
    • DCT
    • TFVectorize
    • IndexToString
    • MinMaxScaler
    • NGram
    • Normalizer
    • OneHotEncoder
    • PCA
    • RegexTokenizer
    • StandardScaler
    • StringIndexer
    • VectorAssembler
    • VectorIndexer

示例
'{"InputHiveTable":"Binarizer"}'
'{"InputCol":"feature","OutputCol":"binarized_feature","Threshold":0.5}
'{"OutputHiveTable":"Binarizer"}'

  1. 模型名称——输入表(hive)——输入模型(hdfs)——
    模型参数——输出模型(hdfs)
    • BinaryClassificationMetrics
    • RegressionMetrics

示例
'DecisionTreeRegression'
'{"InputHiveTable":"DecisionTree"}'
'{"InputModelSource":"hdfs:///user/model/DecisionTreeModel"}'
'{"LabelCol":"label","PredictionCol":"prediction"}'
'{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

  1. 输入表(hive)——模型参数——输出结果(json格式保存到hdfs)
    • StatisticalSummary
    • Correlations

'{"InputHiveTable":"DecisionTree"}'
'{"InputCols":["f1","f2","f3","f4"],"ColLabels":["f1","f2","f3","f4"]}'
'{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

  1. 输入表(hive)——模型参数
    • RandomSplit
      备注:训练集和测试集合默认保存到hive中

示例
'{"InputHiveTable":"DecisionTreeRegression"}'
'{"RandomRate":0.5}'
'{"OutputTrainingSetTable":"trainSet"}'
'{"OutputTestSetTable":"testSet"}'

特征

Feature特征提取

  1. TFVectorize
    1.1 参数
参数名称 参数类型 参数描述 默认值 是否必选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
NumFeatures int Number of features. Should be greater than 0. 20 false
Binary bool Binary toggle to control term frequency counts. false false

1.2 测试用例
feature.TFVectorize
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.TFVectorize --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"wfwfwfwf"}' '{"InputCol":"segmented","OutputCol":"tf","NumFeatures":100,"Binary":true}' '{"OutputHiveTable":"test2"}'

  1. IDFTrainSpell
    2.1 参数
参数名称 参数类型 参数描述 默认值 是否必选
InputCol string Param for input column name. null true
MinDocFreq int The minimum number of documents in which a term should appear. 0 false

2.2 测试用例
feature.IDFTrain
2.3 提交参数示例
spark-submit --class grimoire.ml.feature.conjure.IDFTrain --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"test2"}' '{"InputCol":"tf", "OutputCol":"tfidf"}' '{"OutputModelTarget":"hdfs:///user/model/idf2","Overwrite":true}'

  1. TFIDFVectorize
    3.1 参数
参数名称 参数类型 参数描述 默认值 是否必选
OutputCol string Param for output column name. output true

3.2 测试用例
feature.TFIDFVectorize
3.3 提交参数示例
spark-submit --class grimoire.ml.feature.conjure.TFIDFVectorize --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"test2"}' '{"InputModelSource":"hdfs:///user/model/idf2"}' '{"OutputCol":"tfidf"}' '{"OutputHiveTable":"TFIDFVectorize"}'

  1. CountVectorizer
    4.1 参数
参数名称 参数类型 参数描述 默认值 是否必选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
VocabSize int Max size of the vocabulary. 262144 false
MinDF double Specifies the minimum number of different documents a term must appear in to be included in the vocabulary. If this is an integer greater than or equal to 1, this specifies the number of documents the term must appear in; if this is a double in [0,1), then this specifies the fraction of documents. 1 false
Binary boolean Binary toggle to control term frequency counts. If true, all non-zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. false false

4.2 测试用例
feature.CountVectorizer
4.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.CountVectorizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"countvectorizer"}' '{"InputCol":"words","OutputCol":"features","VocabSize":3,"MinDF":2}' '{"OutputHiveTable":"CountVectorzerConjure"}'

特征转换

  1. RegexTokenizer
    1.1 参数
参数名称 参数类型 参数描述 默认值 是否必选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
Pattern String Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. \s+ false
MinTokenLength int Minimum token length, greater than or equal to 0, to avoid returning empty strings 1 false
ToLowercase boolean Indicates whether to convert all characters to lowercase before tokenizing. true false

1.2 测试用例
feature.RegexTokenizer
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.RegexTokenizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"RegexTokenizer"}' '{"InputCol":"sentence","OutputCol":"words","Pattern":"\\W"}' '{"OutputHiveTable":"RegexTokenizerConjure"}'

  1. NGram
    2.1 参数
参数名称 参数类型 参数描述 默认值 是否必选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
N int Minimum n-gram length, greater than or equal to 1. 2 true

2.2 示例
feature.NGram
2.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.NGram --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"NGram"}' '{"InputCol":"words","OutputCol":"ngrams","N":2}' '{"OutputHiveTable":"NGramConjure"}'

  1. Binarizer
    3.1 参数
参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
Threshold Double Param for threshold used to binarize continuous features. 0.5 false

3.2 示例
feature.Binarizer
3.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Binarizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Binarizer"}' '{"InputCol":"feature","OutputCol":"binarized_feature","Threshold":0.5}' '{"OutputHiveTable":"BinarizerConjure"}'

  1. PCA
    4.1 参数
参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
K Int The number of clusters to infer. Must be > 1. 4 true

4.2 测试用例
feature.PCA
4.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.PCA --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"PCA"}' '{"InputCol":"features","OutputCol":"pcaFeatures","K":3}' '{"OutputHiveTable":"PCAConjure"}'

  1. DCT
    5.1 参数
参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
Inverse Boolean Indicates whether to perform the inverse DCT (true) or forward DCT . null true

5.2 测试用例
feature.DCT
5.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.DCT --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DCT"}' '{"InputCol":"features","OutputCol":"featuresDCT","Inverse":false}' '{"OutputHiveTable":"DCTConjure"}'

  1. StringIndexer
    6.1 参数
参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
HandleInvalid string Param for how to handle invalid entries. Options are skip (which will filter out rows with bad values), or error (which will throw an error). More options may be added later. error false

6.2 测试用例
feature.StringIndexer
6.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.StringIndexer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"StringIndexer"}' '{"InputCol":"category","OutputCol":"categoryIndex"}' '{"OutputHiveTable":"StringIndexerConjure"}'

  1. IndexToString
参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true

7.2 测试用例
feature.IndexToString
7.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.IndexToString --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"IndexToString"}' '{"InputCol":"categoryIndex","OutputCol":"originalCategory"}' '{"OutputHiveTable":"IndexToStringConjure"}'

8 OneHotEncoder
8.1 参数

参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
DropLast Boolean Whether to drop the last category in the encoded vector true false

8.2 测试用例
feature.OneHotEncoder
8.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.OneHotEncoder --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"OneHotEncoder"}' '{"InputCol":"categoryIndex","OutputCol":"categoryVec","DropLast":false}' '{"OutputHiveTable":"OneHotEncoderConjure"}'

  1. VectorAssembler
    9.1 参数
参数名称 参数类型 参数描述 默认值 是否可选
InputCols seq[string] Param for input column name. null true
OutputCol string Param for output column name. output true

9.2 测试用例
feature.VectorAssembler
9.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.VectorAssembler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"VectorAssembler"}' '{"InputCols":["hour", "mobile", "userFeatures"],"OutputCol":"features"}' '{"OutputHiveTable":"VectorAssemblerConjure"}'

  1. Normalizer
    10.1 参数
参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
P doube Normalization in Lp space. null true

10.2 测试用例
feature.Normalizer
10.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Normalizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Normalizer"}' '{"InputCol":"features","OutputCol":"normFeatures","P":1}' '{"OutputHiveTable":"NormalizerConjure"}'

  1. StandardScaler
    11.1参数
参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
WithStd bool Whether to scale the data to unit standard deviation. false false
WithMean bool Whether to center the data with mean before scaling. true false

11.2 测试用例
feature.StandardScaler
11.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.StandardScaler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"StandardScaler"}' '{"InputCol":"features","OutputCol":"scaledFeatures","WithStd":true}' '{"OutputHiveTable":"StandardScalerConjure"}'

  1. MinMaxScaler
    12.1 参数配置
参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true

12.2 测试用例
feature.MinMaxScaler
12.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.MinMaxScaler --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"MinMaxScaler"}' '{"InputCol":"features","OutputCol":"scaledFeatures"}' '{"OutputHiveTable":"MinMaxScalerConjure"}'

13 Bucketizer
13.1 参数配置

参数名称 参数类型 参数描述 默认值 是否可选
InputCol string Param for input column name. null true
OutputCol string Param for output column name. output true
Splits seq[double] Parameter for mapping continuous features into buckets. output true

13.2 测试用例
feature.Bucketizer
13.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.Bucketizer --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"Bucketizer"}' '{"InputCol":"features","OutputCol":"bucketedFeatures","Splits":[-1000, -0.5, 0.0, 0.5, 1000]}' '{"OutputHiveTable":"BucketizerConjure"}'

  1. RandomSplit
    14.1 参数配置
参数名称 参数类型 参数描述 默认值 是否可选
RandomRate double trainset random rate. null true
Trainset string split dataset to trainset. trainset true
TestSet string split dataset to testset . testset true

13.2 测试用例
feature.RandomSplit
13.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.RandomSplit --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"RandomRate":0.5}'
'{"OutputTrainingSetTable":"trainSet"}'
'{"OutputTestSetTable":"testSet"}'

Feature特征选择

  1. ChiSqSelector
参数名称 参数类型 参数描述 默认值 是否必选
FeaturesCol string Column of features. features true
OutputCol string Param for output column name. output true
NumTopFeatures Int Number of features that selector will select, ordered by ascending p-value. 50 true
LabelCol String Column name of label. label true

1.2 测试用例
feature.ChiSqSelector
1.3 提交参数
spark-submit --class grimoire.ml.feature.conjure.ChiSqSelector --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ChiSqSelector"}' '{"NumTopFeatures":1,"FeaturesCol":"features","LabelCol":"clicked","OutputCol":"chisq"}' '{"OutputHiveTable":"ChiSqSelectorConjure"}'

分类

  1. LogisticRegression
参数名称 参数类型 参数描述 默认值 是否可选
MaxIter Int Max iteration of train. 100 false
RegParam double Regularize parameter. 0.0 false
ElasticNetParam double The param of α. 0.0 true
Family String binomial logistic regression or multinomial logistic regression . auto false
LabelCol String Column name of label. label true
FeaturesCol String Column of features. features true
FitIntercept boolean Param for whether to fit an intercept term true false
Standardization boolean Param for whether to standardize the training features before fitting the model. true false
Threshold double Limit of calculation 0.5 false
Tol double Param for the convergence tolerance for iterative algorithms (>= 0) 1.0E-6 false
ProbabilityCol String Param for Column name for predicted class conditional probabilities. probability false
RawPredictionCol String aram for prediction column name. RawPredictionCol false
predictionCol String Param for prediction column name. prediction true

1.2 测试用例
classify.LogisticRegression
1.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.LogisticRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"logistregression_train"}' '{"MaxIter":10,"RegParam":0.3,"ElasticNetParam":0.8,"LabelCol":"label","FeaturesCol":"features","PredictionCol":"predic"}' '{"OutputModelTarget":"hdfs:///user/model/LogisticRegression","Overwrite":true}'

  1. LogisticRegressionPre
    2.1 测试用例
    classify.LogisticRegressionPre
    2.2 提交参数
    spark-submit --class grimoire.ml.classify.conjure.LogisticRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"logistregression_train"}' '{"InputModelSource":"hdfs:///user/model/LogisticRegression"}' '{"OutputHiveTable":"LogisticRegressionTestPre"}'

  2. DecisionTree
    3.1 参数

参数名称 参数类型 参数描述 默认值 是否必选
LabelCol String Column name of label. label true
FeaturesCol String Column of features. features true
RawPredictionCol String Param for prediction column name. rawPrediction false
Seed long Random seed. 159147643 false
CheckpointInterval int Param for set checkpoint interval (>= 1) or disable checkpoint (-1). 10 false
Impurity String Criterion used for information gain calculation (case-insensitive). gini false
MaxBins int Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. 32 false
MaxDepth int Maximum depth of the tree (>= 0). true false
Threshold double Limit of calculation 0.5 false
MinInfoGain double Minimum information gain for a split to be considered at a tree node. 0.0 false
MinInstancesPerNode int Minimum number of instances each child must have after split. 1 false
PredictionCol String Param for prediction column name. prediction true
ProbabilityCol String Param for Column name for predicted class conditional probabilities. probabilities false

3.2 测试用例
classify.DecisionTree
3.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.DecisionTree --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/DecisionTree","Overwrite":true}'

  1. DecisionTreePre
    4.1 测试用例
    classify.DecisionTreePre
    4.2 提交参数
    spark-submit --class grimoire.ml.classify.conjure.DecisionTreePre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/DecisionTree"}' '{"OutputHiveTable":"DecisionTreeTestPre"}'

  2. GBTC
    5.1. 模型参数

参数名称 参数类型 参数描述 默认值 是否必选
LabelCol String Column name of label. label True
FeaturesCol String Column of features. features True
MaxIter int Max iteration of train (default: 20) 20 False
Impurity String Criterion used for information gain calculation (case-insensitive). (default: gini) gini False
MaxBins int Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32) 32 False
MaxDepth int Maximum depth of the tree (>= 0). (default: 5) 5 False
MinInfoGain double Minimum information gain for a split to be considered at a tree node. (default: 0.0) 0.0 False
CheckpointInterval int Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) 10 False
MinInstancesPerNode int Minimum number of instances each child must have after split. (default: 1) 1 False
PredictionCol String Param for prediction column name. prediction True
Seed long Random seed. (default: -1287390502) -1287390502 False
RawPredictionCol String Param for prediction column name. (default: rawPrediction) rawPrediction False
subsamplingRate double Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0) 1.0 False
LossType String Loss function which GBT tries to minimize. (default: squared) squared False
StepSize double Set the initial step size of SGD for the first step. Default 1.0. In subsequent steps, the step size will decrease with stepSize/sqrt(t) (default: 0.1) 0.1 False

5.2 测试用例
classify.GBTC
5.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.GBTC --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10,"LossType":"logistic"}' '{"OutputModelTarget":"hdfs:///user/model/GBTC","Overwrite":true}'

  1. GBTCPre
    6.1 测试用例
    classify.GBTCPre
    6.2 提交参数
    spark-submit --class grimoire.ml.classify.conjure.GBTCPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/GBTC"}' '{"OutputHiveTable":"GBTCTestPre"}'

  2. NaiveBayes
    7.1 模型参数

参数名称 参数类型 参数描述 默认值 是否必选
LabelCol String Column name of label. label True
FeaturesCol String Column of features. features True
WeightCol String Param for weight column name. (default: null) None False
ModelType String The model type which is a string (case-sensitive). (default: multinomial) multinomial False
smothing double The smoothing parameter. (default: 1.0) 1.0 False
PredictionCol String Param for prediction column name. prediction True
ProbabilityCol String Param for Column name for predicted class conditional probabilities. (default: probabilities) probabilities False
RawPredictionCol String Param for prediction column name. (default: rawPrediction) rawPrediction False
Thresholds Seq Param for Thresholds in multi-class classification to adjust the probability of predicting each class. (default: null) None False

7.2 测试用例
classify.NaiveBayes
7.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.NaiveBayes --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/NaiveBayes","Overwrite":true}'

  1. NaiveBayesPre
    8.1 测试用例
    classify.NaiveBayesPre
    8.2 提交参数
    spark-submit --class grimoire.ml.classify.conjure.NaiveBayesPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/NaiveBayes"}' '{"OutputHiveTable":"NaiveBayesTestPre"}'

  2. RandomForest
    9.1 模型参数

参数名称 参数类型 参数描述 默认值 是否必选
LabelCol String Column name of label. label True
FeaturesCol String Column of features. features True
NumTrees int Number of trees to train. 20 True
Impurity String Criterion used for information gain calculation (case-insensitive). (default: gini) gini False
MaxBins int Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32) 32 False
MaxDepth int Maximum depth of the tree (>= 0). (default: 5) 5 False
MinInfoGain double Minimum information gain for a split to be considered at a tree node. (default: 0.0) 0.0 False
CheckpointInterval int Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) 10 False
MinInstancesPerNode int Minimum number of instances each child must have after split. (default: 1) 1 False
PredictionCol String Param for prediction column name. prediction True
FeatureSubsetStrategy String The number of features to consider for splits at each tree node. (default: auto) auto False
ProbabilityCol String Param for Column name for predicted class conditional probabilities. (default: probabilities) probabilities False
Thresholds Seq Param for Thresholds in multi-class classification to adjust the probability of predicting each class. (default: null) None False
Seed long Random seed. (default: 159147643) 159147643 False
RawPredictionCol String Param for prediction column name. (default: rawPrediction) rawPrediction False
subsamplingRate double Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0) 1.0 False

10.2 测试用例
classify.RandomForest
10.3 提交参数
spark-submit --class grimoire.ml.classify.conjure.RandomForest --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","NumTrees":10}' '{"OutputModelTarget":"hdfs:///user/model/RandomForest","Overwrite":true}'

  1. RandomForestPre
    8.1 测试用例
    classify.RandomForestPre
    8.2 提交参数
    spark-submit --class grimoire.ml.classify.conjure.RandomForestPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/RandomForest"}' '{"OutputHiveTable":"RandomForestTestPre"}'

回归

  1. DecisionTreeRegression
    1.1 模型参数
参数名称 参数类型 参数描述 默认值 是否必选
LabelCol String Column name of label. label True
FeaturesCol String Column of features. features True
PredictionCol String Param for prediction column name. prediction True
varianceCol String Param for Column name for the biased sample variance of prediction. (default: null) None False
CacheNodeIds boolean (default: false) false False
CheckpointInterval int Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) 10 False
Impurity String Criterion used for information gain calculation (case-insensitive). (default: variance) variance False
MaxBins int Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32) 32 False
MaxDepth int Maximum depth of the tree (>= 0). (default: 5) 5 False
maxMemoryInMB int Max number of memory intput. (default: 256) 256 False
MinInfoGain double Minimum information gain for a split to be considered at a tree node. (default: 0.0) 0.0 False
MinInstancesPerNode int Minimum number of instances each child must have after split. (default: 1) 1 False
Seed long Random seed. (default: 926680331) 926680331 False

1.2 测试用例
regression.DecisionTreeRegression
1.3 提交参数
spark-submit --class grimoire.ml.regression.conjure.DecisionTreeRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/DecisionTreeRegression","Overwrite":true}'

DecisionTreeRegressionPre
2.1 测试用例
regression.RandomForestPre
2.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.DecisionTreeRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/DecisionTreeRegression"}' '{"OutputHiveTable":"DecisionTreeRegressionPre"}'

3 LinearRegression

参数名称 参数类型 参数描述 默认值 是否必选
MaxIter int Max iteration of train (default: 100) 100 False
RegParam double Regularize parameter (default: 0.0) 0.0 False
ElasticNetParam double The param of α. (default: 0.0) 0.0 False
FitIntercept boolean Param for whether to fit an intercept term. (default: true) true False
LabelCol String Column name of label. label True
FeaturesCol String Column of features. features True
WeightCol String Param for weight column name. (default: null) None False
AggregationDepth int the depth of aggregation (default: 2) 2 False
Standardization boolean Param for whether to standardize the training features before fitting the model. (default: true) true False
Solver String Param for the solver algorithm for optimization. (default: auto) auto False
Tol double Param for the convergence tolerance for iterative algorithms (>= 0). (default: 1.0E-6) 1.0E-6 False
PredictionCol String Param for prediction column name. prediction True

3.1 测试用例
regression.LinearRegression
3.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.LinearRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","MaxIter":10,"RegParam":0.3,"ElasticNetParam":0.8}' '{"OutputModelTarget":"hdfs:///user/model/LinearRegression","Overwrite":true}'

  1. LinearRegressionPre
    4.1 测试用例
    regression.LinearRegressionPre
    4.2 提交参数
    spark-submit --class grimoire.ml.regression.conjure.LinearRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/LinearRegression"}' '{"OutputHiveTable":"LinearRegressionPre"}'

5 RandomForestRegression

参数名称 参数类型 参数描述 默认值 是否必选
LabelCol String Column name of label. label True
FeaturesCol String Column of features. features True
PredictionCol String Param for prediction column name. prediction True
CacheNodeIds boolean (default: false) false False
CheckpointInterval int Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) 10 False
Seed long Random seed. (default: 235498149) 235498149 False
Impurity String Criterion used for information gain calculation (case-insensitive). (default: variance) variance False
MaxBins int Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. (default: 32) 32 False
MaxDepth int Maximum depth of the tree (>= 0). (default: 5) 5 False
maxMemoryInMB int Max number of memory intput. (default: 256) 256 False
MinInfoGain double Minimum information gain for a split to be considered at a tree node. (default: 0.0) 0.0 False
MinInstancesPerNode int Minimum number of instances each child must have after split. (default: 1) 1 False
NumTrees int Number of trees to train. 10 True
subsamplingRate double Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0) 1.0 False
FeatureSubsetStrategy String The number of features to consider for splits at each tree node. (default: auto) auto False

5.1 测试用例
regression.RandomForestRegression
5.2 提交参数
spark-submit --class grimoire.ml.regression.conjure.RandomForestRegression --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","NumTrees":20}' '{"OutputModelTarget":"hdfs:///user/model/RandomForestRegression","Overwrite":true}'

  1. RandomForestRegressionPre
    6.1 测试用例
    regression.RandomForestRegressionPre
    6.2 提交参数
    spark-submit --class grimoire.ml.regression.conjure.RandomForestRegressionPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"DecisionTreeRegression"}' '{"InputModelSource":"hdfs:///user/model/RandomForestRegression"}' '{"OutputHiveTable":"RandomForestRegressionPre"}'

聚类

  1. BisectingKMeans
参数名称 参数类型 参数描述 默认值 是否必选
K int The number of clusters to infer. Must be > 1. 4 True
Seed long Random seed. (default: 566573821) 566573821 False
PredictionCol String Param for prediction column name. prediction True
FeaturesCol String Column of features. features True
maxInter int Param for maximum number of iterations (>= 0). (default: 20) 20 False
minDivisibleClusterSize int Minnum size of divisibleCluster (default: 1) 1 False

1.1 测试用例
clustering.BisectingKMeans
1.2 提交参数
spark-submit --class grimoire.ml.clustering.conjure.BisectingKMeans --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"BisectingKMeans"}' '{"LabelCol":"label","FeaturesCol":"features","PredictionCol":"prediction","K":2}' '{"OutputModelTarget":"hdfs:///user/model/BisectingKMeans","Overwrite":true}'

2.1 测试用例
clustering.BisectingKMeansPre
2.2 提交参数
spark-submit --class grimoire.ml.clustering.conjure.BisectingKMeansPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"BisectingKMeans"}' '{"InputModelSource":"hdfs:///user/model/BisectingKMeans"}' '{"OutputHiveTable":"BisectingKMeansPre"}'

协同过滤

  1. ALS
参数名称 参数类型 参数描述 默认值 是否必选
MaxIter int Max iteration of train (default: 10) 10 False
RegParam double Regularize parameter (default: 0.1) 0.1 False
UserCol String Param for the column name for user ids. user True
ItemCol String Param for the column name for item ids. item True
RatingCol String Param for the column name for ratings. rating True
PredictionCol String Param for prediction column name. prediction True
Alpha double Param for the alpha parameter in the implicit preference formulation (nonnegative). (default: 1.0) 1.0 False
CheckpointInterval int Param for set checkpoint interval (>= 1) or disable checkpoint (-1). (default: 10) 10 False
FinalStorageLevel String (default: MEMORY_AND_DISK) MEMORY_AND_DISK False
Nonnegative boolean Param for whether to apply nonnegativity constraints. (default: false) false False
NumUserBlocks int Param for number of blocks. (default: 10) 10 False
NumItemBlocks int Param for number of item blocks (positive). (default: 10) 10 False
Rank int Param for rank of the matrix factorization (positive). (default: 10) 10 False
Seed long Random seed. (default: 1994790107) 1994790107 False
ImplicitPrefs boolean Param to decide whether to use implicit preference. (default: false) false False

1.1 测试用例
filtering.ALS
1.2 提交参数
spark-submit --class grimoire.ml.filtering.conjure.ALS --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ALS"}' '{"MaxIter":5,"RegParam":0.01,"UserCol":"user","ItemCol":"product","RatingCol":"rating","PredictionCol":"prediction"}' '{"OutputModelTarget":"hdfs:///user/model/ALS","Overwrite":true}'

2.1 测试用例
filtering.ALSPre
2.2 提交参数
spark-submit --class grimoire.ml.filtering.conjure.ALSPre --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"ALS"}' '{"InputModelSource":"hdfs:///user/model/ALS"}' '{"OutputHiveTable":"ALSPre"}'

统计

  1. Correlations
参数名称 参数类型 参数描述 默认值 是否必选
InputCols Seq names of input columns (default: null) None False
CorrelationMethod String Correlation method(default: pearson; alternative: spearman). (default: pearson) pearson False

1.1 测试用例
statistics.ALS
1.2 提交参数
spark-submit --class grimoire.ml.statistics.conjure.Correlations --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"iris"}' '{"InputCols":["f1","f2","f3","f4"],"CorrelationMethod":"pearson"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss.txt"}'

  1. StatisticalSummary
参数名称 参数类型 参数描述 默认值 是否必选
InputCols Seq Param for input column names. None True
rowLabels Seq the labels of rows. (default: List(count, max, min, mean, normL1, normL2, numNonzeros, variance)) List(count, max, min, mean, normL1, normL2, numNonzeros, variance) False
ColLabels Seq the labels of columns. None True
transposed boolean whether or not is transposed (default: true) true False
numCols int the number of rows (default: null) None False
numRows int the number of columns (default: null) None False

2.1 测试用例
statistics.StatisticalSummary
2.2 提交参数
spark-submit --class grimoire.ml.statistics.conjure.StatisticalSummary --master local[4] 'grimoire-assembly-0.1.0.jar' '{"InputHiveTable":"iris"}' '{"InputCols":["f1","f2","f3","f4"],"ColLabels":["f1","f2","f3","f4"]}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

模型评估

  1. BinaryClassificationMetrics
参数名称 参数类型 参数描述 默认值 是否必选
LabelCol String Column name of label. label True
PredictionCol String Param for prediction column name. prediction True

1.1 测试用例
evaluate.BinaryClassificationMetrics
1.2 提交参数
spark-submit --class grimoire.ml.evaluate.conjure.BinaryClassificationMetrics --master local[4] 'grimoire-assembly-0.1.0.jar' 'RandomForest' '{"InputHiveTable":"DecisionTree"}' '{"InputModelSource":"hdfs:///user/model/RandomForest"}' '{"LabelCol":"label","PredictionCol":"prediction"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

  1. RegressionMetrics
参数名称 参数类型 参数描述 默认值 是否必选
LabelCol String Column name of label. label True
PredictionCol String Param for prediction column name. prediction True

1.1 测试用例
evaluate.RegressionMetrics
1.2 提交参数
spark-submit --class grimoire.ml.evaluate.conjure.RegressionMetrics --master local[4] 'grimoire-assembly-0.1.0.jar' 'LinearRegression' '{"InputHiveTable":"REGRESSION"}' '{"InputModelSource":"hdfs:///user/model/REGRESSIONMOD"}' '{"LabelCol":"label","PredictionCol":"prediction"}' '{"OutPutJsonTarget":"hdfs:///user/example/jsonss"}'

你可能感兴趣的:(sparkMlib_doc_1.0)