Pycaret使用教程

文档链接:https://pycaret.org/setup/

模块介绍:

1.数据导入

#Importing data using pandas
import pandas as pd 
data = pd.read_csv('c:/path_to_data/file.csv’)  
#Loading data from pycaret
from pycaret.datasets import get_data
data = get_data('juice’)          

2.导入模型模块

# Classification
from pycaret.classification import *
# Regression
from pycaret.regression import *
# Clustering
from pycaret.clustering import *
# Anomaly Detection
from pycaret.anomaly import *
# Natural Language Processing
from pycaret.nlp import *
# Association Rule Mining
from pycaret.arules import *

3.函数介绍

1.模型比较: compare_models函数
对所有可调用的模型都采用K折交叉验证方法,然后对比不同模型的效果
分类指标:Accuracy, AUC, Recall, Precision, F1, Kappa
回归指标:MAE, MSE, RMSE, R2, RMSLE, MAPE

输出结果为各个模型经过k折验证后的平均得分,可以通过compare_models函数的fold参数来设置K的值(默认为10),结果按所选择的指标从高到低进行排序(默认分类是accuracy,回归是R2),可以通过设置turbo=False,防止某些模型的运行时间过长导致无法比较,只用于pycaret.classification 和 pycaret.regression模块

示例
from pycaret.classification import *
from pycaret.regression import *
分类:
diabetes = get_data('diabetes')
clf1 = setup(data=diabetes, target='Class variable')
compare_models(fold=10, sort='AUC', turbo=False)
回归:
boston = get_data('boston')
reg1 = setup(data=boston, target='medv')
compare_models()

2.模型构造: create_model函数
传参简单:模型缩写的字符串格式
对于有监督学习(分类和回归任务),该函数返回包含K折交叉验证评估指标的table和训练好的模型object
对于无监督学习(聚类,异常检测,自然语言处理,关联规则挖掘),此函数只返回训练好的模型object
create_model函数的fold函数可以用来设置K值(默认10),round参数可以用来设置结果的精度,可以使用ensemble参数
Pycaret使用教程_第1张图片

3.模型调参: tune_model函数
参数:模型缩写与create_model一样
由于优化模型的超参数需要一个目标函数,在有监督学习中自动关联到目标变量,但是在无监督学习中,pycaret允许你通过使用create_model函数中的supervised_target参数来自定义目标函数,有监督学习情况下,该函数返回K折交叉验证评估指标的table和训练好的模型object,对于非监督学习,该函数仅返回模型object。
使用flod参数来确定k折交叉验证,round参数指定输出结果的精度,tune_model函数的功能是对预定义搜索空间进行随机网格搜索,因此它依赖搜索的次数,可以使用
n_iter 参数来确定迭代的次数,增大n_iter可能增加训练的时间,但通常会生成高度优化的模型

4.模型集成1: ensemble_model函数
仅采用一个强制性参数:经过训练的模型对象
该函数返回K折交叉验证评估指标的table和训练好的模型object
使用flod参数来确定k折交叉验证,round参数指定输出结果的精度,可以使用method参数设置两种集成的方法,这两种方法都需要重采样数据集并拟合多个学习器,因此可以使用n_estimators来控制基学习器的个数。
该函数只适用分类和回归任务
两种方法
Bagging:旨在提高稳定性和准确性,还可以减小方差有助于避免过拟合,通常应用于决策树方法,但也可以与任何类型方法一起使用
Boosting:旨在减小偏差和方差,提升算法可将弱学习器(仅比随机猜测略好)变成强学习器

5.模型集成2: blending_models函数
该函数返回K折交叉验证评估指标的table和训练好的模型object
混合模型使用估算器之间的共识来做最终决策,主要思路是结合不同的机器学习算法,并使用投片表决或者预测的平均概率来作为最终的预测结果,可以使用estimator_list参数来指定特定使用的模型组合,如果不传该列表,它将使用模型库中的所有模型,method参数可设置soft和hard,soft表示预测概率,hard表示预测类别。
使用flod参数来确定k折交叉验证,round参数指定输出结果的精度
该函数只适用分类和回归任务

6.模型集成3: stack_models函数
该函数返回K折交叉验证评估指标的table和训练好的模型object
Stacking是个多层的多模型集合方法。每一层都可包括多个模型,下一层利用上一层模型的结果进行学习
estimator_list参数来指定特定使用的模型组合,meta_model用来传元模型(默认logistic),method参数可设置soft和hard,soft表示预测概率,hard表示预测类别
使用flod参数来确定k折交叉验证,round参数指定输出结果的精度,restack参数(默认为True)用来控制原始数据对metal model的作用,如果设置成False,只使用基学习器的预测值来生成最终的预测。

7.模型集成4: create_stacknet函数
多层堆叠模型
将先前层的预测结果作为输入传递到下一层,直到到达元模型,使用estimator_list参数传入每一层的学习器列表,形式为列表中嵌套列表

8.模型画图: plot_model函数
Pycaret使用教程_第2张图片

9.模型解释: interpret_model.函数
解释复杂的模型是至关重要的,通过分析模型认为重要的变量可以帮助调试模型。
传入参数格式:训练好的模型object和画图的类型,可解释性是基于SHAP库,因此只适用于基于树的模型,plot函数用来指定图的类型
该函数只适用分类和回归任务

10.模型Assign: assign_model.函数
传参:训练好的模型object
当进行无监督学习时(聚类,异常检测或自然语言处理)时,通常会对模型生成的标签感兴趣,例如,在聚类中,数据点所属聚合辨识是一个标签,同样,在异常检验中,异常观测值是一个二元标签,在自然语言处理中,主题所属文档是一个标签

11.模型校准: calibrate_model.函数
传参:训练好的模型object
该函数返回K折交叉验证评估指标的table和训练好的模型object
该函数只适用分类任务
当进行分类任务时,不仅需要预测类别的标签,还获得预测的概率。这种概率给你某种置信区间,某些模型可能使您对类的估计不佳,校准良好的分类器可以直接将输出的概率解释成一个置信度。
method可选参数:
(1). ‘sigmoid’:Platt scaling是一种参数化方法(The parametric approach)
参考链接:https://blog.csdn.net/wangxiao7474/article/details/81067436
(2).isotonic保序回归:非参数估计方法
当校准样过少时(<1000),不检验使用
参考链接:https://blog.csdn.net/wangxiao7474/article/details/81069815

12.阈值校准: optimize_threshold.函数
传参:训练好的模型object和损失函数(由true positives, true negatives, false positives and false negatives表示)
返回:一个交互图,x轴为不同的阈值,y轴为损失值,一条垂直线代表分类器最优的阈值,predict_model函数的optimize_threshold参数可以用来定义概率阈值,通常情况下,都以0.5作为概率阈值
在分类任务中,假阳性的代价几乎永远不会和假阴性(代价高)的代价相同,如一个癌症病者实际为恶性但检测为良性的代价远比实际良性但检测为恶性的代价高

13.模型预测: predict_model.函数
传参:训练好的模型object和数据集X
一旦模型成功部署到云端或者使用save_model进行本地保存,就可以使用predict_model函数对新数据集进行预测,对于分类,将基于0.5的阈值进行创建标签,但如果你使用optimize_threshold函数获得一个不一样的阈值,你可以在predict_model传入probability_threshold参数,此函数也可以被用于生成训练、测试集的预测值

14.模型最终生成: finalize_model.函数
传参:训练好的模型object
返回:在整个数据集训练好的模型
模型最终确定是有监督实验工作流程中的最后一步,当使用pycaret建模时,首先就会拆分训练集和测试集(默认7:3),所有的函数都使用训练集来创建,调整或者继承模型,测试集用于诊断过拟合和欠拟合,然而,一但你使用predict_model函数在测试集上进行预测,并且你选择部署模型,就希望在整个数据集上对模型进行最后一次训练

15.模型部署保存: deploy_modell.函数 和.save_model.函数
使用finalize_model函数确定最终模型后,即可进行部署,有两种部署方法
1.save_model函数:本地保存模型,pkl文件
2.deploy_model函数部署在云上
部署云端需要使用命令行配置环境:

使用亚马逊控制台账户的IAM生成以下信息:需要自己申请
AWS Access Key ID
AWS Secret Key Access
Default Region Name (can be seen under Global settings on your AWS console)
Default output format (must be left blank)

16.实验保存: save_experiment.函数
该函数保存整个环境(包括转换管道,模型创建以及所有的中间输出),格式为pkl文件,以备后用

17.数据集处理
训练测试集划分:Train test split —> train_size: float, default = 0.7 (Parameters in setup)

抽样:Sampling —> sampling: bool, default = True; sample_estimator: object, default = None,如果为None,则默认使用线性模型(Parameters in setup)。
当数据集超过25000个样本时,pycaret默认情况下会对数据集进行抽样

18.数据处理

(a).缺失值处理(Parameters in setup)
numeric_imputation: string, default = ‘mean’ or ‘median’
categorical_imputation: string, default = ‘constant’ (not_available or ‘mode’)

(b).改变数据类型(Parameters in setup)
numeric_features: string, default = None 指定数值型特征
categorical_features: string, default = None 指定类别型特征
date_features: string, default = None 指定日期型特征,日期特征不会放入模型训练
ignore_features: string, default = None 指不用于放入模型的列

©.独热编码(Parameters in setup)
类别型变量默认都会进行独热编码

(d).序列编码(Parameters in setup)
rdinal_features: dictionary, default = None 如:ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }

(e).基数编码(Parameters in setup)
high_cardinality_features: string, default = None
high_cardinality_method: string, default = ‘frequency’ 使用频率分布作为原始值,
clustering’,使用聚类标签替换原始值,判断方法:Calinski-Harabasz and Silhouette criterion 指标链接:https://www.jianshu.com/p/841ecdaab847

(f).处理未知类别(Parameters in setup)
handle_unknown_categorical: bool, default = True
nknown_categorical_method: string, default = ‘least_frequent’ or ‘most_frequent’

19.数据变换处理

(a).标准化处理(Parameters in setup)
normalize: bool, default = False
normalize_method: string, default = ‘zscore’ or minmax or maxabs or robust

(b).特征变换处理(Parameters in setup)
transformation: bool, default = False
transformation_method: string, default = ‘yeo-johnson’ or ‘quantile’
链接:https://blog.csdn.net/qq_38958113/article/details/98051207

©.因变量变换处理(Parameters in setup):用于更改目标变量的分布形状
transform_target: bool, default = False
transform_target_method: string, default = ‘box-cox’
链接:https://zhuanlan.zhihu.com/p/38956042

20.特征处理

(a).特征组合(Parameters in setup)
feature_interaction: bool, default = False:可能无法在高维特性空间中使用,特征组合方法 a * b
feature_ratio: bool, default = False 可能无法在高维特性空间中使用,特征组合方法 a / b
interaction_threshold: bool, default = 0.01 与polynomial_threshold类似,通常将通过组合方法新创建特征压缩成稀疏矩阵,根据随机森林,Adaboost的特征重要性和线性相关性低于设置的阈值将保留在数据集中,其余特征被删除

(b).多项式特征(Parameters in setup)
多项式特征目的是:处理非线性的问题
polynomial_features: bool, default = False
polynomial_degree: int, default = 2 多项式特征的度 [a, b]—> [1, a, b, a^2, ab, b^2]
polynomial_threshold: float, default = 0.1 低于阈值的将保留在数据集中

©.三角特征(Parameters in setup)
trigonometry_features: bool, default = False

(d).聚合特征(Parameters in setup)
当数据集中包含以某种方式批次相关的特征时,比如,特征以固定的时间间隔记录,就可以从现有特征中创建一些统计指标,比如均值,中位数,方差和标准差
group_features: list or list of list, default = None
group_names: list, default = None 长度必须与group_features的长度一致

(e).特征分箱(Parameters in setup)
连续变量离散化方法,当连续型特征唯一值过多或者存在少量极值,这种方法就会很有效
bin_numeric_features: list, default = None,当使用聚类分箱时,具有相同聚类中心的变量会放入同一个箱中,使用 ‘sturges’方法确定聚类中心的个数,仅对高斯分布数据有效,而对大型非高斯数据集,箱的个数会被低估

(f).合并少量特征(Parameters in setup)
combine_rare_levels: bool, default = False
当设置成True,所有低于某个阈值的特征都将合并正在一起作为一个单独的特征,
rare_level_threshold: float, default = 0.1 设置用于合并的阈值

21.特征选择

(a).特征重要性(Parameters in setup)
特征重要性是用于在数据集中选择对预测目标变量贡献最大特征的过程,目的是为了防止过拟合
feature_selection: bool, default = False 使用了几种监督的特征选择基数,可以使用feature_selection_threshold来控制子集的大小
常用的特征重要性技术:随机森林,Adaboost,与目标变量的线性相关性,当使用多项式特征和特征组合时,将feature_selection_threshold设置成一个比较低的值
feature_selection_threshold: float, default = 0.8,较大的值将导致较高的特征空间

(b).消除多重共线性(Parameters in setup)
多重共线性:https://zhuanlan.zhihu.com/p/72722146
remove_multicollinearity: bool, default = False
multicollinearity_threshold: float, default = 0.9 阈值,高于这阈值的两个特征,将删除与目标变量相关性低的特征

©.主成分分析(Parameters in setup)
一种降维技术:https://blog.csdn.net/luoluonuoyasuolong/article/details/90711318
pca: bool, default = False
pca_method: string, default = ‘linear’ or ‘kernel’
pca_components: int/float, default = 0.99 宝利源的信息值

(d).移除低方差特征(Parameters in setup)
https://blog.csdn.net/weixin_42394925/article/details/102467872
ignore_low_variance: bool, default = False
低方差特征定义
Count of unique values in a feature / sample size < 10%
Count of most common value / Count of second most common value > 20 times.

22.非监督学习方法特征生成

(a).生成聚类标签(Parameters in setup)
生成聚类标签,然后标签作为新的特征
create_clusters: bool, default = False
cluster_iter: int, default = 20

(b).移除异常点(Parameters in setup)
remove_outliers: bool, default = False https://www.cnblogs.com/glove/p/7256189.html
outliers_threshold: float, default = 0.05 确定要删除异常值的百分比阈值,0.05表示删除分布尾部每一侧的2.5%

4.模块介绍之分类

(a).设置环境
setup(data, target, train_size = 0.7, sampling = True, sample_estimator = None, categorical_features = None, categorical_imputation = ‘constant’, ordinal_features = None, high_cardinality_features = None, high_cardinality_method = ‘frequency’, numeric_features = None, numeric_imputation = ‘mean’, date_features = None, ignore_features = None, normalize = False, normalize_method = ‘zscore’, transformation = False, transformation_method = ‘yeo-johnson’, handle_unknown_categorical = True, unknown_categorical_method = ‘least_frequent’, pca = False, pca_method = ‘linear’, pca_components = None, ignore_low_variance = False, combine_rare_levels = False, rare_level_threshold = 0.10, bin_numeric_features = None, remove_outliers = False, outliers_threshold = 0.05, remove_multicollinearity = False, multicollinearity_threshold = 0.9, create_clusters = False, cluster_iter = 20, polynomial_features = False, polynomial_degree = 2, trigonometry_features = False, polynomial_threshold = 0.1, group_features = None, group_names = None, feature_selection = False, feature_selection_threshold = 0.8, feature_interaction = False, feature_ratio = False, interaction_threshold = 0.01, session_id = None, silent=False, profile = False)
必须传参:data, target

Description:
This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column. All other parameters are optional.

(b).模型比较
compare_models(blacklist = None, fold = 10, round = 4, sort = ‘Accuracy’, turbo = True)

Description:
This function uses all models in the model library and scores them using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default CV = 10 Folds) of all the available models in the model library

©.创建模型
create_model(estimator = None, ensemble = False, method = None, fold = 10, round = 4, verbose = True)

Description:
This function creates a model and scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Fold). This function returns a trained model object. setup() function must be called before using create_model().

(d).模型调参
tune_model(estimator = None, fold = 10, round = 4, n_iter = 10, optimize = ‘Accuracy’, ensemble = False, method = None, verbose = True)

Description:
This function tunes the hyperparameters of a model and scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall Precision, F1 and Kappa by fold (by default = 10 Folds). This function returns a trained model object.

(e).单模型集成
ensemble_model(estimator, method = ‘Bagging’, fold = 10, n_estimators = 10, round = 4, verbose = True)

Description:
This function ensembles the trained base estimator using the method defined in ‘method’ param (default = ‘Bagging’). The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Fold). Model must be created using create_model() or tune_model(). This function returns a trained model object.

(f).多模型集成
blend_models(estimator_list = ‘All’, fold = 10, round = 4, method = ‘hard’, turbo = True, verbose = True)

Description:
This function creates a Soft Voting / Majority Rule classifier for all the estimators in the model library (excluding the few when turbo is True) or for specific trained estimators passed as a list in estimator_list param. It scores it using Stratified Cross Validation. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default CV = 10 Folds). This function returns a trained model object.

(g).模型堆叠(2层)
stack_models(estimator_list, meta_model = None, fold = 10, round = 4, method = ‘soft’, restack = True, plot = False, finalize = False, verbose = True)

Description:
This function creates a meta model and scores it using Stratified Cross Validation. The predictions from the base level models as passed in the estimator_list param are used as input features for the meta model. The restacking parameter controls the ability to expose raw features to the meta model when set to True (default = False). The output prints the score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Folds).
This function returns a container which is the list of all models in stacking.

(h).多模型堆叠(3层以上)
create_stacknet(estimator_list, meta_model = None, fold = 10, round = 4, method = ‘soft’, restack = True, finalize = False, verbose = True)

Description:
This function creates a sequential stack net using cross validated predictions at each layer. The final score grid contains predictions from the meta model using Stratified Cross Validation. Base level models can be passed as estimator_list param, the layers can be organized as a sub list within the estimator_list object. Restacking param controls the ability to expose raw features to meta model.
This function returns a container which is the list of all models in stacking.

(i).模型画图
plot_model(estimator = None, plot = ‘auc’)

Description:
This function takes a trained model object and returns a plot based on the test / hold-out set

(j).模型评估
evaluate_model(estimator): 内部调用plot_model

Description:
This function displays a user interface for all of the available plots for a given estimator. It internally uses the plot_model() function.

(k).模型解释
interpret_model(estimator, plot = ‘summary’, feature = None, observation = None)

Description:
This function takes a trained model object and returns an interpretation plot based on the test / hold-out set. It only supports tree based algorithms. This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.

(l).模型校准
calibrate_model(estimator, method = ‘sigmoid’, fold=10, round=4, verbose=True)
Description:
This function takes the input of trained estimator and performs probability calibration with sigmoid or isotonic regression. The output prints a score grid that shows Accuracy, AUC, Recall, Precision, F1 and Kappa by fold (default = 10 Fold). The output of the original estimator and the calibrated estimator (created using this function) might not differ much. In order to see the calibration differences, use ‘calibration’ plot in plot_model to see the difference before and after.

(m).阈值优化
optimize_threshold(estimator, true_positive = 0, true_negative = 0, false_positive = 0, false_negative = 0)
Description:
This function optimizes probability threshold for a trained model using custom cost function that can be defined using combination of True Positives, True Negatives, False Positives (also known as Type I error), and False Negatives (Type II error). This function returns a plot of optimized cost as a function of probability threshold between 0 to 100.

(n).模型预测
predict_model(estimator, data=None, probability_threshold=None, platform=None, authentication=None)
Description:
This function is used to predict new data using a trained estimator. It accepts an estimator created using one of the function in pycaret that returns a trained model object or a list of trained model objects created using stack_models() or create_stacknet(). New unseen data can be passed to data param as pandas Dataframe. If data is not passed, the test / hold-out set separated at the time of setup() is used to generate predictions.

(o).模型最终训练(训练这个数据集)
finalize_model(estimator)
Description:
This function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.

§.模型部署
deploy_model(model, model_name, authentication, platform = ‘aws’)
Description:
(In Preview)
This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

®.模型保存
save_model(model, model_name, verbose=True)
Description:
This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

(s).模型加载
load_model(model_name, platform = None, authentication = None, verbose=True)
Description:
This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

(t).实验保存
save_experiment(experiment_name=None)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

(u).实验加载
load_experiment(experiment_name)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

5.模块介绍之回归

(a).设置环境
setup(data, target, train_size = 0.7, sampling = True, sample_estimator = None, categorical_features = None, categorical_imputation = ‘constant’, ordinal_features = None, high_cardinality_features = None, high_cardinality_method = ‘frequency’, numeric_features = None, numeric_imputation = ‘mean’, date_features = None, ignore_features = None, normalize = False, normalize_method = ‘zscore’, transformation = False, transformation_method = ‘yeo-johnson’, handle_unknown_categorical = True, unknown_categorical_method = ‘least_frequent’, pca = False, pca_method = ‘linear’, pca_components = None, ignore_low_variance = False, combine_rare_levels = False, rare_level_threshold = 0.10, bin_numeric_features = None, remove_outliers = False, outliers_threshold = 0.05, remove_multicollinearity = False, multicollinearity_threshold = 0.9, create_clusters = False, cluster_iter = 20, polynomial_features = False, polynomial_degree = 2, trigonometry_features = False, polynomial_threshold = 0.1, group_features = None, group_names = None, feature_selection = False, feature_selection_threshold = 0.8, feature_interaction = False, feature_ratio = False, interaction_threshold = 0.01, transform_target = False, transform_target_method = ‘box-cox’, session_id = None, silent=False, profile = False)
Description:
This function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must called before executing any other function in pycaret. It takes two mandatory parameters: dataframe {array-like, sparse matrix} and name of the target column. All other parameters are optional.

(b).模型比较
compare_models(blacklist = None, fold = 10, round = 4, sort = ‘R2’, turbo = True)

Description:
This function uses all models in the model library and scores them using K-fold Cross Validation. The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds) of all the available models in model library.

©.创建模型
create_model(estimator = None, ensemble = False, method = None, fold = 10, round = 4, verbose = True)

Description:
This function creates a model and scores it using K-fold Cross Validation. (default = 10 Fold). The output prints a score grid that shows MAE, MSE, RMSE, RMSLE, R2 and MAPE. This function returns a trained model object. setup() function must be called before using create_model()

(d).模型调参
tune_model(estimator = None, fold = 10, round = 4, n_iter = 10, optimize = ‘r2’, ensemble = False, method = None, verbose = True)

Description:
This function tunes the hyperparameters of a model and scores it using K-fold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (by default = 10 Folds). This function returns a trained model object.
tune_model() only accepts a string parameter for estimator.

(e).单模型集成
ensemble_model(estimator, method = ‘Bagging’, fold = 10, n_estimators = 10, round = 4, verbose = True)

Description:
This function ensembles the trained base estimator using the method defined in ‘method’ param (default = ‘Bagging’). The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default CV = 10 Folds). Model must be created using create_model() or tune_model(). This function returns a trained model object.

(f).多模型集成
blend_models(estimator_list = ‘All’, fold = 10, round = 4, turbo = True, verbose = True)

Description:
This function creates an ensemble meta-estimator that fits a base regressor on the whole dataset. It then averages the predictions to form a final prediction. By default, this function will use all estimators in the model library (excl. the few estimators when turbo is True) or a specific trained estimator passed as a list in estimator_list param. It scores it using K-fold Cross Validation. The output prints the score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default = 10 Fold). This function returns a trained model object.

(g).模型堆叠(2层)
stack_models(estimator_list, meta_model = None, fold = 10, round = 4, method = ‘soft’, restack = True, plot = False, finalize = False, verbose = True)

Description:
This function creates a meta model and scores it using K-fold Cross Validation. The predictions from the base level models as passed in the estimator_list param are used as input features for the meta model. The restacking parameter controls the ability to expose raw features to the meta model when set to True (default = False). The output prints a score grid that shows MAE, MSE, RMSE, R2, RMSLE and MAPE by fold (default = 10 Folds).
This function returns a container which is the list of all models in stacking.

(h).多模型堆叠(3层以上)
create_stacknet(estimator_list, meta_model = None, fold = 10, round = 4, method = ‘soft’, restack = True, finalize = False, verbose = True)

Description:
This function creates a sequential stack net using cross validated predictions at each layer. The final score grid contains predictions from the meta model using K-fold Cross Validation. Base level models can be passed as estimator_list param, the layers can be organized as a sub list within the estimator_list object. Restacking param controls the ability to expose raw features to meta model.
This function returns a container which is the list of all models in stacking.
This will result in the stacking of models in multiple layers. The first layer contains dt and rf, the predictions of which are used by models in the second layer to generate predictions which are then used by the meta model to generate final predictions. By default, the meta model is Linear Regression but can be changed with meta_model param.

(i).模型画图
plot_model(estimator = None, plot = ‘auc’)

Description:
This function takes a trained model object and returns a plot based on the test / hold-out set

(j).模型评估
evaluate_model(estimator): 内部调用plot_model

Description:
This function displays a user interface for all of the available plots for a given estimator. It internally uses the plot_model() function.

(k).模型解释
interpret_model(estimator, plot = ‘summary’, feature = None, observation = None)

Description:
This function takes a trained model object and returns an interpretation plot based on the test / hold-out set. It only supports tree based algorithms. This function is implemented based on the SHAP (SHapley Additive exPlanations), which is a unified approach to explain the output of any machine learning model. SHAP connects game theory with local explanations.

(l).模型预测
predict_model(estimator, data=None, probability_threshold=None, platform=None, authentication=None)
Description:
This function is used to predict new data using a trained estimator. It accepts an estimator created using one of the function in pycaret that returns a trained model object or a list of trained model objects created using stack_models() or create_stacknet(). New unseen data can be passed to data param as pandas Dataframe. If data is not passed, the test / hold-out set separated at the time of setup() is used to generate predictions.

(m).模型最终训练(训练这个数据集)
finalize_model(estimator)
Description:
This function fits the estimator onto the complete dataset passed during the setup() stage. The purpose of this function is to prepare for final model deployment after experimentation.

(n).模型部署
deploy_model(model, model_name, authentication, platform = ‘aws’)
Description:
(In Preview)
This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

(o).模型保存
save_model(model, model_name, verbose=True)
Description:
This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

§.模型加载
load_model(model_name, platform = None, authentication = None, verbose=True)
Description:
This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

(q).实验保存
save_experiment(experiment_name=None)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

(i).实验加载
load_experiment(experiment_name)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

6.模块介绍之聚类

(a).设置环境
setup(data, categorical_features = None, categorical_imputation = ‘constant’, ordinal_features = None, high_cardinality_features = None, numeric_features = None, numeric_imputation = ‘mean’, date_features = None, ignore_features = None, normalize = False, normalize_method = ‘zscore’, transformation = False, transformation_method = ‘yeo-johnson’, handle_unknown_categorical = True, unknown_categorical_method = ‘least_frequent’, pca = False, pca_method = ‘linear’, pca_components = None, ignore_low_variance = False, combine_rare_levels = False, rare_level_threshold = 0.10, bin_numeric_features = None, remove_multicollinearity = False, multicollinearity_threshold = 0.9, group_features = None, group_names = None, supervised = False, supervised_target = None, session_id = None, profile = False, verbose=True)
Description:
This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: dataframe {array-like, sparse matrix}.

(b).创建模型
create_model(model = None, num_clusters = None, verbose=True)

Description:
This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model(). This function returns a trained model object.

©.模型标注?
assign_model(model, transformation = False, verbose = True)
Description:
This function assigns each of the data point in the dataset passed during setup stage to one of the clusters using trained model object passed as model param. create_model() function must be called before using assign_model(). This function returns a pandas Dataframe.

(d).模型画图
plot_model(model, plot=’cluster’, feature = None, label = False)
Description:
This function takes a trained model object and returns a plot on the dataset passed during setup stage. This function internally calls assign_model before generating a plot.

(e).模型调参
tune_model(model = None, supervised_target = None, estimator = None, optimize = None, fold = 10)
Description:
This function tunes the num_clusters model parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear.
This function returns the tuned model object.

(f).模型预测
predict_model(model, data, platform=None, authentication=None)
Description:
This function is used to predict new data using a trained model. It requires a trained model object created using one of the function in pycaret that returns a trained model object. New data must be passed to data param as pandas Dataframe.

(g).模型部署
deploy_model(model, model_name, authentication, platform = ‘aws’)
Description:
(In Preview)
This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

(h).模型保存
save_model(model, model_name, verbose=True)
Description:
This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

(i).模型加载
load_model(model_name, platform = None, authentication = None, verbose=True)
Description:
This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

(j).实验保存
save_experiment(experiment_name=None)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

(k).实验加载
load_experiment(experiment_name)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

7.模块介绍之异常检测

(a).设置环境
setup(data, categorical_features = None, categorical_imputation = ‘constant’, ordinal_features = None, high_cardinality_features = None, numeric_features = None, numeric_imputation = ‘mean’, date_features = None, ignore_features = None, normalize = False, normalize_method = ‘zscore’, transformation = False, transformation_method = ‘yeo-johnson’, handle_unknown_categorical = True, unknown_categorical_method = ‘least_frequent’, pca = False, pca_method = ‘linear’, pca_components = None, ignore_low_variance = False, combine_rare_levels = False, rare_level_threshold = 0.10, bin_numeric_features = None, remove_multicollinearity = False, multicollinearity_threshold = 0.9, group_features = None, group_names = None, supervised = False, supervised_target = None, session_id = None, profile = False, verbose=True)

Description:
This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: dataframe {array-like, sparse matrix}.

(b).创建模型
create_model(model = None, fraction = 0.05, verbose=True)
Description:
This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model(). This function returns a trained model object.

©.模型标注?
assign_model(model, transformation = False, verbose = True)
Description:
This function flags each of the data point in the dataset passed during setup stage as either outlier or inlier (1 = outlier, 0 = inlier) using trained model object passed as model param. create_model() function must be called before using assign_model(). This function returns data frame with Outlier flag (1 = outlier, 0 = inlier) and decision score, when score is set to True.

(d).模型画图
plot_model(model, plot=’tsne’, feature = None)
Description:
This function takes a trained model object and returns a plot on the dataset passed during setup stage. This function internally calls assign_model before generating a plot.

(e).模型调参
tune_model(model = None, supervised_target = None, method=’drop’ , estimator = None, optimize = None, fold = 10)
Description:
This function tunes the fraction parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear. This function returns the tuned model object.
This function returns the tuned model object.

(f).模型预测
predict_model(model, data, platform=None, authentication=None)
Description:
This function is used to predict new data using a trained model. It requires a trained model object created using one of the function in pycaret that returns a trained model object. New data must be passed to data param as pandas Dataframe.

(g).模型部署
deploy_model(model, model_name, authentication, platform = ‘aws’)
Description:
(In Preview)
This function deploys the transformation pipeline and trained model object for production use. The platform of deployment can be defined under the platform param along with the applicable authentication tokens which are passed as a dictionary to the authentication param.

(h).模型保存
save_model(model, model_name, verbose=True)
Description:
This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

(i).模型加载
load_model(model_name, platform = None, authentication = None, verbose=True)
Description:
This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

(j).实验保存
save_experiment(experiment_name=None)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

(k).实验加载
load_experiment(experiment_name)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

8.模块介绍之自然语言处理

(a).设置环境
setup(data, target=None, custom_stopwords=None, session_id = None)
Description:
This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes one mandatory parameter: dataframe {array-like, sparse matrix} or object of type list. If a dataframe is passed, target column containing text must be specified. When data passed is of type list, no target parameter is required. All other parameters are optional. This module only supports English Language at this time.

(b).创建模型
create_model(model=None, multi_core=False, num_topics = None, verbose=True)

Description:
This function creates a model on the dataset passed as a data param during the setup stage. setup() function must be called before using create_model(). This function returns a trained model object.

©.模型标注?
assign_model(model, verbose = True)
Description:
This function assigns each of the data point in the dataset passed during setup stage to one of the topic using trained model object passed as model param. create_model() function must be called before using assign_model(). This function returns data frame with topic weights, dominant topic and % of the dominant topic (where applicable).

(d).模型画图
plot_model(model = None, plot = ‘frequency’, topic_num = None)

Description:
This function takes a trained model object (optional) and returns a plot based on the inferred dataset by internally calling assign_model before generating a plot. Where a model parameter is not passed, a plot on the entire dataset will be returned instead of one at the topic level. As such, plot_model can be used with or without model. All plots with a model parameter passed as a trained model object will return a plot based on the first topic i.e. ‘Topic 0’. This can be changed using the topic_num param.

(e).模型评估
evaluate_model(model)
Description:
This function displays the user interface for all the available plots for a given model. It internally uses the plot_model() function.

(f).模型调参
tune_model(model=None, multi_core=False, supervised_target=None, estimator=None, optimize=None, auto_fe = True, fold=10)
Description:
This function tunes the num_topics model parameter using a predefined grid with the objective of optimizing a supervised learning metric as defined in the optimize param. You can choose the supervised estimator from a large library available in pycaret. By default, supervised estimator is Linear. This function returns the tuned model object.

(g).模型保存
save_model(model, model_name, verbose=True)
Description:
This function saves the transformation pipeline and trained model object into the current active directory as a pickle file for later use.

(h).模型加载
load_model(model_name, platform = None, authentication = None, verbose=True)
Description:
This function loads a previously saved transformation pipeline and model from the current active directory into the current python environment. Load object must be a pickle file.

(i).实验保存
save_experiment(experiment_name=None)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

(j).实验加载
load_experiment(experiment_name)
Description:
This function saves the entire experiment into the current active directory. All outputs using pycaret are internally saved into a binary list which is pickilized when save_experiment() is used.

9.模块介绍之关联规则挖掘

(a).设置环境
setup(data, transaction_id, item_id, ignore_items = None, session_id = None)
Description:
This function initializes the environment in pycaret. setup() must called before executing any other function in pycaret. It takes three mandatory parameters: (i) dataframe {array-like, sparse matrix}, (ii) transaction_id param identifying basket and (iii) item_id param used to create rules. These three params are normally found in any transactional dataset. pycaret will internally convert the dataframe into a sparse matrix which is required for association rules mining.

(b).创建模型
create_model(metric=’confidence’, threshold = 0.5, min_support = 0.05, round = 4)
Description:
This function creates an association rules model using data and identifiers passed at setup stage. This function internally transforms the data for association rule mining. setup() function must be called before using create_model().

©.模型画图
plot_model(model, plot=’2d’)
Description:
This function takes a model dataframe returned by create_model() function. ‘2d’ and ‘3d’ plots are available.

你可能感兴趣的:(数据挖掘)