Autogluon代码

教程来源:Multimodal Prediction — AutoGluon Documentation 0.5.2 documentation

此外还有:

图像分类、多语言文本、多模态、CLIP等

目录

入门:表格预测

文本分类

文本相似

GPU设置


入门:表格预测

#简易例子
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data=train_data)
predictions = predictor.predict(test_data)
score = predictor.evaluate(test_data)

最好使用csv,里面不要带中文,免得utf-8转得麻烦

表格预测:

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # 子样本尺寸,demo展示用,平时也可以测试着玩
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

#数据集切分
df = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
train = df[:500].copy()
test = df[500:].copy()

#查看某个特征或标签的情况
label = 'class'
print("Summary of class variable: \n", train_data[label].describe())

#训练
save_path = 'agModels-predictClass'  # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)

#GPU
predictor = TabularPredictor(label=label).fit(
    train_data,
    ag_args_fit={'num_gpus': 1}
)

#测试集
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label]  # values to predict
test_data_nolab = test_data.drop(columns=[label])  # delete label column to prove we're not cheating
test_data_nolab.head()

#预测->结果简明
#predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file
y_pred = predictor.predict(test_data_nolab)
print("Predictions:  \n", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

#model及评估总览(推荐使用)
predictor.leaderboard(test_data, silent=True)

#预测结果的概率
pred_probs = predictor.predict_proba(test_data_nolab)
pred_probs.head(5)

#较好的参数
time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
metric = 'roc_auc'  # 具体评估的还有几种,见下文
predictor = TabularPredictor(label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets='best_quality')
predictor.leaderboard(test_data, silent=True)

present参数的三种:

Preset Model Quality Use Cases F i t T i m e ( I d e a l ) Inference Time (Relative to medium_ quality ) D i s k U s a g e
best_quality State-of- the-art (SOTA), much better than high_quality When accuracy is what matters 1 6 x + 32x+ 1 6 x +
high _quality Better than good_quality When a very powerful, portable solution with fast inference is required: Large-scale batch inference 1 6 x 4x 2 x
good _quality Significantly better than medium_quality When a powerful, highly portable solution with very fast inference is required: Billion-scale batch inference, sub-100ms online-inference, edge-devices 1 6 x 2x 1 x
medium_quality Competitive with other top AutoML Framework s Initial prototyping, establishing a performance baseline 1 x 1x 1 x

TablearPrector ():

eval _metrics 可用非默认指标包括:

‘ f1’(用于二进制分类) ,‘ roc _ auc’(用于二进制分类) ,‘ log _ loss’(用于分类) ,‘ mean _ Absol_ error’(用于回归) ,‘ Middle _ Absol_ error’(用于回归)。您还可以定义自己的自定义度量函数。

Otherwise, options for classification:

[‘accuracy’, ‘balanced_accuracy’, ‘f1’, ‘f1_macro’, ‘f1_micro’, ‘f1_weighted’, ‘roc_auc’, ‘roc_auc_ovo_macro’, ‘average_precision’, ‘precision’, ‘precision_macro’, ‘precision_micro’, ‘precision_weighted’, ‘recall’, ‘recall_macro’, ‘recall_micro’, ‘recall_weighted’, ‘log_loss’, ‘pac_score’]

Options for regression:

[‘root_mean_squared_error’, ‘mean_squared_error’, ‘mean_absolute_error’, ‘median_absolute_error’, ‘mean_absolute_percentage_error’, ‘r2’]

problem_type:可用非默认指标包括:

预测问题的类型,即这是一个二进制/多元分类或回归问题(选项: “二进制”,“多类”,“回归”,“分位数”)(即: ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’)。如果 problem _ type = None,则根据提供的数据集中的标签值推断预测问题类型。

文本分类

%matplotlib inline

import numpy as np
import warnings
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')

#情感分类
from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
subsample_size = 1000  # subsample data for faster demo, try setting this to larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head(10)

#训练
from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label='label', eval_metric='acc', path='./automm_sst')
predictor.fit(train_data, time_limit=60)#时长60s,官方推荐1h,或者不设置

#评估
test_score = predictor.evaluate(test_data, metrics=['acc', 'f1'])
print(test_score)

#预测
#分类
sentence1 = "it's a charming and often affecting journey."
sentence2 = "It's slow, very, very, very slow."
predictions = predictor.predict({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Sentiment":', predictions[0])
print('"Sentence":', sentence2, '"Predicted Sentiment":', predictions[1])
#概率
probs = predictor.predict_proba({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Class-Probabilities":', probs[0])
print('"Sentence":', sentence2, '"Predicted Class-Probabilities":', probs[1])
#全部
test_predictions = predictor.predict(test_data)
test_predictions.head()

#提取embeddings
embeddings = predictor.extract_embedding(test_data)
print(embeddings.shape)
#可视化
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2, random_state=123).fit_transform(embeddings)
for val, color in [(0, 'red'), (1, 'blue')]:
    idx = (test_data['label'].to_numpy() == val).nonzero()
    plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], c=color, label=f'label={val}')
plt.legend(loc='best')


文本相似

sts_train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/train.parquet')[['sentence1', 'sentence2', 'score']]
sts_test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/dev.parquet')[['sentence1', 'sentence2', 'score']]
sts_train_data.head(10)

#训练
predictor_sts = MultiModalPredictor(label='score', path='./automm_sts')
predictor_sts.fit(sts_train_data, time_limit=60)

#评估
test_score = predictor_sts.evaluate(sts_test_data, metrics=['rmse', 'pearsonr', 'spearmanr'])
print('RMSE = {:.2f}'.format(test_score['rmse']))
print('PEARSONR = {:.4f}'.format(test_score['pearsonr']))
print('SPEARMANR = {:.4f}'.format(test_score['spearmanr']))

#预测
sentences = ['The child is riding a horse.',
             'The young boy is riding a horse.',
             'The young man is riding a horse.',
             'The young man is riding a bicycle.']

score1 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[1]]}, as_pandas=False)

score2 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[2]]}, as_pandas=False)

score3 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[3]]}, as_pandas=False)
print(score1, score2, score3)

GPU设置

# by default, all available gpus are used by AutoMM
predictor.fit(hyperparameters={"env.num_gpus": -1})
# use 1 gpu only
predictor.fit(hyperparameters={"env.num_gpus": 1})

你可能感兴趣的:(python)