教程来源:Multimodal Prediction — AutoGluon Documentation 0.5.2 documentation
此外还有:
图像分类、多语言文本、多模态、CLIP等
目录
入门:表格预测
文本分类
文本相似
GPU设置
#简易例子
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
predictor = TabularPredictor(label='class').fit(train_data=train_data)
predictions = predictor.predict(test_data)
score = predictor.evaluate(test_data)
最好使用csv,里面不要带中文,免得utf-8转得麻烦
表格预测:
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500 # 子样本尺寸,demo展示用,平时也可以测试着玩
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()
#数据集切分
df = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
train = df[:500].copy()
test = df[500:].copy()
#查看某个特征或标签的情况
label = 'class'
print("Summary of class variable: \n", train_data[label].describe())
#训练
save_path = 'agModels-predictClass' # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)
#GPU
predictor = TabularPredictor(label=label).fit(
train_data,
ag_args_fit={'num_gpus': 1}
)
#测试集
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label] # values to predict
test_data_nolab = test_data.drop(columns=[label]) # delete label column to prove we're not cheating
test_data_nolab.head()
#预测->结果简明
#predictor = TabularPredictor.load(save_path) # unnecessary, just demonstrates how to load previously-trained predictor from file
y_pred = predictor.predict(test_data_nolab)
print("Predictions: \n", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)
#model及评估总览(推荐使用)
predictor.leaderboard(test_data, silent=True)
#预测结果的概率
pred_probs = predictor.predict_proba(test_data_nolab)
pred_probs.head(5)
#较好的参数
time_limit = 60 # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
metric = 'roc_auc' # 具体评估的还有几种,见下文
predictor = TabularPredictor(label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets='best_quality')
predictor.leaderboard(test_data, silent=True)
present参数的三种:
Preset | Model Quality | Use Cases | F i t T i m e ( I d e a l ) | Inference Time (Relative to medium_ quality ) | D i s k U s a g e |
best_quality | State-of- the-art (SOTA), much better than high_quality | When accuracy is what matters | 1 6 x + | 32x+ | 1 6 x + |
high _quality | Better than good_quality | When a very powerful, portable solution with fast inference is required: Large-scale batch inference | 1 6 x | 4x | 2 x |
good _quality | Significantly better than medium_quality | When a powerful, highly portable solution with very fast inference is required: Billion-scale batch inference, sub-100ms online-inference, edge-devices | 1 6 x | 2x | 1 x |
medium_quality | Competitive with other top AutoML Framework s | Initial prototyping, establishing a performance baseline | 1 x | 1x | 1 x |
TablearPrector ():
eval _metrics 可用非默认指标包括:
‘ f1’(用于二进制分类) ,‘ roc _ auc’(用于二进制分类) ,‘ log _ loss’(用于分类) ,‘ mean _ Absol_ error’(用于回归) ,‘ Middle _ Absol_ error’(用于回归)。您还可以定义自己的自定义度量函数。
Otherwise, options for classification:
[‘accuracy’, ‘balanced_accuracy’, ‘f1’, ‘f1_macro’, ‘f1_micro’, ‘f1_weighted’, ‘roc_auc’, ‘roc_auc_ovo_macro’, ‘average_precision’, ‘precision’, ‘precision_macro’, ‘precision_micro’, ‘precision_weighted’, ‘recall’, ‘recall_macro’, ‘recall_micro’, ‘recall_weighted’, ‘log_loss’, ‘pac_score’]
Options for regression:
[‘root_mean_squared_error’, ‘mean_squared_error’, ‘mean_absolute_error’, ‘median_absolute_error’, ‘mean_absolute_percentage_error’, ‘r2’]
problem_type:可用非默认指标包括:
预测问题的类型,即这是一个二进制/多元分类或回归问题(选项: “二进制”,“多类”,“回归”,“分位数”)(即: ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’)。如果 problem _ type = None,则根据提供的数据集中的标签值推断预测问题类型。
%matplotlib inline
import numpy as np
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
#情感分类
from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
subsample_size = 1000 # subsample data for faster demo, try setting this to larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head(10)
#训练
from autogluon.multimodal import MultiModalPredictor
predictor = MultiModalPredictor(label='label', eval_metric='acc', path='./automm_sst')
predictor.fit(train_data, time_limit=60)#时长60s,官方推荐1h,或者不设置
#评估
test_score = predictor.evaluate(test_data, metrics=['acc', 'f1'])
print(test_score)
#预测
#分类
sentence1 = "it's a charming and often affecting journey."
sentence2 = "It's slow, very, very, very slow."
predictions = predictor.predict({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Sentiment":', predictions[0])
print('"Sentence":', sentence2, '"Predicted Sentiment":', predictions[1])
#概率
probs = predictor.predict_proba({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Class-Probabilities":', probs[0])
print('"Sentence":', sentence2, '"Predicted Class-Probabilities":', probs[1])
#全部
test_predictions = predictor.predict(test_data)
test_predictions.head()
#提取embeddings
embeddings = predictor.extract_embedding(test_data)
print(embeddings.shape)
#可视化
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2, random_state=123).fit_transform(embeddings)
for val, color in [(0, 'red'), (1, 'blue')]:
idx = (test_data['label'].to_numpy() == val).nonzero()
plt.scatter(X_embedded[idx, 0], X_embedded[idx, 1], c=color, label=f'label={val}')
plt.legend(loc='best')
sts_train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/train.parquet')[['sentence1', 'sentence2', 'score']]
sts_test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/dev.parquet')[['sentence1', 'sentence2', 'score']]
sts_train_data.head(10)
#训练
predictor_sts = MultiModalPredictor(label='score', path='./automm_sts')
predictor_sts.fit(sts_train_data, time_limit=60)
#评估
test_score = predictor_sts.evaluate(sts_test_data, metrics=['rmse', 'pearsonr', 'spearmanr'])
print('RMSE = {:.2f}'.format(test_score['rmse']))
print('PEARSONR = {:.4f}'.format(test_score['pearsonr']))
print('SPEARMANR = {:.4f}'.format(test_score['spearmanr']))
#预测
sentences = ['The child is riding a horse.',
'The young boy is riding a horse.',
'The young man is riding a horse.',
'The young man is riding a bicycle.']
score1 = predictor_sts.predict({'sentence1': [sentences[0]],
'sentence2': [sentences[1]]}, as_pandas=False)
score2 = predictor_sts.predict({'sentence1': [sentences[0]],
'sentence2': [sentences[2]]}, as_pandas=False)
score3 = predictor_sts.predict({'sentence1': [sentences[0]],
'sentence2': [sentences[3]]}, as_pandas=False)
print(score1, score2, score3)
# by default, all available gpus are used by AutoMM
predictor.fit(hyperparameters={"env.num_gpus": -1})
# use 1 gpu only
predictor.fit(hyperparameters={"env.num_gpus": 1})