无意间发现了一个叫AutoGluon的工具,这个工具提供了丰富的功能,包括时间序列模型、eda数据分析模块,以及包括图像文本matching、物体检测、命名实体识别、文本分类等很多功能,如有需要可以访问:
https://auto.gluon.ai/dev/tutorials/
该模型可以针对数值型、文本型数据分别完成预测。
数值型预测(例如鸢尾花数据预测)的研究较为充分,
而文本类型的预测(例如泰坦尼克预测)则是将很多文本转化为数字类型,autogluon进行了改进,根据文档显示,autogluon使用了transformer进行文本学习。
数值型预测:
from autogluon.tabular import TabularDataset, TabularPredictor
data_root = 'https://autogluon.s3.amazonaws.com/datasets/Inc/'
train_data = TabularDataset(data_root + 'train.csv')
test_data = TabularDataset(data_root + 'test.csv')
predictor = TabularPredictor(label='class').fit(train_data=train_data)
predictions = predictor.predict(test_data)
文本型预测:
from autogluon.multimodal import MultiModalPredictor
import uuid
time_limit = 3 * 60 # set to larger value in your applications
model_path = f"./tmp/{uuid.uuid4().hex}-automm_text_book_price_prediction"
predictor = MultiModalPredictor(label='Price', path=model_path)
predictor.fit(train_data, time_limit=time_limit)
准备数据:
from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/train_v2.csv')
test_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/test_v2.csv')
train_data.head(5)
数据类似这样,短横线分割了原始输入和标注结果:
what movies star bruce willis-------------[{“entity_group”: “ACTOR”, “start”: 17, “end”:…
show me films with drew barrymore from the 1980s------------[{“entity_group”: “ACTOR”, “start”: 19, “end”:…
训练模型:
from autogluon.multimodal import MultiModalPredictor
import uuid
label_col = "entity_annotations"
model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner" # You can rename it to the model path you like
predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path)
predictor.fit(
train_data=train_data,
hyperparameters={'model.ner_text.checkpoint_name':'google/electra-small-discriminator'},
time_limit=300, #second
)
效果展示:
from autogluon.multimodal.utils import visualize_ner
sentence = "Game of Thrones is an American fantasy drama television series created by David Benioff"
predictions = predictor.predict({'text_snippet': [sentence]})
print('Predicted entities:', predictions[0])
# Visualize
visualize_ner(sentence, predictions[0])
部分输出(其实还可以进行可视化展示):
Predicted entities: [{'entity_group': 'TITLE', 'start': 0, 'end': 15}, {'entity_group': 'GENRE', 'start': 22, 'end': 44}, {'entity_group': 'DIRECTOR', 'start': 74, 'end': 87}]
再训练:
如果数据发生了更新,还可以在旧模型的基础上,继续进行训练:
new_predictor = MultiModalPredictor.load(model_path)
new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner_continue_train"
new_predictor.fit(train_data, time_limit=60, save_path=new_model_path)
test_score = new_predictor.evaluate(test_data, metrics=['overall_f1', 'ACTOR'])
print(test_score)
更多模型
可以参考:https://github.com/autogluon/autogluon/tree/master/examples/automm