Flair-1

阅读动机:

发现他在NER progress屠榜了,ps:榜中的ontonotes和我做的89类的ontonotes不一样,是一个业界比较上心的18类的。

主要贡献:

贡献1:

contextual string embeddings:一个建模在字符级别上的语言模型(character language model)

image.png

训练的细节感觉也没什么可以说的。字符集别的嵌入,就这样。

贡献2:

是一个挺不错的框架,集成了各种不同的embedding
ps: 比较牛掰的是还写了一个stackEmbeddings, 这得省去多少代码啊,具体内容后面写
里面能够知道的XLNet、FlairEmbedding、Bert、Elmo、Word2vec等等,具体见上面链接

上手DEMO

下面代码中的load如果没有资源会自动下载到本地,比较坑的是这个文件没有vpn访问不了,醉了。当然自己下一下也是挺快的。
下载地址参考1-ner
下载地址参考2-onto-ner
可以靠着这个构造其他的下载链接
可以在下面的Sentence中设置tokenizer

from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence [sentence是一个类, list of token类]
sentence = Sentence('The George Washington went to Washington .')

# load the NER tagger model
# tagger = SequenceTagger.load('ner') 
tagger = SequenceTagger.load('/home/huyufeng/flair/flair/checkpoints/en-ner-conll03-v0.4.pt')
tagger = SequenceTagger.load('/home/huyufeng/flair/flair/checkpoints/en-ner-ontonotes-v0.4.pt')

# run NER over sentence
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())
#>>> The George  Washington  went to Washington  .

上面是打印比较简略的信息,下面是打印比较完整的信息:

>>> print(sentence.to_dict(tag_type='ner'))
{"text": "The George Washington went to Washington .",
 "labels": [],
 "entities": [{"text": "George Washington",
    "start_pos": 4, "end_pos": 21, "type": "PER",
    "confidence": 0.9787668585777283},
    {"text": "Washington",
    "start_pos": 30, "end_pos": 40, "type": "LOC",
    "confidence": 0.9987319111824036}]}
>>> 
>>> for entity in sentence.get_spans('ner'):
...     print(entity)
... 
PER-span [2,3]: "George Washington"
LOC-span [6]: "Washington"

例子结束,上面这个例子彰显了flair的方便,能够很快的部署模型,当然上面是直接Seq2Seq,句子直接得到标签,没有中间商...嗯。

--------------------第一阶段结束--------------------

构造数据集

当然,Sentence的结构既然固定了,那就需要构造数据集了。

sentence = Sentence('The grass is green .')
print(sentence)
sentence[3].add_tag('ner', 'color')
print(sentence.to_tagged_string())



Embedding

下面介绍了Glove, Elmo, Flair, bert 四种embedding

from flair.embeddings import WordEmbeddings, ELMoEmbeddings, FlairEmbeddings, BertEmbeddings
from flair.data import Sentence
s = Sentence("it was filthy to do such dirty work")

root = "/home/huyufeng/elmo/dataset/"
weight_file_path = "elmo_2x2048_256_2048cnn_1xhighway_weights.hdf5"
options_file_path = "elmo_2x2048_256_2048cnn_1xhighway_options.json"
elmo_embedding = ELMoEmbeddings(options_file=root+options_file_path , weight_file=root+weight_file_path)
elmo_embedding.embed(s)
for token in s:
    print(token)
    print(token.embedding)
    print(token.embedding.size())


glove_path = "/home/huyufeng/flair/flair/checkpoints/glove.gensim.vectors.npy"
glove_embedding = WordEmbeddings(glove_path)
glove_embedding.embed(s)
for token in s:
    print(token)
    print(token.embedding)
    print(token.embedding.size())


flair_embedding_forward = FlairEmbeddings('news-forward')
# flair_embedding_forward = FlairEmbeddings('news-backward')
flair_embedding_forward.embed(s)
for token in s:
    print(token)
    print(token.embedding)
    print(token.embedding.size())#3584  7*512


bert_path = "/home/huyufeng/glove/uncased_L-12_H-768_A-12"
bert_embedding = BertEmbeddings(bert_path)
bert_embedding.embed(s)
for token in s:
    print(token)
    print(token.embedding)
    print(token.embedding.size())#3072  3*1024

stacking——simutaneously-multi-embed

采用的concat的方法
直接参考link,后续真用上了再补。
直接一个函数搞定,很方便。

from flair.embeddings import ELMoEmbeddings, FlairEmbeddings, BertEmbeddings, StackedEmbeddings

# create a StackedEmbedding object that combines elmo and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        elmo_embedding,
                                        bert_embedding,
                                        flair_embedding_forward,
                                        flair_embedding_backward,
                                       ])

Document Embeddings

Word Embeding 得到的是words * dim
Docu Embeding 得到的是dim, 即整个句子的embedding
文章给出了两种方法:Pooling 和 RNN方法:
Pooling方法,默认是mean,有[mean, max, min]三个方法可选,立即可用。
RNN方法默认采用GRU,有['GRU', 'LSTM'], 问题:这里的GRU和LSTM可能需要训练过才有效。

from flair.embeddings import ELMoEmbeddings, FlairEmbeddings, BertEmbeddings, StackedEmbeddings, DocumentPoolEmbeddings, DocumentRNNEmbeddings

# create an example sentence
sentence = Sentence('The grass is green . And the sky is blue .')
sentence2 = Sentence('The grass is green . And the sky is blue .')

# initialize the document embeddings, mode = mean
document_embeddings = DocumentPoolEmbeddings([bert_embedding,
                                              flair_embedding_backward,
                                              flair_embedding_forward])
# embed the sentence with our document embedding
document_embeddings.embed(sentence)
print(sentence.get_embedding()) #7168
>>> tensor([-0.0132, -0.1393,  0.0427,  ..., -0.0013, -0.0026,  0.0170],
       grad_fn=)


document_embeddings_rnn = DocumentRNNEmbeddings([bert_embedding,
                                              flair_embedding_backward,
                                              flair_embedding_forward])
# embed the sentence with our document embedding
document_embeddings_rnn.embed(sentence2)
print(sentence2.get_embedding()) #7296
>>> tensor([-0.0651,  0.6252,  0.2668,  ..., -0.0013, -0.0026,  0.0170],
       grad_fn=)

Loading Training Data

有各种各样的数据集,感觉可以玩一玩。

数据集 描述 备注
'UD_ENGLISH' 普遍依赖树图资料库
'WIKINER_ENGLISH' WikiNER
'NEWSGROUPS' 文本分类 似乎和我的比较相似
‘IMDB' 咳咳 兴趣所在

导入数据

import flair.datasets
corpus = flair.datasets.IMDB()
news_corpus = flair.datasets.NEWSGROUPS()

打印数据

print(corpus) 
>>> Corpus: 10183 train + 1131 dev + 7532 test sentences
print(len(corpus.test)) #[train, test, dev]
print(corpus.test[0])
print(corpus.test[0].to_tagged_string('pos'))
print(corpus.test[0].labels)  # 这个是文本分类的(只有一个label),和上面不一样,还有ner的

数据采样 10 %

downsampled_corpus = flair.datasets.IMDB().downsample(0.1)

数据分析:会得到所有类别的详细的分析。结果见附录

stats = corpus.obtain_statistics()
print(stats)

Multi-Corpus

暂时不知道有什么用

from flair.data import MultiCorpus
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])

读取数据不再上述中的列表:形如link

George N B-PER
Washington N I-PER
went V O
to P O
Washington N B-LOC

pass

读取不再上述中的CSV数据:形如[link]

pass

歪楼

好像git上找一个函数的调用比VScode还要舒服一点,是时候转git了。

reference

git - flair

学习目录

  • Tutorial 1: Basics
  • Tutorial 2: Tagging your Text
  • Tutorial 3: Embedding Words
  • Tutorial 4: List of All Word Embeddings
  • Tutorial 5: Embedding Documents
  • Tutorial 6: Loading your own Corpus
  • Tutorial 7: Training your own Models
  • Tutorial 8: Optimizing your Models
  • Tutorial 9: Training your own Flair Embeddings

附录--句子分类

句子分类,顺便上手,利用其结果也许未来有用呢? 这里举一个IMDB pos-neg的例子(用于舆情分析咯,除了情绪,还有检测攻击语言的,呼呼呼~)

sentence = Sentence('France is the current world cup winner.')

# add a label to a sentence
sentence.add_label('sports')
# a sentence can also belong to multiple classes
sentence.add_labels(['sports', 'world cup'])
# you can also set the labels while initializing the sentence
sentence = Sentence('France is the current world cup winner.', labels=['sports', 'world cup'])
print(sentence)
for label in sentence.labels:
    print(label)

上面是为句子设类别,下面是正式的句子分类

from flair.models import TextClassifier
classifier = TextClassifier.load('/home/huyufeng/flair/flair/checkpoints/imdb-v0.4.pt')

s2 = Sentence('I feel bad about this movie')
s3 = Sentence('I feel really bad about characters\' suffering, It touched something deeply in my heart, absolutely, it is good')
# predict NER tags
classifier.predict([s2,  s3])

# print sentence with predicted labels
>>> print(s2.labels)
[NEGATIVE (0.9519780874252319)]
>>> print(s3.labels)
[POSITIVE (0.9950523972511292)]

附录——embedding 目录

方便自己找位置:246

model

root = /home/huyufeng/flair/flair/checkpoints
Positive-Negtive Class——imdb-v0.4.pt
NER-ontonotes——en-ner-ontonotes-v0.4.pt
NER——en-ner-conll03-v0.4.pt

Embeding目录

ELMO:
root = /home/huyufeng/elmo/dataset/
elmo_2x1024_128_2048cnn_1xhighway_weights.hdf5
elmo_2x1024_128_2048cnn_1xhighway_options.json
elmo_2x2048_256_2048cnn_1xhighway_weights.hdf5
elmo_2x2048_256_2048cnn_1xhighway_options.json
elmo_2x4096_512_2048cnn_2xhighway_5.5B_weights.hdf5
elmo_2x4096_512_2048cnn_2xhighway_5.5B_options.json

GLOVE
root = /home/huyufeng/flair/flair/checkpoints
glove.gensim.vectors.npy

FlairEmbedding
FlairEmbeddings('news-forward') #已经下载了,直接载入。

bertEmbedding
/home/huyufeng/glove/uncased_L-12_H-768_A-12
直接导入文件夹就好了,比较坑的是,必须把bert_config.json文件的名字改成config.json才行,浪费半小时。

附录——embedding 目录

方便自己找位置:246

root = /home/huyufeng/flair/flair/checkpoints/DATASET

数据集 位置
'WIKINER_ENGLISH' aij-wikiner-en-wp3.bz2
'NEWSGROUPS' 20news-bydate.tar.gz
‘IMDB' aclImdb_v1.tar.gz
import flair.datasets
corpus = flair.datasets.IMDB()

PS

学习到了一个代码的写法,再文件夹下面写init.py
然后在里面写导入文件,就可以直接导入了。

image.png

import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

附录——数据分析

{
    "TRAIN": {
        "dataset": "TRAIN",
        "total_number_of_documents": 10183,
        "number_of_documents_per_class": {
            "rec.motorcycles": 535,
            "comp.sys.mac.hardware": 513,
            "comp.windows.x": 530,
            "sci.electronics": 523,
            "talk.politics.mideast": 526,
            "misc.forsale": 533,
            "talk.politics.guns": 487,
            "soc.religion.christian": 535,
            "rec.autos": 534,
            "alt.atheism": 435,
            "comp.os.ms-windows.misc": 529,
            "sci.med": 535,
            "rec.sport.baseball": 528,
            "sci.crypt": 540,
            "comp.graphics": 535,
            "talk.religion.misc": 339,
            "rec.sport.hockey": 535,
            "comp.sys.ibm.pc.hardware": 533,
            "talk.politics.misc": 418,
            "sci.space": 540
        },
        "number_of_tokens_per_tag": {},
        "number_of_tokens": {
            "total": 3397464,
            "min": 22,
            "max": 13487,
            "avg": 333.6407738387509
        }
    },
 "TEST": {...

附录——为什么选择flair

与Facebook的FastText甚至谷歌的AutoML自然语言平台不同,使用Flair进行文本分类仍然是一项底层的工作。我们可以通过设置诸如学习率、批量大小、退火因子(anneal factor)、损失函数、优化选择等参数来完全控制文本嵌入和训练的方式…为了获得最佳表现,需要调整这些超参数。Flair为我们提供了一个有名的超参数调优库Hyperopt的封装器,我们可以使用它来对超参数进行调优以获得最佳的性能。

在本文中,为了简单起见,我们使用了默认的超参数。在大多数默认参数下,我们的Flair模型在10个训练周期后获得了0.973的f1-score。

为了进行对比,我们使用FastText和AutoML自然语言平台训练了一个文本分类模型。首先我们使用默认参数运行FastText,并获得了0.883的f1-score,这意味着模型在很大程度上优于FastText。然而,FastText只需要几秒钟的训练时间,而我们训练的Flair模型则需要5分钟。

我们将结果与在谷歌的AutoML自然语言平台上获得的结果进行了比较。平台首先需要20分钟来解析数据集。之后,我们开始了训练过程,这几乎花了3个小时完成,但却获得了99.211的f1-score——这比我们自己训练的模型稍微好一点。

你可能感兴趣的:(Flair-1)