使用pytorch中的bert模型获取句子向量为后续NLP任务做准备

1.安装pytorch-pretrained-BERT

pip install pytorch-pretrained-bert

我的python版本是3.6

2.下载模型和字典:

模型和字典位置:https://s3.amazonaws.com/models.huggingface.co

例如下载bert-base-cased.tar.gz

https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz

https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt

放到文件夹bert-base-cased_file中,切记文件夹名字不要和模型名字一样,把bert-base-cased-vocab.txt重命名为vocab.txt,解压bert-base-cased.tar.gz

3.获得隐层向量和最后输出到下个单元的向量

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

# Load pre-trained model tokenizer (vocabulary)

 

BERT Word Embeddings Tutorial  https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased_file')

# Tokenized input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']

# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased_file')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, pooled_output = model(tokens_tensor, segments_tensors)
    print(pooled_output)
# We have a hidden states for each of the 12 layers in model bert-base-uncased
assert len(encoded_layers) == 12
 

 

Q: Where do you get the fixed representation? Did you do pooling or something?

A: I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.

Q: Why not use the hidden state of the first token, i.e. the [CLS]?

A: Because a pre-trained model is not fine-tuned on any downstream tasks yet. In this case, the hidden state of [CLS] is not a good sentence representation. If later you fine-tune the model, you may use get_pooled_output() to get the fixed length representation as well.

Q: Why not the last hidden layer? Why second-to-last?

A: The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets.

参考:https://pythonawesome.com/mapping-a-variable-length-sentence-to-a-fixed-length-vector-using-pretrained-bert-model/

你可能感兴趣的:(数据挖掘,机器学习)