将一句话进行实体标注,如以下格式
John lives in New York
B-PER O O B-LOC I-LOC
我们的数据分为两个,sentence.txt
和labels.txt
:
#sentences.txt
John lives in New York
Where is John ?
#labels.txt
B-PER O O B-LOC I-LOC
O O B-PER O
我们假设运行build_vocab.py
在/data
中创建词汇表,将生成两个文件:
#words.txt
John
lives
in
...
```python
#tags.txt
B-PER
B-LOC
在NLP应用中,使用数字来代替词语。
假设我们的词汇库是{'is':1, 'John':2, 'Where':3, '.':4, '?':5}
,则 “Where is John ?”将被表示为[3,1,2,5]。
读取words.txt
词汇表,并给每个单词对应上数字。
word.txt
中包含了两个特别的tokens,一个是UNK
来表示词汇表中没有的词,另一个是PAD
在句子补全中使用。
vocab = {}
with open(words_path) as f:
for i, l in enumerate(f.read().splitlines()):
vocab[l] = i
同样的道理来处理tags.txt
。
接下来读取文本,并将文本转换为数字:
train_sentences = []
train_labels = []
with open(train_sentences_file) as f:
for sentence in f.read().splitlines():
#replace each token by its index if it is in vocab
#else use index of UNK
s = [vocab[token] if token in self.vocab
else vocab['UNK']
for token in sentence.split(' ')]
train_sentences.append(s)
with open(train_labels_file) as f:
for sentence in f.read().splitlines():
#replace each label by its index
l = [tag_map[label] for label in sentence.split(' ')]
train_labels.append(l)
每个句子可能是不等长的,需要补充PAD
。
Let’s say we have a batch of sentences batch_sentences that is a Python list of lists, with its corresponding batch_tags which has a tag for each token in batch_sentences
1.首先计算每个batch中最长的语句,不够长的语句使用PAD
填充
2.然后使用(num_sentences,batch_max_len)
来初始化batch,由于Embedding layer需要输入long
type,所以转换为LongTensor
#compute length of longest sentence in batch
batch_max_len = max([len(s) for s in batch_sentences])
#prepare a numpy array with the data, initializing the data with 'PAD'
#and all labels with -1; initializing labels to -1 differentiates tokens
#with tags from 'PAD' tokens
batch_data = vocab['PAD']*np.ones((len(batch_sentences), batch_max_len))
batch_labels = -1*np.ones((len(batch_sentences), batch_max_len))
#copy the data to the numpy array
for j in range(len(batch_sentences)):
cur_len = len(batch_sentences[j])
batch_data[j][:cur_len] = batch_sentences[j]
batch_labels[j][:cur_len] = batch_tags[j]
#since all data are indices, we convert them to torch LongTensors
batch_data, batch_labels = torch.LongTensor(batch_data), torch.LongTensor(batch_labels)
#convert Tensors to Variables
batch_data, batch_labels = Variable(batch_data), Variable(batch_labels)
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self, params):
super(Net, self).__init__()
#maps each token to an embedding_dim vector
self.embedding = nn.Embedding(params.vocab_size, params.embedding_dim)
#the LSTM takens embedded sentence
self.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True)
#fc layer transforms the output to give the final output layer
self.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags)
def forward(self, s):
#apply the embedding layer that maps each token to its embedding
s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim
#run the LSTM along the sentences of length batch_max_len
s, _ = self.lstm(s) # dim: batch_size x batch_max_len x lstm_hidden_dim
#reshape the Variable so that each row contains one token
s = s.view(-1, s.shape[2]) # dim: batch_size*batch_max_len x lstm_hidden_dim
#apply the fully connected layer and obtain the output for each token
s = self.fc(s) # dim: batch_size*batch_max_len x num_tags
return F.log_softmax(s, dim=1) # dim: batch_size*batch_max_len x num_tags
主要是去除PAD的影响。
def loss_fn(outputs, labels):
#reshape labels to give a flat vector of length batch_size*seq_len
labels = labels.view(-1)
#mask out 'PAD' tokens
mask = (labels >= 0).float()
#the number of tokens is the sum of elements in mask
num_tokens = int(torch.sum(mask).data[0])
#pick the values corresponding to labels and multiply by mask
outputs = outputs[range(outputs.shape[0]), labels]*mask
#cross entropy loss for all non 'PAD' tokens
return -torch.sum(outputs)/num_tokens
参考:
https://cs230.stanford.edu/blog/namedentity/#goals-of-this-tutorial
https://github.com/cs230-stanford/cs230-code-examples