pytorch ner

将一句话进行实体标注,如以下格式

John   lives in New   York
B-PER  O     O  B-LOC I-LOC

我们的数据分为两个,sentence.txtlabels.txt

#sentences.txt
John lives in New York
Where is John ?
#labels.txt
B-PER O O B-LOC I-LOC
O O B-PER O

我们假设运行build_vocab.py/data中创建词汇表,将生成两个文件:

#words.txt
John
lives
in
...


```python
#tags.txt
B-PER
B-LOC

加载文本文件

在NLP应用中,使用数字来代替词语。
假设我们的词汇库是{'is':1, 'John':2, 'Where':3, '.':4, '?':5},则 “Where is John ?”将被表示为[3,1,2,5]。
读取words.txt词汇表,并给每个单词对应上数字。
word.txt中包含了两个特别的tokens,一个是UNK来表示词汇表中没有的词,另一个是PAD在句子补全中使用。

vocab = {}
with open(words_path) as f:
    for i, l in enumerate(f.read().splitlines()):
        vocab[l] = i

同样的道理来处理tags.txt

接下来读取文本,并将文本转换为数字:

train_sentences = []        
train_labels = []

with open(train_sentences_file) as f:
    for sentence in f.read().splitlines():
        #replace each token by its index if it is in vocab
        #else use index of UNK
        s = [vocab[token] if token in self.vocab 
            else vocab['UNK']
            for token in sentence.split(' ')]
        train_sentences.append(s)

with open(train_labels_file) as f:
    for sentence in f.read().splitlines():
        #replace each label by its index
        l = [tag_map[label] for label in sentence.split(' ')]
        train_labels.append(l)  

batch 预处理

每个句子可能是不等长的,需要补充PAD
Let’s say we have a batch of sentences batch_sentences that is a Python list of lists, with its corresponding batch_tags which has a tag for each token in batch_sentences

1.首先计算每个batch中最长的语句,不够长的语句使用PAD填充
2.然后使用(num_sentences,batch_max_len)来初始化batch,由于Embedding layer需要输入long type,所以转换为LongTensor

#compute length of longest sentence in batch
batch_max_len = max([len(s) for s in batch_sentences])

#prepare a numpy array with the data, initializing the data with 'PAD' 
#and all labels with -1; initializing labels to -1 differentiates tokens 
#with tags from 'PAD' tokens
batch_data = vocab['PAD']*np.ones((len(batch_sentences), batch_max_len))
batch_labels = -1*np.ones((len(batch_sentences), batch_max_len))

#copy the data to the numpy array
for j in range(len(batch_sentences)):
    cur_len = len(batch_sentences[j])
    batch_data[j][:cur_len] = batch_sentences[j]
    batch_labels[j][:cur_len] = batch_tags[j]

#since all data are indices, we convert them to torch LongTensors
batch_data, batch_labels = torch.LongTensor(batch_data), torch.LongTensor(batch_labels)

#convert Tensors to Variables
batch_data, batch_labels = Variable(batch_data), Variable(batch_labels)

递归神经网络

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self, params):
        super(Net, self).__init__()
    #maps each token to an embedding_dim vector
    self.embedding = nn.Embedding(params.vocab_size, params.embedding_dim)
    #the LSTM takens embedded sentence
    self.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True)
    #fc layer transforms the output to give the final output layer
    self.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags)
def forward(self, s):
    #apply the embedding layer that maps each token to its embedding
    s = self.embedding(s)   # dim: batch_size x batch_max_len x embedding_dim
    #run the LSTM along the sentences of length batch_max_len
    s, _ = self.lstm(s)     # dim: batch_size x batch_max_len x lstm_hidden_dim                
    #reshape the Variable so that each row contains one token
    s = s.view(-1, s.shape[2])  # dim: batch_size*batch_max_len x lstm_hidden_dim
    #apply the fully connected layer and obtain the output for each token
    s = self.fc(s)          # dim: batch_size*batch_max_len x num_tags
    return F.log_softmax(s, dim=1)   # dim: batch_size*batch_max_len x num_tags

自定义Loss Function

主要是去除PAD的影响。

def loss_fn(outputs, labels):
    #reshape labels to give a flat vector of length batch_size*seq_len
    labels = labels.view(-1)  

    #mask out 'PAD' tokens
    mask = (labels >= 0).float()

    #the number of tokens is the sum of elements in mask
    num_tokens = int(torch.sum(mask).data[0])

    #pick the values corresponding to labels and multiply by mask
    outputs = outputs[range(outputs.shape[0]), labels]*mask

    #cross entropy loss for all non 'PAD' tokens
    return -torch.sum(outputs)/num_tokens

参考:
https://cs230.stanford.edu/blog/namedentity/#goals-of-this-tutorial
https://github.com/cs230-stanford/cs230-code-examples

你可能感兴趣的:(python)