NLP中基于Bert的数据预处理

1.中文命名实体识别

###1.原始数据格式

海 O
钓 O
比 O
赛 O
地 O
点 O
在 O
厦 B-LOC
门 I-LOC
与 O
金 B-LOC
门 I-LOC
之 O
间 O
的 O
海 O
域 O
。 O


日 B-LOC
俄 B-LOC
两 O
国 O
国 O
内 O
政 O
局 O
都 O
充 O
满 O
变 O
数 O
, O
尽 O
管 O
日 B-LOC
俄 B-LOC
关 O
系 O
目 O
前 O
是 O
历 O
史 O
最 O
佳 O
时 O
期 O
, O
但 O
其 O
脆 O
弱 O
性 O
不 O
言 O
自 O
明 O
。 O

2.读取数据

将数据中的文本和对应的标签,读取成下面的one-hot格式

['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间', '的', '海', '域', '。', ']
[ 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O' , 'O', 'O', 'O', 'O', 'O']

3.处理数据成为Bert的输入格式

首先创建一个标签的字典库

VOCAB=(“[PAD]”,“O”,“'B-LOC”,“I-LOC”,“[CLS]”,“[SEP]”)

其中[PAD]是一个补位符号,可以自己定义,也可以定义为x都可以,VOCAB对应的字典如下:

{0: ‘[PAD]’, 1: ‘O’, 2: ‘B-TIM’, 3: ‘I-TIM’, 4: ‘B-PER’, 5: ‘I-PER’, 6: ‘B-ORG’, 7: ‘I-ORG’, 8: ‘B-LOC’, 9: ‘I-LOC’, 10: ‘[CLS]’, 11: ‘[SEP]’}
然后根据这个字典处理成Bert需要的数据:
input_ids:将文本转换成bert字典中对应的编号
input_mask:用1和0代表两句话的位置,如果只有一句话,就用1表示(开始符,分割符也要占一个位置),0用来补位
segment_ids:表示句子的长度
label_ids:是实体标签对应字典中的位置

  • 在输入的句子的前后需要加一个[CLS]和[SEP],将他们转换成id的格式,在bert的字典中[CLS]对应的是101,[SEP]对应的是102
  • 在label也添加[CLS]和[SEP],标签库里[CLS]对应的是10,[SEP]对应的是11
    转换成bert输入的格式如下:
'input_ids': [101, 3862, 7157, 3683, 6612, 1765, 4157, 1762, 1336, 7305, 680, 7032, 7305, 722, 7313, 4638, 3862, 1818, 511, 102]
'input_mask':[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
'segment_ids':[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
'label_ids':[10, 1, 1, 1, 1, 1, 1, 1, 8, 9, 1, 8, 9, 1, 1, 1, 1, 1, 1, 11]

4.填补数据

因为输入到bert模型中的数据,每一个epoch的数据的长度要求一致,且长度不能超过512位,便于计算.
input_ids:在list后面添加0补位,因为在句子中是使用[PAD]进行补位,[PAD]在bert的字典库中对应的位置为0
input_mask:用0补位
segment_ids:用0补位
label_ids:用0补位,因为在我自己设置的tag字典库中补位是用[PAD],他对应的位置是0,这个可以自定义.
我设置的最大长度为100,转换成长度一致的结果如下:

input_ids=[101, 3862, 7157, 3683, 6612, 1765, 4157, 1762, 1336, 7305, 680, 7032, 7305, 722, 7313, 4638, 3862, 1818, 511, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
input_mask=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
segment_ids=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label_ids=[10, 1, 1, 1, 1, 1, 1, 1, 8, 9, 1, 8, 9, 1, 1, 1, 1, 1, 1, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

5.过程代码

为了便于理解,我将每一个过程都封装成了一个函数,代码如下

from transformers import BertTokenizer
bert_model='./bert-base-chinese'
tokenizer=BertTokenizer.from_pretrained(bert_model)
# 自定义的label标签字典库
VOCAB = (
    "[PAD]",
    "O",
    'B-TIM',
    'I-TIM',
    "B-PER",
    "I-PER",
    "B-ORG",
    "I-ORG",
    "B-LOC",
    "I-LOC",
    "[CLS]",
    "[SEP]"
)
# 标签转数字,数字转标签
tag2idx = {tag: idx for idx, tag in enumerate(VOCAB)}
idx2tag = {idx: tag for idx, tag in enumerate(VOCAB)}
MAX_LEN=510

# 定义一个类分装好数据
class Input_futures(object):
    def __init__(self,input_ids,input_mask,segment_ids,label_ids):
        self.input_ids=input_ids
        self.input_mask=input_mask
        self.segment_ids=segment_ids
        self.label_ids=label_ids

class Input_example(object):
    def __init__(self, input_ids, input_mask, segment_ids, label_ids):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_ids = label_ids


def load_data(file_path):
    with open(file_path,'r',encoding='utf-8')as f:
        lines=f.read().strip().split('\n\n')
    sents, tags_li = [], []  # list of lists

    for i, entry in enumerate(lines):
        words = [line.split()[0] for line in entry.splitlines()]
        tags = ([line.split()[-1] for line in entry.splitlines()])
        sents.append(words[:MAX_LEN])
        tags_li.append(tags[:MAX_LEN])
    return zip(sents,tags_li)

def convert_to_feature(entitys):
    input_ids=[]
    input_mask=[]
    segment_ids=[]
    label_ids=[]
    for entity in entitys:
        input=tokenizer.encode_plus(entity[0])
        #label的前面添加[CLS],尾部添加[SEP]
        label=list(entity[1])
        label.insert(0,'[CLS]')
        label.append('[SEP]')
        label_id=[tag2idx[each] for each in label]
        input_ids.append(input['input_ids'])
        input_mask.append(input['attention_mask'])
        segment_ids.append(input['token_type_ids'])
        label_ids.append(label_id)
    feature = Input_futures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_ids=label_ids
    )
    return feature

def convrt_to_example(feature,max_len):
    f=lambda feature,max_len:[sample+[0]*(max_len-len(sample))for sample in feature]
    input_ids=f(feature.input_ids,max_len)
    input_mask=f(feature.input_mask,max_len)
    segment_ids=f(feature.segment_ids,max_len)
    label_ids=f(feature.label_ids,max_len)
    example=Input_example(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_ids=label_ids
    )
    return example

if __name__=='__main__':
    print('程序开始'.center(50,'#'))
    # 1.读取数据
    path='./data/train.txt'
    entitys=load_data(path)
    # 2.将数据处理成bert格式的类型
    feature=convert_to_feature(entitys)
    # 3.填充数据
    example=convrt_to_example(feature,100)
    print('程序结束'.center(50,'#'))

6.分装成DataSet类

因为每次要去写代码比较麻烦,所以我将代码封装成了DataSet类,只要按照我们自己要求的格式进行输入就可以直接拿去用,省去了再写代码的过程.

  • 前置工作
    将数据封装成如下格式,并且保存成pkl格式
['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间', '的', '海', '域', '。', ']
[ 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O' , 'O', 'O', 'O', 'O', 'O']

分装数据:

def load_data(file_path):
    with open(file_path,'r',encoding='utf-8')as f:
        lines=f.read().strip().split('\n\n')
    sents, tags_li = [], []  # list of lists

    for i, entry in enumerate(lines):
        words = [line.split()[0] for line in entry.splitlines()]
        tags = ([line.split()[-1] for line in entry.splitlines()])
        sents.append(words[:MAX_LEN])
        tags_li.append(tags[:MAX_LEN])
    return zip(sents,tags_li)

注意:我这里是将数据打包成了一个zip格式,到时候输出的时候,sent和tag会作为一个整体输出.
创建2个utils,保存成pkl格式,和读取pkl格式的数据

import pickle as pkl
import codecs

def save_pkl(path, obj):
    print(f'save  in {path}')
    with codecs.open(path, 'wb') as f:
        pkl.dump(obj, f)

def load_pkl(path):
    print(f'load in {path}')
    with codecs.open(path,'rb') as f:
        data = pkl.load(f)
    return data

只要完成上面2步骤之后,数据保存成我指定的格式,然后把下面的封装好的代码放到项目中就可以直接使用,节省了很多步骤.

  • 分装Dataset类
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizer
from utils.util import load_pkl
bert_model='./bert-base-chinese'
tokenizer=BertTokenizer.from_pretrained(bert_model)

class NerDataset(Dataset):
    def __init__(self,file_path,tag2idx):
        self.tag2idx=tag2idx
        self.entities=list(load_pkl(file_path))

    def __getitem__(self, item):
        entity=self.entities[item]
        feature = {}
        input = tokenizer.encode_plus(entity[0])
        # label的前面添加[CLS],尾部添加[SEP]
        label = entity[1]
        label.insert(0, '[CLS]')
        label.append('[SEP]')
        label_id = [self.tag2idx[each] for each in label]
        feature['input_ids']=input['input_ids']
        feature['attention_mask']=input['attention_mask']
        feature['token_type_ids']=input['token_type_ids']
        feature['label_ids']=label_id
        feature['len']=len(input['input_ids'])
        return feature

    def __len__(self):
        return len(self.entities)

def collate_fn(batch):
    all_input_ids = []
    all_attention_mask = []
    all_token_type_ids = []
    all_labels = []
    # 计算这一个批次的最大长度
    lens = [data['len'] for data in batch]
    max_len=max(lens)
    f = lambda feature, max_len: feature + [0] * (max_len - len(feature))
    for feature in batch:
        input_ids = f(feature['input_ids'], max_len)
        attention_mask = f(feature['attention_mask'], max_len)
        token_type_ids = f(feature['token_type_ids'], max_len)
        label_ids = f(feature['label_ids'], max_len)
        all_input_ids.append(input_ids)
        all_attention_mask.append(attention_mask)
        all_token_type_ids.append(token_type_ids)
        all_labels.append(label_ids)
    # 最后将这些值转换为tensor格式进行输出
    return torch.tensor(all_input_ids),torch.tensor(all_token_type_ids),torch.tensor(all_attention_mask),torch.tensor(all_labels)
  • 使用方法
from torch.utils.data import DataLoader
from NER.NerDataSet import NerDataset, collate_fn
VOCAB = (
    "[PAD]",
    "O",
    'B-TIM',
    'I-TIM',
    "B-PER",
    "I-PER",
    "B-ORG",
    "I-ORG",
    "B-LOC",
    "I-LOC",
    "[CLS]",
    "[SEP]"
)
# 标签转数字,数字转标签
tag2idx = {tag: idx for idx, tag in enumerate(VOCAB)}
idx2tag = {idx: tag for idx, tag in enumerate(VOCAB)}
MAX_LEN=510
if __name__=="__main__":
    print("程序开始".center(40,'#'))
    save_path = './data/train.pkl'

    train_data = NerDataset(file_path=save_path,tag2idx=tag2idx)
    train_iter=DataLoader(dataset=train_data,
                          batch_size=4,
                          shuffle=True,
                          collate_fn=collate_fn)

    for i,batch in enumerate(train_iter):
        print(f'第{i}个batch')
        input_ids, token_type_ids, attention_mask, labels_ids = batch
        print(input_ids)
        print(token_type_ids)
        print(attention_mask)
        print(labels_ids)
  • 结果展示
    因为我设置了batch_size是4,则一个batch中就只有四个数据,并且这些数据的顺序都是打乱的.
第0个batch
tensor([[ 101, 1046, 7716, 2209, 4638, 1957, 1036, 6375, 2025,  791, 2399, 6438,
          758, 2399, 5277, 8024, 1961, 2792, 1762, 4638, 4408,  677, 3300,  124,
          121, 1914, 1399, 1398, 2110, 8024, 6421, 4408, 4638,  100, 2157, 1999,
          833,  100, 4507,  122,  121, 1399, 2157, 7270, 5299, 2768,  511,  102,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0],
        [ 101, 3189,  915,  697, 1744, 1744, 1079, 3124, 2229, 6963, 1041, 4007,
         1359, 3144, 8024, 2226, 5052, 3189,  915, 1068, 5143, 4680, 1184, 3221,
         1325, 1380, 3297,  881, 3198, 3309, 8024,  852, 1071, 5546, 2483, 2595,
          679, 6241, 5632, 3209,  511,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0],
        [ 101,  852,  868,  711,  671,  702, 1066,  772, 1054, 1447,  510,  782,
         3696, 1062,  789, 8024, 2418, 2496, 5541, 2577, 2160, 7333, 8024, 4696,
         3633,  976, 1168,  100, 1044, 1921,  678,  722, 2569, 5445, 2569, 8024,
         1400, 1921,  678,  722,  727, 5445,  727,  100, 8024, 3909, 1265,  702,
          782, 4638, 1399, 1164, 2533, 1927, 1469, 2143, 6802, 2650, 1599, 8024,
         2828, 3121, 7484, 1920,  689, 3030, 1762, 7674,  855, 8024, 6821, 3416,
         2798, 5543, 6631, 6632, 5632, 2769, 8024, 3030, 5564,  686,  921, 8024,
         3300, 2792,  868,  711,  511,  102],
        [ 101, 3763, 4294, 7339, 3136, 5298,  877, 7440, 2861, 8038,  697, 3118,
         7339, 6963, 2682, 5526, 8024, 1728, 3634, 6963,  868, 1139,  749, 3297,
         1920, 4638, 1222, 1213,  511,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
tensor([[10,  4,  5,  5,  1,  1,  1,  4,  5,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  6,  7,
          7,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 11,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [10,  8,  8,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  8,
          8,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1, 11,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [10,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 11],
        [10,  6,  7,  7,  1,  1,  4,  5,  5,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 11,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])
第1个batch
tensor([[ 101, 6821, 2429,  898, 2255,  988, 3717, 4638, 1300, 4289, 7667, 4507,
         1744, 1079,  671, 3837, 4638, 6392, 6369, 2360,  712, 2898, 6392, 6369,
         8024, 3146,  702, 2456, 5029, 5408, 5125, 5401, 5445, 2612, 2131,  511,
          102],
        [ 101, 3862, 7157, 3683, 6612, 1765, 4157, 1762, 1336, 7305,  680, 7032,
         7305,  722, 7313, 4638, 3862, 1818,  511,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0],
        [ 101, 1762, 1355, 6809, 1744, 2157, 8024, 2593, 3131,  924, 7372, 1282,
         1146, 3249, 1350, 8024, 2347, 2768,  711, 4852,  833,  924, 7397,  860,
         5143, 4638, 7028, 6206, 5299, 2768, 6956, 1146,  511,  102,    0,    0,
            0],
        [ 101, 1346, 1217, 3635, 6121, 4638, 3300, 4511, 3300, 1957, 8024, 3300,
         2399, 6768,  782, 8024,  738, 3300,  704, 2399,  782,  511,  102,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
tensor([[10,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         11],
        [10,  1,  1,  1,  1,  1,  1,  1,  8,  9,  1,  8,  9,  1,  1,  1,  1,  1,
          1, 11,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0],
        [10,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 11,  0,  0,
          0],
        [10,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1, 11,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0]])

5.总结

命名实体识别任务,是每个字对应一个标签,所以在填充数据的时候,需要给标签也填补位置.而bert输入格式是固定的,则在其他几个任务中,是一样的处理过程,唯一的区别就是在输出的真实值不一样.

2.文本分类

后续继续补充

3.多标签文本分类

后续继续补充

4.关系抽取

后续继续补充

你可能感兴趣的:(#,NLP基础知识,自然语言处理,bert,人工智能)