tokenizer.batch_encode_plus

注释是输出

tokenizer = BertTokenizer.from_pretrained('C:\\Users\\lgy\\Desktop\\fsdownload\\bert-base-uncased')
print(tokenizer.mask_token) # [MASK]
print(tokenizer.convert_tokens_to_ids('a')) # 1037
print(tokenizer.convert_ids_to_tokens(1037)) # a

string = "test batch encode plus"
strings = [string,string]
tokens = tokenizer.tokenize(string)
print(tokens)#['test', 'batch', 'en', '##code', 'plus']
out = tokenizer.batch_encode_plus(strings,max_length=10,padding='max_length',truncation='longest_first')#长的截,短的补
print(out)# {'input_ids': [[101, 3231, 14108, 4372, 16044, 4606, 102, 0, 0, 0], [101, 3231, 14108, 4372, 16044, 4606, 102, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]}

你可能感兴趣的:(自然语言处理,自然语言处理)