Roberta的tokenizer简单使用

from transformers import AutoTokenizer
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokens = datasets["train"][4]['tokens']
token_strings = ' '.join(datasets["train"][4]['tokens'])
print('sentence:',token_strings,end = '\n\n')
print('after tokenizer:',tokenizer(token_strings),end = '\n\n')
print('切分结果:',tokenizer(token_strings).word_ids(),end = '\n\n')
print('tokenize结果:',tokenizer.tokenize(token_strings),end = '\n\n')
# 与input_id只差了开头和结尾的标识符
print('tokenize结果转化id:',tokenizer.convert_tokens_to_ids(tokenizer.tokenize(token_strings)))

输出

sentence: Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .

after tokenizer: {‘input_ids’: [101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

切分结果: [None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 14, 15, 16, 17, 18, 19, 20, 20, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, None]

tokenize结果: [‘germany’, “’”, ‘s’, ‘representative’, ‘to’, ‘the’, ‘european’, ‘union’, “’”, ‘s’, ‘veterinary’, ‘committee’, ‘werner’, ‘z’, ‘##wing’, ‘##mann’, ‘said’, ‘on’, ‘wednesday’, ‘consumers’, ‘should’, ‘buy’, ‘sheep’, ‘##me’, ‘##at’, ‘from’, ‘countries’, ‘other’, ‘than’, ‘britain’, ‘until’, ‘the’, ‘scientific’, ‘advice’, ‘was’, ‘clearer’, ‘.’]

tokenize结果转化id: [2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012]

你可能感兴趣的:(python,开发语言,pytorch)