transformers中大部分模型的输入都是相同的。
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sequence = "A Titan RTX has 24GB of VRAM"
# tokenizer依据它的vocabulary按照一定的算法将seq分割为tokens
# RTX是被分割成rt 和 ##x,这是一个分词的算法,不是简单的按空格分割
tokenized_sequence = tokenizer.tokenize(sequence)
# tokenizer返回了相应模型所需要的所有参数
# dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
inputs = tokenizer(sequence)
# input_ids
encoded_sequence = inputs["input_ids"]
decoded_sequence = tokenizer.decode(encoded_sequence)
在处理batch时语句补充pad时,可以告诉模型,哪些tokens要attented,哪些不需要。
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# a和b的长度是不同的
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
encoded_sequence_b = tokenizer(sequence_b)["input_ids"]
# 在tokenizer时需要进行padding
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
padded_sequences["input_ids"]
# attention_mask中a中的0的部分就是不需要attended的部分,因为是补充的pad
padded_sequences["attention_mask"]
一些模型需要输入成对的句子,比如做分类的相似度判断,或者问答。我们输入的格式是:
# [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])
# token_type_ids中第一句话下标为0,第二句话下标为1
encoded_dict['token_type_ids']
如果不指定 position_ids参数,模型会自动常见,所以不需要我们操心。
参考:
https://huggingface.co/transformers/glossary.html#input-ids