Transformers包tokenizer.encode()方法源码阅读笔记

1 引言

  • Hugging Face公司出的transformers包,能够超级方便的引入预训练模型,BERT、ALBERT、GPT2…
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertForTokenClassification.from_pretrained('bert-base-uncased')
  • 这两行代码就导入了bert-base-uncased预训练模型和针对于NER任务的BertForTokenClassification微调模型,简直不能再方便。
  • 对我们的句子进行token to id的转化,下面的是官网给出的样例:
input_ids = torch.tensor(
    tokenizer.encode("Hello, my dog is cuting",
                     add_special_tokens=True)).unsqueeze(0) 
  • 在这个NER任务中使用了tokenizer的encode方法,那么:
    • 这个encode和tokeninze的区别是什么?这两个方法的输出是什么?
    • encode可以对句子做哪些操作?有哪些可以选择的字段?


我们将将在下面进行一一探索于整理。

2 开始读码

2.1 encode和tokeninze方法的区别
  • 直接上代码比较直观(其中处理的句子是随便起的,为了凸显WordPiece功能):
sentence = "Hello, my son is cuting."
input_ids_method1 = torch.tensor(
    tokenizer.encode(sentence, add_special_tokens=True))  # Batch size 1
    # tensor([ 101, 7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012,  102])

input_token2 = tokenizer.tokenize(sentence)
# ['hello', ',', 'my', 'son', 'is', 'cut', '##ing', '.']
input_ids_method2 = tokenizer.convert_tokens_to_ids(input_token2)
# tensor([7592, 1010, 2026, 2365, 2003, 3013, 2075, 1012])
# 并没有开头和结尾的标记:[cls]、[sep]
  • 从例子中可以看出,encode方法可以一步到位地生成对应模型的输入。
  • 相比之下,tokenize只是用于分词,可以分成WordPiece的类型,并且在分词之后还要手动使用convert_tokens_to_ids方法,比较麻烦。
  • 通过源码阅读,发现encode方法中调用了tokenize方法,所以在使用的过程中,我们可以通过设置encode方法中的参数,达到转化数据到可训练格式一步到位的目的,下面开始介绍encode的相关参数与具体操作。
2.2 tokenizer.encode()参数介绍
  • 上源码:
    def encode(
        self,
        text: str,  # 需要转化的句子
        text_pair: Optional[str] = None,   
        add_special_tokens: bool = True, 
        max_length: Optional[int] = None,  
        stride: int = 0,
        truncation_strategy: str = "longest_first",
        pad_to_max_length: bool = False,
        return_tensors: Optional[str] = None,
        **kwargs
    ):
        """
        Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.

        Same as doing ``self.convert_tokens_to_ids(self.tokenize(text))``.

        Args:
            text (:obj:`str` or :obj:`List[str]`):
                The first sequence to be encoded. This can be a string, a list of strings (tokenized string using
                the `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`
                method)
            text_pair (:obj:`str` or :obj:`List[str]`, `optional`, defaults to :obj:`None`):
                Optional second sequence to be encoded. This can be a string, a list of strings (tokenized
                string using the `tokenize` method) or a list of integers (tokenized string ids using the
                `convert_tokens_to_ids` method)
            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
                If set to ``True``, the sequences will be encoded with the special tokens relative
                to their model.
            max_length (:obj:`int`, `optional`, defaults to :obj:`None`):
                If set to a number, will limit the total sequence returned so that it has a maximum length.
                If there are overflowing tokens, those will be added to the returned dictionary
            stride (:obj:`int`, `optional`, defaults to ``0``):
                If set to a number along with max_length, the overflowing tokens returned will contain some tokens
                from the main sequence returned. The value of this argument defines the number of additional tokens.
            truncation_strategy (:obj:`str`, `optional`, defaults to `longest_first`):
                String selected in the following options:

                - 'longest_first' (default) Iteratively reduce the inputs sequence until the input is under max_length
                  starting from the longest one at each token (when there is a pair of input sequences)
                - 'only_first': Only truncate the first sequence
                - 'only_second': Only truncate the second sequence
                - 'do_not_truncate': Does not truncate (raise an error if the input sequence is longer than max_length)
            pad_to_max_length (:obj:`bool`, `optional`, defaults to :obj:`False`):
                If set to True, the returned sequences will be padded according to the model's padding side and
                padding index, up to their max length. If no max length is specified, the padding is done up to the
                model's max length. The tokenizer padding sides are handled by the class attribute `padding_side`
                which can be set to the following strings:

                - 'left': pads on the left of the sequences
                - 'right': pads on the right of the sequences
                Defaults to False: no padding.
            return_tensors (:obj:`str`, `optional`, defaults to :obj:`None`):
                Can be set to 'tf' or 'pt' to return respectively TensorFlow :obj:`tf.constant`
                or PyTorch :obj:`torch.Tensor` instead of a list of python integers.
            **kwargs: passed to the `self.tokenize()` method
        """
        encoded_inputs = self.encode_plus(
            text,
            text_pair=text_pair,
            max_length=max_length,
            add_special_tokens=add_special_tokens,
            stride=stride,
            truncation_strategy=truncation_strategy,
            pad_to_max_length=pad_to_max_length,
            return_tensors=return_tensors,
            **kwargs,
        )

        return encoded_inputs["input_ids"]

  • add_special_tokens: bool = True 将句子转化成对应模型的输入形式,默认开启

  • add_special_tokens: bool = True 设置最大长度,如果不设置的话原模型设置的最大长度是512,此时,如果句子长度超过512会报下面的错:

Token indices sequence length is longer than the specified maximum sequence length for this model (5904 > 512). Running this sequence through the model will result in indexing errors

这时候我们需要做切断句子操作,或者启用这个参数,设置我们想要的最大长度,这样函数将只保留长度-2(除去[cls][sep])个token并转化成id。

  • pad_to_max_length: bool = False
    是否按照最长长度补齐,默认关闭,此处可以通过tokenizer.padding_side='left'设置补齐的位置在左边插入。

  • truncation_strategy: str = "longest_first"
    截断机制,有四种方式来读取句子内容:

    • ‘longest_first’(默认):一直迭代,读到不能再读,读满为止
    • ‘only_first’: 只读入第一个序列
    • ‘only_second’: 只读入第二个序列
    • ‘do_not_truncate’: 不做截取,长了就报错

3 最后

  • 读了一个api的之后,我们今后的代码中便有了那个api的影子。
  • 当今NLP的门槛越来越低,像Hugging Face的Transformers 这样的模型快速搭建包只会越来越多,并且使用起来也会越来越便捷。我认为,应用型的NLP工程师的核心竞争力除了储备大量模型算法外,还要如同老中医一般,对任务“望闻问切”后能马上给出实验方案,快速的算法落地,完成项目上线。既然要速度,我们就必须要准备一些这样“简单粗暴”的“菜刀型武器”(武功再高,也怕菜刀)。

你可能感兴趣的:(Pytorch)