什么时候需要DataCollator以及一些常见的DataCollator

DataCollator:如果不指定也会有个默认的DataCollator,默认的DataCollator作用是将输入转换为tensor,常见的需要手动指定的时候就是数据没有做padding的时候,要动态padding。也就是说如果在data_process中做了padding,并且没有特殊处理需求,那么也许就不需要collator了。

DataCollatorForSeq2Seq: Data collator that will dynamically pad the inputs received, as well as the labels.(区分input和output)

class DataCollatorWithPadding:Data collator that will dynamically pad the inputs received.

class DataCollatorForTokenClassification:Data collator that will dynamically pad the inputs received, as well as the labels.

class DataCollatorForLanguageModeling:Data collator used for language modeling. Inputs are dynamically padded to the maximum length of a batch if they are not all of the same length.

  args: ①mlm (`bool`, *optional*, defaults to `True`):Whether or not to use masked language modeling. If set to `False`, the labels are the same as the inputs with the padding tokens ignored (by setting them to -100). Otherwise, the labels are -100 for non-masked  tokens and the value to predict for the masked token.②mlm_probability (`float`, *optional*, defaults to 0.15):The probability with which to (randomly) mask tokens in the input, when `mlm` is set to `True`.

Tip:  For best performance, this data collator should be used with a dataset having items that are dictionaries or BatchEncoding, with the `"special_tokens_mask"` key, as returned by a [`PreTrainedTokenizer`] or a [`PreTrainedTokenizerFast`] with the argument `return_special_tokens_mask=True`.

你可能感兴趣的:(nlp,人工智能,语言模型,python)