tokenizers decoders模块

模块介绍

decoders模块负责将id转换成可读的文本。decoders模块中Decoder主要用于解码pre_tokenizers模块中PreTokenizer使用的特殊字符,比如pre_tokenizers模块中Metaspace,将空格转换成下划线,通过deocders模块中Metaspace,则可以将下划线还原成空格。又如,pre_tokenizers模块中ByteLevel,将空格转换成符号"Ġ",对应deocders模块中ByteLevel将符号"Ġ"解码成空格。

decoder模块实现的是Decoder的子类,对于Decoder,官方文档的解释如下,也就是说Decoder负责将标记化的输入映射回原始字符串。 通常根据我们之前使用的 PreTokenizer 来选择解码器。

Decoding: In charge of mapping back a tokenized input to the original string. The decoder is usually chosen according to the PreTokenizer we used previously.

模块使用

1、BPEDecoder
tokenizers.decoders.BPEDecoder(suffix = '')

BPEDecoder解码器用于处理将子词合并成单词,并处理每个单词后面添加的后缀"",将其转换成空格。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")
>>> tokenizer = Tokenizer(models.BPE(unk_token=""))
>>> tokenizer.normalizer = normalizers.BertNormalizer(lowercase=False)
>>> tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
>>> special_tokens = [""]
>>> trainers = trainers.BpeTrainer(special_tokens=special_tokens,
                                   end_of_word_suffix="")
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> tokenizer.decoder = decoders.BPEDecoder()
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids)
this is a text !
2、ByteLevel
tokenizers.decoders.ByteLevel()

ByteLevel解码器和ByteLevel预分词器一起使用,将ids转换成文本的同时也将符号Ġ转换成空格。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")

>>> tokenizer = Tokenizer(models.BPE(unk_token=""))
>>> tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
>>> special_tokens = ["", "", "", "", ""]
>>> trainers = trainers.BpeTrainer(special_tokens=special_tokens)
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> tokenizer.post_processor = processors.RobertaProcessing(sep=("", tokenizer.token_to_id("")),
	                                                        cls=("", tokenizer.token_to_id("")),
	                                                        trim_offsets=True,
	                                                        add_prefix_space=False)
>>> tokenizer.decoder = decoders.ByteLevel()

>>> tokenizer.encode("this is a text!").ids
[2, 256, 202, 305, 176, 4452, 5, 3]
>>> tokenizer.pre_tokenizer.pre_tokenize_str("this is a text!")
[('this', (0, 4)), ('Ġis', (4, 7)), ('Ġa', (7, 9)), ('Ġtext', (9, 14)), ('!', (14, 15))]
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids)
this is a text!

3、CTC

tokenizers.decoders.CTC(pad_token = '', word_delimiter_token = '|', cleanup = True )

CTC解码器,每个单词之间用"|"来分隔。

4、Metaspace

tokenizers.decoders.Metaspace(replacement="▁", add_prefix_space=True)

Metaspace解码器用于和Metaspace预分词器一起使用,将ids转换成文本,并将下划线转换成空格。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")
>>> tokenizer = Tokenizer(models.Unigram())
>>> tokenizer.normalizer = normalizers.Sequence(
	    [normalizers.Replace("``", '"'), normalizers.Replace("''", '"'), normalizers.Lowercase()]
	)
>>> tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()
>>> special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
>>> trainers = trainers.UnigramTrainer(special_tokens=special_tokens, unk_token="[UNK]")
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> cls_token_id = tokenizer.token_to_id("[CLS]")
>>> sep_token_id = tokenizer.token_to_id("[SEP]")
>>> tokenizer.post_processor = processors.TemplateProcessing(
	    single="[CLS]:0 $A:0 [SEP]:0",
	    pair="[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
	    special_tokens=[
	        ("[CLS]", cls_token_id),
	        ("[SEP]", sep_token_id),
	    ],
	)
>>> tokenizer.decoder = decoders.Metaspace()
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids)
this is a text!

5、WordPiece

tokenizers.decoders.WordPiece(prefix = '##', cleanup = True)

WordPiece解码器用于处理子词的前缀"##",并将子词合并成词。

>>> def batch_iterator():
	    for i in range(0, len(dataset), 1000):
	        yield dataset[i: i + 1000]["text"]


>>> dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="validation")
>>> tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
>>> tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
>>> special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
>>> trainers = trainers.WordPieceTrainer(special_tokens=special_tokens)
>>> tokenizer.train_from_iterator(batch_iterator(), trainers)

>>> tokenizer.post_processor = processors.BertProcessing(sep=("[SEP]", tokenizer.token_to_id("[SEP]")),
                                                   	 	 cls=("[CLS]", tokenizer.token_to_id("[CLS]")))
>>> tokenizer.decoder = decoders.WordPiece()
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids)
this is a text!
>>> tokenizer.decode(tokenizer.encode("this is a text!").ids, skip_special_tokens=False)
[CLS] this is a text! [SEP]

你可能感兴趣的:(#,transformers,python,开发语言)