torchtext.data 的 Field, RawField

今天试图更改open-nmt代码时,在preprocess阶段发现一处代码:

    fields = inputters.get_fields( 
        opt.data_type,
        src_nfeats,
        tgt_nfeats,
        dynamic_dict=opt.dynamic_dict,
        src_truncate=opt.src_seq_length_trunc,
        tgt_truncate=opt.tgt_seq_length_trunc)

而fields的各个“组成部件”为不同的类型

    fields["tgt"] = fields_getters["text"](**tgt_field_kwargs)          # TextMultiField 

    indices = Field(use_vocab=False, dtype=torch.long, sequential=False) # 
    fields["indices"] = indices

    if dynamic_dict:
        src_map = Field(
            use_vocab=False, dtype=torch.float,
            postprocessing=make_src, sequential=False)
        fields["src_map"] = src_map

        src_ex_vocab = RawField()
        fields["src_ex_vocab"] = src_ex_vocab

        align = Field(
            use_vocab=False, dtype=torch.long,
            postprocessing=make_tgt, sequential=False)
        fields["alignment"] = align

可以看到有的是TextMultiField,有的是Field, RawField

这是为什么呢?有什么区别呢?

 

打开torchtext官方文档:https://torchtext.readthedocs.io/en/latest/data.html#field

查找到如下内容:

Field

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='', unk_token='', pad_first=False, truncate_first=False, stop_words=None, is_target=False)
Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

定义一个数据类型和转换成张量的指令。

字段类模型可由张量表示的常见文本处理数据类型。
它持有一个Vocab对象,该对象定义字段元素的可能值集及其相应的数值表示。
Field对象还包含与数据类型如何数字化相关的其他参数,比如记号化方法和应该生成的张量类型。

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.

Variables:变量详情见上述网址

 

RawField

class torchtext.data.RawField(preprocessing=None, postprocessing=None, is_target=False)
Defines a general datatype.

Every dataset consists of one or more types of data. For instance, a text classification dataset contains sentences and their classes, while a machine translation dataset contains paired examples of text in two languages. Each of these types of data is represented by a RawField object. A RawField object does not assume any property of the data type and it holds parameters relating to how a datatype should be processed.

定义一般数据类型。

每个数据集都包含一种或多种类型的数据。
例如,文本分类数据集包含句子及其类,而机器翻译数据集包含两种语言的成对文本示例。
这些类型的数据都由一个RawField对象表示。
RawField对象不假设数据类型的任何属性,它持有与处理数据类型相关的参数。

Variables:

 

 

 

 

你可能感兴趣的:(随记)