Task04 编写BERT模型

1 BertTokenizer(Tokenization分词)

组成结构:BasicTokenizer和WordPieceTokenizer

BasicTokenizer主要作用:

按标点、空格分割句子,对于中文字符,通过预处理(加空格方式)进行按字分割

通过never_split指定对某些词不进行分割

处理是否统一小写

清理非法字符

WordPieceTokenizer主要作用:

进一步将词分解为子词(subword)

subword介于char和word之间,保留了词的含义,又能够解决英文中单复数、时态导致的词表爆炸和未登录词的OOV问题

将词根和时态词缀分割,减小词表,降低训练难度

BertTokenizer常用方法:

from_pretrained:从包含词表文件(vocab.txt)的目录中初始化一个分词器;

tokenize:将文本(词或者句子)分解为子词列表;

convert_tokens_to_ids:将子词列表转化为子词对应的下标列表;

convert_ids_to_tokens :与上一个相反;

convert_tokens_to_string:将subword列表按“##”拼接回词或者句子;

encode:

对于单个句子输入,分解词,同时加入特殊词形成“[CLS], x, [SEP]”的结构,并转换为词表对应的下标列表;

对于两个句子输入(多个句子只取前两个),分解词并加入特殊词形成“[CLS], x1, [SEP], x2, [SEP]”的结构并转换为下标列表;

decode:可以将encode方法的输出变为完整句子。

2 BertModel(BERT Model 本体模型)

组成结构:主要是Transformer-Encoder结构

embeddings:BertEmbeddings类的实体,根据单词符号获取对应的向量表示;

encoder:BertEncoder类的实体;

pooler:BertPooler类的实体,这一部分是可选的

BertModel常用方法:

get_input_embeddings:提取 embedding 中的 word_embeddings,即词向量部分;

set_input_embeddings:为 embedding 中的 word_embeddings 赋值;

_prune_heads:提供了将注意力头剪枝的函数,输入为{layer_num: list of heads to prune in this layer}的字典,可以将指定层的某些注意力头剪枝。

2.1 BertEmbeddings

输出结果:通过word_embeddings、token_type_embeddings、position_embeddings三个部分求和,并通过一层 LayerNorm+Dropout 后输出得到,其大小为(batch_size, sequence_length, hidden_size)

word_embeddings:子词(subword)对应的embeddings

token_type_embeddings:用于表示当前词所在的句子,区别句子与 padding、句子对之间的差异

position_embeddings:表示句子中每个词的位置嵌入,用于区别词的顺序

使用 LayerNorm+Dropout 的必要性:

通过layer normalization得到的embedding的分布,是以坐标原点为中心,1为标准差,越往外越稀疏的球体空间中

2.2 BertEncoder

技术拓展:梯度检查点(gradient checkpointing),通过减少保存的计算图节点压缩模型占用空间

2.2.1 BertAttention

BertSelfAttention

初始化部分:检查隐藏层和注意力头的参数配置倍率、进行各参数的赋值

前向传播部分:

multi-head self-attention的基本公式:

\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O \\ \text{head}_i = \text{SDPA}(\text{QW}_i^Q, \text{KW}_i^K, \text{VW}_i^V) \\ \text{SDPA}(Q, K, V) = \text{softmax}(\frac{Q \cdot K^T}{\sqrt{d_k}}) \cdot V

MHA(Q,K,V)=Concat(head

1

,…,head

h

)W

O

head

i

=SDPA(QW

i

Q

,KW

i

K

,VW

i

V

)

SDPA(Q,K,V)=softmax(

d

k

Q⋅K

T

)⋅V

transpose_for_scores:用于将 hidden_size 拆成多个头输出的形状,并且将中间两维转置进行矩阵相乘

torch.einsum:根据下标表示形式,对矩阵中输入元素的乘积求和

positional_embedding_type:

absolute:默认值,不用进行处理

relative_key:对key layer处理

relative_key_query:对 key 和 value 都进行相乘以作为位置编码

BertSelfOutput:

前向传播部分使用LayerNorm+Dropout组合,残差连接用于降低网络层数过深,带来的训练难度,对原始输入更加敏感。

2.2.2 BertIntermediate

主要结构:全连接和激活操作

全连接:将原始维度进行扩展,参数intermediate_size

激活:激活函数默认为 gelu,使用一个包含tanh的表达式进行近似求解

2.2.3 BertOutput

主要结构:全连接、dropout+LayerNorm、残差连接(residual connect)

2.3 BertPooler

主要作用:取出句子的第一个token,即[CLS]对应的向量,然后通过一个全连接层和一个激活函数后输出结果。

3 实战练习

3.1 BertToknizer代码

import collections

import os

import unicodedata

from typing import List, Optional, Tuple

from transformers.tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace

from transformers.utils import logging

logger = logging.get_logger(__name__)

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

PRETRAINED_VOCAB_FILES_MAP = {

    "vocab_file": {

        "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt",

    }

}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {

    "bert-base-uncased": 512,

}

PRETRAINED_INIT_CONFIGURATION = {

    "bert-base-uncased": {"do_lower_case": True},

}

def load_vocab(vocab_file):

    """Loads a vocabulary file into a dictionary."""

    vocab = collections.OrderedDict()

    with open(vocab_file, "r", encoding="utf-8") as reader:

        tokens = reader.readlines()

    for index, token in enumerate(tokens):

        token = token.rstrip("\n")

        vocab[token] = index

    return vocab

def whitespace_tokenize(text):

    """Runs basic whitespace cleaning and splitting on a piece of text."""

    text = text.strip()

    if not text:

        return []

    tokens = text.split()

    return tokens

class BertTokenizer(PreTrainedTokenizer):

    vocab_files_names = VOCAB_FILES_NAMES

    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP

    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION

    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

    def __init__(

        self,

        vocab_file,

        do_lower_case=True,

        do_basic_tokenize=True,

        never_split=None,

        unk_token="[UNK]",

        sep_token="[SEP]",

        pad_token="[PAD]",

        cls_token="[CLS]",

        mask_token="[MASK]",

        tokenize_chinese_chars=True,

        strip_accents=None,

        **kwargs

    ):

        super().__init__(

            do_lower_case=do_lower_case,

            do_basic_tokenize=do_basic_tokenize,

            never_split=never_split,

            unk_token=unk_token,

            sep_token=sep_token,

            pad_token=pad_token,

            cls_token=cls_token,

            mask_token=mask_token,

            tokenize_chinese_chars=tokenize_chinese_chars,

            strip_accents=strip_accents,

            **kwargs,

        )

        if not os.path.isfile(vocab_file):

            raise ValueError(

                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "

                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"

            )

        self.vocab = load_vocab(vocab_file)

        self.ids_to_tokens = collections.OrderedDict(

            [(ids, tok) for tok, ids in self.vocab.items()])

        self.do_basic_tokenize = do_basic_tokenize

        if do_basic_tokenize:

            self.basic_tokenizer = BasicTokenizer(

                do_lower_case=do_lower_case,

                never_split=never_split,

                tokenize_chinese_chars=tokenize_chinese_chars,

                strip_accents=strip_accents,

            )

        self.wordpiece_tokenizer = WordpieceTokenizer(

            vocab=self.vocab, unk_token=self.unk_token)

    @property

    def do_lower_case(self):

        return self.basic_tokenizer.do_lower_case

    @property

    def vocab_size(self):

        return len(self.vocab)

    def get_vocab(self):

        return dict(self.vocab, **self.added_tokens_encoder)

    def _tokenize(self, text):

        split_tokens = []

        if self.do_basic_tokenize:

            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):

                # If the token is part of the never_split set

                if token in self.basic_tokenizer.never_split:

                    split_tokens.append(token)

                else:

                    split_tokens += self.wordpiece_tokenizer.tokenize(token)

        else:

            split_tokens = self.wordpiece_tokenizer.tokenize(text)

        return split_tokens

    def _convert_token_to_id(self, token):

        """Converts a token (str) in an id using the vocab."""

        return self.vocab.get(token, self.vocab.get(self.unk_token))

    def _convert_id_to_token(self, index):

        """Converts an index (integer) in a token (str) using the vocab."""

        return self.ids_to_tokens.get(index, self.unk_token)

    def convert_tokens_to_string(self, tokens):

        """Converts a sequence of tokens (string) in a single string."""

        out_string = " ".join(tokens).replace(" ##", "").strip()

        return out_string

    def build_inputs_with_special_tokens(

        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None

    ) -> List[int]:

        """

        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and

        adding special tokens. A BERT sequence has the following format:

        - single sequence: ``[CLS] X [SEP]``

        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

        Args:

            token_ids_0 (:obj:`List[int]`):

                List of IDs to which the special tokens will be added.

            token_ids_1 (:obj:`List[int]`, `optional`):

                Optional second list of IDs for sequence pairs.

        Returns:

            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.

        """

        if token_ids_1 is None:

            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]

        cls = [self.cls_token_id]

        sep = [self.sep_token_id]

        return cls + token_ids_0 + sep + token_ids_1 + sep

    def get_special_tokens_mask(

        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False

    ) -> List[int]:

        """

        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding

        special tokens using the tokenizer ``prepare_for_model`` method.

        Args:

            token_ids_0 (:obj:`List[int]`):

                List of IDs.

            token_ids_1 (:obj:`List[int]`, `optional`):

                Optional second list of IDs for sequence pairs.

            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):

                Whether or not the token list is already formatted with special tokens for the model.

        Returns:

            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

        """

        if already_has_special_tokens:

            return super().get_special_tokens_mask(

                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True

            )

        if token_ids_1 is not None:

            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]

        return [1] + ([0] * len(token_ids_0)) + [1]

    def create_token_type_ids_from_sequences(

        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None

    ) -> List[int]:

        """

        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence

        pair mask has the following format:

        ::

            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1

            | first sequence    | second sequence |

        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).

        Args:

            token_ids_0 (:obj:`List[int]`):

                List of IDs.

            token_ids_1 (:obj:`List[int]`, `optional`):

                Optional second list of IDs for sequence pairs.

        Returns:

            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given

            sequence(s).

        """

        sep = [self.sep_token_id]

        cls = [self.cls_token_id]

        if token_ids_1 is None:

            return len(cls + token_ids_0 + sep) * [0]

        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:

        index = 0

        if os.path.isdir(save_directory):

            vocab_file = os.path.join(

                save_directory, (filename_prefix + "-" if filename_prefix else "") +

                VOCAB_FILES_NAMES["vocab_file"]

            )

        else:

            vocab_file = (filename_prefix +

                          "-" if filename_prefix else "") + save_directory

        with open(vocab_file, "w", encoding="utf-8") as writer:

            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):

                if index != token_index:

                    logger.warning(

                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."

                        " Please check that the vocabulary is not corrupted!"

                    )

                    index = token_index

                writer.write(token + "\n")

                index += 1

        return (vocab_file,)

class BasicTokenizer(object):

    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):

        if never_split is None:

            never_split = []

        self.do_lower_case = do_lower_case

        self.never_split = set(never_split)

        self.tokenize_chinese_chars = tokenize_chinese_chars

        self.strip_accents = strip_accents

    def tokenize(self, text, never_split=None):

        """

        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see

        WordPieceTokenizer.

        Args:

            **never_split**: (`optional`) list of str

                Kept for backward compatibility purposes. Now implemented directly at the base class level (see

                :func:`PreTrainedTokenizer.tokenize`) List of token not to split.

        """

        # union() returns a new set by concatenating the two sets.

        never_split = self.never_split.union(

            set(never_split)) if never_split else self.never_split

        text = self._clean_text(text)

        # This was added on November 1st, 2018 for the multilingual and Chinese

        # models. This is also applied to the English models now, but it doesn't

        # matter since the English models were not trained on any Chinese data

        # and generally don't have any Chinese data in them (there are Chinese

        # characters in the vocabulary because Wikipedia does have some Chinese

        # words in the English Wikipedia.).

        if self.tokenize_chinese_chars:

            text = self._tokenize_chinese_chars(text)

        orig_tokens = whitespace_tokenize(text)

        split_tokens = []

        for token in orig_tokens:

            if token not in never_split:

                if self.do_lower_case:

                    token = token.lower()

                    if self.strip_accents is not False:

                        token = self._run_strip_accents(token)

                elif self.strip_accents:

                    token = self._run_strip_accents(token)

            split_tokens.extend(self._run_split_on_punc(token, never_split))

        output_tokens = whitespace_tokenize(" ".join(split_tokens))

        return output_tokens

    def _run_strip_accents(self, text):

        """Strips accents from a piece of text."""

        text = unicodedata.normalize("NFD", text)

        output = []

        for char in text:

            cat = unicodedata.category(char)

            if cat == "Mn":

                continue

            output.append(char)

        return "".join(output)

    def _run_split_on_punc(self, text, never_split=None):

        """Splits punctuation on a piece of text."""

        if never_split is not None and text in never_split:

            return [text]

        chars = list(text)

        i = 0

        start_new_word = True

        output = []

        while i < len(chars):

            char = chars[i]

            if _is_punctuation(char):

                output.append([char])

                start_new_word = True

            else:

                if start_new_word:

                    output.append([])

                start_new_word = False

                output[-1].append(char)

            i += 1

        return ["".join(x) for x in output]

    def _tokenize_chinese_chars(self, text):

        """Adds whitespace around any CJK character."""

        output = []

        for char in text:

            cp = ord(char)

            if self._is_chinese_char(cp):

                output.append(" ")

                output.append(char)

                output.append(" ")

            else:

                output.append(char)

        return "".join(output)

    def _is_chinese_char(self, cp):

        """Checks whether CP is the codepoint of a CJK character."""

        # This defines a "chinese character" as anything in the CJK Unicode block:

        #  https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)

        #

        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,

        # despite its name. The modern Korean Hangul alphabet is a different block,

        # as is Japanese Hiragana and Katakana. Those alphabets are used to write

        # space-separated words, so they are not treated specially and handled

        # like the all of the other languages.

        if (

            (cp >= 0x4E00 and cp <= 0x9FFF)

            or (cp >= 0x3400 and cp <= 0x4DBF)  #

            or (cp >= 0x20000 and cp <= 0x2A6DF)  #

            or (cp >= 0x2A700 and cp <= 0x2B73F)  #

            or (cp >= 0x2B740 and cp <= 0x2B81F)  #

            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #

            or (cp >= 0xF900 and cp <= 0xFAFF)

            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #

        ):  #

            return True

        return False

    def _clean_text(self, text):

        """Performs invalid character removal and whitespace cleanup on text."""

        output = []

        for char in text:

            cp = ord(char)

            if cp == 0 or cp == 0xFFFD or _is_control(char):

                continue

            if _is_whitespace(char):

                output.append(" ")

            else:

                output.append(char)

        return "".join(output)

class WordpieceTokenizer(object):

    """Runs WordPiece tokenization."""

    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):

        self.vocab = vocab

        self.unk_token = unk_token

        self.max_input_chars_per_word = max_input_chars_per_word

    def tokenize(self, text):

        """

        Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform

        tokenization using the given vocabulary.

        For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.

        Args:

          text: A single token or whitespace separated tokens. This should have

            already been passed through `BasicTokenizer`.

        Returns:

          A list of wordpiece tokens.

        """

        output_tokens = []

        for token in whitespace_tokenize(text):

            chars = list(token)

            if len(chars) > self.max_input_chars_per_word:

                output_tokens.append(self.unk_token)

                continue

            is_bad = False

            start = 0

            sub_tokens = []

            while start < len(chars):

                end = len(chars)

                cur_substr = None

                while start < end:

                    substr = "".join(chars[start:end])

                    if start > 0:

                        substr = "##" + substr

                    if substr in self.vocab:

                        cur_substr = substr

                        break

                    end -= 1

                if cur_substr is None:

                    is_bad = True

                    break

                sub_tokens.append(cur_substr)

                start = end

            if is_bad:

                output_tokens.append(self.unk_token)

            else:

                output_tokens.extend(sub_tokens)

        return output_tokens

Copy to clipboardErrorCopied

bt = BertTokenizer.from_pretrained('bert-base-uncased')

bt('I like natural language progressing!')

Copy to clipboardErrorCopied

{'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

Copy to clipboardErrorCopied

3.2 BertSelfAttention

from torch import nn

class BertSelfAttention(nn.Module):

    def __init__(self, config):

        super().__init__()

        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):

            raise ValueError(

                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "

                f"heads ({config.num_attention_heads})"

            )

        self.num_attention_heads = config.num_attention_heads

        self.attention_head_size = int(

            config.hidden_size / config.num_attention_heads)

        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)

        self.key = nn.Linear(config.hidden_size, self.all_head_size)

        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

        self.position_embedding_type = getattr(

            config, "position_embedding_type", "absolute")

        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":

            self.max_position_embeddings = config.max_position_embeddings

            self.distance_embedding = nn.Embedding(

                2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

    def transpose_for_scores(self, x):

        new_x_shape = x.size()[

            :-1] + (self.num_attention_heads, self.attention_head_size)

        x = x.view(*new_x_shape)

        return x.permute(0, 2, 1, 3)

    def forward(

        self,

        hidden_states,

        attention_mask=None,

        head_mask=None,

        encoder_hidden_states=None,

        encoder_attention_mask=None,

        past_key_value=None,

        output_attentions=False,

    ):

        mixed_query_layer = self.query(hidden_states)

        # If this is instantiated as a cross-attention module, the keys

        # and values come from an encoder; the attention mask needs to be

        # such that the encoder's padding tokens are not attended to.

        is_cross_attention = encoder_hidden_states is not None

        if is_cross_attention and past_key_value is not None:

            # reuse k,v, cross_attentions

            key_layer = past_key_value[0]

            value_layer = past_key_value[1]

            attention_mask = encoder_attention_mask

        elif is_cross_attention:

            key_layer = self.transpose_for_scores(

                self.key(encoder_hidden_states))

            value_layer = self.transpose_for_scores(

                self.value(encoder_hidden_states))

            attention_mask = encoder_attention_mask

        elif past_key_value is not None:

            key_layer = self.transpose_for_scores(self.key(hidden_states))

            value_layer = self.transpose_for_scores(self.value(hidden_states))

            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)

            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)

        else:

            key_layer = self.transpose_for_scores(self.key(hidden_states))

            value_layer = self.transpose_for_scores(self.value(hidden_states))

        query_layer = self.transpose_for_scores(mixed_query_layer)

        if self.is_decoder:

            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.

            # Further calls to cross_attention layer can then reuse all cross-attention

            # key/value_states (first "if" case)

            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of

            # all previous decoder key/value_states. Further calls to uni-directional self-attention

            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)

            # if encoder bi-directional self-attention `past_key_value` is always `None`

            past_key_value = (key_layer, value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.

        attention_scores = torch.matmul(

            query_layer, key_layer.transpose(-1, -2))

        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":

            seq_length = hidden_states.size()[1]

            position_ids_l = torch.arange(

                seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)

            position_ids_r = torch.arange(

                seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)

            distance = position_ids_l - position_ids_r

            positional_embedding = self.distance_embedding(

                distance + self.max_position_embeddings - 1)

            positional_embedding = positional_embedding.to(

                dtype=query_layer.dtype)  # fp16 compatibility

            if self.position_embedding_type == "relative_key":

                relative_position_scores = torch.einsum(

                    "bhld,lrd->bhlr", query_layer, positional_embedding)

                attention_scores = attention_scores + relative_position_scores

            elif self.position_embedding_type == "relative_key_query":

                relative_position_scores_query = torch.einsum(

                    "bhld,lrd->bhlr", query_layer, positional_embedding)

                relative_position_scores_key = torch.einsum(

                    "bhrd,lrd->bhlr", key_layer, positional_embedding)

                attention_scores = attention_scores + \

                    relative_position_scores_query + relative_position_scores_key

        attention_scores = attention_scores / \

            math.sqrt(self.attention_head_size)

        if attention_mask is not None:

            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)

            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.

        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might

        # seem a bit unusual, but is taken from the original Transformer paper.

        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to

        if head_mask is not None:

            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()

        new_context_layer_shape = context_layer.size()[

            :-2] + (self.all_head_size,)

        context_layer = context_layer.view(*new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (

            context_layer,)

        if self.is_decoder:

            outputs = outputs + (past_key_value,)

        return outputs

Copy to clipboardErrorCopied

3.3 BerSelfOutput

class BertSelfOutput(nn.Module):

    def __init__(self, config):

        super().__init__()

        self.dense = nn.Linear(config.hidden_size, config.hidden_size)

        self.LayerNorm = nn.LayerNorm(

            config.hidden_size, eps=config.layer_norm_eps)

        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):

        hidden_states = self.dense(hidden_states)

        hidden_states = self.dropout(hidden_states)

        hidden_states = self.LayerNorm(hidden_states + input_tensor)

        return hidden_states

Copy to clipboardErrorCopied

3.4 BertOutput

class BertOutput(nn.Module):

    def __init__(self, config):

        super().__init__()

        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)

        self.LayerNorm = nn.LayerNorm(

            config.hidden_size, eps=config.layer_norm_eps)

        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):

        hidden_states = self.dense(hidden_states)

        hidden_states = self.dropout(hidden_states)

        hidden_states = self.LayerNorm(hidden_states + input_tensor)

        return hidden_states

Copy to clipboardErrorCopied

3.5 BertPooler

class BertPooler(nn.Module):

    def __init__(self, config):

        super().__init__()

        self.dense = nn.Linear(config.hidden_size, config.hidden_size)

        self.activation = nn.Tanh()

    def forward(self, hidden_states):

        # We "pool" the model by simply taking the hidden state corresponding

        # to the first token.

        first_token_tensor = hidden_states[:, 0]

        pooled_output = self.dense(first_token_tensor)

        pooled_output = self.activation(pooled_output)

        return pooled_output

Copy to clipboardErrorCopied

from transformers.models.bert.configuration_bert import *

import torch

# 配置参数

config = BertConfig.from_pretrained("bert-base-uncased")

bert_pooler = BertPooler(config = config)

print("input to bert pooler size: {}".format(config.hidden_size))

# 调用bert_pooler

batch_size = 1

seq_len = 2

hidden_size = 768

x = torch.rand(batch_size, seq_len, hidden_size)

y = bert_pooler(x)

print(y.size())

Copy to clipboardErrorCopied

input to bert pooler size: 768

torch.Size([1, 768])

Copy to clipboardErrorCopied

4 总结

本次任务,主要讲解了BERT的源码,包括BertTokenizer、BertModel,其中BertTokenizer主要用于分割句子,并分解成subword;BertModel是BERT的本体模型类,主要包括BertEmbeddings、BertEncoder和BertPooler三部分,BertEmbeddings用于构造word、position和token_type embedings的Embeddings,BertEncoder由BertAttention、BertIntermediate和BertOutput三个部分组成,BertPooler用于取出句子的第一个token。整个过程需要配合Task03的Bert模型架构来阅读。

上一章节

你可能感兴趣的:(Task04 编写BERT模型)