Task04 编写BERT模型

1 BertTokenizer（Tokenization分词）

组成结构：BasicTokenizer和WordPieceTokenizer

BasicTokenizer主要作用：

按标点、空格分割句子，对于中文字符，通过预处理（加空格方式）进行按字分割

通过never_split指定对某些词不进行分割

处理是否统一小写

清理非法字符

WordPieceTokenizer主要作用：

进一步将词分解为子词(subword)

subword介于char和word之间，保留了词的含义，又能够解决英文中单复数、时态导致的词表爆炸和未登录词的OOV问题

将词根和时态词缀分割，减小词表，降低训练难度

BertTokenizer常用方法：

from_pretrained：从包含词表文件（vocab.txt）的目录中初始化一个分词器；

tokenize：将文本（词或者句子）分解为子词列表；

convert_tokens_to_ids：将子词列表转化为子词对应的下标列表；

convert_ids_to_tokens ：与上一个相反；

convert_tokens_to_string：将subword列表按“##”拼接回词或者句子；

encode：

对于单个句子输入，分解词，同时加入特殊词形成“[CLS], x, [SEP]”的结构，并转换为词表对应的下标列表；

对于两个句子输入（多个句子只取前两个），分解词并加入特殊词形成“[CLS], x1, [SEP], x2, [SEP]”的结构并转换为下标列表；

decode：可以将encode方法的输出变为完整句子。

2 BertModel（BERT Model 本体模型）

组成结构：主要是Transformer-Encoder结构

embeddings：BertEmbeddings类的实体，根据单词符号获取对应的向量表示；

encoder：BertEncoder类的实体；

pooler：BertPooler类的实体，这一部分是可选的

BertModel常用方法：

get_input_embeddings：提取 embedding 中的 word_embeddings，即词向量部分；

set_input_embeddings：为 embedding 中的 word_embeddings 赋值；

_prune_heads：提供了将注意力头剪枝的函数，输入为{layer_num: list of heads to prune in this layer}的字典，可以将指定层的某些注意力头剪枝。

2.1 BertEmbeddings

输出结果：通过word_embeddings、token_type_embeddings、position_embeddings三个部分求和，并通过一层 LayerNorm+Dropout 后输出得到，其大小为(batch_size, sequence_length, hidden_size)

word_embeddings：子词(subword)对应的embeddings

token_type_embeddings：用于表示当前词所在的句子，区别句子与 padding、句子对之间的差异

position_embeddings：表示句子中每个词的位置嵌入，用于区别词的顺序

使用 LayerNorm+Dropout 的必要性：

通过layer normalization得到的embedding的分布，是以坐标原点为中心，1为标准差，越往外越稀疏的球体空间中

2.2 BertEncoder

技术拓展：梯度检查点（gradient checkpointing），通过减少保存的计算图节点压缩模型占用空间

2.2.1 BertAttention

BertSelfAttention

初始化部分：检查隐藏层和注意力头的参数配置倍率、进行各参数的赋值

前向传播部分：

multi-head self-attention的基本公式：

\text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O \\ \text{head}_i = \text{SDPA}(\text{QW}_i^Q, \text{KW}_i^K, \text{VW}_i^V) \\ \text{SDPA}(Q, K, V) = \text{softmax}(\frac{Q \cdot K^T}{\sqrt{d_k}}) \cdot V

MHA(Q,K,V)=Concat(head

,…,head

head

=SDPA(QW

,KW

,VW

)

SDPA(Q,K,V)=softmax(

Q⋅K

)⋅V

transpose_for_scores：用于将 hidden_size 拆成多个头输出的形状，并且将中间两维转置进行矩阵相乘

torch.einsum：根据下标表示形式，对矩阵中输入元素的乘积求和

positional_embedding_type：

absolute：默认值，不用进行处理

relative_key：对key layer处理

relative_key_query：对 key 和 value 都进行相乘以作为位置编码

BertSelfOutput：

前向传播部分使用LayerNorm+Dropout组合，残差连接用于降低网络层数过深，带来的训练难度，对原始输入更加敏感。

2.2.2 BertIntermediate

主要结构：全连接和激活操作

全连接：将原始维度进行扩展，参数intermediate_size

激活：激活函数默认为 gelu，使用一个包含tanh的表达式进行近似求解

2.2.3 BertOutput

主要结构：全连接、dropout+LayerNorm、残差连接（residual connect）

2.3 BertPooler

主要作用：取出句子的第一个token，即[CLS]对应的向量，然后通过一个全连接层和一个激活函数后输出结果。

3 实战练习

3.1 BertToknizer代码

import collections

import os

import unicodedata

from typing import List, Optional, Tuple

from transformers.tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace

from transformers.utils import logging

logger = logging.get_logger(__name__)

VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}

PRETRAINED_VOCAB_FILES_MAP = {

"vocab_file": {

"bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt",

}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {

"bert-base-uncased": 512,

}

PRETRAINED_INIT_CONFIGURATION = {

"bert-base-uncased": {"do_lower_case": True},

}

def load_vocab(vocab_file):

"""Loads a vocabulary file into a dictionary."""

vocab = collections.OrderedDict()

with open(vocab_file, "r", encoding="utf-8") as reader:

tokens = reader.readlines()

for index, token in enumerate(tokens):

token = token.rstrip("\n")

vocab[token] = index

return vocab

def whitespace_tokenize(text):

"""Runs basic whitespace cleaning and splitting on a piece of text."""

text = text.strip()

if not text:

return []

tokens = text.split()

return tokens

class BertTokenizer(PreTrainedTokenizer):

vocab_files_names = VOCAB_FILES_NAMES

pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP

pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION

max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

def __init__(

self,

vocab_file,

do_lower_case=True,

do_basic_tokenize=True,

never_split=None,

unk_token="[UNK]",

sep_token="[SEP]",

pad_token="[PAD]",

cls_token="[CLS]",

mask_token="[MASK]",

tokenize_chinese_chars=True,

strip_accents=None,

**kwargs

super().__init__(

do_lower_case=do_lower_case,

do_basic_tokenize=do_basic_tokenize,

never_split=never_split,

unk_token=unk_token,

sep_token=sep_token,

pad_token=pad_token,

cls_token=cls_token,

mask_token=mask_token,

tokenize_chinese_chars=tokenize_chinese_chars,

strip_accents=strip_accents,

**kwargs,

)

if not os.path.isfile(vocab_file):

raise ValueError(

f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "

"model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"

)

self.vocab = load_vocab(vocab_file)

self.ids_to_tokens = collections.OrderedDict(

[(ids, tok) for tok, ids in self.vocab.items()])

self.do_basic_tokenize = do_basic_tokenize

if do_basic_tokenize:

self.basic_tokenizer = BasicTokenizer(

do_lower_case=do_lower_case,

never_split=never_split,

tokenize_chinese_chars=tokenize_chinese_chars,

strip_accents=strip_accents,

)

self.wordpiece_tokenizer = WordpieceTokenizer(

vocab=self.vocab, unk_token=self.unk_token)

@property

def do_lower_case(self):

return self.basic_tokenizer.do_lower_case

@property

def vocab_size(self):

return len(self.vocab)

def get_vocab(self):

return dict(self.vocab, **self.added_tokens_encoder)

def _tokenize(self, text):

split_tokens = []

if self.do_basic_tokenize:

for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):

# If the token is part of the never_split set

if token in self.basic_tokenizer.never_split:

split_tokens.append(token)

else:

split_tokens += self.wordpiece_tokenizer.tokenize(token)

else:

split_tokens = self.wordpiece_tokenizer.tokenize(text)

return split_tokens

def _convert_token_to_id(self, token):

"""Converts a token (str) in an id using the vocab."""

return self.vocab.get(token, self.vocab.get(self.unk_token))

def _convert_id_to_token(self, index):

"""Converts an index (integer) in a token (str) using the vocab."""

return self.ids_to_tokens.get(index, self.unk_token)

def convert_tokens_to_string(self, tokens):

"""Converts a sequence of tokens (string) in a single string."""

out_string = " ".join(tokens).replace(" ##", "").strip()

return out_string

def build_inputs_with_special_tokens(

self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None

) -> List[int]:

"""

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and

adding special tokens. A BERT sequence has the following format:

- single sequence: ``[CLS] X [SEP]``

- pair of sequences: ``[CLS] A [SEP] B [SEP]``

Args:

token_ids_0 (:obj:`List[int]`):

List of IDs to which the special tokens will be added.

token_ids_1 (:obj:`List[int]`, `optional`):

Optional second list of IDs for sequence pairs.

Returns:

:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.

"""

if token_ids_1 is None:

return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]

cls = [self.cls_token_id]

sep = [self.sep_token_id]

return cls + token_ids_0 + sep + token_ids_1 + sep

def get_special_tokens_mask(

self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False

) -> List[int]:

"""

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding

special tokens using the tokenizer ``prepare_for_model`` method.

Args:

token_ids_0 (:obj:`List[int]`):

List of IDs.

token_ids_1 (:obj:`List[int]`, `optional`):

Optional second list of IDs for sequence pairs.

already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):

Whether or not the token list is already formatted with special tokens for the model.

Returns:

:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

"""

if already_has_special_tokens:

return super().get_special_tokens_mask(

token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True

)

if token_ids_1 is not None:

return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]

return [1] + ([0] * len(token_ids_0)) + [1]

def create_token_type_ids_from_sequences(

self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None

) -> List[int]:

"""

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence

pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1

| first sequence | second sequence |

If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).

Args:

token_ids_0 (:obj:`List[int]`):

List of IDs.

token_ids_1 (:obj:`List[int]`, `optional`):

Optional second list of IDs for sequence pairs.

Returns:

:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given

sequence(s).

"""

sep = [self.sep_token_id]

cls = [self.cls_token_id]

if token_ids_1 is None:

return len(cls + token_ids_0 + sep) * [0]

return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]

def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:

index = 0

if os.path.isdir(save_directory):

vocab_file = os.path.join(

save_directory, (filename_prefix + "-" if filename_prefix else "") +

VOCAB_FILES_NAMES["vocab_file"]

)

else:

vocab_file = (filename_prefix +

"-" if filename_prefix else "") + save_directory

with open(vocab_file, "w", encoding="utf-8") as writer:

for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):

if index != token_index:

logger.warning(

f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."

" Please check that the vocabulary is not corrupted!"

)

index = token_index

writer.write(token + "\n")

index += 1

return (vocab_file,)

class BasicTokenizer(object):

def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):

if never_split is None:

never_split = []

self.do_lower_case = do_lower_case

self.never_split = set(never_split)

self.tokenize_chinese_chars = tokenize_chinese_chars

self.strip_accents = strip_accents

def tokenize(self, text, never_split=None):

"""

Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see

WordPieceTokenizer.

Args:

**never_split**: (`optional`) list of str

Kept for backward compatibility purposes. Now implemented directly at the base class level (see

:func:`PreTrainedTokenizer.tokenize`) List of token not to split.

"""

# union() returns a new set by concatenating the two sets.

never_split = self.never_split.union(

set(never_split)) if never_split else self.never_split

text = self._clean_text(text)

# This was added on November 1st, 2018 for the multilingual and Chinese

# models. This is also applied to the English models now, but it doesn't

# matter since the English models were not trained on any Chinese data

# and generally don't have any Chinese data in them (there are Chinese

# characters in the vocabulary because Wikipedia does have some Chinese

# words in the English Wikipedia.).

if self.tokenize_chinese_chars:

text = self._tokenize_chinese_chars(text)

orig_tokens = whitespace_tokenize(text)

split_tokens = []

for token in orig_tokens:

if token not in never_split:

if self.do_lower_case:

token = token.lower()

if self.strip_accents is not False:

token = self._run_strip_accents(token)

elif self.strip_accents:

token = self._run_strip_accents(token)

split_tokens.extend(self._run_split_on_punc(token, never_split))

output_tokens = whitespace_tokenize(" ".join(split_tokens))

return output_tokens

def _run_strip_accents(self, text):

"""Strips accents from a piece of text."""

text = unicodedata.normalize("NFD", text)

output = []

for char in text:

cat = unicodedata.category(char)

if cat == "Mn":

continue

output.append(char)

return "".join(output)

def _run_split_on_punc(self, text, never_split=None):

"""Splits punctuation on a piece of text."""

if never_split is not None and text in never_split:

return [text]

chars = list(text)

i = 0

start_new_word = True

output = []

while i < len(chars):

char = chars[i]

if _is_punctuation(char):

output.append([char])

start_new_word = True

else:

if start_new_word:

output.append([])

start_new_word = False

output[-1].append(char)

i += 1

return ["".join(x) for x in output]

def _tokenize_chinese_chars(self, text):

"""Adds whitespace around any CJK character."""

output = []

for char in text:

cp = ord(char)

if self._is_chinese_char(cp):

output.append(" ")

output.append(char)

output.append(" ")

else:

output.append(char)

return "".join(output)

def _is_chinese_char(self, cp):

"""Checks whether CP is the codepoint of a CJK character."""

# This defines a "chinese character" as anything in the CJK Unicode block:

# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)

# Note that the CJK Unicode block is NOT all Japanese and Korean characters,

# despite its name. The modern Korean Hangul alphabet is a different block,

# as is Japanese Hiragana and Katakana. Those alphabets are used to write

# space-separated words, so they are not treated specially and handled

# like the all of the other languages.

if (

(cp >= 0x4E00 and cp <= 0x9FFF)

or (cp >= 0x3400 and cp <= 0x4DBF) #

or (cp >= 0x20000 and cp <= 0x2A6DF) #

or (cp >= 0x2A700 and cp <= 0x2B73F) #

or (cp >= 0x2B740 and cp <= 0x2B81F) #

or (cp >= 0x2B820 and cp <= 0x2CEAF) #

or (cp >= 0xF900 and cp <= 0xFAFF)

or (cp >= 0x2F800 and cp <= 0x2FA1F) #

): #

return True

return False

def _clean_text(self, text):

"""Performs invalid character removal and whitespace cleanup on text."""

output = []

for char in text:

cp = ord(char)

if cp == 0 or cp == 0xFFFD or _is_control(char):

continue

if _is_whitespace(char):

output.append(" ")

else:

output.append(char)

return "".join(output)

class WordpieceTokenizer(object):

"""Runs WordPiece tokenization."""

def __init__(self, vocab, unk_token, max_input_chars_per_word=100):

self.vocab = vocab

self.unk_token = unk_token

self.max_input_chars_per_word = max_input_chars_per_word

def tokenize(self, text):

"""

Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform

tokenization using the given vocabulary.

For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.

Args:

text: A single token or whitespace separated tokens. This should have

already been passed through `BasicTokenizer`.

Returns:

A list of wordpiece tokens.

"""

output_tokens = []

for token in whitespace_tokenize(text):

chars = list(token)

if len(chars) > self.max_input_chars_per_word:

output_tokens.append(self.unk_token)

continue

is_bad = False

start = 0

sub_tokens = []

while start < len(chars):

end = len(chars)

cur_substr = None

while start < end:

substr = "".join(chars[start:end])

if start > 0:

substr = "##" + substr

if substr in self.vocab:

cur_substr = substr

break

end -= 1

if cur_substr is None:

is_bad = True

break

sub_tokens.append(cur_substr)

start = end

if is_bad:

output_tokens.append(self.unk_token)

else:

output_tokens.extend(sub_tokens)

return output_tokens

Copy to clipboardErrorCopied

bt = BertTokenizer.from_pretrained('bert-base-uncased')

bt('I like natural language progressing!')

Copy to clipboardErrorCopied

{'input_ids': [101, 1045, 2066, 3019, 2653, 27673, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

Copy to clipboardErrorCopied

3.2 BertSelfAttention

from torch import nn

class BertSelfAttention(nn.Module):

def __init__(self, config):

super().__init__()

if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):

raise ValueError(

f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "

f"heads ({config.num_attention_heads})"

)

self.num_attention_heads = config.num_attention_heads

self.attention_head_size = int(

config.hidden_size / config.num_attention_heads)

self.all_head_size = self.num_attention_heads * self.attention_head_size

self.query = nn.Linear(config.hidden_size, self.all_head_size)

self.key = nn.Linear(config.hidden_size, self.all_head_size)

self.value = nn.Linear(config.hidden_size, self.all_head_size)

self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

self.position_embedding_type = getattr(

config, "position_embedding_type", "absolute")

if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":

self.max_position_embeddings = config.max_position_embeddings

self.distance_embedding = nn.Embedding(

2 * config.max_position_embeddings - 1, self.attention_head_size)

self.is_decoder = config.is_decoder

def transpose_for_scores(self, x):

new_x_shape = x.size()[

:-1] + (self.num_attention_heads, self.attention_head_size)

x = x.view(*new_x_shape)

return x.permute(0, 2, 1, 3)

def forward(

self,

hidden_states,

attention_mask=None,

head_mask=None,

encoder_hidden_states=None,

encoder_attention_mask=None,

past_key_value=None,

output_attentions=False,

mixed_query_layer = self.query(hidden_states)

# If this is instantiated as a cross-attention module, the keys

# and values come from an encoder; the attention mask needs to be

# such that the encoder's padding tokens are not attended to.

is_cross_attention = encoder_hidden_states is not None

if is_cross_attention and past_key_value is not None:

# reuse k,v, cross_attentions

key_layer = past_key_value[0]

value_layer = past_key_value[1]

attention_mask = encoder_attention_mask

elif is_cross_attention:

key_layer = self.transpose_for_scores(

self.key(encoder_hidden_states))

value_layer = self.transpose_for_scores(

self.value(encoder_hidden_states))

attention_mask = encoder_attention_mask

elif past_key_value is not None:

key_layer = self.transpose_for_scores(self.key(hidden_states))

value_layer = self.transpose_for_scores(self.value(hidden_states))

key_layer = torch.cat([past_key_value[0], key_layer], dim=2)

value_layer = torch.cat([past_key_value[1], value_layer], dim=2)

else:

key_layer = self.transpose_for_scores(self.key(hidden_states))

value_layer = self.transpose_for_scores(self.value(hidden_states))

query_layer = self.transpose_for_scores(mixed_query_layer)

if self.is_decoder:

# if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.

# Further calls to cross_attention layer can then reuse all cross-attention

# key/value_states (first "if" case)

# if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of

# all previous decoder key/value_states. Further calls to uni-directional self-attention

# can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)

# if encoder bi-directional self-attention `past_key_value` is always `None`

past_key_value = (key_layer, value_layer)

# Take the dot product between "query" and "key" to get the raw attention scores.

attention_scores = torch.matmul(

query_layer, key_layer.transpose(-1, -2))

if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":

seq_length = hidden_states.size()[1]

position_ids_l = torch.arange(

seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)

position_ids_r = torch.arange(

seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)

distance = position_ids_l - position_ids_r

positional_embedding = self.distance_embedding(

distance + self.max_position_embeddings - 1)

positional_embedding = positional_embedding.to(

dtype=query_layer.dtype) # fp16 compatibility

if self.position_embedding_type == "relative_key":

relative_position_scores = torch.einsum(

"bhld,lrd->bhlr", query_layer, positional_embedding)

attention_scores = attention_scores + relative_position_scores

elif self.position_embedding_type == "relative_key_query":

relative_position_scores_query = torch.einsum(

"bhld,lrd->bhlr", query_layer, positional_embedding)

relative_position_scores_key = torch.einsum(

"bhrd,lrd->bhlr", key_layer, positional_embedding)

attention_scores = attention_scores + \

relative_position_scores_query + relative_position_scores_key

attention_scores = attention_scores / \

math.sqrt(self.attention_head_size)

if attention_mask is not None:

# Apply the attention mask is (precomputed for all layers in BertModel forward() function)

attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.

attention_probs = nn.Softmax(dim=-1)(attention_scores)

# This is actually dropping out entire tokens to attend to, which might

# seem a bit unusual, but is taken from the original Transformer paper.

attention_probs = self.dropout(attention_probs)

# Mask heads if we want to

if head_mask is not None:

attention_probs = attention_probs * head_mask

context_layer = torch.matmul(attention_probs, value_layer)

context_layer = context_layer.permute(0, 2, 1, 3).contiguous()

new_context_layer_shape = context_layer.size()[

:-2] + (self.all_head_size,)

context_layer = context_layer.view(*new_context_layer_shape)

outputs = (context_layer, attention_probs) if output_attentions else (

context_layer,)

if self.is_decoder:

outputs = outputs + (past_key_value,)

return outputs

Copy to clipboardErrorCopied

3.3 BerSelfOutput

class BertSelfOutput(nn.Module):

def __init__(self, config):

super().__init__()

self.dense = nn.Linear(config.hidden_size, config.hidden_size)

self.LayerNorm = nn.LayerNorm(

config.hidden_size, eps=config.layer_norm_eps)

self.dropout = nn.Dropout(config.hidden_dropout_prob)

def forward(self, hidden_states, input_tensor):

hidden_states = self.dense(hidden_states)

hidden_states = self.dropout(hidden_states)

hidden_states = self.LayerNorm(hidden_states + input_tensor)

return hidden_states

Copy to clipboardErrorCopied

3.4 BertOutput

class BertOutput(nn.Module):

def __init__(self, config):

super().__init__()

self.dense = nn.Linear(config.intermediate_size, config.hidden_size)

self.LayerNorm = nn.LayerNorm(

config.hidden_size, eps=config.layer_norm_eps)

self.dropout = nn.Dropout(config.hidden_dropout_prob)

def forward(self, hidden_states, input_tensor):

hidden_states = self.dense(hidden_states)

hidden_states = self.dropout(hidden_states)

hidden_states = self.LayerNorm(hidden_states + input_tensor)

return hidden_states

Copy to clipboardErrorCopied

3.5 BertPooler

class BertPooler(nn.Module):

def __init__(self, config):

super().__init__()

self.dense = nn.Linear(config.hidden_size, config.hidden_size)

self.activation = nn.Tanh()

def forward(self, hidden_states):

# We "pool" the model by simply taking the hidden state corresponding

# to the first token.

first_token_tensor = hidden_states[:, 0]

pooled_output = self.dense(first_token_tensor)

pooled_output = self.activation(pooled_output)

return pooled_output

Copy to clipboardErrorCopied

from transformers.models.bert.configuration_bert import *

import torch

# 配置参数

config = BertConfig.from_pretrained("bert-base-uncased")

bert_pooler = BertPooler(config = config)

print("input to bert pooler size: {}".format(config.hidden_size))

# 调用bert_pooler

batch_size = 1

seq_len = 2

hidden_size = 768

x = torch.rand(batch_size, seq_len, hidden_size)

y = bert_pooler(x)

print(y.size())

Copy to clipboardErrorCopied

input to bert pooler size: 768

torch.Size([1, 768])

Copy to clipboardErrorCopied

4 总结

本次任务，主要讲解了BERT的源码，包括BertTokenizer、BertModel，其中BertTokenizer主要用于分割句子，并分解成subword；BertModel是BERT的本体模型类，主要包括BertEmbeddings、BertEncoder和BertPooler三部分，BertEmbeddings用于构造word、position和token_type embedings的Embeddings，BertEncoder由BertAttention、BertIntermediate和BertOutput三个部分组成，BertPooler用于取出句子的第一个token。整个过程需要配合Task03的Bert模型架构来阅读。

上一章节

Task04 编写BERT模型

你可能感兴趣的:(Task04 编写BERT模型)