Google 的工作处处体现着 Money的重要性,毕竟 **All you need is money ** ,Bret 在编码器和解码器分别叠加的6层 Transformer,训练过程及其复杂,需要很高的配置,并且需要大量的训练时间。但是,Google 人性化的是 公布了多个预训练好的模型,我们可以直接使用这些预训练好的模型进行微调(fine-trun)。这也是nlp领域发展的趋势——迁移学习
BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
上面是Google训练好的一些模型,可以在 项目中下载:
我们使用的是:BERT-Base, Chinese: Chinese Simplified and Traditional 使用简体和繁体中文训练的一个中文字符的模型,
下载好模型之后,需要将 google 的Bert项目 复制下来:
git clone
要提取文本的词向量,需要使用项目中的 extract_features.py脚本,官方给出的范例:
python \
--input_file=/tmp/input.txt \
--output_file=/tmp/output.jsonl \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--layers=-1,-2,-3,-4 \
--max_seq_length=128 \
# Sentence A and Sentence B are separated by the ||| delimiter for sentence pair tasks like question answering and entailment.
# For single sentence inputs, put one sentence per line and DON'T use the delimiter.
echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt
如果要训练 sentense pair 则写成:
吃了吗? ||| 吃过了
其中 ||| 是sentense A 和sentence B的分隔符
如果只训练单个句子,则不需要||| 分割:
**vocab_file:**是词典的路径 BERT_BASE_DIR 是解压下载 预训练模型BERT-Base, Chinese: Chinese Simplified and Traditional 的路径 (下同)
layers: 是输出那些层的参数,-1就是最后一层,-2是倒数第二层,一次类推
max_seq_length: 是最大句子长度,根据自己的任务配置。如果你的GPU内存比较小,可以减小这个值,节省存储
batch_size: 不解释
"linex_index": 0,
"features": [
{ "token": "[CLS]",//句子开始标志
"layers": [{ "index": -1, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -2, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -3, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -4, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },]
{ "token": ""token": "\u769f"",//句子中第一个字
"layers": [{ "index": -1, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },//第一个词的最后一层(-1)网络的参数
{ "index": -2, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },//第一个词的倒数二层(-2)网络的参数
{ "index": -3, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },//第一个词的倒数三层(-3)网络的参数
{ "index": -4, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },]/
{ "token": ""token": "\u45ef"",//句子中第2个字
"layers": [{ "index": -1, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -2, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -3, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -4, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },]
{ "token": ""token": "\SEP"",//句子结束标志
"layers": [{ "index": -1, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -2, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -3, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },
{ "index": -4, "values": [0.402158, -7.281092, -0.351869, -0.432365, -0.453649 ...(dim=768)] },]