在目前的开源模型中,LLaMA模型无疑是一颗闪亮的⭐️,但是相对于ChatGLM、BaiChuan等国产大模型,其对于中文的支持能力不是很理想。原版LLaMA模型的词表大小是32K,中文所占token是几百个左右,这将会导致中文的编解码效率低。
在将LLaMA系列模型用于中文语言时需要进行中文词表扩充,基于sentencepiece工具训练,产生新的词表,然后与原始词表合并得到一个新词表。
本文将LLaMA模型中文词表扩充分为以下步骤:训练数据准备、词表训练、词表合并、词表测试。
这里使用MedicalGPT中的天龙八部小说作为训练文本。
数据是txt文件,一行文本作为一条数据。
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='tianlongbabu.txt',
model_prefix='bpe_llama',
shuffle_input_sentence=False,
train_extremely_large_corpus=True,
max_sentence_length=2048,
pad_id=3,
model_type='BPE',
vocab_size=5000,
split_digits=True,
split_by_unicode_script=True,
byte_fallback=True,
allow_whitespace_only_pieces=True,
remove_extra_whitespaces=False,
normalization_rule_name="nfkc",
)
print('训练完成')
sentencepiece训练参数:
Usage: …/build/src/spm_train [options] files
text
或tsv
。) 类型:字符串 默认值:“”unk_surface
。) 类型:字符串 默认值:“ ⁇ ”import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("bpe_llama.model")
print(sp.encode_as_pieces("这老者姓左,名叫子穆,是“无量剑”东宗的掌门。那道姑姓辛,道号双清,是“无量剑”西宗掌门。"))
print(sp.encode_as_ids("这老者姓左,名叫子穆,是“无量剑”东宗的掌门。那道姑姓辛,道号双清,是“无量剑”西宗掌门。"))
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
from transformers import LlamaTokenizer
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm
llama_tokenizer_dir = 'llama-2-7b-bin'
chinese_sp_model_file = 'bpe_llama.model'
# 分词器加载
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_sp_model = spm.SentencePieceProcessor()
chinese_sp_model.Load(chinese_sp_model_file)
# 解析
llama_spm = sp_pb2_model.ModelProto()
llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
chinese_spm = sp_pb2_model.ModelProto()
chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())
# 词表长度
print(len(llama_tokenizer),len(chinese_sp_model))
# 添加新token到llama词表
llama_spm_tokens_set=set(p.piece for p in llama_spm.pieces)
print(len(llama_spm_tokens_set))
print(f"Before:{len(llama_spm_tokens_set)}")
for p in chinese_spm.pieces:
piece = p.piece
if piece not in llama_spm_tokens_set:
new_p = sp_pb2_model.ModelProto().SentencePiece()
new_p.piece = piece
new_p.score = 0
llama_spm.pieces.append(new_p)
print(f"New model pieces: {len(llama_spm.pieces)}")
output_sp_dir = '../merged_tokenizer_sp'
output_hf_dir = '../merged_tokenizer_hf'
vocab_content = ''
for p in llama_spm.pieces:
vocab_content += f"{p.piece} {p.score}\n"
# 保存词表
with open(output_sp_dir+'/chinese_llama.vocab', "w", encoding="utf-8") as f:
f.write(vocab_content)
# 保存spm模型
with open(output_sp_dir+'/chinese_llama.model', 'wb') as f:
f.write(llama_spm.SerializeToString())
# 保存llama新tokenizer
tokenizer = LlamaTokenizer(vocab_file=output_sp_dir+'/chinese_llama.model')
tokenizer.save_pretrained(output_hf_dir)
print(f"Chinese-LLaMA tokenizer has been saved to {output_hf_dir}")
from transformers import LlamaTokenizer
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_llama_tokenizer = LlamaTokenizer.from_pretrained(output_hf_dir)
print(tokenizer.all_special_tokens)
print(tokenizer.all_special_ids)
print(tokenizer.special_tokens_map)
text='''白日依山尽,黄河入海流。欲穷千里目,更上一层楼。'''
text='''大模型是指具有非常大的参数数量的人工神经网络模型。 在深度学习领域,大模型通常是指具有数亿到数万亿参数的模型。'''
print("Test text:\n",text)
print(f"Tokenized by LLaMA tokenizer:{len(llama_tokenizer.tokenize(text))},{llama_tokenizer.tokenize(text)}")
print(f"Tokenized by GoGPT-LLaMA tokenizer:{len(chinese_llama_tokenizer.tokenize(text))},{chinese_llama_tokenizer.tokenize(text)}")
从结果可以看到,中文分词后长度显著减小,英文分词没有产生影响。
注:在对中文词表扩展后的LLaMA模型做增量预训练时,需要调整嵌入层的大小(model.resize_token_embeddings(len(tokenizer))
),因为词表大小发生变化。
[1] https://github.com/shibing624/MedicalGPT/tree/main
[2] https://github.com/yanqiangmiffy/how-to-train-tokenizer/tree/main