BPE(Byte Pair Encoding,双字节编码)。2016年应用于机器翻译,解决 集外词(OOV)和罕见词(Rare word)问题。论文题目《Neural Machine Translation of Rare Words with Subword Units》 —发表于ACL2016
http://www.sohu.com/a/115373230_465975
tensor2tensor有用到bpe,抽取:
data_generators/problem.py
data_generators/translate_ende.py
1.参考:https://plmsmile.github.io/2017/10/19/subword-units/
import re
def process_raw_words(words, endtag='-'):
'''把单词分割成最小的符号,并且加上结尾符号'''
vocabs = {}
for word, count in words.items():
# 加上空格
word = re.sub(r'([a-zA-Z])', r' \1', word)
word += ' ' + endtag
vocabs[word] = count
return vocabs
def get_symbol_pairs(vocabs):
''' 获得词汇中所有的字符pair,连续长度为2,并统计出现次数
Args:
vocabs: 单词dict,(word, count)单词的出现次数。单词已经分割为最小的字符
Returns:
pairs: ((符号1, 符号2), count)
'''
#pairs = collections.defaultdict(int)
pairs = dict()
for word, freq in vocabs.items():
# 单词里的符号
symbols = word.split()
for i in range(len(symbols) - 1):
p = (symbols[i], symbols[i + 1])
pairs[p] = pairs.get(p, 0) + freq
return pairs
def merge_symbols(symbol_pair, vocabs):
'''把vocabs中的所有单词中的'a b'字符串用'ab'替换
Args:
symbol_pair: (a, b) 两个符号
vocabs: 用subword(symbol)表示的单词,(word, count)。其中word使用subword空格分割
Returns:
vocabs_new: 替换'a b'为'ab'的新词汇表
'''
vocabs_new = {}
raw = ' '.join(symbol_pair)
merged = ''.join(symbol_pair)
# 非字母和数字字符做转义
bigram = re.escape(raw)
p = re.compile(r'(?
输出:
原来:{"low":5, "lower":2, "newest":6, "widest":3}
经过bpe:{' low-': 5, ' low e r -': 2, ' newest-': 6, ' wi d est-': 3}
{“low”:5, “lower”:2, “newest”:6, “widest”:3}这个是原本每个单词出现的频率。最后输出,可以以空格为划分,比如作为建模单元,比如这里的建模单元为 low e r newest wi d est 。输出文本经过建模单元就能都映射出来,一串表示。
2.参考 《Neural Machine Translation of Rare Words with Subword Units》
论文讲解:http://www.sohu.com/a/115373230_465975
import re, collections
def get_stats(vocab):
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
print(symbols)
print("len(symbols) --- ",len(symbols))
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
print("bigram ",bigram)
p = re.compile(r'(? for word in v_in:
w_out = p.sub(''.join(pair), word)
print("w_out ",w_out)
v_out[w_out] = v_in[word]
return v_out
vocab = {'l o w ' : 5, 'l o w e r ' : 2,
'n e w e s t ':6, 'w i d e s t ':3}
num_merges = 10
for i in range(num_merges):
print("=#####################################=== ")
pairs = get_stats(vocab)
print("===========11111======= ")
print(pairs)
#print("===========11111======= ")
best = max(pairs, key=pairs.get)
print("===========2222======= ")
print("pairs.get ",pairs.get)
print("best ",best)
#raise SystemExit
vocab = merge_vocab(best, vocab)
print("vocab ",vocab)
个人觉得分词最好用的还是sentencepiece~~
参考https://github.com/google/sentencepiece/tree/master/python
分词20k个label id
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=/data/yelong/bpe_test/lib.txt --model_prefix=/data/yelong/bpe_test/bpe --vocab_size=20000 --model_type=bpe')
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("/data/yelong/bpe_test/bpe.model")
with open('/data/yelong/bpe_test/wav/train/text.txt', 'a') as fid, open('/data/yelong/bpe_test/wav/train/train.txt') as did:
for line in did:
a = line.strip().split()[1:] # eg. "TWO COME MUSE MIGRATE"
aa = ' '.join([t for t in a])
listid = sp.EncodeAsIds(aa)
strid = ' '.join([str(t) for t in listid])
b = line.strip().split()[:1]
b =''.join([t for t in b])
fid.write(b+' '+strid+'\n')
得到.model和.vocab两个文件,
bpe.vocab:
0
0
0
▁T -0
HE -1
▁A -2
▁THE -3
IN -4
▁S -5
▁W -6
一个映射关系,右边并不是id号,因为model_type有好几种(unigram (default), bpe, char, or word),当选择比如unigram种类时,得到的右边是小数,所以并不是id号。
所以我不应该把nabu里配置里的alphabet里只写了0-19996(bpe.vocab末尾是19996),而应该写0-19999才对。
验证过了,0-19999的id都有对应的piece,验证方法:
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("/data/yelong/bpe_test/bpe.model")
>>> for i in range(20000):
... sp.IdToPiece(i)
都能输出。(不能输出的话会报错,退出)