BERT其中的一个重要作用是可以生成词向量
下面介绍获取词向量的方法
获取BERT词向量的时候用到了肖涵博士的bert-as-service,具体使用方式如下。
环境要求:python版本>=3.5,tensorflow版本>=1.10
相关包的安装:
pip install bert-serving-server
pip install bert-serving-client
下载训练好的BERT中文模型:https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
启动bert-as-service :
在cmd窗口进入bert-serving-start.exe所在的文件夹(一般这个文件在python安装路径下的Scripts文件夹下),
在该文件路径下输入 bert-serving-start -model_dir C:\Users\admin\Desktop\text_cf\chinese_L-12_H-768_A-12
测试一下
from bert_serving.client import BertClient
bc = BertClient()
d=bc.encode(['你好','我','朋友'])
>>>array([[ 0.28940195, -0.13572705, 0.07591176, ..., -0.14091267,
0.5463005 , -0.30118063],
[-0.17258267, 0.05145651, 0.3027011 , ..., 0.06416287,
0.11442862, -0.33527803],
[-0.16574037, 0.29926932, 0.00558878, ..., -0.14497901,
0.64227146, -0.3119482 ]], dtype=float32)
接下来正式工作
# 读取数据
df = pd.read_csv(...)
# 把https://github.com/google-research/bert的tokenization.py下载该文件路径下
from tokenization import BasicTokenizer
tokenizer = BasicTokenizer()
# 进行分词处理
df['cutted'] = df['text'].apply(lambda x: tokenizer.tokenize(x))
def padding_sentences(input_sentences, padding_token, padding_sentence_length = 200):
sentences = [sentence for sentence in input_sentences]
max_sentence_length = padding_sentence_length
l=[]
for sentence in sentences:
if len(sentence) > max_sentence_length:
sentence = sentence[:max_sentence_length]
l.append(sentence)
else:
sentence.extend([padding_token] * (max_sentence_length - len(sentence)))
l.append(sentence)
return (l, max_sentence_length)
# 截取文本固定长度
sentences, max_document_length = padding_sentences(df['cutted'], '[UNK]')
from bert_serving.client import BertClient
bc = BertClient()
vec=[]
for i in range(len(sentences)):
bert_vec=bc.encode(sentences[i])
print(i,bert_vec.shape)
vec.append(bert_vec)
bert_vec=np.array(vec) #这里注意时间复杂度会很大,慎用
# 向量获取成功,接下来可参照https://blog.csdn.net/hufei_neo/article/details/98732734
# 做接下来的textcnn文本分类工作