应大部分初学者的要求,本文主要针对实现的代码以及数据进行介绍;
整理后的代码放在https://github.com/chenmingwei00/upload_KBQA.git
训练好的数据参数在链接:https://pan.baidu.com/s/1Dv9Md94AUjlCk5JlWKll-g
提取码:qwkk
本次介绍的代码是按照执行顺序进行讲解,所以调式要按照这个顺序进行调式
(1)sqlserver的安装;
(2)数据的下载:http://www.openkg.cn/dataset/cndbpedia
这两个数据集,上面一个数据,网站有很清除的解绍,就不再做过多的介绍,全部是三元组数据,下面那一个是 mention2entity文件,例如 mention是张杰,当年是entity却有可能有好几个张杰,有可能是歌手,是作家等多个实 体。
(3)导入sqlserver数据库,并且对entity这一列加索引(为了加快查询速度,并且查询都是基于entity这一列进行查询 的)。
本次代码介绍准备以训练好的模型,倒序的方式进行介绍,方便大家理解,实现代码与文章有一定的出入;介绍代码顺序非常重要,也是按照执行代码顺序进行介绍的,需要的依赖请子机进行下载。
这个是运行代码的总函数,也就是利用了flask简单实现的接口进行运行的,看一下代码如下:
# -*- coding: utf-8 -*-
"""
Created on Fri Sep 15 11:00:22 2017
@author: Administrator
"""
from flask import Flask, request, render_template, jsonify
from urllib import parse
import main_qa
application1 = Flask(__name__)
robot = main_qa.Robots()
@application1.route("/")
def api_index():
return render_template('index.html')
@application1.route('/get_answer',methods=['POST'])
def get_answer():
que=parse.parse_qs(request.get_data().decode('utf-8'))
print(que['text'][0],"11111")
resu = robot.get_answer_qa(que['text'][0])
print(resu,"3333333333333333333333333333333333")
return jsonify({"key":resu})
if __name__ == "__main__":
application1.run('127.0.0.1', 6550,debug=True)
从这个文件代码可以很清楚看到比较简单,1/ 主要由一个route路径,传递的函数就是que这个用户的问题。2/导入了main_qa文件,并且利用了robot = main_qa.Robots() 实例化一个对象,利用了get_answer_qa函数获取答案,那么重点就在main_qa这个文件中。
从文件的复杂性来看貌似很复杂,主要在于本人的编程代码能力有限,那么咱们从头一一给大家进行介绍;本次例子主要以“三国演义的作者是谁”这个问题进行介绍,用到的某些文件,会依次再进行介绍。
#! -*- coding:utf-8 -*-
import math
import pickle
import jieba
import gensim
import pandas as pd
import jieba.analyse
import jieba.posseg
from stanfordcorenlp import StanfordCoreNLP
from KBQA_small_data_version1.kbqa.connectSQLServer import connectSQL
from KBQA_small_data.kbqa.entity_recognize import Entity
import numpy as np
import re
包的依赖就不说了,只说自己的包
1/ connectSQLServer数据库导入包,本次并没有用ORM对象所以自己写的sql语句有点不友好。
2/ entity_recognize 比较明显是命名实体识别的文件
host = '172.16.211.128'
user = 'sa'
password = 'chentian184616_'
database= 'chentian'
querySQL = connectSQL(host, user, password, database)
pd.set_option('display.max_columns',5000)
pd.set_option('display.max_rows',5000)
pd.set_option('display.width',1000000)
pd.set_option('display.max_columns',None)
这些是数据库的一些配置,说实话不能这样写,如果从规则等方面,在此就不再说类似这样的话题,只说逻辑
class Robots:
def __init__(self):
pkl_file = open('../../KBQA_small_data/data/entity_template.pkl', 'rb')
self.template_property = pickle.load(pkl_file)
ppt_file=open('../../KBQA_small_data/data/ppt_update_update1.pkl', 'rb')
self.ppt_property=pickle.load(ppt_file)
concept_fre=open('../../KBQA_small_data/data/concept_count.pkl', 'rb')
self.concept_fre=pickle.load(concept_fre)
self.jieba_pos = ['i', 'j', 'l','nr', 'nt', 'nz', 'b', 'nrfg','zg']
self.unused_pos=['b','c','dg','e','o','p','r','u','w','y','z','uj','x']
self.stanford_pos=['NR']
self.tf_idf = jieba.analyse.extract_tags
self.nlp = StanfordCoreNLP(path_or_host='../../stanford-corenlp/stanford-corenlp-full-2017-06-09/',lang='zh')
self.sql2 = "SELECT * FROM [chentian].[dbo].[baike_triples1] WHERE entity ='%s' "
self.sql = "SELECT * FROM [chentian].[dbo].[baike_triples1] WHERE entity in %(name)s "
self.sq3="SELECT * FROM [chentian].[dbo].[m2e1] where entity='%s'"
self.entity_re=Entity()
self.model = gensim.models.Word2Vec.load('../../w2vModel/corpus.model')
本次模型的所有参数以及数据保存到了pkl中一次介绍:
entity_template.pkl 实体对应模板概念的概率
ppt_update_update1 模板对应属性的概率
concept_count 应该是一个实体对应的概念的概率
以下的是由于命名实体识别效果不好,用的词性做规则来识别实体,其实可以用自己训练的模型;来进行实体识别,效果会更好。
实例化了一个entity用来识别实体。
self.model是用来进行句子相似性匹配的。
get_answer_qa实际是主函数接下来仔细分析来沟通整个函数的结果
def get_answer_qa(self,sentence):
"""
对用户问题进行实体识别,产生实体,然后找到实体类别,形成template,
匹配对应template库寻找对应属性答案
:return:
"""
final_result = []
final_result_final = []
second_result = []
question_template = []
template_property={}#模板对应属性,属性已经排序成功
entities = self.entity_recognize(sentence)
这是用原来的方法先进行实体识别
for entity in entities:
entity = entity.replace("'", "''")
real_entity = [k.replace("'", "") for k in
self.entity_re.get_synonym1(entity)['real_entities']] # 由于实体中可能包含',则替换为'' 在数据库中就认为是单引号,这一个过程就是把m2e文件的候选实体拿出来,标记一下,就是需要进一步分析的地方
if len(real_entity) == 0:
real_entity = "('" + str(entity) + "')" # 如果m2e文件中没有多义词,则实体自己为real_entity
elif len(real_entity) == 1:
real_entity = "('" + str(real_entity[0]) + "')"
else:
real_entity = tuple(real_entity)
以上是通过m2e获取候选实体,类型是 ('张杰(上海市浦东法院民五庭庭长)', '张杰(世界书画报社长总编辑)', '张杰(东北林业大学生命科学学院副教授)')类似于这样多个候选实际实体
temp_sql = self.sql % {'name': real_entity} # real_entity 是一个元组,
result = querySQL.Query(temp_sql) # 用sqlserver的in (e1,e2,e3)元组中得到所有的结果,不用再对real_entity实体循环多次select查找 ,这里的是所有候选实体的三元组对象。
result['template_score']=''
result['property_score']=''
result['score']=''
concepts = result[result['property'] == 'BaiduTAG']['value'] #对应所有概念
for pro in concepts:
temp_template = sentence.replace(entity, '$$$$$' + pro + "$$$$$") # 对应concept形成问题模板
# print("tempplte", temp_template)
if temp_template in self.template_property: #看是否包含在训练好的模板中
predicts = self.template_property[temp_template]#模板对应的属性
property_fre = self.ppt_property[temp_template]#模板对应多个意图的概率数值,是以字典的形式构建。
property_fre=dict(sorted(property_fre.items(), key=lambda d: d[1], reverse=True)[:4]) #选中模型中模板对应的前四个意图概率较大的
template_property[temp_template]=property_fre
for predict in list(property_fre.keys()):
if predict=="BaiduTAG":continue
#result是一个dataframe结构,包含了候选答案的实体,属性,value数值
if predict=='BaiduCARD':final_result_final.append(result[result['property']=='BaiduCARD'])
elif len(result[result['property']==predict])!=0:
result.loc[result['property']==predict,['template_score']]=self.concept_fre[pro]#把对应模板的分数赋值,为模板排序做准备
result.loc[result['property'] == predict, ['property_score']]=property_fre[predict]
result.loc[result['property'] == predict, ['score']]=self.concept_fre[pro]*property_fre[predict] #利用了论文中的概率计算,相当于把计算之后的结果放在了score这一列中
final_result.append(result[result['property']==predict])
second_result.append(result)
接下来的代码都比较简单,就是对计算结果进行排序
if len(final_result)!=0:
if len(final_result)!=0:
final_result=pd.concat(final_result).drop_duplicates()
tempresult=''.join(list(final_result.sort_values(by=['score'], ascending=False).loc[:,['entity','property', 'value']].iloc[0]))
return tempresult
else:return 'no_answer'
# return self.sort_result(final_result)
elif len(final_result_final)!=0:
if len(final_result_final)!=0:
final_result_final=pd.concat(final_result_final).drop_duplicates()
return self.sort_result(final_result_final,sentence)
else:return 'no_answer'
else:
if len(second_result)!=0:
final_result= pd.concat(second_result).drop_duplicates()
final_result=list(self.sort_result(final_result,sentence).reset_index().loc[0])[1:]
return ''.join(final_result[:2])+":"+final_result[-1].replace("",'').replace('','')
else:return 'no_answer'
def sort_result(self,data_fream,sentence):
"""
对最后结果按照热度进行排序
:param data_fream: 输入数据
:return:
"""
entities = data_fream['entity']
entities=list(set(entities))
if len(entities)>=1:
data_fream['score'] = ''
data_fream['property_score']=''
data_fream['cos_score']=''
for ele in entities:
if len(ele.split("("))>1:
ele_temp=ele.split("(")[1].replace(')',"")
entity=ele.split("(")[0] #表示问句中的实体
important_words = self.tf_idf(ele_temp)
important_words = important_words[:math.ceil(len(important_words) * 0.8)]
scorce = 0
for word in important_words:
if word==entity:continue #如果修饰词中含有问句的实体,则不计为相似词 2017/12/27
try:
scorce += self.model.similarity(entity, word)
except:
scorce = 0
data_fream.loc[data_fream['entity'] == ele, ['score']] = scorce
property_word = []
rest_words = sentence.replace(entity, '')
pos_words=jieba.posseg.cut(rest_words)
for i in pos_words:
# print(i.word,i.flag)
if i.flag not in self.unused_pos:
property_word.append(i.word)
properties=list(data_fream['property'])
for pro in properties:
ask_vec=np.zeros(400);query_vec=np.zeros(400)
pro_words='|'.join(jieba.cut(pro)).split("|")
for wor in pro_words:
try:
ask_vec+=self.model[wor]
except:continue
# print(property_word)
for wor1 in property_word:
try:
query_vec+=self.model[wor1]
except:continue
cos_simil = self.cosSimil(ask_vec, query_vec) # +perSimil
data_fream.loc[(data_fream['entity']==ele)&(data_fream['property']==pro),['cos_score']]=cos_simil
else:
property_word = []
rest_words = sentence.replace(ele, '')
pos_words = jieba.posseg.cut(rest_words)
for i in pos_words:
# print(i.word,i.flag)
if i.flag not in self.unused_pos:
property_word.append(i.word)
properties = list(data_fream['property'])
for pro in properties:
ask_vec = np.zeros(400);
query_vec = np.zeros(400)
pro_words = '|'.join(jieba.cut(pro)).split("|")
for wor in pro_words:
try:
ask_vec += self.model[wor]
except:
continue
for wor1 in property_word:
try:
query_vec += self.model[wor1]
except:
continue
cos_simil = self.cosSimil(ask_vec, query_vec) # +perSimil
data_fream.loc[
(data_fream['entity'] == ele) & (data_fream['property'] == pro), ['cos_score']] = cos_simil
fin_data=[]
arclen=math.ceil(len(data_fream)*0.3)
fin_data.append(data_fream.sort_values(by='cos_score',ascending=False)[:arclen]) #后是属性排序)
fin_data.append(data_fream.sort_values(by='score',ascending=False)[:arclen])
return pd.concat(fin_data).loc[:,['entity','property','value']]
else:
return data_fream.loc[:,['entity','property','value']]
# entity_score[ele]=scorce
# entity_score=dict(sorted(entity_score.items(),key=lambda d:d[1] ,reverse=True))
# 计算余弦相似度
def cosSimil(self, v1, v2):
return np.dot(v1, v2) / (
math.sqrt(sum(v1 ** 2)) * math.sqrt(sum(v2 ** 2)) + 0.000000000000000000000000000000001)
def entity_recognize(self,sentence):
"""
识别出问题中对应的实体,根据训练数据集的train.json文件,特点,做了一系列的规则处理。
:param sentence: 用户问题
:return: 返回实体
"""
if re.search('《.*》', sentence)!=None :
return [re.search('《.*》', sentence).group().replace("《", "").replace("》", "")]
if re.search('“.*”', sentence) :
return [re.search('“.*”', sentence).group().replace("“", "").replace("”", "")]
if re.search('‘.*’', sentence):
return [re.search('‘.*’', sentence).group().replace("‘", "").replace("’", "")]
jieba_cut = "|".join(jieba.cut(sentence)).split("|")
if "是谁唱的" in sentence or "是谁写的" in sentence or "谁唱" in sentence or "谁写" in sentence:
question_entity = ''
for e in sentence:
if e == "是" or e == "谁": break
question_entity += e
question_entity = [question_entity]
else:
question_entity = self.nlp.ner(sentence) # 获得Stanford的实体识别的结果,以及切词结
pos_jieba = jieba.posseg.cut(sentence)
tf_idf = jieba.analyse.extract_tags
JIE = tf_idf(sentence)
if len(jieba_cut) < len(question_entity):#如果结巴切词比Stanford少,
final_words = []
for ele in jieba_cut:
tem_word = ''
flag = False
for el in question_entity:
if el[0] in ele:
if el[1] != 'O' and el[1] != 'NT' and el[1] != 'NUMBER': flag = True
tem_word += el[0]
if flag == True:
final_words.append(tem_word)
question_entity = final_words
else:
question_entity = self.entity_re.entity_connect(question_entity)
if len(question_entity)==0:
stanford_pos = self.nlp.pos_tag(sentence)
for wor in stanford_pos:
if wor[1] in self.stanford_pos:
question_entity=[wor[0]]
if len(question_entity) == 0:
for i in pos_jieba:
if i.flag in self.jieba_pos:
question_entity.append(i.word)
# print(question_entity, "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1")
# #对实体进行连接,相邻作为一个实体在kb中寻找,依次递减
# 如果整个句子中不包含实体,则需要从m2e中寻找且此后对应的实体,从名词‘NN’中作为备选实体
if len(question_entity) == 0:
jieba_entity = []
jieba_pos = jieba.posseg.cut(sentence)
for i in jieba_pos:
if i.flag in self.jieba_pos:
jieba_entity.append(i.word)
question_entity = jieba_entity
if len(question_entity) == 0:
# print(JIE)
words_tag_jieba = JIE[:math.ceil(len(JIE) * 0.3)] # 这是jieba切词结果,要比stanford更符合中文习惯,
question_entities = []
try:
words_tag = self.nlp.pos_tag("".join(words_tag_jieba))
# print(len(words_tag_jieba) , len(words_tag))
if len(words_tag_jieba) < len(words_tag):
final_words = []
for ele in words_tag_jieba:
tem_word = ''
for el in words_tag:
if el[0] in ele:
tem_word += el[0]
final_words.append(tem_word)
question_entity = final_words
else:
for value in words_tag:
if value[1] in self.stanford_pos:
question_entities.append(value[0])
question_entity = question_entities
except:
return 0
if len(question_entity)==0:
tf_idf = jieba.analyse.extract_tags
JIE = tf_idf(sentence)
if len(JIE)==0:JIE=[sentence]
extract = {} # 提取出问题中的实体以及答案中的value,还有对应的property ,类型为[entity,property,value]
question_entity.append(JIE[0])
# question_entity = self.connect_entity(jieba_cut, question_entity)
# print(question_entity, "**************")
return question_entity
def connect_entity(self,question,question_entity):
prio = []
real_enity=[]
for question_e in question_entity:
if question_e in question:
prio.append(question.index(question_e))
k=1
# print(question_entity)
while k
接下来请看二