chatbot使用
In today’s time, ChatBots have become extremely popular. Highly developed ChatBots like Siri, Cortana, and Alexa have left people surprised with their intelligence and capabilities. A chatbot is actually defined as:
在当今时代,ChatBots变得非常流行。 诸如Siri,Cortana和Alexa之类的高度开发的聊天机器人使人们对其智能和功能感到惊讶。 聊天机器人实际上定义为:
Chatbots can be as simple as rudimentary programs that answer a simple query with a single-line response, or as sophisticated as digital assistants that learn and evolve to deliver increasing levels of personalization as they gather and process information.
聊天机器人可以像简单的程序一样,用单行响应来回答一个简单的查询,也可以像数字助手那样复杂,后者可以学习和发展以在收集和处理信息时提供越来越高的个性化水平。
In many articles, I have seen a basic comment-response model. My attempt is to see if we can take this a bit further, to modify the user comments to a broader basis and maybe provide the chatbot with other capabilities using a simple web request, automation, and scraping.
在许多文章中,我已经看到了基本的评论-响应模型。 我的尝试是看看我们是否可以更进一步,以更广泛的方式修改用户评论,并可能使用简单的Web请求,自动化和抓取功能为聊天机器人提供其他功能。
Before we start, I just want you to know a few things about, what are we looking at. We will not be looking at a super-smart chatbot like Siri because it will need a huge experience and expertise. Now, if we think, it will be pretty cool, if our chatbot can help us book a hotel, play a song for us, tell us about the weather reports, and so on. We will try to implement all these facilities in our chatbot using just some basic python web handling libraries. We will be using NLTK or Python’s Natural Language Toolkit Library. So, let’s start.
在开始之前,我只想让您了解一些有关我们正在寻找的内容。 我们将不再需要像Siri这样的超级智能聊天机器人,因为它需要大量的经验和专业知识。 现在,如果我们认为,这将非常酷,如果我们的聊天机器人可以帮助我们预订酒店,为我们播放歌曲,向我们介绍天气报告等。 我们将尝试仅使用一些基本的python Web处理库在我们的聊天机器人中实现所有这些功能。 我们将使用NLTK或Python的自然语言工具包库。 所以,让我们开始吧。
Chatbots are mainly of two types based on their design
根据其设计,聊天机器人主要分为两种类型
- Rule-Based 基于规则
- Self-Learned 自学的
We are going to use a combined version of the two. In the Rule-Based approach generally, a set of ground rules are set and the chatbot can only operate on those rules in a constrained manner. For the self-learned version, Neural networks are used to train the chatbots to reply to a user, based on some training set of interaction. For the task parts, we will be using a rule-based approach and for the general interactions, we will use a self-learned approach. I found this combined approach much effective than a fully self-learned approach.
我们将使用两者的组合版本。 通常,在基于规则的方法中,设置了一组基本规则,并且聊天机器人只能以约束方式对这些规则进行操作。 对于自学版本,基于交互的一些训练集,神经网络用于训练聊天机器人以回复用户。 对于任务部分,我们将使用基于规则的方法,对于一般的交互,我们将使用自学习的方法。 我发现这种组合方法比完全自学的方法更为有效。
NLTK图书馆的工作 (Working of NLTK Library)
Before we jump into the application, let us look at how NLTK works and how it is used in Natural Language Processing. There are 5 main components of Natural Language Processing. They are:
在进入该应用程序之前,让我们看一下NLTK的工作原理以及它在自然语言处理中的使用方式。 自然语言处理有5个主要组成部分。 他们是:
- Morphological and Lexical Analysis 词法分析
- Syntactic Analysis 句法分析
- Semantic Analysis 语义分析
- Discourse Integration 话语整合
- Pragmatic Analysis 语用分析
Morphological and Lexical Analysis: Lexical analysis depicts analyzing, identifying, and description of the structure of words. It includes dividing a text into paragraphs, words, and sentences.
词法和词法分析:词法分析描述了对单词结构的分析,识别和描述。 它包括将文本分为段落,单词和句子。
Syntactic analysis: The words are commonly accepted as being the smallest units of syntax. The syntax refers to the principles and rules that govern the sentence structure of any individual language. Syntax focus on the proper ordering of words which can affect its meaning.
语法分析:单词通常被视为语法的最小单位。 语法是指控制任何一种语言的句子结构的原则和规则。 语法着重于可能影响其含义的正确词序。
Semantic Analysis: This component transfers linear sequences of words into structures. It shows how the words are associated with each other. Semantics focuses only on the literal meaning of words, phrases, and sentences.
语义分析:此组件将单词的线性序列转换为结构。 它显示了单词如何相互关联。 语义仅关注单词,短语和句子的字面含义。
Discourse Integration: It means a sense of the context. The meaning of any single sentence which depends upon those sentences. It also considers the meaning of the following sentence.
话语整合:这意味着上下文感。 取决于那些句子的任何单个句子的含义。 它还考虑了以下句子的含义。
Pragmatic Analysis: Pragmatic Analysis deals with the overall communicative and social content and its effect on interpretation. It means abstracting or deriving the meaningful use of language in situations.
语用分析:语用分析处理整体的交流和社会内容及其对解释的影响。 它意味着在情况下抽象或派生有意义的语言使用。
Now, let’s talk about the methods or functions used to implement these five components:
现在,让我们讨论用于实现这五个组件的方法或函数:
Tokenization: Tokenization is the process by which big quantity of text is divided into smaller parts called tokens. It takes in a sentence and decomposes it into the smallest extractable units or words.
令牌化:令牌化是将大量文本分成称为令牌的较小部分的过程。 它吸收一个句子并将其分解为最小的可提取单元或单词。
Parts of Speech Tagging: It is a very useful tool in NLP. We know, various parts of speech like Verb, Noun, Adjective, and others. Now, if can tag to which parts of speech do a word belongs to, it will be easy for us to understand the context of a sentence.
语音标记的一部分:这是NLP中非常有用的工具。 我们知道,动词,名词,形容词等其他词类都不同。 现在,如果可以标记单词属于词性的哪一部分,那么我们很容易理解句子的上下文。
Lemmatization: This operation is another very useful tool for NLP. The words which have the same meaning but have some variation according to the context or sentence are brought down to their root word using this operation. This is very important for pattern matching and for rule-based approaches.
合法化:此操作是NLP的另一个非常有用的工具。 使用此操作,将具有相同含义但根据上下文或句子有所不同的单词归为其词根。 这对于模式匹配和基于规则的方法非常重要。
All these facilities are provided by the python’s NLTK libraries. Now, let’s check how the Self Learning Algorithm works.
所有这些功能由python的NLTK库提供。 现在,让我们检查一下自学习算法的工作原理。
自学习算法的工作 (Working of Self Learning Algorithm)
We use a bag of words algorithm for the purpose to train our model. We have a file that contains all the intents and input patterns. It looks like:
我们使用一袋单词算法来训练我们的模型。 我们有一个包含所有意图和输入模式的文件。 看起来像:
We can see that here we have a ‘tag’ field which actually depicts the intentions. Intentions are usually terms, which are the actual subject or motive behind a sentence given. For example, here there are intentions like ‘introductions’, ‘thanks’, ‘greetings’, and ‘goodbye’ which are basically motives. Secondly, we have a ‘patterns’ field which actually depicts the patterns or type of sentences that can have the corresponding motive. Then we have the responses field which contains some responses which may be the bot’s response if the corresponding motive is detected. For example, for the ‘tag’ or intention of ‘greeting’ patterns detected can be ‘Hi’, ‘Hello’, and the corresponding responses can be ‘Hello’, ‘Hi There’.
我们可以看到,这里有一个“标签”字段,它实际上描述了意图。 意图通常是术语,是给定句子背后的实际主题或动机。 例如,这里有一些动机,例如“介绍”,“感谢”,“问候”和“再见”。 其次,我们有一个“模式”字段,该字段实际上描述了可以具有相应动机的句子的模式或类型。 然后,我们有一个包含一些响应的响应字段,如果检测到相应的动机,这些响应可能是机器人的响应。 例如,对于检测到的“标签”或“打招呼”模式的意图可以是“ Hi”,“ Hello”,而相应的响应可以是“ Hello”,“ Hi There”。
Now, we pick all the unique words from the patterns one by one lemmatize them, convert to them to lower case, and then append it to create a list of words. This list will look something like this:
现在,我们从模式中一一挑选所有独特的词,然后对其进行词法去除,将其转换为小写,然后附加它以创建词列表。 该列表如下所示:
[‘hi’, ‘there’, ‘how’, ‘are’, ‘you’,……………………………. ‘what’, ‘can’, ‘do’]
['嘿,你好吗',……………………………。 “什么”,“可以”,“要做”]
This will be our bag of words or vocabulary for our model training. Now, we will need to one-hot encode the list to create the encoded vectors which are then fitted to the model as the train set.
这将是我们用于模型训练的单词或词汇包。 现在,我们将需要对列表进行一次热编码,以创建编码后的向量,然后将其作为训练集拟合到模型中。
Then, we have the tags which are also put in a list. It looks like this:
然后,我们将标签也放置在列表中。 看起来像这样:
[‘greetings’, ‘goodbye’, ‘thanks’…………………………]
[“问候”,“再见”,“谢谢”…………………………]
This is also one hot encoded to create a list that will serve as our target data for training our model. Now, how will we do that?
这也是一个热编码的代码,用于创建一个列表,该列表将用作我们训练模型的目标数据。 现在,我们将如何做?
First, we obtain the filtered bag of words list. In our case, the length of the word list is 47. Now, we create a list of size 47 and mark it’s all indices with 0. Then we mark the indices corresponding to the words present in the input sentence as 1. For example, the input sentence is ‘Hi There’. Then the list becomes:
首先,我们获得过滤后的单词列表袋。 在我们的例子中,单词列表的长度为47。现在,我们创建一个大小为47的列表,并将所有索引标记为0。然后,将与输入句子中出现的单词对应的索引标记为1。例如,输入的句子是“ Hi There”。 然后列表变为:
[1,1,0,0,0,…………….]
[1,1,0,0,0,……………。]
Only the indices corresponding to ‘hi’ and ‘there’ are 1 and all others are 0. These encoded vectors are obtained from all the input statements in our batch. So, if the batch size is n. We have n x 47 lists and it is our input dimension of X values of the training set.
只有与“ hi”和“ there”相对应的索引为1,所有其他索引为0。这些编码的矢量是从我们批次中的所有输入语句中获得的。 因此,如果批次大小为n。 我们有nx 47个列表,它是训练集X值的输入维。
Similarly, We have the target set. We have 7 tags. So, we create a list of size 1. The tag index corresponding to the tag which is denoted by the input statement is 1 and all others are 0. For example, for ‘Hi There’ in our training data set tag is ‘greeting’. So, the index corresponding to the greeting tag is 1 and all others are 0. So, if the batch size is n. We have n x 7 lists and it is our input dimension of Y values or the target values of the training set.
同样,我们有目标集。 我们有7个标签。 因此,我们创建了一个大小为1的列表。与输入语句所表示的标签相对应的标签索引为1,其他所有索引均为0。例如,对于训练数据集中的“ Hi There”,标签为“ greeting” 。 因此,对应于greeting标签的索引为1,所有其他索引为0。因此,如果批处理大小为n。 我们有nx 7个列表,它是Y值或训练集目标值的输入维。
Each sentence present in patterns of all the intents comprises of our total dataset. The model is trained on the basis of these datasets.
以所有意图模式呈现的每个句子都构成了我们的总数据集。 在这些数据集的基础上训练模型。
Use: Now, when a user inputs a sentence or statement, it is also tokenized, lemmatized, and converted to lower case. Then after the preprocessing the data from the user, we one hot encode the user data using our vocabulary and form a 47 length encoded vector. We fit this vector to our model, which gives us the predicted tag of the user statement. We then go and find a random response to the user statement from our response lists.
使用:现在,当用户输入句子或陈述时,它也会被标记化,去词化并转换为小写。 然后,在对来自用户的数据进行预处理之后,我们使用我们的词汇表对用户数据进行热编码,并形成47个长度的编码矢量。 我们将此向量拟合到模型中,从而为我们提供了用户语句的预测标签。 然后,我们从响应列表中找到对用户语句的随机响应。
应用 (Application)
Let’s jump into the Codes:
让我们跳入代码:
from nltk.tokenize import word_tokenize
import nltk
from nltk.stem import WordNetLemmatizer
import re
lem = WordNetLemmatizer()def filter_command(phrase):
tokens=[]
tok=word_tokenize(phrase)
for t in tok:
tokens.append(tok.lower())
tags = nltk.pos_tag(tokens)work=[]
work_f=[]
subject=[]
number=[]
adj=[]
query=[]
name=0
for tup in tags:
if "VB" in tup[1]:
work.append(tup[0])
if "CD" in tup[1]:
number.append(tup[0])
if "JJ" in tup[1]:
adj.append(tup[0])
if "NN" in tup[1]:
subject.append(tup[0])
if "W" in tup[1] and "that" not in tup[0]:
query.append(tup[0])
for w in work:
work_f.append(lem.lemmatize(w.lower()))
if query:
if "you" in tokens or "your" in tokens:
task=0
elif 'weather' not in tokens or 'news' not in tokens or 'headlines' not in tokens:
task=1
elif 'play' in work_f or 'song' in subject or 'play' in subject:
task=2
elif 'book' in work_f or 'book' in tokens[0]:
task=3
elif 'weather' in subject:
task=4
elif 'news' in subject or 'headlines' in subject:
task=5
else:
if '?' in tokens and 'you' not in tokens and 'your' not in tokens:
task=1
else:
task=0
return task,work_f,subject,number,adj,query
This is our main library code which is the merging area between our rule-based learning and self-learning. The user sends a message, which is picked up by this code and this code classifies the message. In other words, it tells whether the statement assigns some tasks or is just a common casual conversation.
这是我们的主要库代码,是基于规则的学习和自学之间的融合区域。 用户发送一条消息,此代码对此消息进行拾取,并且此代码对消息进行分类。 换句话说,它告诉语句是分配某些任务还是只是普通的随意对话。
Here we will first tokenize the statement and then tag parts of speech. Now, If it is a question there will be a question mark or it will have a ‘wh’ term. If these characteristics are detected then the statement is classified as a query and corresponding actions are taken. Similarly, if we want weather or news, those terms will appear as the subjects of the statements. In case, if the statement is not a defined query or task, the task is assigned 0 and is taken over by our self-learner model which will now try to classify the statement.
在这里,我们将首先标记该语句,然后标记词性。 现在,如果是一个问题,将会有一个问号或一个“ wh”字样。 如果检测到这些特征,则将该语句分类为查询并采取相应的措施。 同样,如果我们需要天气或新闻,这些术语将作为声明的主题出现。 如果该语句不是定义的查询或任务,则该任务被分配0并由我们的自学习模型接管,该模型现在将尝试对该语句进行分类。
from create_response import responseflag=0
while(flag==0):
user_response = input()
flag=response(user_response)
This is our caller module. which runs in a while loop until the flag is defined 1 by the response function.
这是我们的呼叫者模块。 它在while循环中运行,直到响应函数将标志定义为1。
import json
from preprocess_predicted import predict_c
from lib import filter_command
import random
from support import search_2,play,book,get_weather,get_news_update,scrapedef get_res(res):
with open('intents.json') as file:
intents=json.load(file)
tag = res[0]['intent']
all_tags = intents['intents']
for tags in all_tags:
if(tags['tag']== tag):
result = random.choice(tags['responses'])
break
return result,tagdef response(phrase):
flag_1=0
task,work_f,subject,number,adj,query=filter_command(phrase)
if task==0:
res_1=predict_c(phrase)
result,tag=get_res(res_1)
if tag=="noanswer":
results="Here are some google search results"
search_2(subject,phrase)
if tag=='goodbye':
flag_1=1elif task==1:
scrape(phrase)
result="Here are some results"elif task==2:
play(phrase,subject)
result="Here you go"elif task==3:
book(phrase)
result="Here are some results"
elif task==4:
get_weather()
result="Here are the results"elif task==5:
get_news_update()
result="Here are the results"else:
result="Sorry, I don't think i understand"
print(result)
return flag_1
This is our driving response creator code. It receives the user sentence from the caller. It then calls filter_command() to return if any classified task is detected or not. Here we have 6 tasks(0–5). 0 being no task detected. Any ‘wh’ query or statement carrying ‘?’ is handled by the scrape() snippet. Similarly, task 2 is playing a youtube video, 3 is booking a ticket or room, 4 being weather news. and finally, 5 is News update. 0 invokes the self-learning model.
这是我们的驾驶响应创建者代码。 它从呼叫者那里接收用户句子。 然后,它调用filter_command()以返回是否检测到任何分类任务。 这里我们有6个任务(0-5)。 0为未检测到任务。 任何带有“?”的“ wh”查询或语句 由scrape()代码段处理。 同样,任务2正在播放youtube视频,任务3正在预订机票或房间,任务4是天气新闻。 最后,5是新闻更新。 0调用自学习模型。
All the task handlers are defined in a support file. They use Selenium automation and web scraping mostly. Let me show three of them.
所有任务处理程序都在支持文件中定义。 他们主要使用Selenium自动化和Web抓取。 让我给他们看三个。
def play(message,subject):
result=message
ext="https://www.youtube.com/results?search_query="
messg=result.replace(" ","+")
msg=ext+messgdriver = webdriver.Chrome()
wait = WebDriverWait(driver, 3)
presence = EC.presence_of_element_located
visible = EC.visibility_of_element_locateddriver.get(msg)
wait.until(visible((By.ID, "video-title")))
names=driver.find_elements_by_id("video-title")
i=0
for name in names:
print(name.text)
if len(subject)==2:
s=subject[1]
else:
s=subject[0]
if s not in name.text.lower():
i+=1
continue
else:
break#driver.quit()
print(i)
driver.find_elements_by_id("video-title")[i].click()
url = driver.current_url
time_prev= int(round(time.time()))
#video=pafy.new(url)
#print((driver.find_elements_by_xpath("//span[@class='ytp-time-duration']")[0]).text)
s=driver.find_elements_by_xpath("//html/body/ytd-app/div/ytd-page-manager/ytd-search/div[1]/ytd-two-column-search-results-renderer/div/ytd-section-list-renderer/div[2]/ytd-item-section-renderer/div[3]/ytd-video-renderer[1]/div[1]/ytd-thumbnail/a/div[1]/ytd-thumbnail-overlay-time-status-renderer/span")[0].text
#s="10:00"
time_k=int(s.split(':')[0])*60+int(s.split(':')[1])
boo=True
while boo:
time_now=int(round(time.time()))
if time_now-time_prev==int(time_k+2):
driver.quit()
boo=False
This is youtube automation. It searches a video and picks up video titles. It searches the subjects of our message in the title. This is done to avoid the advertisement videos which pop on top of the suggested video list. So, we need to be careful of providing correct names in statements. Another problem with this is the ‘s’ value. This is supposed to retain the video length using Xpath so that we can close the driver once the video finishes. But due to Youtube’s constantly changing its source codes this sometimes generates errors. It can be solved by APIs but I didn’t prefer to use them.
这是youtube自动化。 它搜索视频并选择视频标题。 它在标题中搜索我们邮件的主题。 这样做是为了避免广告视频出现在建议视频列表的顶部。 因此,我们需要注意在语句中提供正确的名称。 另一个问题是“ s”值。 应该使用Xpath保留视频长度,以便在视频结束后可以关闭驱动程序。 但是由于Youtube不断更改其源代码,因此有时会产生错误。 可以通过API解决,但我不喜欢使用它们。
The next is the News Scraper:
接下来是News Scraper:
def get_news_update():
url="https://www.telegraphindia.com/"
news = requests.get(url)
news_c = news.content
soup = BeautifulSoup(news_c, 'html5lib')
headline_1 = soup.find_all('h2', class_="fs-32 uk-text-1D noto-bold uk-margin-small-bottom ellipsis_data_2 firstWord")headline_2 = soup.find_all('h2',class_="fs-20 pd-top-5 noto-bold uk-text-1D uk-margin-small-bottom ellipsis_data_2")headlines=[]
for i in headline_1:
h=i.get_text()[1:]
headlines.append(h)
for i in headline_2:
h=i.get_text()[1:]
headlines.append(h)
for i in headlines:
l=i.split()
new_string = " ".join(l)
print(new_string)
This is the headlines scraper code. We have used the ‘Telegraph’ as the source of our news headlines. It has headlines of two types marked by two classes. We here scraped both and with the help of the Soup, we extract the code.
这是标题刮板代码。 我们已将“电报”用作新闻头条的来源。 它有两种类型的标题,分别标有两个类别。 我们在这里将两者都刮掉,并在汤的帮助下提取代码。
Next, lastly, let’s look at the query answerer code. It is also a kind of scraper.
接下来,最后,让我们看一下查询应答器代码。 这也是一种刮板。
def scrape(phrase):
flag=0
ext="https://www.google.com/search?q="
links=search(phrase, num=5, stop=5, pause=2)
msg=phrase.replace(" ","+")
url=ext+msg
i=0
for link in links:
i+=1
if 'wikipedia' in link:
flag=1
l=link
breakif flag==1:
wiki = requests.get(l)wiki_c = wiki.content
soup = BeautifulSoup(wiki_c, 'html.parser')
data=soup.find_all('p')
print("Source:wikipedia")print(data[0].get_text())print(data[1].get_text())
print(data[2].get_text())
print(data[3].get_text())
else:
print("wikipedia Source not available")
print("Providing searanch results")
webbrowser.open(url,new=1)
time.sleep(3)
Here we obtain the result options from a google search and if a Wikipedia page is in the search result,s we scrape it to provide the first 4 paragraphs from the page. If wikilinks are not found we just do a basic google search.
在这里,我们从google搜索中获得结果选项,如果搜索结果中有Wikipedia页面,我们将其刮擦以提供该页面的前4个段落。 如果找不到维基链接,我们只是进行基本的Google搜索。
Now, let’s move to the self-learning model part.
现在,让我们进入自学习模型部分。
Here, first is the data pre-processing part:
这里,首先是数据预处理部分:
import nltk
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
import json
import pickledef preprocess():
words=[]
words_f=[]
tags=[]
docs=[]
ignore_words = ['?', '!',',','.']
with open('intents.json') as file:
intents=json.load(file)for i in intents["intents"]:
for pattern in i['patterns']:
w=nltk.word_tokenize(pattern)
words.extend(w)docs.append((w, i['tag']))if i['tag'] not in tags:
tags.append(i['tag'])for w in words:
if w in ignore_words:
continue
else:
w_1=lem.lemmatize(w.lower())
words_f.append(w_1)
words_f = sorted(list(set(words_f)))pickle.dump(words_f,open('words.pkl','wb'))
pickle.dump(tags,open('tags.pkl','wb'))
return words_f,tags,docs
This code is used for pre-processing. As we have known, all the patterns from all_patterns detected under all tags are tokenized. The words after tokenization are put into a list. So, these lists have repeated values. The corresponding tags are also being saved in the ‘tags’ list. The docs list is saving the tuples in the format (tokenized_words, tag). So, it is saved as ([‘hi’, ‘there’], ‘greeting’) for example.
此代码用于预处理。 众所周知,在所有标签下检测到的all_patterns中的所有模式都被标记化了。 标记化后的单词将放入列表中。 因此,这些列表具有重复的值。 相应的标签也被保存在“标签”列表中。 docs列表以(tokenized_words,tag)格式保存元组。 因此,例如,它另存为(['hi','there'],'greeting')。
Now, the word_f lists is a purer version of the list with no repetition of words.
现在,word_f列表是列表的纯净版本,没有重复的单词。
from preprocessing import preprocess
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
import numpy as np
import randomdef data_creator():
words,tags,docs=preprocess()out_array=[0] * len(tags)
train=[]for doc in docs:bag_of_words=[]
patt=doc[0]
patt_f=[]
## Accessing the first part of a tuple (word,tag)for pat in patt:
p=lem.lemmatize(pat.lower())
patt_f.append(p)for word in words:
if word in patt_f:
bag_of_words.append(1)else:
bag_of_words.append(0)# Creating vector of wordsoutput_req=list(out_array)
output_req[tags.index(doc[1])] = 1
#print(len(bag_of_words))
#print(len(output_req))
train.append([bag_of_words, output_req])
random.shuffle(train)
train=np.array(train)
X_train=list(train[:,0])
Y_train=list(train[:,1])
print("1")
np.save('X_train.npy',X_train)
np.save('Y_train.npy',Y_train)
Here, the bag of words or encoded vectors is formed. This snippet picks the word_tokenized part in our doc to create the encoded vectors which are a part of our X_train set formed, and their corresponding tags are also encoded to form the Y_train or target values of our train set.
在此,形成单词或编码向量的包。 此代码段在文档中选择了word_tokenized部分以创建编码矢量,这些矢量是形成的X_train集的一部分,并且它们的相应标签也被编码以形成Y_train或我们的火车集的目标值。
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam
import numpy as np
from training_data_creation import data_creator
def model_c():
data_creator()
X=np.load('X_train.npy')
Y=np.load('Y_train.npy')
model = Sequential()
model.add(Dense(32, input_shape=(len(X[0]),), activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(len(Y[0]), activation='softmax'))adam=Adam(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
return model,X,Y
Our model and the train(X) and target(Y) sets of our training data are returned by the function. Our model has 2 hidden layers. It uses categorical cross-entropy as loss function and activation Softmax on the final layer. We used ‘Adam’ optimizer here.
该函数返回我们的模型以及训练数据的train(X)和target(Y)集。 我们的模型有2个隐藏层。 它使用分类交叉熵作为损失函数,并在最后一层激活Softmax。 我们在这里使用了“亚当”优化程序。
import tensorflow as tf
from model import model_cdef train():
callback=tf.keras.callbacks.ModelCheckpoint(filepath='chatbot_model.h5',
monitor='loss',
verbose=1,
save_best_only=True,
save_weights_only=False,
mode='auto')
model,X,Y=model_c()
model.fit(X, Y, epochs=500, batch_size=16,callbacks=[callback])train()
Our model is trained at 500 epochs saving best weights only.
我们的模型经过500次训练,仅节省了最佳体重。
import nltk
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
import numpy as np
import pickle
import tensorflow as tf
import randomdef preprocess(phrase):
words=nltk.word_tokenize(phrase)
words_f=[]
for word in words:
w=lem.lemmatize(word.lower())
words_f.append(w)
#print("****************")
#print(words_f)return words_fdef bag_of_words(phrase,words):
obt_words=preprocess(phrase)bag_of_words=[0]*len(words)for o in obt_words:
for w in words:
#print(o)
#print(w)
if o==w:
#print("A")
bag_of_words[words.index(w)]=1b_n=np.array(bag_of_words)
return b_ndef predict_c(phrase):
model=tf.keras.models.load_model('chatbot_model.h5')
words = []
with (open("words.pkl", "rb")) as openfile:
while True:
try:
words.append(pickle.load(openfile))
except EOFError:
break
tags = []
with (open("tags.pkl", "rb")) as openfile:
while True:
try:
tags.append(pickle.load(openfile))
except EOFError:
break
#print(words)
#print(tags)
#print(phrase)
to_pred=bag_of_words(phrase,words[0])
#print(to_pred)
pred=model.predict(np.array([to_pred]))[0]
threshold=0.25
results = [[i,r] for i,r in enumerate(pred) if r>threshold]
results.sort(key=lambda x: x[1], reverse=True)
return_list = []
for r in results:
return_list.append({"intent": tags[0][r[0]], "prob": str(r[1])})
return return_list
The above code is used for the preprocessing of the new user statement to be predicted by the model. It involves converting to lowercase and lemmatization. Then, the encoded-vector is formed using our vocabulary and sent to the model for prediction. The Bot response depends on the tag predicted.
上面的代码用于模型要预测的新用户语句的预处理。 它涉及转换为小写和词形化。 然后,使用我们的词汇表形成编码矢量,并将其发送到模型进行预测。 Bot响应取决于预测的标签。
This is the overall process of the operation of our Bot.
这是我们Bot运作的整个过程。
结果 (Results)
The short video shows the use and actions of our bot. There are some version issues with CUDA in my device I suppose. So, I got a few warnings, but I think that’s nothing to worry about.
短视频展示了我们的机器人的用法和动作。 我想我的设备中的CUDA存在一些版本问题。 因此,我收到了一些警告,但我认为这没什么好担心的。
挑战性 (Challenges)
There are a few drawbacks and challenges in the field. For example, POS tagging, while tagging ‘play’ word, it is sometimes tagged as Noun and sometimes as a Verb. Similarly, this issue exists with the ‘book’ issue also, here I have handled the exceptions but in real-world large scenarios, these things will matter. Matters won’t be bad if we use self-learning models. The challenge with Self-learning models is that they need a huge training set that needs to manually designed. So, that is why chatbots are usually kept to serve certain purposes, like handling front office client complaints and interactions up to a certain level and record the issues. But, currently, they are developing at a very fast rate. Hope, soon we will get to see more evolved bots.
该领域存在一些弊端和挑战。 例如,POS标记在标记“播放”字词时,有时被标记为名词,有时被标记为动词。 同样,此问题也与“书”问题同时存在,在这里,我已经处理了例外情况,但在现实世界中的大型场景中,这些问题很重要。 如果我们使用自学习模型,事情不会很糟糕。 自我学习模型的挑战在于,他们需要庞大的训练集,需要手动设计。 因此,这就是为什么聊天机器人通常被保留用于某些目的的原因,例如处理前台客户的投诉和达到特定级别的交互并记录问题。 但是,目前,它们正在以非常快的速度发展。 希望不久,我们将能够看到更多进化的机器人。
结论 (Conclusion)
In this article, we talked about a way, a basic plan chatbot can be given some features. I hope this helps.
在本文中,我们讨论了一种方法,可以为基本计划的聊天机器人提供一些功能。 我希望这有帮助。
Here is the Github link.
这是Github链接 。
翻译自: https://towardsdatascience.com/designing-a-chatbot-using-python-a-modified-approach-96f09fd89c6d
chatbot使用