自动化的内容生成语言模型如何帮助您赢得seo竞赛

自然语言处理 (Natural Language Processing)

Imagine you are starting a new business. You have an amazing product that could potentially disrupt a 100-million-dollar industry, but virtually nobody knows about it. How would you get your first set of early adopters?

我想像您正在开展一项新业务。 您拥有一个惊人的产品,它可能会破坏一个价值一亿美元的产业,但几乎没人知道。 您将如何获得第一批早期采用者?

The easy answer is to make your content discoverable from the internet- getting on the top results on the first page of the commonly used search engines would do wonders for your website’s organic traffic. However, it has never been harder to achieve such a goal as simply flooding the web with the same content would not bump your ranks on those modern-day search engines. A better way would be to spin some paraphrased content from the same base, but it would involve some heavy human labor that could be costly for a new business like yours. What would be the next better alternative?

简单的答案是使您的内容可从Internet上发现-在常用搜索引擎首页上获得排名靠前的结果会为您的网站的自然流量带来奇迹。 但是,实现这样的目标从来没有像现在这样困难,因为仅用相同的内容充斥网络就不会增加您在那些现代搜索引擎上的排名。 更好的方法是从同一个基础中旋转一些释义的内容,但这将涉及大量的人工工作,而这对于像您这样的新企业而言可能是昂贵的。 下一个更好的选择是什么?

What if I tell you that you could put the entire process on automation? With the advancement in Natural Language Processing, we are able to automatically find synonyms for some words in the article, creating a spined version for the original content. In this blog post, we will use a relatively simple version of the language model, pentagram, to build a content spinner that could take in a collection of content texts and produce their corresponding paraphrased counterparts.

如果我告诉您可以将整个过程自动化吗? 随着自然语言处理技术的进步,我们能够自动找到文章中某些单词的同义词,从而为原始内容创建一个旋转版本。 在此博客文章中,我们将使用语言模型(五角星)的相对简单版本来构建内容微调器,该微调器可以接收内容文本的集合并产生其对应的措辞对应的内容。

大概的概念 (General Idea)

You probably already noticed paraphrasing a sentence is just replacing some of the words with their synonyms, which allows the sentence to have a similar meaning. Yes, our pentagram model will operate on the same token — as we are aiming to replace the word that occurred between some context words with other words that appeared between the same context. In doing so we are assuming the words that appear between the same surrounding words to have the same/similar meaning. We will then use the large text corpus to find the probability of each word appearing between those context words and sample the replacement word from the distribution.

您可能已经注意到,句子的释义只是将某些单词替换为其同义词,这使句子具有相似的含义。 是的,我们的五角星形模型将基于相同的标记运行-因为我们旨在将出现在某些上下文单词之间的单词替换为出现在相同上下文之间的其他单词。 这样做时, 我们假设出现在相同周围单词之间的单词具有相同/相似的含义 。 然后,我们将使用大型文本语料库来查找每个单词出现在这些上下文单词之间的概率,并从分布中采样替换单词。

For example, suppose we have the following content

例如,假设我们具有以下内容

I like eating ice cream

after we go through the entire text corpus we found that the words “eating”, “making”, “having” with probabilities of 0.2, 0.3 and 0.5, we will randomly sample these 3 words according to their distribution — 50% of chances we will get “having” as the replacement word of the original “eating”.

遍历整个文本语料库后,我们发现“ 进食 ”,“ 正在制作 ”,“ 具有 ”这些单词的概率分别为0.2、0.3和0.5,我们将根据这三个单词的分布随机抽样-50%的机会将获得“ 具有 ”作为原始“ 饮食 ”的替换词。

"I like eating ice cream" -> Context: (I, like, ice, cream)Possible words for this context,eating: 0.2 probability
making: 0.3 probability
having: 0.5 probabilitySample a word from the above 3 choices, the original sentence will become"I like having ice cream" (50% chance)
"I like making ice cream" (30% chance)
"I like eating ice cream" (20% chance)

Now we understand the general idea of the automated content spinning process, we will discuss the implementation details of the construction of this fascinating tool.

现在,我们了解了自动内容旋转过程的总体思想,我们将讨论该引人入胜的工具的构造的实现细节。

文本语料库加载器 (The Text Corpus Loader)

We will start by loading in the text corpus as word tokens, which will be used to find all the context words. The text data here will be the same John Hopkins University’s multi-domain sentiment dataset that was used in my Text Vectorizations Post, enriched by the Wikipedia extract sample to ensure we have more words to capture enough samples within a certain context for the spinner to generate more fitting paraphrased content. The loader will start by reading in the text data from the local directories using the relative paths. Review data has different categories, which requires us to save them as a dictionary. Wikipedia data could simply be read in as strings.

我们将从加载文本语料库作为单词标记开始,这将用于查找所有上下文单词。 此处的文本数据将与约翰霍普金斯大学的“ 文本向量化帖子”中使用的约翰·霍普金斯大学的多域情感数据集相同 ,并通过Wikipedia提取示例进行了充实,以确保我们有更多的单词来捕获特定上下文中的足够示例以供微调程序生成更合适的措辞内容。 加载程序将通过使用相对路径从本地目录中读取文本数据开始。 评论数据具有不同的类别,这要求我们将其另存为字典。 维基百科数据可以简单地以字符串形式读取。

Then we will use NLTK’s word tokenizer to transform the entire text corpus into tokens. Due to the enormous size of the text data, we will use a caching strategy to avoid the long processing time for this tokenization task. The resulting tokens will preserve the ordering of the original sentence, which makes finding the correct context possible.

然后,我们将使用NLTK的单词标记器将整个文本语料库转换为标记。 由于文本数据量巨大,我们将使用缓存策略来避免此令牌化任务的处理时间过长。 产生的标记将保留原始句子的顺序,这使找到正确的上下文成为可能。

Notice that here we DO NOT perform any of our most familiar preprocessing steps (lower case, lemmatization, stopwords removal, etc.) since we want to preserve the original structure and meaning of the input content. Removing the stop words from the possible pentagrams will make the downstream implementations quite challenging to recover the original text unless we stored all the indices of all the words that got preprocessed. The full implementation of the LoadTextCorpus class is the following.

请注意,由于我们要保留输入内容的原始结构和含义,因此在这里我们不执行任何我们最熟悉的预处理步骤(小写,词义化,停用词删除等)。 除非我们存储所有经过预处理的单词的所有索引,否则从可能的五角星中删除停用词将使下游实现极具挑战性,以恢复原始文本。 下面是LoadTextCorpus类的完整实现。

五角星模型 (The Pentagram Model)

As we are attempting to sample from the word distribution of a surrounding context, we will first need to have such a distribution in place for all the possible word contexts. For any word in the content, we will look at the preceding and trailing two words for the context. One context would have different words mapping to it, so we will save the resulting context as a dictionary, with the context as keys. Using the above example, the dictionary will look like the following,

当我们尝试从周围上下文的单词分布中采样时,我们首先需要针对所有可能的单词上下文都具有适当的分布。 对于内容中的任何单词,我们将查看上下文的前两个单词和后两个单词。 一个上下文将有不同的单词映射到它,因此我们将结果上下文保存为字典,上下文作为键。 使用上面的示例,字典将如下所示,

{("I", "like", "ice", "cream"):{"eating": 0.2, "making": 0.3, "having": 0.5}, ...}

As the documents getting longer and longer, there will be more and more combinations of context inserted into this dictionary. We will loop through the entire token set to find all the possible contexts and append each occurrence of the middle word into a list, which is implemented in a method called get_pentagrams

随着文档越来越长,将越来越多的上下文组合插入到此字典中。 我们将遍历整个令牌集以查找所有可能的上下文,并将中间单词的每次出现都添加到列表中,该列表通过称为get_pentagrams的方法get_pentagrams

Then we will aggregate all the occurrences of each middle word and divide the total number of word occurrences for that particular context to find probabilities of each middle word occurring within the context. The operation is implemented in the below build method.

然后,我们将汇总每个中间单词的所有出现次数,并将该特定上下文中单词出现的总数相除,以找出在上下文中出现的每个中间单词的概率。 该操作在以下build方法中实现。

For the same reason for speed considerations, we will employ the caching strategy again to avoid long processing times. We will save the above-described dictionary to a local pickle file. Once that file exists, the pentagram model dictionary will be directly loaded instead of running through the long for loop. The full implementation is shown below.

出于速度考虑的相同原因,我们将再次采用缓存策略,以避免处理时间过长。 我们将上述字典保存到本地的pickle文件中。 一旦该文件存在,五角星形模型字典将直接加载,而不是通过long for循环运行。 完整的实现如下所示。

Now as we have all the possible words for all given context, we can build a module that generates the spined content by sampling from this distribution.

现在,对于所有给定的上下文,我们都有所有可能的单词,我们可以构建一个模块,该模块通过从此分布中采样来生成旋转内容。

内容微调器 (The Content Spinner)

Imagine we are looking at a specific word and its context, how should we find a replacement word that fits in the same exact context? Since we already have the word distributions for this context, we can use the choice method from numpy’s random module. We will first evaluate if the surrounding context exists in the precomputed pentagram probabilities, and return the original word if it doesn’t.

想象我们正在查看一个特定的单词及其上下文,我们应该如何找到一个适合相同确切上下文的替换单词? 由于我们已经有了针对该上下文的单词分布,因此我们可以使用numpy的random模块中的choice方法。 我们将首先评估预计算的五角星概率中是否存在周围的上下文,如果不存在则返回原始单词。

Using the above predict_word method we will be able to perform the paraphrasing for the entire content. For every sentence, we will need to use the word tokenizer to split it into word tokens. We will set up the generate_spinned_content method that takes in a list of content and returns the spined content

使用上面的predict_word方法,我们将能够对整个内容进行释义。 对于每个句子,我们将需要使用单词标记器将其拆分为单词标记。 我们将设置generate_spinned_content方法,该方法接受内容列表并返回旋转的内容

The full implementation of ContentSpinner class is shown below.

ContentSpinner类的完整实现如下所示。

That is it! We can now give the ContentSpinner a list of texts and see what it come up with.

这就对了! 现在,我们可以为ContentSpinner提供文本列表,并查看其内容。

旋转内容看起来如何? (How would the spined content look?)

We will randomly select 5 sets of texts and compare the original and spined contents. We can use the following code snippet to accomplish such a task.

我们将随机选择5组文本并比较原始内容和旋转内容。 我们可以使用以下代码片段完成此任务。

It turned out we have some rather successful spined content with multiple consecutive words replaced using the model.

事实证明,我们有一些相当成功的内容,使用该模型替换了多个连续单词。

“extremely pleased” got replaced to “very happy”, as the rest of the content did not change a lot. Here is a slightly longer one,

“非常高兴”被替换为“ 非常高兴 ”,因为其余内容没有太大变化。 这是稍长的一个,

I decided it was time to upgrade” becomes “I decided it is necessary to go” (wow!). Using a larger set of surrounding context gives the model abilities to capture the potential “synonyms” distributions more accurately. If we only use a surrounding context of 1 (trigram), we will probably not be able to get as good of a result as right now.

我决定是时候升级了 ”变成了“ 我决定有必要去 升级 ”(哇!)。 使用更大范围的周围环境使模型能够更准确地捕获潜在的“同义词”分布。 如果我们仅使用1(三元组)的周围环境,那么我们可能将无法获得与现在一样好的结果。

But as you can see, there are also some epic failures in this spined content, as their meanings got completely twisted by the content spinner. The sentence “...aesthetically pleasing with the exception of sub unit…” becomes “…aesthetically pleasing with the results of the political unit…” (lol). This is due to a limitation of our pentagram model — since the replacement word is one of the possible middle words within that context, there is no guarantee all of the words are synonyms. In fact, it was a very aggressive assumption to assume words between a certain context all have the same meaning. For example, we can go back with the ice cream example,

但正如您所看到的那样,此旋转内容中还存在一些史诗般的失败,因为其含义被内容旋转器完全扭曲了。 句子“ ……在美学上令人愉悦,但子单位……除外 ”变为“ ……在美学上令人愉悦,而政治单位却…… ”(大声笑)。 这是由于我们的五角星形模型的局限性-由于替换单词是该上下文中可能的中间单词之一,因此不能保证所有单词都是同义词。 实际上,假设某个上下文之间的词都具有相同的含义是一个非常激进的假设。 例如,我们可以回到冰淇淋示例,

"I like eating ice cream" -> Context: (I, like, ice, cream)Possible words for this context,eating: 0.2 probability
making: 0.3 probability
having: 0.5 probability

eating, making, and having are all possible words between the context (I, like, ice, cream), but obviously they mean different things when used in the sentence.

制作 是上下文(我一样,冰,霜)之间的所有可能的话,但很明显,在句子中使用时,他们的意思是不同的东西。

What could we potentially do to improve these results? It turns out we are not entirely out of luck. We will talk about some potential improvements in the final concluding section.

我们可能会做些什么来改善这些结果? 事实证明,我们并非完全不走运。 我们将在最后的结论部分中讨论一些潜在的改进。

潜在的改进和结论 (Potential Improvements and Conclusion)

When we look at the surrounding context of 4 words for a long sentence, there is some significant chance of the surrounding 4 words that do not have different word choices with the one that is already in the sentence. In that case, the original word would not be replaced. That is also the reason why the majority part of the content remains original. In a better scenario when the surrounding context does have some different word choices — if the choices are limited, we are more prone to replace the original word with some word that completely alters the meaning of the content. The obvious method of remedy is to use a larger surrounding context, as more surrounding words would eliminate the possible erroneous word choices from the “possible word distribution”. However, this method would, in turn, make the number of possible words for each (now larger) context smaller.

当我们查看一个长句子的4个单词的周围环境时,很有可能周围的4个单词与句子中已经存在的单词没有不同的单词选择。 在这种情况下,原始单词将不会被替换。 这也是大部分内容仍保持原始的原因。 在更好的情况下,如果周围的上下文确实有一些不同的单词选择-如果选择受到限制,我们更倾向于用完全改变内容含义的单词替换原始单词。 一种明显的补救方法是使用较大的周围环境,因为更多周围的单词会消除“可能的单词分布”中可能出现的错误单词选择。 但是,此方法反过来会使每个(现在更大)上下文的可能单词数变小。

A better way would be the inclusion of the part of speech of words as the surrounding context. Specifically, if we look at the same ice-cream example

更好的方法是将单词的词性作为周围环境。 具体来说,如果我们看相同的冰淇淋示例

"I like eating ice cream" -> Context: (I(pronoun), like(verb), ice(noun), cream(noun))

By doing this we will have a more accurate way to distinguish the same word that belongs to a different part-of-speeches based on the context it’s being used in. The word “like”, for example, in this sentence is a verb, but potentially could be an adverb, conjunction, or even a noun in some other sentences.

这样,我们将有一种更准确的方法,可根据其使用的上下文来区分属于不同词性的同一个词。例如,该词中的“ like ”一词是动词,但可能是其他句子中的副词,连词甚至是名词。

After some patient wait, some hard work on designing the data structure, we can now use this model to automatically generate some content and help our startup to occupy more coverage on the search engine!

经过耐心等待,在设计数据结构上进行了一些艰苦的工作之后,我们现在可以使用此模型自动生成一些内容,并帮助我们的创业公司在搜索引擎上占据更多的覆盖范围!

Gain Access to Expert View — Subscribe to DDI Intel

获得访问专家视图的权限- 订阅DDI Intel

翻译自: https://medium.com/datadriveninvestor/automated-content-generations-how-a-language-model-could-help-you-win-the-race-of-seo-cd65c4a00067

你可能感兴趣的:(python,java,人工智能,机器学习)