文档相似算法
If you want to know the best algorithm on document similarity task in 2020, you’ve come to the right place.
如果您想知道2020年文档相似性任务的最佳算法,那么您来对地方了。
With 33,914 New York Times articles, I’ve tested 5 popular algorithms for the quality of document similarity. They range from a traditional statistical approach to a modern deep learning approach.
在《纽约时报》的33,914篇文章中,我测试了5种流行的算法来提高文档相似性的质量。 它们的范围从传统的统计方法到现代的深度学习方法。
Each implementation is less than 50 lines of code. And all models being used are taken from the Internet. So you will be able to use it out of the box with no prior data science knowledge, while expecting a similar result.
每个实现少于50行代码。 并且所有使用的模型均来自Internet。 因此,您可以在没有类似数据科学知识的情况下开箱即用地使用它,并且可以获得类似的结果。
In this post, you’ll learn how to implement each algorithm and how the best one is chosen. The agenda goes by:
在本文中,您将学习如何实现每种算法以及如何选择最佳算法。 议程如下:
- Defining the best; 定义最好的;
- Experiment goal statement; 实验目标陈述;
- Data setup; 数据设置;
- Comparison criteria; 比较标准;
- Algorithm setup; 算法设置;
- Picking the winner; 选择优胜者;
- Suggestion for starters; 给初学者的建议;
You want to dip your toes into natural language processing and AI. You want to spice up the user experience with relevant suggestions. You want to upgrade the old existing algorithm. Then you will love this post.
您想将脚趾插入自然语言处理和AI。 您想通过相关建议来改善用户体验。 您想升级旧的现有算法。 然后,您将喜欢这个职位。
Shall we get started?
我们开始吧?
数据科学家争夺绝对最佳 (Data Scientists Argue For The Absolute Best)
You decide to search for the term “best document similarity algorithms”.
您决定搜索术语“最佳文档相似性算法”。
Then you will get search results from academic papers, blogs, to q&a. Some focus on a tutorial of a specific algorithm, while the other focuses on a theoretical overview.
然后,您将获得来自学术论文,博客和问答的搜索结果。 一些侧重于特定算法的教程,而其他侧重于理论概述。
In academic papers, a headline says this algorithm performed 80% accuracy beating all the others that achieved only 75%. Ok. But will that difference enough to make it noticeable in our eyes? What about 2% increase? How easy is it to implement that algorithm? Scientists have a bias towards going for the best in a given test set leaving out the practical implication.
在学术论文中,一个标题称该算法的准确率达到了80%,超过了其他所有仅达到75%的算法。 好。 但是这种差异足以使它在我们眼中变得明显吗? 大约增加2%呢? 实现该算法有多容易? 科学家倾向于在给定的测试集中追求最好的效果,而没有实际意义。
In Q&A, hyped enthusiasts dominate the conversation. Some say the best algorithm today is BERT. That algorithm concept is so revolutionary it beats everything else. The cynics, on the other hand, call everything depends on the job. Some answers are from years ago predating deep learning. Take a look at this Stackoverflow. When the most voted was written in 2012, it’s hard to judge what it truly means to us.
在问答环节中,被炒作的发烧友占据了主导地位。 有人说当今最好的算法是BERT。 该算法概念如此具有革命性,胜过其他一切。 另一方面,愤世嫉俗者称一切都取决于工作。 一些答案来自几年前的深度学习。 看看这个Stackoverflow 。 当2012年投票最多时,很难判断这对我们真正意味着什么。
Google would be happy to throw in millions of dollars in engineer power and the latest computational power just to improve their search 1% better. That will not likely be practical nor meaningful for us.
Google乐于投入数百万美元的工程师力量和最新的计算能力,只是将搜索结果提高1%。 对于我们来说,这将不切实际,也没有意义。
What is the trade off between the performance gain and the technical expertise required for implementation? How much memory does it require? How fast does it run with minimal preprocessing?
在性能提升和实施所需的技术专长之间有何取舍? 需要多少内存? 它以最少的预处理运行多快?
What you want to see is how one algorithm is better than another in a practical sense.
您想要看到的是,从实际意义上讲,一种算法比另一种算法更好。
This post will provide you with a guideline as to which algorithm to implement for your next document similarity problem.
这篇文章将为您提供指导,指导您针对下一个文档相似性问题实施哪种算法。
多样的算法,热门文章全文,预训练模型 (Diverse algorithms, full-length popular articles, pretrained models)
There are 4 goals in this experiment:
此实验有4个目标:
- By running multiple algorithms on the same dataset, you will see which algorithm fairs against another and by how much. 通过在同一个数据集上运行多种算法,您将看到哪种算法与另一种算法公平,多少公平。
- By using full-length articles from popular media as our dataset, you will discover the effectiveness of real-world applications. 通过将流行媒体的全文作为我们的数据集,您将发现现实世界中应用程序的有效性。
- By accessing article URLs, you will be able to compare the differences in result quality. 通过访问文章URL,您将能够比较结果质量的差异。
- By only using the pretrained models available publicly, you will be able to set up your own document similarity and expect similar output. 通过仅使用公开可用的预训练模型,您将能够设置自己的文档相似度并期望获得相似的输出。
“Pre-trained models are your friend.“
“经过培训的模型是您的朋友。”
- Cathal Horan, machine learning engineer at Intercom
-对讲机的机器学习工程师Cathal Horan
数据设置-5基础文章 (Data Setup — 5 base articles)
For this experiment, 33,914 New York Times articles are selected. They are from 2018 to June 2020. The data has been mostly collected from the RSS feed that is parsed with full content. An average length of articles is 6,500 characters.
对于该实验,选择了33,914篇《纽约时报》的文章。 它们是从2018年到2020年6月。数据大部分是从RSS提要中收集的,并已对其进行了完整的内容分析。 文章的平均长度为6,500个字符。
From the pool, we choose 5 as the basis for similarity search. Each represents a different category.
从池中,我们选择5作为相似度搜索的基础。 每个代表一个不同的类别。
On top of the semantic categories, we will measure written formats as well. The more description is down below.
除了语义类别之外,我们还将测量书面格式。 下面是更多描述。
How My Worst Date Ever Became My Best (Lifestyle, Human Interest)
我最糟糕的约会如何成为我最好的 (生活方式,人类利益)
A Deep-Sea Magma Monster Gets a Body Scan (Science, Informational)
深海岩浆怪兽进行身体扫描 (科学,信息性)
Renault and Nissan Try a New Way After Years When Carlos Ghosn Ruled (Business, News)
雷诺和日产在卡洛斯·戈恩(Carlos Ghosn)统治多年后尝试新方法 (商业新闻)
Dominic Thiem Beats Rafael Nadal in Australian Open Quarterfinal (Sports, News)
多米尼克·蒂姆(Dominic Thiem)在澳网八强战中击败纳达尔(Rafael Nadal) (体育新闻)
2020 Democrats Seek Voters in an Unusual Spot: Fox News (Politics, News)
福克斯新闻:2020年民主党人在一个不寻常的地方寻求选民 (政治,新闻)
评判标准 (Judgment Criteria)
We will use 5 criteria to judge the nature of similarities. Please skip this section if you just want to see the results.
我们将使用5条标准来判断相似性的性质。 如果您只想查看结果,请跳过此部分。
- Tag Overlap 标签重叠
- Section 部分
- Subsections 小节
- Story Style 故事风格
- Theme 主题
Tags are the closest proxy to human judgments in content similarity. Journalists themselves write down the tags by hand. You can inspect them at news_keywords meta tags in the HTML headers. The best part of using tags is that we can objectively measure how much overlap the two contents have. Each tag goes in size ranging from 1 to 12. The more overlap 2 articles have, the more similar they are.
在内容相似性方面,标签是最接近人类判断的代理。 记者自己亲自写下标签。 您可以通过HTML标头中的news_keywords元标记检查它们。 使用标签的最好之处在于,我们可以客观地衡量两个内容的重叠程度。 每个标签的大小在1到12之间。2个文章重叠的越多,它们越相似。
Second, we look at the section. That’s how New York Times categorizes articles at the highest level: science, politics, sports, etc. The first part of URL displays section (or slug) right after the domain (nytimes.com/…).
其次,我们看一下这一节。 这就是《纽约时报》在最高层次上对文章进行分类的方式:科学,政治,体育等。URL的第一部分在域名(nytimes.com/…)之后显示部分(或子词段)。
The second is a subsection. For example, an opinion section can be subdivided into world, or world into Australia. Not all articles contain it, and it is not as significant as the other two.
第二个是小节。 例如,一个意见栏可以细分为世界,也可以细分为澳大利亚。 并非所有文章都包含它,并且它不如其他两项重要。
The fourth is the style of writing. Most analysis of document comparison only looks at the semantics. But since we are comparing the recommendation in a practical use case, we want similar writing styles as well. For example, you do not want to get a commercially focused reading about “the top 10 running shoes” right after the “running shoes and orthotics” from an academic journal. We will group articles based on the writing guidelines taught at Jefferson County Schools. The list follows human interest, personality, the best (ex: product review), news, how-to, past events, and informational.
第四是写作风格。 对文档比较的大多数分析仅着眼于语义。 但是,由于我们在实际用例中比较了建议,因此我们也希望使用类似的写作风格。 例如,您不希望在学术期刊上紧随“跑鞋和矫形器”之后就“十大跑鞋”的商业重点阅读。 我们将根据杰斐逊县学校教授的写作指南对文章进行分组。 该列表遵循人类的兴趣,个性,最佳(例如:产品评论),新闻,使用方法,过去的事件和信息。
5个算法候选人 (5 Algorithm Candidates)
These were the algorithms we will look at.
这些是我们将要研究的算法。
- Jaccard 贾卡德
- TF-IDF 特遣部队
- Doc2vec Doc2vec
- USE 用
- BERT 伯特
Each algorithm was run against 33,914 articles to find the top 3 articles with the highest scores. That process is repeated for each of the base articles.
每种算法针对33,914篇文章进行搜索,以找到得分最高的前3篇文章。 对于每个基本文章重复该过程。
The input was the article content in full length. Titles were ignored.
输入的是文章内容的全文。 标题被忽略。
Be aware, some algorithms are not built for document similarity in mind. But with such diverse opinions on the internet, we shall see the outcome with our own eyes.
请注意,有些算法并不是为文档相似性而构建的。 但是,在互联网上有如此多样的意见,我们将亲眼看到结果。
We will not focus on conceptual understanding nor on a detailed code review. Rather the aim is to demonstrate how easy the setup is. Don’t worry if you do not understand every detail explained here. It’s not important to follow along the post. For understanding the theory, check the reading list at the bottom for the excellent blogs written by others.
我们不会专注于概念性理解,也不会专注于详细的代码审查。 相反,其目的是演示设置过程的简易性。 如果您不了解这里解释的每个细节,请不要担心。 跟着帖子走并不重要。 为了理解该理论,请查看底部的阅读列表,以获取其他人撰写的优秀博客。
You can find the entire codebase in the Github repo.
您可以在Github存储库中找到整个代码库。
If you just want to see the results, skip this section.
如果只想查看结果,请跳过此部分。
贾卡德 (Jaccard)
Paul Jaccard proposed this formula over a century ago. And the concept has been the standard go-to for similarity tasks for a long time.
保罗·雅卡德(Paul Jaccard) 在一个多世纪前提出了这个公式。 长期以来,该概念一直是相似性任务的标准选择。
Luckily, you will find the jaccard the easiest algorithm to understand. The math is straightforward with no vectorization. And it lets you write codes from scratch.
幸运的是,您会发现jaccard是最容易理解的算法。 数学很简单,没有向量化。 它使您可以从头开始编写代码。
Also, jaccard is one of the few algorithms that does not use cosine similarity. It tokenizes the words and calculates the intersection over union.
而且,jaccard是不使用余弦相似度的少数算法之一。 它标记单词并计算联合的交点。
We use NLTK to preprocess the text.
我们使用NLTK预处理文本。
Steps:
脚步:
- Lowercase all text 小写所有文字
- Tokenize 标记化
- Remove stop words 删除停用词
- Remove punctuation 删除标点符号
- Lemmatize 合法化
- Calculate intersection/union in 2 documents 计算2个文档中的交点/联合
def calculate_jaccard(word_tokens1, word_tokens2):
# Combine both tokens to find union.
both_tokens = word_tokens1 + word_tokens2
union = set(both_tokens)
# Calculate intersection.
intersection = set()
for w in word_tokens1:
if w in word_tokens2:
intersection.add(w)
jaccard_score = len(intersection)/len(union)
return jaccard_score
特遣部队 (TF-IDF)
This is another well established algorithm that has been around since 1972. With enough battle testing for decades, it is the default search implementation of Elasticsearch.
这是自1972年以来出现的另一种完善的算法。 经过数十年的足够的战斗测试,它是Elasticsearch的默认搜索实现。
Scikit-learn offers nice out of the box implementation of TF-IDF. TfidfVectorizer lets anyone try this in a blink of eyes.
Scikit-learn提供了TF-IDF的即用型实现。 TfidfVectorizer让任何人都可以眨眼尝试一下。
The results of TF-IDF word vectors are calculated by scikit-learn’s cosine similarity. We will be using this cosine similarity for the rest of the examples. Cosine similarity is such an important concept used in many machine learning tasks, it might be worth your time to familiarize yourself (academic overview).
TF-IDF字向量的结果是通过scikit-learn的余弦相似度计算的。 在其余示例中,我们将使用这种余弦相似度。 余弦相似度是许多机器学习任务中使用的重要概念,可能值得您花一些时间来熟悉一下自己( 学术概述 )。
Thanks to scikit-learn, this algorithm came out with the shortest lines of codes.
感谢scikit-learn,该算法以最短的代码行出现了。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def process_tfidf_similarity():
vectorizer = TfidfVectorizer()
# To make uniformed vectors, both documents need to be combined first.
documents.insert(0, base_document)
embeddings = vectorizer.fit_transform(documents)
cosine_similarities = cosine_similarity(embeddings[0:1], embeddings[1:]).flatten()
Doc2vec (Doc2vec)
Word2vec came out in 2014, which took the breath of developers at the time. You might have heard of the famous demonstration:
Word2vec于2014年问世,当时吸引了开发人员的注意。 您可能听说过著名的示威游行:
King - Man = Queen
国王-男子=女王
Word2vec is great at understanding the individual word, vectorizing the whole sentence takes a long time. Let alone the entire document.
Word2vec擅长理解单个单词,将整个句子向量化需要很长时间。 更不用说整个文档了。
Instead we will use Doc2vec — a similar embedding algorithm that vectorizes paragraphs instead of each word (2014, Google Inc). In a more digestible format, you can check out this intro blog by Gidi Shperber.
取而代之的是,我们将使用Doc2vec —一种类似的嵌入算法,可对段落(而不是每个单词)进行矢量化处理( 2014,Google Inc )。 您可以通过更容易理解的格式查看Gidi Shperber的介绍博客 。
Unfortunately for Doc2vec, no corporation sponsored pretrained model has been published. We will use pretrained enwiki_dbow model from this repo. It is trained on English Wikipedia (unspecified number but the model size is decent at 1.5gb).
不幸的是,对于Doc2vec,尚未发布公司赞助的预训练模型。 我们将使用预训练enwiki_dbow模型从这个回购 。 它在英语维基百科上进行了培训(未指定数量,但是模型大小不错,为1.5gb)。
The official documentation for Doc2vec states you can insert any amount of input however long. Once tokenized, we will feed in the entire document, using gensim library.
Doc2vec的官方文档指出,您可以插入任意长的输入量。 标记化之后,我们将使用gensim库输入整个文档。
from gensim.models.doc2vec import Doc2Vec
def process_doc2vec_similarity():
filename = './models/enwiki_dbow/doc2vec.bin'
model= Doc2Vec.load(filename)
tokens = preprocess(base_document)
# Only handle words that appear in the doc2vec pretrained vectors. enwiki_ebow model contains 669549 vocabulary size.
tokens = list(filter(lambda x: x in model.wv.vocab.keys(), tokens))
base_vector = model.infer_vector(tokens)
通用句子编码器(USE) (Universal Sentence Encoder (USE))
This is a popular algorithm published by Google much more recently in May 2018 (the famous Ray Kurzweil was behind this publication). The implementation detail is well documented in Google’s Tensorflow.
这是Google最近于2018年5月发布的一种流行算法(著名的Ray Kurzweil在此出版物的后面)。 Google的Tensorflow中详细记录了实现细节。
We will use the latest official pretrained model by Google: Universal Sentence Encoder 4.
我们将使用Google提供的最新的官方预训练模型: Universal Sentence Encoder 4 。
As the name suggests, it’s been built with a sentence in mind. But the official document does not constrain the input size. There’s nothing that’s stopping us from using it for a document comparison task.
顾名思义,它在构建时就牢记了一个句子。 但是正式文件并不限制输入的大小。 没有什么可以阻止我们将其用于文档比较任务。
The whole document is inserted into Tensorflow as is. No tokenization is done.
整个文档将原样插入Tensorflow。 没有标记化完成。
import tensorflow_hub as hub
def process_use_similarity():
filename = "./models/universal-sentence-encoder_4"
model = hub.load(filename)
base_embeddings = model([base_document])
embeddings = model(documents)
scores = cosine_similarity(base_embeddings, embeddings).flatten()
变压器的双向编码器表示(BERT) (Bidirectional Encoder Representations from Transformers (BERT))
This is a big shot. Google open sourced BERT algorithm in November 2018. In the following year, Google’s vice president in Search published a blog post calling BERT their biggest leap in the past 5 years.
这是一个大人物。 Google于2018年11月开源BERT算法。 次年,Google搜索副总裁发表了一篇博客文章,称BERT是过去5年中最大的飞跃。
It is specifically built to understand your search query. When it comes to understanding the context of one sentence, BERT seems to outperform every other mentioned here.
它是专门为了解您的搜索查询而构建的。 在理解一个句子的上下文方面,BERT的表现似乎优于这里提到的其他句子。
The original BERT task was not meant to handle a large amount of text input. For embedding multiple sentences, we will use Sentence Transformers open source project published by UKPLab (from German University), whose calculation speed is superior. They also provide us with a pretrained model as well that is comparable to the original model.
最初的BERT任务并不是要处理大量文本输入。 为了嵌入多个句子,我们将使用UKPLab(来自德国大学)发布的Sentence Transformers开源项目,其计算速度更快。 它们还为我们提供了与原始模型相当的预训练模型 。
So each document is tokenized into sentences. And the results are averaged to represent the document in one vector.
因此,每个文档都被标记为句子。 然后将结果平均以一个矢量表示文档。
from sentence_transformers import SentenceTransformer
def process_bert_similarity():
# This will download and load the pretrained model offered by UKPLab.
model = SentenceTransformer('bert-base-nli-mean-tokens')
sentences = sent_tokenize(base_document)
base_embeddings_sentences = model.encode(sentences)
base_embeddings = np.mean(np.array(base_embeddings_sentences), axis=0)
优胜者算法 (Winner Algorithms)
Let’s see how each algorithm performs in our 5 different types of articles. We select the top 3 articles by the highest scores for comparison.
让我们看看5种不同类型的文章中每种算法的性能。 我们选择得分最高的前三篇文章进行比较。
In this blog post, we will go through only the results of the best performing algorithm for each of the five. For the full results along with individual article links, please see the algorithm directories in the repo.
在此博客文章中,我们将仅针对这五个算法中的每一个仅考察性能最佳的算法的结果。 有关完整结果以及各个文章链接的信息,请参见repo中的算法目录。
1.我最糟糕的约会成为我最好的 (1. How My Worst Date Ever Became My Best)
BERT wins.
BERT获胜。
The article is a human interest story that involves a romantic date for a divorced woman in 50s.
这篇文章是一个涉及人类的故事,涉及一个50年代离婚女人的浪漫约会。
This style of writing does not carry specific nouns like celebrity names. It is not time-sensitive either. One human interest story from 2010 would be likely as relevant today. Therefore, not one algorithm was far off in the comparisons.
这种写作风格不包含名人名等特定名词。 它也不是时间敏感的。 2010年的一个人类利益故事可能与今天相关。 因此,在比较中,没有一种算法相差很远。
It was a close call against USE. While a USE story took a detour into a social issue such as LGBTQ, BERT focused solely on romance and dating. Other algorithms diverged into the topics on family and children, possibly from seeing the word “ex husband”.
这是对USE的强烈呼吁。 尽管USE故事绕开了LGBTQ之类的社会话题,但BERT只专注于浪漫和约会。 其他算法可能会涉及到“前夫”一词,从而成为有关家庭和儿童的话题。
2.深海岩浆怪兽进行身体扫描 (2. A Deep-Sea Magma Monster Gets a Body Scan)
TF-IDF wins.
TF-IDF获胜。
This scientific article talks about 3D scanning the active volcano inside the ocean.
这篇科学文章讨论了3D扫描海洋中活火山的过程。
3D scanning, volcano, and ocean are rare terminologies and they are easy to pick up. All algorithms faired well.
3D扫描,火山和海洋是罕见的术语,很容易掌握。 所有算法都运行良好。
TF-IDF correctly picked those that only talk about volcano within the earth’s ocean. USE was a good contender as well with the focus on volcanoes on Mars instead of oceans. Others picked up articles about Russian military submarines which is not science-related and off topic.
TF-IDF正确地选择了那些只谈论地球海洋中的火山的人。 USE也是一个很好的竞争者,重点放在火星上的火山而不是海洋上。 其他人则选择了与俄罗斯军用潜艇无关的文章,这些文章与科学无关。
3.雷诺和日产在卡洛斯·戈恩(Carlos Ghosn)统治多年后尝试新方法 (3. Renault and Nissan Try a New Way After Years When Carlos Ghosn Ruled)
TF-IDF wins.
TF-IDF获胜。
The article talks about what has happened to Renault and Nissan after the former CEO Carlos Ghosn escaped.
文章讨论了前首席执行官卡洛斯·戈恩(Carlos Ghosn)逃脱后,雷诺和日产发生了什么事。
The ideal matches would talk about those 3 entities. Compared to the previous two, this article is much more event driven and time-sensitive. The relevant news should happen similar to this date or after (it is from November 2019).
理想的比赛将谈论这三个实体。 与前两个相比,本文更加受事件驱动且对时间敏感。 相关新闻应该在该日期或之后(从2019年11月开始)发生。
TF-IDF correctly chose articles that focus on Nissan CEO. Others picked articles that were talking about generic automotive industry news such as an alliance between Fiat Chrysler and Peugeot.
TF-IDF正确选择了有关日产CEO的文章。 其他人则选择了谈论通用汽车行业新闻的文章,例如菲亚特·克莱斯勒和标致之间的联盟。
It’s also worth mentioning Doc2vec and USE led to exactly the same result.
还值得一提的是Doc2vec和USE导致的结果完全相同。
4.多米尼克·蒂姆(Dominic Thiem)在澳网四分之一决赛中击败拉菲尔·纳达尔(Rafael Nadal) (4. Dominic Thiem Beats Rafael Nadal in Australian Open Quarterfinal)
Tie among Jaccard, TF-IDF and USE.
在Jaccard,TF-IDF和USE之间建立联系。
The article is about tennis player Dominic Thiem in Australian Open (tennis match) in 2020.
这篇文章是关于2020年澳大利亚网球公开赛(网球比赛)的网球选手多米尼克·蒂姆(Dominic Thiem)的。
The news is event driven and very specific to the individual. So ideally the matches will be about Dominic and Australian Open.
该新闻是事件驱动的,非常针对个人。 因此理想情况下,比赛将是关于多米尼克和澳网的。
Unfortunately, the result suffered from a lack of sufficient data. All of them talked about tennis. But some of the matches were talking about Dominic in French Open from 2018. Or, did about Roger Federer from the Australian Open.
不幸的是,结果缺乏足够的数据。 他们所有人都谈论网球。 但是一些比赛是从2018年开始在法国公开赛上谈论多米尼克。或者是关于澳大利亚公开赛的罗杰·费德勒。
The result is a tie among the 3 algorithms. This speaks of the crucial importance: we need to our best at collecting, diversifying, and expanding the data pool for the best similarity matching result.
结果是这三种算法之间的联系。 这说明了至关重要的重要性:我们需要尽最大的努力来收集,多样化和扩展数据池,以获得最佳的相似性匹配结果。
5. 2020年民主党人在一个不寻常的地方寻求选民:福克斯新闻 (5. 2020 Democrats Seek Voters in an Unusual Spot: Fox News)
USE wins.
USE获胜。
The article is about Democrat with a particular focus on Bernie Sanders appearing on Fox News for the 2020 election.
这篇文章是关于民主党的,特别关注伯尼·桑德斯(Bernie Sanders)出现在2020年大选的福克斯新闻上。
Each topic can be big on its own. There is an abundance of articles about democratic party candidates and the election. Since the gist of the story is a novelty, we prioritized those that discuss the relation of democratic party candidate and Fox.
每个主题都有自己的特色。 关于民主党候选人和选举的文章很多。 由于这个故事的要点是新颖的,因此我们优先考虑那些讨论民主党候选人与福克斯关系的人。
A side note: in practice, you want to be careful with suggestions in politics. Mixing liberal and conservative news can easily upset readers. Since we are dealing with New York Times alone, it will not be our concern.
旁注:在实践中,您需要谨慎对待政治建议。 混合自由新闻和保守新闻会轻易使读者沮丧。 由于我们仅在与《纽约时报》打交道,因此我们不会担心。
USE found articles that talked about Bernie Sanders and TV cables such as Fox and MSNBC. Others picked articles that discuss other Democrat candidates in 2020 election. Those were considered too generic.
USE发现了有关Bernie Sanders和电视电缆的文章,例如Fox和MSNBC。 其他人则选择了讨论2020年大选其他民主党候选人的文章。 这些被认为太通用了。
速度之王 (King of Speed)
Before concluding the winner, we need to talk about the performance time. Each algorithm performed very differently in terms of speed.
在确定获胜者之前,我们需要谈论表演时间。 在速度方面,每种算法的执行情况都非常不同。
The result was that TF-IDF implementation was way faster than any other. To calculate through 33,914 documents on a single CPU from the start to end (tokenization, vectorization, and comparison), it took:
结果是TF-IDF的实现比其他任何方法都快。 要从头到尾在单个CPU上计算33,914个文档(令牌化,向量化和比较),需要花费:
- TF-IDF: 1.5 min. TF-IDF:1.5分钟。
- Jaccard: 13 min. 贾卡德:13分钟。
- Doc2vec: 43 min. Doc2vec:43分钟。
- USE: 62 min. 使用:62分钟
- BERT: 50+ hours (each sentence was vectorized). BERT:50个小时以上(每个句子都是矢量化的)。
TF-IDF took only one and a half minutes. That’s 2.5% of what it took on USE. Of course, you can incorporate multiple efficiency enhancements. But the potential gain needs to be considered discussed first. It would give a reason us another reason to have a hard look at the development difficulty trade-off.
TF-IDF只花了一个半分钟。 这是使用率的2.5%。 当然,您可以合并多个效率增强功能。 但是潜在收益需要首先考虑。 这将使我们有另一个理由来认真考虑开发难度的权衡。
这是每篇文章的优胜者算法。 (Here’s the winner algorithms from each article.)
- BERT 伯特
- TF-IDF 特遣部队
- TF-IDF 特遣部队
- Tie among Jaccard, TF-IDF, and USE 在Jaccard,TF-IDF和USE之间建立联系
- USE 用
From the result, we can say for a documentation similarity in news articles, TF-IDF is the best candidate. That’s particularly true if you use it with minimal customization. It is also surprising given TF-IDF is the second oldest algorithm invented. Rather you might be disappointed that the modern state-of-art AI deep learning does not mean anything in this task.
从结果可以看出,对于新闻文章中的文档相似性,TF-IDF是最佳人选。 如果您以最小的自定义使用它,则尤其如此。 考虑到TF-IDF是发明的第二古老的算法,这也令人惊讶。 相反,您可能会对现代最先进的AI深度学习对这项任务没有任何意义感到失望。
Of course, each deep learning technique can be improved by training your own model and preprocess the data better. But all come with development costs. You want to be thinking hard about how better that effort will bring in relative to the naive TF-IDF method.
当然,可以通过训练自己的模型并更好地预处理数据来改进每种深度学习技术。 但是所有这些都伴随着开发成本。 您想认真思考一下,相对于朴素的TF-IDF方法,这种努力将带来什么样的效果。
Lastly, it is fair to say we should forget about Jaccard and Doc2vec altogether in document similarity. They do not bring any benefit compared to the alternative today.
最后,可以公平地说,我们在文档相似性方面应该完全忘记Jaccard和Doc2vec。 与今天的替代方案相比,它们没有带来任何好处。
给初学者的建议 (Recommendation For Starters)
Say you are deciding to implement the similarity algorithm in your application from scratch, here is my recommendation.
假设您决定从头开始在您的应用程序中实现相似性算法,这是我的建议。
1.首先实施TF-IDF (1. Implement TF-IDF first)
The state of art in document similarity match out of the box is TF-IDF, despite the deep learning hype. It gives you a high-quality result. And the best of all, it is lightning fast.
在文档领域的现状相似度匹配开箱即用的是TF-IDF,尽管深学习宣传。 它为您提供高质量的结果。 最重要的是,它闪电般快。
As we’ve seen, upgrading it to deep learning methods might or might not give you a better performance. Lots of thoughts must be placed beforehand to calculate the trade-off.
如我们所见,将其升级为深度学习方法可能会或可能不会为您带来更好的性能。 必须事先考虑很多想法才能计算出取舍。
2.积累更好的数据 (2. Accumulate Better Data)
Andrew Ng gave an analogy “data is the new oil” back in 2017. You cant expect your car to run without oil. And that oil has to be good.
吴安德(Andrew Ng)早在2017年给出了一个比喻“数据就是新的机油”。 而且这种油必须是好的。
Document similarity relies as much on the data diversity as on the specific algorithm. You should put most of your effort to find unique data to enhance your similarity result.
文档相似度与特定算法一样,很大程度上取决于数据多样性。 您应尽最大努力查找唯一数据以增强相似性结果。
3.升级到深度学习 (3. Upgrade to Deep Learning)
Only and only if you are dissatisfied with the result of TF-IDF, migrate to USE or BERT. Set up the data pipeline and upgrade your infrastructure. You will need to take into account the explosive calculation time. You will likely preprocess the word embedding, so you can handle similarity matching at a run time much faster. Google wrote a tutorial on that topic.
仅且仅当您对TF-IDF的结果不满意时,才迁移到USE或BERT。 设置数据管道并升级基础架构。 您将需要考虑爆炸计算时间。 您可能会预处理单词嵌入,因此可以在运行时更快地处理相似性匹配。 Google为此主题编写了一个教程 。
4.调整深度学习算法 (4. Tweak the Deep Learning Algorithm)
You can slowly upgrade your model. Training your own model, fitting the pretrained into the specific domain, etc. There are also many different deep learning models available today. You can try one by one to see which one fits your specific requirement the most.
您可以缓慢升级模型。 训练自己的模型,将预训练的模型适合特定领域,等等。当今还有许多不同的深度学习模型可用。 您可以一一尝试,看看哪一种最符合您的特定要求。
文档相似性是许多NLP任务之一 (Document Similarity Is One Of Many NLP Tasks)
You can achieve document similarities with various algorithms: some are traditional statistical approach and others are cutting edge deep learning methods. We have seen how they compare to one another in the real world New York Times articles.
您可以使用各种算法来实现文档的相似性:有些是传统的统计方法,而另一些是最先进的深度学习方法。 我们已经在《纽约时报》的真实世界文章中看到了它们之间的比较。
With TF-IDF, you can easily start your own document similarity on your local laptop. No fancy GPU is necessary. No large memory is necessary. With high-quality data, you will still get competitive results.
使用TF-IDF,您可以轻松地在本地笔记本电脑上启动自己的文档相似性。 不需要花哨的GPU。 不需要大内存。 有了高质量的数据,您仍将获得有竞争力的结果。
Granted, if you want to do other tasks such as sentiment analysis or classification, deep learning should suit your job. But, while researchers try to push the boundary of deep learning efficiency and performance, it’s not healthy for all of us to live in the hype loop. It creates tremendous anxiety and insecurity for the newcomers.
当然,如果您想执行其他任务,例如情感分析或分类,则深度学习应该适合您的工作。 但是,尽管研究人员试图突破深度学习的效率和性能的界限,但我们所有人都生活在炒作循环中并不健康。 它给新来者带来极大的焦虑和不安全感。
Staying empirical can keep our eyes on reality.
保持经验可以使我们继续关注现实。
Hopefully, the blog has encouraged you to start your own NLP project.
希望该博客鼓励您启动自己的NLP项目。
Let’s start getting our hands dirty.
让我们开始弄脏双手。
This post is made in conjunction with the document clustering of Kaffae project I’m working on. Along that line, I am planning to do another series on a sentence similarity.
这篇文章是与我正在研究的Kaffae项目的文档聚类一起完成的。 沿着这条思路,我计划就句子相似性进行另一系列研究。
Stay tuned
敬请期待
进一步阅读 (Further Reading)
An article covering TF-IDF and Cosine similarity with examples: “Overview of Text Similarity Metrics in Python”.
包含TF-IDF和余弦相似度的示例文章:“ Python中的文本相似度度量概述 ”。
An academic paper discussing how cosine similarity is used in various NLP machine learning tasks: “Cosine Similarity”.
讨论如何在各种NLP机器学习任务中使用余弦相似性的学术论文:“ 余弦相似性 ”。
Discussion of sentence similarity in different algorithms: “Text Similarities : Estimate the degree of similarity between two texts”.
在不同算法中讨论句子相似度:“ 文本相似度:估计两个文本之间的相似度 ”。
An examination of various deep learning models in text analysis: “When Not to Choose the Best NLP Model”.
在文本分析中考察了各种深度学习模型:“ 何时不选择最佳的NLP模型 ”。
Conceptual dive into BERT model: “A review of BERT based models”.
深入探讨BERT模型:“ 基于 BERT的模型的回顾 ”。
A literature review on document embeddings: “Document Embedding Techniques”
有关文档嵌入的文献综述:“ 文档嵌入技术 ”
翻译自: https://towardsdatascience.com/the-best-document-similarity-algorithm-in-2020-a-beginners-guide-a01b9ef8cf05
文档相似算法