在上一章中,我们遇到了难以描述语料库的常用词。这是不同种类的 NLP 任务的问题。幸运的是,信息检索领域已经开发了许多可用于改进各种 NLP 应用的技术。

早些时候,我们谈到了文本数据是如何存在的,并且每天都在生成更多。我们需要一些方法来管理和搜索这些数据。如果有 ID 或标题,我们当然可以对这些数据进行索引,但是我们如何按内容搜索呢?使用结构化数据,我们可以创建逻辑表达式并检索满足表达式的所有行。这也可以用文本来完成,虽然不太准确。



查询 q  :描述您正在查找的文档或文档类型的逻辑语句

查询词 q_t :查询中的一个术语,通常是一个标记

文件语料库 D :文档集合

文档 d : 包含描述文档的D 术语 的文档t_d

排名功能  r(q, D) :D 根据  与查询 的相关性对文档进行排名的函数 q

结果 R  :文档排序列表




倒排索引的索引与传统索引略有不同;相反,它从索引的数学概念中汲取灵感——即将索引分配给集合中的一个元素。回想一下我们的文档集。我们可以为每个文档分配一个数字,创建从整数到文档的映射 i -> d。D

让我们为我们的DataFrame. 通常,我们会将倒排索引存储在允许快速查找的数据存储中。SparkDataFrame不适合快速查找。我们将介绍用于搜索的工具。


让我们看看如何在 Spark 中构建倒排索引。以下是我们将遵循的步骤:

  1. 加载数据。
  2. 创建索引:i -> d*

    • 由于我们使用的是 Spark,我们将在行上生成此索引。
  3. 处理文本。
  4. 创建从术语到文档的倒排索引:t_d -> i*


我们将为 mini_newsgroups 数据集创建一个倒排索引。

import os

from pyspark.sql.types import *
from pyspark.sql.functions import collect_set
from pyspark.sql import Row
from pyspark.ml import Pipeline

import sparknlp
from sparknlp import DocumentAssembler, Finisher
from sparknlp.annotator import *

spark = sparknlp.start()
path = os.path.join('data', 'mini_newsgroups', '*')
texts = spark.sparkContext.wholeTextFiles(path)

schema = StructType([
    StructField('path', StringType()),
    StructField('text', StringType()),

texts = spark.createDataFrame(texts, schema=schema).persist()


现在我们需要创建索引。Spark 假设数据是分布式的,因此要分配索引,我们需要使用较低级别的RDDAPI。zipWithIndex将对工人的数据进行排序并分配索引。

rows_w_indexed = texts.rdd.zipWithIndex()
(path, text), i = rows_w_indexed.first()

Xref: cantaloupe.srv.cs.cmu.edu sci.astro:35223 sci.space:61404
Newsgroups: sci.astro,sci.space
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!...

现在我们已经创建了索引,我们需要DataFrame像之前一样创建一个,除了现在我们需要将我们的索引添加到我们Row的 s 中。表 6-1显示了结果。

indexed = rows_w_indexed.map(
    lambda row_index: Row(
(i, path, text) = indexed.first()
indexed_schema = schema.add(StructField('index', IntegerType()))

indexed = spark.createDataFrame(indexed, schema=indexed_schema)\


表 6-1。索引文件
  path text index
0 file:/.../spark-nlp-book/data/m... Newsgroups:rec.motorcycles\nFrom[email protected]... 0
1 file:/.../spark-nlp-book/data/m... Path:cantaloupe.srv.cs.cmu.edu!das-news.harva... 1
2 file:/.../spark-nlp-book/data/m... Newsgroups:rec.motorcycles\nPath:cantaloupe.... 2
3 file:/.../spark-nlp-book/data/m... Xref:cantaloupe.srv.cs.cmu.edu rec.motorcycle... 3
4 file:/.../spark-nlp-book/data/m... Path:cantaloupe.srv.cs.cmu.edu!das-news.harva... 4
5 file:/.../spark-nlp-book/data/m... Path:cantaloupe.srv.cs.cmu.edu!magnesium.club... 5
6 file:/.../spark-nlp-book/data/m... Newsgroups:rec.motorcycles\nPath:cantaloupe.... 6
7 file:/.../spark-nlp-book/data/m... Newsgroups:rec.motorcycles\nPath:cantaloupe.... 7
8 file:/.../spark-nlp-book/data/m... Path:cantaloupe.srv.cs.cmu.edu!rochester!udel... 8
9 file:/.../spark-nlp-book/data/m... Path:cantaloupe.srv.cs.cmu.edu!crabapple.srv.... 9

每个文档 d 都是术语的集合,  t_d. 所以我们的索引是从整数到术语集合的映射。

另一方面,倒排索引是从项t_d到整数的映射inv-index: t_d -> i, j, k, ...。这使我们能够快速查找包含给定术语的文档。

第 3 步

现在让我们处理文本(见表 6-2)。

from sparknlp.pretrained import PretrainedPipeline

assembler = DocumentAssembler()\
tokenizer = Tokenizer()\
lemmatizer = LemmatizerModel.pretrained()\
normalizer = Normalizer()\
finisher = Finisher()\

pipeline = Pipeline().setStages([
    assembler, tokenizer, 
    lemmatizer, normalizer, finisher

indexed_w_tokens = pipeline.transform(indexed)

表 6-2。具有标准化标记的文档

path text index normalized
0 file:/.../spark-nlp-book/data/m... ... 0 [newsgroups, recmotorcycles, from, lisaalexcom...
1 file:/.../spark-nlp-book/data/m... ... 1 [path, cantaloupesrvcscmuedudasnewsharvardedun...
2 file:/.../spark-nlp-book/data/m... ... 2 [newsgroups, recmotorcycles, path, cantaloupes...
3 file:/.../spark-nlp-book/data/m... ... 3 [xref, cantaloupesrvcscmuedu, recmotorcyclesha...
4 file:/.../spark-nlp-book/data/m... ... 4 [path, cantaloupesrvcscmuedudasnewsharvardeduo...
5 file:/.../spark-nlp-book/data/m... ... 5 [path, cantaloupesrvcscmuedumagnesiumclubcccmu...
6 file:/.../spark-nlp-book/data/m... ... 6 [newsgroups, recmotorcycles, path, cantaloupes...
7 file:/.../spark-nlp-book/data/m... ... 7 [newsgroups, recmotorcycles, path, cantaloupes...
8 file:/.../spark-nlp-book/data/m... ... 8 [path, cantaloupesrvcscmuedurochesterudelbogus...
9 file:/.../spark-nlp-book/data/m... ... 9 [path, cantaloupesrvcscmueducrabapplesrvcscmue...

由于我们使用的是小型数据集,因此出于本示例的目的,我们将移出 Spark。我们会将我们的数据收集到 pandas 中,并使用我们的索引字段作为我们的DataFrame索引。

doc_index = indexed_w_tokens.select('index', 'path', 'text').toPandas()
doc_index = doc_index.set_index('index')


现在,让我们创建倒排索引。我们将使用 Spark SQL 来执行此操作。结果如表6-3所示。

SELECT term, collect_set(index) AS documents FROM ( SELECT index, explode(normalized) AS term FROM indexed_w_tokens ) GROUP BY term ORDER BY term
inverted_index = indexed_w_tokens\
    .selectExpr('index', 'explode(normalized) AS term')\
表 6-3。倒排索引(从术语到文档索引的映射)
  term documents
0 aaangel.qdeck.com [198]
1 accumulation [857]
2 adventists [314]
3 aecfb.student.cwru.edu [526]
4 again...hmmm [1657]
5 alt.binaries.pictures [815]
6 amplifier [630, 624, 654]
7 antennae [484, 482]
8 apr..gordian.com [1086]
9 apr..lokkur.dexter.mi.us [292]

这是我们的倒​​排索引。我们可以看到术语“放大器”出现在文档 630、624 和 654 中。有了这些信息,我们可以快速找到包含特定术语的所有文档。

另一个好处是这个倒排索引是基于我们词汇量的大小,而不是我们语料库中的文本量,所以它不是大数据。倒排索引仅随着新术语和文档索引而增长。对于非常大的语料库,这仍然可能是单台机器的大量数据。然而,在 mini_newsgroups 数据集的情况下,它很容易管理。



对我们来说,由于我们的文档数量如此之少,倒排索引的条目比索引多。词频遵循 Zipf 定律——也就是说,一个词在排序时的频率与其排名成反比。结果,最常用的英语单词已经在我们的倒排索引中。这可以通过不跟踪至少不出现一定次数的单词来进一步限制。

inverted_index = {
    term: set(docs) 
    for term, docs in inverted_index.collect()


lang_docs = inverted_index['language']
print('docs', ('{}, ' * 10).format(*list(lang_docs)[:10]), '...')
print('number of docs', len(lang_docs))
docs 1926, 1937, 1171, 1173, 1182, 1188, 1830, 808, 1193, 433,  ...
number of docs 44
info_docs = inverted_index['information']
print('docs', ('{}, ' * 10).format(*list(info_docs)[:10]), '...')
print('number of docs', len(info_docs))
docs 516, 519, 520, 1547, 1035, 1550, 1551, 17, 1556, 22,  ...
number of docs 215
filter_set = list(lang_docs | info_docs)
print('number of docs in filter set', len(filter_set))
number of docs in filter set 246
intersection = list(lang_docs & info_docs)
print('number of docs in intersection set', len(intersection))
number of docs in intersection set 13

让我们打印过滤器集中的行。这里,过滤器集就是结果集,但一般过滤器集的排名是r(q, D),从而得到结果集。


k = 1
for i in filter_set:
    path, text = doc_index.loc[i]
    lines = text.split('\n')
    print(path.split('/')[-1], 'length:', len(text))
    for line_number, line in enumerate(lines):
        if 'information' in line or 'language' in line:
            print(line_number, line)
    k += 1
    if k > 5:
178813 length: 1783
14 >>    Where did you get this information?  The FBI stated ...

104863 length: 2795
14 of information that I received, but as my own bad mouthing) ...

104390 length: 2223
51 ... appropriate disclaimer to outgoing public information,

178569 length: 11735
60  confidential information obtained illegally from law ...
64  ... to allegations of obtaining confidential information from
86  employee and have said they simply traded information with ...
91  than truthful" in providing information during an earlier ...
125  and Department of Motor Vehicles information such as ...
130  Bullock also provided information to the South African ...
142  information.
151  exchanged information with the FBI and worked with ...
160  information in Los Angeles, San Francisco, New York, ...
168  some confidential information in the Anti-Defamation ...
182  police information on citizens and groups.
190  ... spying operations, which collected information on more than
209  information to police, journalists, academics, government ...
211  information illegally, he said.
215  identity of any source of information," Foxman said.

104616 length: 1846
45 ... an appropriate disclaimer to outgoing public information,

现在我们有了结果集,我们应该如何对结果进行排序?我们可以只计算搜索词的出现次数,但这会偏向于长文档。另外,如果我们的查询包含一个非常常见的单词,比如“the”,会发生什么?如果我们只使用计数,像“the”这样的常用词将主导我们的结果。在我们的结果集中,查询词出现次数最多的文本最长。我们可以说文档中找到的术语越多,文档的相关性就越高,但这也有问题。我们如何处理单项查询?在我们的示例中,只有一个文档具有两者。同样,如果我们的查询有一个常用词——例如,“the cat in the hat”——“the”和“in”是否应该与“cat”和“hat”具有相同的重要性?为了解决这个问题,我们需要一个更灵活的模型来处理我们的文档和查询。


在上一章中,我们介绍了矢量化文档的概念。我们讨论了创建二进制向量,其中 1 表示该单词存在于文档中。我们也可以使用计数。


让我们计算数据集的向量。在上一章中,我们使用了CountVectorizer这个。我们将在 Python 中构建向量,但构建它们的方式将帮助我们了解库如何实现向量化。

class SparseVector(object):
    def __init__(self, indices, values, length):
        # if the indices are not in ascending order, we need 
        # to sort them
        is_ascending = True
        for i in range(len(indices) - 1):
            is_ascending = is_ascending and indices[i] < indices[i+1]
        if not is_ascending:
            pairs = zip(indices, values)
            sorted_pairs = sorted(pairs, key=lambda x: x[0])
            indices, values = zip(*sorted_pairs)
        self.indices = indices
        self.values = values
        self.length = length
    def __getitem__(self, index):
            return self.values[self.indices.index(index)]
        except ValueError:
            return 0.0
    def dot(self, other):
        assert isinstance(other, SparseVector)
        assert self.length == other.length
        res = 0
        i = j = 0
        while i < len(self.indices) and j < len(other.indices):
            if self.indices[i] == other.indices[j]:
                res += self.values[i] * other.values[j]
                i += 1
                j += 1
            elif self.indices[i] < other.indices[j]:
                i += 1
            elif self.indices[i] > other.indices[j]:
                j += 1
        return res
    def hadamard(self, other):
        assert isinstance(other, SparseVector)
        assert self.length == other.length
        res_indices = []
        res_values = []
        i = j = 0
        while i < len(self.indices) and j < len(other.indices):
            if self.indices[i] == other.indices[j]:
                res_values.append(self.values[i] * other.values[j])
                i += 1
                j += 1
            elif self.indices[i] < other.indices[j]:
                i += 1
            elif self.indices[i] > other.indices[j]:
                j += 1
        return SparseVector(res_indices, res_values, self.length)
    def sum(self):
        return sum(self.values)
    def __repr__(self):
        return 'SparseVector({}, {})'.format(
            dict(zip(self.indices, self.values)), self.length)


from collections import Counter

vocabulary = set()
vectors = {}

for row in indexed_w_tokens.toLocalIterator():
    counts = Counter(row['normalized'])
    vectors[row['index']] = counts
vocabulary = list(sorted(vocabulary))
inv_vocabulary = {term: ix for ix, term in enumerate(vocabulary)}
vocab_len = len(vocabulary)


for index in vectors:
    terms, values = zip(*vectors[index].items())
    indices = [inv_vocabulary[term] for term in terms]
    vectors[index] = SparseVector(indices, values, vocab_len)
SparseVector({56: 1, 630: 1, 678: 1, 937: 1, 952: 1, 1031: 1, 1044: 1,
1203: 1, 1348: 1, 1396: 5, 1793: 1, 2828: 1, 3264: 3, 3598: 3, 3753: 1,
4742: 1, 5907: 1, 7990: 1, 7999: 1, 8451: 1, 8532: 1, 9570: 1, 11031: 1,
11731: 1, 12509: 1, 13555: 1, 13772: 1, 14918: 1, 15205: 1, 15350: 1,
15475: 1, 16266: 1, 16356: 1, 16865: 1, 17236: 2, 17627: 1, 17798: 1,
17931: 2, 18178: 1, 18329: 2, 18505: 1, 18730: 3, 18776: 1, 19346: 1,
19620: 1, 20381: 1, 20475: 1, 20594: 1, 20782: 1, 21831: 1, 21856: 1,
21907: 1, 22560: 1, 22565: 2, 22717: 1, 23714: 1, 23813: 1, 24145: 1,
24965: 3, 25937: 1, 26437: 1, 26438: 1, 26592: 1, 26674: 1, 26679: 1,
27091: 1, 27109: 1, 27491: 2, 27500: 1, 27670: 1, 28583: 1, 28864: 1,
29636: 1, 31652: 1, 31725: 1, 31862: 1, 33382: 1, 33923: 1, 34311: 1,
34451: 1, 34478: 1, 34778: 1, 34904: 1, 35034: 1, 35635: 1, 35724: 1,
36136: 1, 36596: 1, 36672: 1, 37048: 1, 37854: 1, 37867: 3, 37872: 1,
37876: 3, 37891: 1, 37907: 1, 37949: 1, 38002: 1, 38224: 1, 38225: 2,
38226: 3, 38317: 3, 38856: 1, 39818: 1, 40870: 1, 41238: 1, 41239: 1,
41240: 1, 41276: 1, 41292: 1, 41507: 1, 41731: 1, 42384: 2}, 42624)








我们希望删除的这些常用词称为停用词。这个术语是 1950 年代由信息检索领域的先驱汉斯·彼得·卢恩 (Hans Peter Luhn) 创造的。默认停用词列表可用,但通常需要针对不同任务修改通用停用词列表。

from pyspark.ml.feature import StopWordsRemover

sw_remover = StopWordsRemover() \
    .setInputCol("normalized") \
    .setOutputCol("filtered") \

filtered = sw_remover.transform(indexed_w_tokens)
from collections import Counter

vocabulary_filtered = set()
vectors_filtered = {}

for row in filtered.toLocalIterator():
    counts = Counter(row['filtered'])
    vectors_filtered[row['index']] = counts
vocabulary_filtered = list(sorted(vocabulary_filtered))
inv_vocabulary_filtered = {
    term: ix 
    for ix, term in enumerate(vocabulary_filtered)
vocab_len_filtered = len(vocabulary)
for index in vectors:
    terms, values = zip(*vectors_filtered[index].items())
    indices = [inv_vocabular_filteredy[term] for term in terms]
    vectors_filtered[index] = \
        SparseVector(indices, values, vocab_len_filtered)
SparseVector({630: 1, 678: 1, 952: 1, 1031: 1, 1044: 1, 1203: 1, 1348: 1,
1793: 1, 2828: 1, 3264: 3, 4742: 1, 5907: 1, 7990: 1, 7999: 1, 8451: 1, 
8532: 1, 9570: 1, 11031: 1, 11731: 1, 12509: 1, 13555: 1, 13772: 1, 
14918: 1, 15205: 1, 15350: 1, 16266: 1, 16356: 1, 16865: 1, 17236: 2, 
17627: 1, 17798: 1, 17931: 2, 18178: 1, 18505: 1, 18776: 1, 20475: 1, 
20594: 1, 20782: 1, 21831: 1, 21856: 1, 21907: 1, 22560: 1, 22565: 2, 
22717: 1, 23714: 1, 23813: 1, 24145: 1, 25937: 1, 26437: 1, 26438: 1, 
26592: 1, 26674: 1, 26679: 1, 27109: 1, 27491: 2, 28583: 1, 28864: 1, 
29636: 1, 31652: 1, 31725: 1, 31862: 1, 33382: 1, 33923: 1, 34311: 1, 
34451: 1, 34478: 1, 34778: 1, 34904: 1, 35034: 1, 35724: 1, 36136: 1, 
36596: 1, 36672: 1, 37048: 1, 37872: 1, 37891: 1, 37949: 1, 38002: 1, 
38224: 1, 38225: 2, 38226: 3, 38856: 1, 39818: 1, 40870: 1, 41238: 1, 
41239: 1, 41240: 1, 41276: 1, 41731: 1}, 42624)






我们将这些值乘以术语频率,即给定文档中单词的频率。逆文档频率乘以词频的结果就是 TF.IDF。

 让我们用我们的向量来计算它。我们实际上已经有了词频,所以我们需要做的就是计算idf,用 转换值,然后log乘以。tfidf

idf = Counter()

for vector in vectors.values():
for ix, count in idf.most_common(20):
    print('{:5d} {:20s} {:d}'.format(ix, vocabulary[ix], count))
11031 date                 2000
15475 from                 2000
23813 messageid            2000
26438 newsgroups           2000
28583 path                 2000
36672 subject              2000
21907 lines                1993
27897 organization         1925
37876 the                  1874
 1793 apr                  1861
 3598 be                   1837
38317 to                   1767
27500 of                   1756
   56 a                    1730
16266 gmt                  1717
18329 i                    1708
18730 in                   1695
 1396 and                  1674
15166 for                  1474
17238 have                 1459

我们现在可以制作 idf 一个 SparseVector. 我们知道它包含所有单词,所以它实际上不会稀疏,但这将帮助我们实现接下来的步骤。

indices, values = zip(*idf.items())
idf = SparseVector(indices, values, vocab_len)
from math import log

for index, vector in vectors.items():
    vector.values = list(map(lambda v: log(1+v), vector.values))
idf.values = list(map(lambda v: log(vocab_len / (1+v)), idf.values))
tfidf = {index: tf.hadamard(idf) for index, tf in vectors.items()}
SparseVector({56: 2.2206482367540246, 630: 5.866068667810157, 
678: 5.793038323439593, 937: 2.7785503981772224, 952: 5.157913986067814, 
41731: 2.4998956290056062, 42384: 3.8444034764394415}, 42624)

让我们看一下“be”和“the”的 TF.IDF 值。让我们也看看TF.IDF比这些常用词更高的词之一。

tfidf[42][3598] # be
tfidf[42][37876] # the
vocabulary[17236], tfidf[42][17236]
('hausmann', 10.188396765921954)


Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard...
From: [email protected] (Bake Timmons)
Newsgroups: alt.atheism
Subject: Re: Amusing atheists and agnostics
Date: 20 Apr 93 06:00:04 GMT
Sender: [email protected]
Lines: 32

Maddi Hausmann chirps:

>[email protected] (Bake Timmons) writes: >


>"Killfile" Keith Allen Schneider = Frank "Closet Theist" O'Dwyer = ...

= Maddi "The Mad Sound-O-Geek" Hausmann


Bake Timmons, III

我们可以看到该文件正在谈论一个名叫“Maddi Hausman”的人。


Spark 在 MLlib 中具有计算 TF.IDF 的阶段。如果您有一个包含字符串数组的列,您可以使用CountVectorizer我们已经熟悉的或者HashingTF来获取tf值。HashingTF使用散列技巧,您可以在其中预先确定一个向量空间,然后将单词散列到该向量空间中。如果发生碰撞,那么这些单词将被视为相同。这使您可以在内存效率和准确性之间进行权衡。随着您使预定的向量空间变大,输出向量会变大,但发生冲突的机会会降低。



现在我们已经计算了 TF.IDF 值,让我们构建一个搜索函数。首先,我们需要一个函数来处理查询。

def process_query(query, pipeline):
    data = spark.createDataFrame([(query,)], ['text'])
    return pipeline.transform(data).first()['normalized']


def get_filter_set(processed_query):
    filter_set = set()
    # 查找所有包含任何条款的文档
    return filter_set


def get_score(index, terms): 
    return # 返回单个分数


def display(index, score, terms): 
    hits = [term for term in terms if term in words and tfidf[index][inv_vocabulary[term]] > 0.] 
    print('terms', terms, 'hits', hits ) 
    print('score', score) 
    print('path', path) 
    print('length', len(doc_index.loc[index]['text']))


def search(query, pipeline, k=5):
    processed_query = process_query(query, pipeline)
    filter_set = get_filter_set(processed_query)
    scored = {index: get_score(index, processed_query) for index in filter_set}
    display_list = list(sorted(filter_set, key=scored.get, reverse=True))[:k]
    for index in display_list:
        display(index, scored[index], processed_query)
search('search engine', pipeline)

您应该能够实现get_filter_setget_score轻松使用本章中的示例。尝试几个查询。您可能会注意到这里有两个很大的限制。没有 N-gram 支持,并且排名器偏向于较长的文档。你可以修改什么来解决这些问题?
