python实现文本检索-文本相似度算法

目的

给定一个或多个搜索词,如“高血压 患者”,从已有的若干篇文本中找出最相关的(n篇)文本。

理论知识

文本检索(text retrieve)的常用策略是:用一个ranking function根据搜索词对所有文本进行排序,选取前n个,就像百度搜索一样。

算法:模型选择

  • 1、基于word2vec的词语相似度计算模型
  • 2、python的实现用到了gensim库
  • 3、“jieba”中文分词

分步实现:

  • jieba.cut

方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用
   HMM 模型
构建停用词表

[Python] 纯文本查看 复制代码
?
1
2
3
stop_words = 'demo/stop_words.txt'
stopwords = codecs.open(stop_words,'r',encoding='utf8').readlines()
stopwords = [ w.strip() for w in stopwords ]

 

结巴分词后的停用词性 [标点符号、连词、助词、副词、介词、时语素、‘的’、数词、方位词、代词]
[Python] 纯文本查看 复制代码
?
1
stop_flag = ['x', 'c', 'u','d', 'p', 't', 'uj', 'm', 'f', 'r']

 

对一篇文章分词、去停用词
[Python] 纯文本查看 复制代码
?
1
2
3
4
5
6
7
8
9
def tokenization(filename):
    result = []
    with open(filename, 'r') as f:
        text = f.read()
        words = pseg.cut(text)
    for word, flag in words:
        if flag not in stop_flag and word not in stopwords:
            result.append(word)
    return result

 

对目录下的所有文本进行预处理,构建字典
[Python] 纯文本查看 复制代码
?
1
2
3
4
5
6
7
8
corpus = [];
dirname = 'demo/articles'
filenames = []
for root,dirs,files in os.walk(dirname):
    for f in files:
        if re.match(ur'[\u4e00-\u9fa5]*.txt', f.decode('utf-8')):
            corpus.append(tokenization(f))
            filenames.append(f)
[Python] 纯文本查看 复制代码
?
1
2
3
4
5
6
7
8
9
dictionary = corpora.Dictionary(corpus)
print len(dictionary)建立词袋模型 # 生成词向量
    doc_vectors = [dictionary.doc2bow(text) for text in corpus]  # 语料库
    建立TF-IDF模型# 生成TF-IDF模型
tfidf = models.TfidfModel(doc_vectors)  # TF-IDF模型对语料库建模
tfidf_vectors = tfidf[doc_test_vec]  # 每个词的TF-IDF值相似矩阵计算相似度index = similarities.MatrixSimilarity(tfidf[doc_vectors])
sim = index[tfidf[doc_test_vec]]   # 获取分值索引
print(sim)相似度排序scores=sorted(enumerate(sim), key=lambda item: -item[1])  # 排序
print(scores[0][1])

# 输出分值结果示例:

[Python] 纯文本查看 复制代码
?
1
2
3
测试数据为: 富宁县里达中学宿舍楼建设项目
匹配结果集(匹配度从大到小)  [(2, 1.0), (31, 0.07981655), (43, 0.077732354), (33, 0.06620947), (30, 0.065360494), (14, 0.061563488), (6, 0.05077639), (22, 0.05062699), (7, 0.044322222), (42, 0.044024862), (21, 0.043359876), (26, 0.035853535), (27, 0.03457492), (29, 0.033902794), (45, 0.03236963), (25, 0.031936638), (40, 0.030814772), (48, 0.030788476), (20, 0.027607089), (8, 0.02558621), (11, 0.024541285), (5, 0.024447413), (28, 0.020779021), (4, 0.020459857), (13, 0.015429099), (34, 0.014453442), (50, 0.011855431), (36, 0.006562164), (0, 0.006476198), (32, 0.0051991176), (46, 0.00477116), (35, 0.0047449875), (38, 0.004728446), (18, 0.004499278), (41, 0.004158474), (44, 0.0037516006), (47, 0.0036311403), (15, 0.003384664), (37, 0.00318741), (23, 0.0030692797), (17, 0.0022487652), (39, 0.0020392523), (24, 0.0016430109), (12, 0.0014699087), (1, 0.0), (3, 0.0), (9, 0.0), (10, 0.0), (16, 0.0), (19, 0.0), (49, 0.0)]
分析结果为:中标项目:富宁县里达中学宿舍楼建设项目   最大匹配度为 1.0



 

[Python] 纯文本查看 复制代码
?
1
2
3
测试数据为: 湿地保护与恢复建设工程
匹配结果集(匹配度从大到小)  [(13, 0.57420367), (40, 0.10633894), (48, 0.106248185), (43, 0.10532686), (49, 0.0816016), (12, 0.077999234), (31, 0.07725123), (25, 0.07712983), (11, 0.058984473), (50, 0.05736675), (7, 0.047928438), (34, 0.04754001), (33, 0.04504219), (30, 0.038571842), (22, 0.037484765), (27, 0.03233484), (45, 0.031974725), (14, 0.0313408), (26, 0.030683806), (5, 0.030661184), (2, 0.026870431), (4, 0.02638424), (8, 0.026375605), (20, 0.02581845), (35, 0.024404963), (32, 0.019936334), (28, 0.019432766), (44, 0.018292043), (42, 0.018038727), (38, 0.01745583), (6, 0.017230202), (17, 0.015729848), (46, 0.013131632), (29, 0.012461022), (19, 0.0117950225), (47, 0.0064870343), (0, 0.0), (1, 0.0), (3, 0.0), (9, 0.0), (10, 0.0), (15, 0.0), (16, 0.0), (18, 0.0), (21, 0.0), (23, 0.0), (24, 0.0), (36, 0.0), (37, 0.0), (39, 0.0), (41, 0.0)]
分析结果为:中标项目:四川省若尔盖国际重要湿地保护与恢复建设工程第1标段   最大匹配度为 0.57420367

你可能感兴趣的:(python)