Python 余弦相似性应用

本文旨在对两个未知类型的文本进行余弦相似度分析,判断哪个文本属于军事类

理论知识

  • 对于二维空间,根据空间向量点积公式:

这里写图片描述

  • 假设向量a,b的坐标分别为(x1,y1),(x2,y2),则有:

Python 余弦相似性应用_第1张图片

假设将向量A,B扩展到多维,向量A (A1,A2,Ai,····,An,),向量 B(B1,B2,Bi,····,Bn,) 则有:

Python 余弦相似性应用_第2张图片

  • 余弦值的范围在[-1,1]之间,值越趋近于1,代表两个向量的方向越接近(文档越相似);接近于0,表示两个向量近乎于正交;越趋近于-1,他们的方向越相反。

代码应用

- 所使用文件如下:
  • military_sample.txt 关于军事的样本文本
  • military_unknow.txt 关于军事的未分类测试文本
  • meeting_unknow.txt 关于会议的未分类文本
  • stop_word.txt 停用词表
- 步骤:
  • 1)用 Python 编写余弦相似度函数
def get_cossimi(x,y):
    myx=np.array(x)
    myy=np.array(y)
    cos1=np.sum(myx*myy)
    cos21=np.sqrt(sum(myy*myy))
    cos22=np.sqrt(sum(myx*myx))
    return (cos1/float(cos22*cos21))
  • 2)对军事样本文本进行数据预处理(分词、去除停用词,计算每个词的词频)
try:
    sample0=open('../cosine_similarity/military_sample.txt','r',encoding='utf-8')
    sample=sample0.read()
finally:
        sample0.close()
sample_cut=jieba.cut(sample)

try:
    stop0=open('../cosine_similarity/stop_word.txt','r',encoding='utf-8')
    stop=stop0.read()
finally:
        stop0.close()
stop=stop.split('\n')
#计算每个词条的词频
test_words={}
all_words={}
for myword in sample_cut:
    if myword.strip() not in stop:
        test_words.setdefault(myword,0)
        all_words.setdefault(myword,0)
        all_words[myword]+=1
  • dict.setdefault(key,default=None)方法:
    –遍历字典,如果字典中不存在key,将会添加键并设置值为默认值;如果存在,则什么影响都木有

  • 3)读取两个待分类文本,并进行中文分词

#读取第一个待分类数据,并分词
try:
    military_unknow0=open('../cosine_similarity/military_unknow.txt','r',encoding='utf-8')
    military_unknow=military_unknow0.read()
finally:
    military_unknow0.close()
military_unknow_cut=jieba.cut(military_unknow)
#读取第二个待分类数据,并分词
try:
    meeting_unknow0=open('../cosine_similarity/meeting_unknow.txt''r',encoding='utf-8')
    meeting_unknow=meeting_unknow0.read()
finally:
    meeting_unknow0.close()
meeting_unknow_cut=jieba.cut(meeting_unknow)
#去除停用词,生成词频
military_unknow_word=copy.deepcopy(test_words)
for myword in military_unknow_cut:
    if myword.strip() not in stop:
        if myword in military_unknow_word:
            military_unknow_word[myword]+=1

meeting_unknow_word=copy.deepcopy(test_words)
for myword in meeting_unknow_cut:
    if myword.strip() not in stop:
        if myword in meeting_unknow_word:
            meeting_unknow_word[myword]+=1
  • copy.deepcopy (object) 方法:
    –deepcopy()是将别的对象复制过来,自己形成一个新的对象,原对象的改变并不会影响到这个新的对象。

  • 4)计算并且输出两个待分类文本与样本文本的余弦相似度

sample_data=[]
military_data=[]
meeting_data=[]
for key in all_words.keys():
    sample_data.append(all_words[key])
    military_data.append(military_unknow_word[key])
    meeting_data.append(meeting_unknow_word[key])
military_similarity=get_cossimi(sample_data,military_data)
meeting_similarity=get_cossimi(sample_data,meeting_data)
print (military_similarity)
print (meeting_similarity)
  • 5)完整代码:
#-*-coding : utf-8 -*-
import pandas as pd
import jieba
import copy
import numpy as np

#自定义余弦相似度函数
def get_cossimi(x,y):
    myx=np.array(x)
    myy=np.array(y)
    cos1=np.sum(myx*myy)
    cos21=np.sqrt(sum(myy*myy))
    cos22=np.sqrt(sum(myx*myx))
    return (cos1/float(cos22*cos21))

#读取样本文本,分词
try:
    sample0=open('../cosine_similarity/military_sample.txt','r',encoding='utf-8')
    sample=sample0.read()
finally:
        sample0.close()
sample_cut=jieba.cut(sample)
try :
    stop0=open('../cosine_similarity/stop_word.txt','r',encoding='utf-8')
    stop=stop0.read()
finally:
        stop0.close()
stop=stop.split('\n')

test_words={}
all_words={}
for myword in sample_cut:
    if myword.strip() not in stop:
        test_words.setdefault(myword,0)
        all_words.setdefault(myword,0)
        all_words[myword]+=1


#读取待分类文本
#第一个分类数据,并分词
try:
    military_unknow0=open('../cosine_similarity/military_unknow.txt','r',encoding='utf-8')
    military_unknow=military_unknow0.read()
finally:
    military_unknow0.close()
military_unknow_cut=jieba.cut(military_unknow)

#读取第二个分类数据,并分词
try:
    meeting_unknow0=open('../cosine_similarity/meeting_unknow.txt','r',encoding='utf-8')
    meeting_unknow=meeting_unknow0.read()
finally:
    meeting_unknow0.close()
meeting_unknow_cut=jieba.cut(meeting_unknow)

#对待分类文本进行停用词处理,生成词频特征码
military_unknow_word=copy.deepcopy(test_words)
for myword in military_unknow_cut:
    if myword.strip() not in stop:
        if myword in military_unknow_word:
            military_unknow_word[myword]+=1

meeting_unknow_word=copy.deepcopy(test_words)
for myword in meeting_unknow_cut:
    if myword.strip() not in stop:
        if myword in meeting_unknow_word:
            meeting_unknow_word[myword]+=1

#计算并输出样本与待分类文本的余弦相似度
sample_data=[]
military_data=[]
meeting_data=[]
for key in all_words.keys():
    sample_data.append(all_words[key])
    military_data.append(military_unknow_word[key])
    meeting_data.append(meeting_unknow_word[key])
military_similarity=get_cossimi(sample_data,military_data)
meeting_similarity=get_cossimi(sample_data,meeting_data)
print ("军事样本文本词频统计:")
print (sample_data)
print ("军事未分类文本词频统计:")
print(military_data)
print ("会议未分类文本词频统计:")
print (meeting_data)
print('------------------------------------')
print('------------------------------------')
print ('【 military_unknow 】与样本【 military_sample 】的相似度为:%f'%military_similarity)
print ('【 meeting_unknow 】与样本【 military_sample 】的相似度为:%f'%meeting_similarity)
  • 最后结果:

Python 余弦相似性应用_第3张图片

  • 根据上面的代码执行结果,还是令人满意的,meeting_unknow.txt 与military_sample.txt 的相似度为0.48;military_unknow.txt与 military_sample.txt的相似度为0.19,因此,military_know归为军事类

你可能感兴趣的:(python,python)