目的
理论知识
算法:模型选择
分步实现:
方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用
HMM 模型构建停用词表
1
2
3
|
stop_words = 'demo/stop_words.txt'
stopwords = codecs. open (stop_words, 'r' ,encoding = 'utf8' ).readlines()
stopwords = [ w.strip() for w in stopwords ]
|
1
|
stop_flag = [ 'x' , 'c' , 'u' , 'd' , 'p' , 't' , 'uj' , 'm' , 'f' , 'r' ]
|
1
2
3
4
5
6
7
8
9
|
def tokenization(filename):
result = []
with open (filename, 'r' ) as f:
text = f.read()
words = pseg.cut(text)
for word, flag in words:
if flag not in stop_flag and word not in stopwords:
result.append(word)
return result
|
1
2
3
4
5
6
7
8
|
corpus = [];
dirname = 'demo/articles'
filenames = []
for root,dirs,files in os.walk(dirname):
for f in files:
if re.match(ur '[\u4e00-\u9fa5]*.txt' , f.decode( 'utf-8' )):
corpus.append(tokenization(f))
filenames.append(f)
|
1
2
3
4
5
6
7
8
9
|
dictionary = corpora.Dictionary(corpus)
print len (dictionary)建立词袋模型 # 生成词向量
doc_vectors = [dictionary.doc2bow(text) for text in corpus] # 语料库
建立TF - IDF模型 # 生成TF-IDF模型
tfidf = models.TfidfModel(doc_vectors) # TF-IDF模型对语料库建模
tfidf_vectors = tfidf[doc_test_vec] # 每个词的TF-IDF值相似矩阵计算相似度index = similarities.MatrixSimilarity(tfidf[doc_vectors])
sim = index[tfidf[doc_test_vec]] # 获取分值索引
print (sim)相似度排序scores = sorted ( enumerate (sim), key = lambda item: - item[ 1 ]) # 排序
print (scores[ 0 ][ 1 ])
|
# 输出分值结果示例:
1
2
3
|
测试数据为: 富宁县里达中学宿舍楼建设项目
匹配结果集(匹配度从大到小) [( 2 , 1.0 ), ( 31 , 0.07981655 ), ( 43 , 0.077732354 ), ( 33 , 0.06620947 ), ( 30 , 0.065360494 ), ( 14 , 0.061563488 ), ( 6 , 0.05077639 ), ( 22 , 0.05062699 ), ( 7 , 0.044322222 ), ( 42 , 0.044024862 ), ( 21 , 0.043359876 ), ( 26 , 0.035853535 ), ( 27 , 0.03457492 ), ( 29 , 0.033902794 ), ( 45 , 0.03236963 ), ( 25 , 0.031936638 ), ( 40 , 0.030814772 ), ( 48 , 0.030788476 ), ( 20 , 0.027607089 ), ( 8 , 0.02558621 ), ( 11 , 0.024541285 ), ( 5 , 0.024447413 ), ( 28 , 0.020779021 ), ( 4 , 0.020459857 ), ( 13 , 0.015429099 ), ( 34 , 0.014453442 ), ( 50 , 0.011855431 ), ( 36 , 0.006562164 ), ( 0 , 0.006476198 ), ( 32 , 0.0051991176 ), ( 46 , 0.00477116 ), ( 35 , 0.0047449875 ), ( 38 , 0.004728446 ), ( 18 , 0.004499278 ), ( 41 , 0.004158474 ), ( 44 , 0.0037516006 ), ( 47 , 0.0036311403 ), ( 15 , 0.003384664 ), ( 37 , 0.00318741 ), ( 23 , 0.0030692797 ), ( 17 , 0.0022487652 ), ( 39 , 0.0020392523 ), ( 24 , 0.0016430109 ), ( 12 , 0.0014699087 ), ( 1 , 0.0 ), ( 3 , 0.0 ), ( 9 , 0.0 ), ( 10 , 0.0 ), ( 16 , 0.0 ), ( 19 , 0.0 ), ( 49 , 0.0 )]
分析结果为:中标项目:富宁县里达中学宿舍楼建设项目 最大匹配度为 1.0
|
1
2
3
|
测试数据为: 湿地保护与恢复建设工程
匹配结果集(匹配度从大到小) [( 13 , 0.57420367 ), ( 40 , 0.10633894 ), ( 48 , 0.106248185 ), ( 43 , 0.10532686 ), ( 49 , 0.0816016 ), ( 12 , 0.077999234 ), ( 31 , 0.07725123 ), ( 25 , 0.07712983 ), ( 11 , 0.058984473 ), ( 50 , 0.05736675 ), ( 7 , 0.047928438 ), ( 34 , 0.04754001 ), ( 33 , 0.04504219 ), ( 30 , 0.038571842 ), ( 22 , 0.037484765 ), ( 27 , 0.03233484 ), ( 45 , 0.031974725 ), ( 14 , 0.0313408 ), ( 26 , 0.030683806 ), ( 5 , 0.030661184 ), ( 2 , 0.026870431 ), ( 4 , 0.02638424 ), ( 8 , 0.026375605 ), ( 20 , 0.02581845 ), ( 35 , 0.024404963 ), ( 32 , 0.019936334 ), ( 28 , 0.019432766 ), ( 44 , 0.018292043 ), ( 42 , 0.018038727 ), ( 38 , 0.01745583 ), ( 6 , 0.017230202 ), ( 17 , 0.015729848 ), ( 46 , 0.013131632 ), ( 29 , 0.012461022 ), ( 19 , 0.0117950225 ), ( 47 , 0.0064870343 ), ( 0 , 0.0 ), ( 1 , 0.0 ), ( 3 , 0.0 ), ( 9 , 0.0 ), ( 10 , 0.0 ), ( 15 , 0.0 ), ( 16 , 0.0 ), ( 18 , 0.0 ), ( 21 , 0.0 ), ( 23 , 0.0 ), ( 24 , 0.0 ), ( 36 , 0.0 ), ( 37 , 0.0 ), ( 39 , 0.0 ), ( 41 , 0.0 )]
分析结果为:中标项目:四川省若尔盖国际重要湿地保护与恢复建设工程第 1 标段 最大匹配度为 0.57420367
|