The main class is Similarity, which builds an index for a given set of documents. Once the index is built, you can perform efficient queries like “Tell me how similar is this query document to each document in the index?”. The result is a vector of numbers as large as the size of the initial set of documents, that is, one float for each index document. Alternatively, you can also request only the top-N most similar index documents to the query.
示例代码:为了清楚的查看结果,对训练数据做了标号# 训练样本
raw_documents = [
'10夫妻双方1990年按农村习俗举办婚礼没有结婚证 一方可否起诉离婚',
corpora_documents = []
for item_text in raw_documents:
item_str = util_words_cut.get_class_words_list(item_text)
# 生成字典和向量语料
dictionary = corpora.Dictionary(corpora_documents)
corpus = [dictionary.doc2bow(text) for text in corpora_documents]
similarity = Similarity('-Similarity-index', corpus, num_features=400)
test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'
test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)
test_corpus_1 = dictionary.doc2bow(test_cut_raw_1)
similarity.num_best = 5
print(similarity[test_corpus_1]) # 返回最相似的样本材料,(index_of_document, similarity) tuples
test_data_2 = '家人因涉嫌运输毒品被抓,她只是去朋友家探望朋友的,结果就被抓了,还在朋友家收出毒品,可家人的身上和行李中都没有。现在已经拘留10多天了,请问会被判刑吗'
test_cut_raw_2 = util_words_cut.get_class_words_list(test_data_2)
test_corpus_2 = dictionary.doc2bow(test_cut_raw_2)
similarity.num_best = 5
print(similarity[test_corpus_2]) # 返回最相似的样本材料,(index_of_document, similarity) tuples
/usr/bin/python3.4 /data/work/python-workspace/test_doc_similarity.py
Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model from cache /tmp/jieba.cache
Loading model cost 0.521 seconds.
Loading model cost 0.521 seconds.
Prefix dict has been built succesfully.
Prefix dict has been built succesfully.
adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(61 unique tokens: ['丈夫', '法院', '结婚', '住房', '出资']...) from 16 documents (total 89 corpus positions)
starting similarity index under -Similarity-index
[(14, 0.40824830532073975), (15, 0.40824830532073975), (10, 0.35355338454246521)]
creating sparse index
creating sparse matrix from corpus
PROGRESS: at document #0/16
created <16x400 sparse matrix of type ''
with 86 stored elements in Compressed Sparse Row format>
creating sparse shard #0
saving index shard to -Similarity-index.0
saving SparseMatrixSimilarity object under -Similarity-index.0, separately None
loading SparseMatrixSimilarity object from -Similarity-index.0
[(6, 0.50395262241363525), (2, 0.47140452265739441), (4, 0.33333337306976318), (1, 0.29814240336418152), (5, 0.29814240336418152)]
Process finished with exit code 0
# 使用doc2vec来判断
cores = multiprocessing.cpu_count()
corpora_documents = []
for i, item_text in enumerate(raw_documents):
words_list = util_words_cut.get_class_words_list(item_text)
document = TaggedDocument(words=words_list, tags=[i])
model = Doc2Vec(size=89, min_count=1, iter=10)
print('#########', model.vector_size)
test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'
test_cut_raw_1 = util_words_cut.get_class_words_list(test_data_1)
inferred_vector = model.infer_vector(test_cut_raw_1)
sims = model.docvecs.most_similar([inferred_vector], topn=3)
Pattern library is not installed, lemmatization won't be available.
'pattern' package not found; tag filters are not available for English
Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model from cache /tmp/jieba.cache
Loading model cost 0.513 seconds.
Loading model cost 0.513 seconds.
Prefix dict has been built succesfully.
Prefix dict has been built succesfully.
consider setting layer size to a multiple of 4 for greater performance
collecting all words and their counts
PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
collected 61 word types and 16 unique tags from a corpus of 16 examples and 89 words
min_count=1 retains 61 unique words (drops 0)
min_count leaves 89 word corpus (100% of original 89)
deleting the raw counts dictionary of 61 items
sample=0 downsamples 0 most-common words
downsampling leaves estimated 89 word corpus (100.0% of prior 89)
estimated required memory for 61 words and 89 dimensions: 91828 bytes
constructing a huffman tree from 61 words
built huffman tree with maximum node depth 7
resetting layer weights
training model with 1 workers on 61 vocabulary and 89 features, using sg=0 hs=1 sample=0 negative=0
expecting 16 sentences, matching count from corpus used for vocabulary survey
[TaggedDocument(words=['无偿', '居间', '介绍', '买卖', '毒品', '定性'], tags=[0]), TaggedDocument(words=['吸毒', '动态', '持有', '毒品', '认定'], tags=[1])]
worker thread finished; awaiting finish of 0 more threads
training on 890 raw words (1050 effective words) took 0.0s, 506992 effective words/s
under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
######### 89
['离婚', '孩子', '自动', '生效', '离婚']
[ 2.54629389e-03 1.87756249e-03 -9.76708368e-04 -5.15014399e-03
-7.54948880e-04 -3.74549557e-03 5.37392031e-03 3.35739669e-03
-3.50345811e-03 2.63415743e-03 -1.32059853e-03 -4.15759953e-03
-2.39425618e-03 -6.20105816e-03 -1.42006821e-03 -4.64246795e-03
3.78829846e-03 1.47493952e-03 4.49652784e-03 -5.57655795e-03
-1.40081509e-04 -7.10823014e-03 -5.34327468e-04 -4.21888893e-03
-2.96280603e-03 6.52066898e-04 5.98943839e-03 -4.01164964e-03
2.49637989e-03 -9.08742077e-04 4.65002051e-03 9.24886088e-04
1.67128560e-03 -1.93383044e-03 -4.58135502e-03 1.78024184e-03
-9.60796722e-04 7.26479106e-04 4.50814469e-03 2.58095766e-04
-4.53767460e-03 -1.72883295e-03 -3.89566552e-03 4.85864235e-03
5.90517826e-04 4.30173194e-03 3.37816169e-03 -1.08716707e-03
1.85196218e-03 1.94042712e-03 1.20989932e-03 -4.69703926e-03
-5.35873650e-03 -1.35291950e-03 -4.62053996e-03 2.15436472e-03
4.05823253e-03 8.01778078e-05 -3.84314684e-03 1.11574796e-03
-4.36050585e-03 -3.31182266e-03 -2.15692003e-03 -2.09038518e-03
4.50274721e-03 -1.85286190e-04 -5.09306230e-03 -1.12043330e-04
8.25022871e-04 2.60405545e-03 -1.73542544e-03 5.14509249e-03
-9.16058663e-04 1.01291772e-03 -7.90049613e-04 4.20650374e-03
-3.00139328e-03 3.34924040e-03 -2.11520446e-03 4.79168072e-03
2.11459701e-03 -3.07943812e-03 -5.09956060e-03 -2.34926818e-03
7.30032055e-03 -5.31428820e-03 -2.96888268e-03 4.95154131e-03
[(15, 0.2670447528362274), (14, 0.18831682205200195), (10, 0.07022987306118011)]
precomputing L2-norms of doc weight vectors
文档相关链接如下: https://radimrehurek.com/gensim/models/doc2vec.html
LSH Forest: Locality Sensitive Hashing forest [1] is an alternative method for vanilla approximate nearest neighbor search methods. LSH forest data structure has been implemented using sorted arrays and binary search and 32 bit fixed-length hashes. Random projection is used as the hash family which approximates cosine distance.
# 使用lsh来处理
tfidf_vectorizer = TfidfVectorizer(min_df=3, max_features=None, ngram_range=(1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)
train_documents = []
for item_text in raw_documents:
item_str = util_words_cut.get_class_words_with_space(item_text)
x_train = tfidf_vectorizer.fit_transform(train_documents)
test_data_1 = '你好,我想问一下我想离婚他不想离,孩子他说不要,是六个月就自动生效离婚'
test_cut_raw_1 = util_words_cut.get_class_words_with_space(test_data_1)
x_test = tfidf_vectorizer.transform([test_cut_raw_1])
lshf = LSHForest(random_state=42)
distances, indices = lshf.kneighbors(x_test.toarray(), n_neighbors=3)
[[ 0.42264973 0.42264973 0.48875208]]
[[10 15 14]]