利用Word2Vec在语料中构建种子词集同类词

nlp小白努力探索的第n天......

今天记录和分享利用gensim.model.word2vec.Word2Vec在语料中构建种子词集同类词

先说明任务情况:

全量语料数据中包含:已打标语料,未打标语料。从已打标语料中提取出关键词,我们将这部分词汇成为种子词集。现需要从全量语料数据中找到种子词集的同类词。

步骤:

1)对语料库建立word2vec词向量模型;

2)利用 gensim.model 联动的词汇相似度计算方法,以种子词集作为研究对象在语料库中找出同类词。

数据展示:

文件名:userdic

利用Word2Vec在语料中构建种子词集同类词_第1张图片

可以发现,userdic中的词汇其实应该叫做phrase(短语),分词处理后会出现无效词,比如“了”,因此需要进行去停用词和去除干扰词的处理。

代码:

导入所需的库文件

# -*- coding: utf-8 -*-

import math

import jieba
import jieba.posseg as psg
from gensim import corpora, models
from jieba import analyse
import functools

停用词表加载 

# 停用词表加载方法
def get_stopword_list():
    # 停用词表存储路径,每一行为一个词,按行读取进行加载
    # 进行编码转换确保匹配准确率
    stop_word_path = './data/stopword.txt'
    stopword_list = [sw.replace('\n', '') for sw in open(stop_word_path,encoding='utf-8').readlines()]
    return stopword_list

去除干扰词 

# 去除干扰词
def word_filter(seg_list, pos=False):
    stopword_list = get_stopword_list()
    filter_list = []
    # 根据POS参数选择是否词性过滤
    ## 不进行词性过滤,则将词性都标记为n,表示全部保留
    for seg in seg_list:
        if not pos:
            word = seg
            flag = 'n'
        else:
            word = seg.word
            flag = seg.flag
        if not flag.startswith('n'):
            continue
        # 过滤停用词表中的词,以及长度为<2的词
        if not word in stopword_list and len(word) > 1:
            filter_list.append(word)

    return filter_list

数据处理

model 是之前训练好的全量语料为w2v_model

(具体可参考使用gensim.models.word2vec.LineSentence之前的语料预处理_Papaya沐的博客-CSDN博客) 

model.wv.most_similar()中 topn = k 参数为取相似度最高的前k个词汇。

def dic_more(file_path,reduce_path):
    # load data:dic
    dicfile_read = open(file_path,'rb') #open file
    dic = dicfile_read.read() #read file
    dicfile_write = open(reduce_path,'w+')

    seg_dic = jieba.lcut(dic,cut_all = False)  #cut dic
     #去除停用词
    pos = False
    seg_dic = word_filter(seg_dic, pos)  

    #load w2v model
    model = gensim.models.Word2Vec.load('model_corpus_w2v.word2vec')

    # using dic-words to find similar words and  vectors in w2c-model which has been trained  advance (model's traning uses corpus)

    i =0   #word's number
    for word in seg_dic:
        if model.wv.__contains__(word): #判断训练好的词向量模型中是否有包含待寻找词汇
            i+=1
            print(i,".",word,file = dicfile_write)
            print(model.wv.most_similar(word,topn = 5),file = dicfile_write)


    dicfile_read.close()
    dicfile_write.close()


#try
path_userdic = "./data/userdic.txt"
path_0_userdic = "./data/0_userdic.txt"
path_userdic_more = "./data/userdic_more.txt"
path_0_userdic_more = "./data/0_userdic_more.txt"

dic_more(path_userdic,path_userdic_more)
dic_more(path_0_userdic ,path_0_userdic_more)

 结果展示:

利用Word2Vec在语料中构建种子词集同类词_第2张图片



原创不易,引用请注明出处! 

你可能感兴趣的:(nlp,word2vec,python,自然语言处理)