word2vec与Glove

总结一下常用的两种文本特征word2vec与glove

因为SNLI问题一般常用glove特征,比如glove.840b.300d,其文件组织形式为:
word vector(300d) 每行
故想尝试用word2vec特征进行替换,看看效果,而w2v采用的存储格式为bin,因此先将bin文件转换为txt文件

import codecs
import gensim 
def main():
    path_to_model = 'GoogleNews-vectors-negative300.bin'
    output_file = 'GoogleNews-vectors-negative300.txt'
    export_to_file(path_to_model, output_file)

def export_to_file(path_to_model, output_file):
    output = codecs.open(output_file, 'w' , 'utf-8')
    model = gensim.models.KeyedVectors.load_word2vec_format(path_to_model, binary=True)
    print('Successfully load Word2Vec')
    vocab = model.vocab
    for mid in vocab:
        vector = list()
        for dimension in model[mid]:
            vector.append(str(dimension))
        vector_str = " ".join(vector)
        line = mid + " "  + vector_str
        #line = json.dumps(line)
        output.write(line + "\n")
    output.close() 
if __name__ == "__main__":
    main()

你可能感兴趣的:(实验记录)